Taking the Training Wheels Off: Aligning LLMs Without Personas
5 hours ago
- #Personaless Alignment
- #AI Alignment
- #Superintelligence
- Current AI alignment techniques rely on models mimicking 'good personas' from training data, like helpful humans, which works for present-day AI but may not scale to superhuman AI.
- Superhuman AI faces out-of-distribution situations where human personas provide no data, making mimicry insufficient for alignment.
- Personaless Alignment is proposed as a research direction to align models without relying on personas, aiming to test alignment techniques under tougher conditions that better simulate superintelligence challenges.
- Experiments for Personaless Alignment include filtering morality from pretraining data or conducting 'Pessimal Pretraining' with misaligned data, though both present design difficulties and may be insufficient.
- The goal is to develop alignment methods that go beyond mimicry, offering a better indicator for future artificial superintelligence (ASI) alignment.