Taking the Training Wheels Off: Aligning LLMs Without Personas

5 hours ago

Current AI alignment techniques rely on models mimicking 'good personas' from training data, like helpful humans, which works for present-day AI but may not scale to superhuman AI.
Superhuman AI faces out-of-distribution situations where human personas provide no data, making mimicry insufficient for alignment.
Personaless Alignment is proposed as a research direction to align models without relying on personas, aiming to test alignment techniques under tougher conditions that better simulate superintelligence challenges.
Experiments for Personaless Alignment include filtering morality from pretraining data or conducting 'Pessimal Pretraining' with misaligned data, though both present design difficulties and may be insufficient.
The goal is to develop alignment methods that go beyond mimicry, offering a better indicator for future artificial superintelligence (ASI) alignment.

Hasty Briefsbeta