Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment
7 hours ago
- #self-fulfilling misalignment
- #AI alignment
- #pretraining discourse
- Pretraining corpora contain AI discourse that can influence downstream alignment.
- Negative AI descriptions in training may lead models to internalize misalignment, creating self-fulfilling prophecies.
- Controlled study pretrained 6.9B-parameter LLMs with varying amounts of alignment or misalignment discourse.
- Upsampling misalignment discourse increased misaligned behavior; upsampling aligned discourse reduced misalignment scores from 45% to 9%.
- Effects persist through post-training, showing the importance of pretraining data in shaping alignment priors (alignment pretraining).
- Researchers recommend considering alignment alongside capabilities during pretraining, not just post-training.