Hasty Briefsbeta

Bilingual

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

7 hours ago
  • #self-fulfilling misalignment
  • #AI alignment
  • #pretraining discourse
  • Pretraining corpora contain AI discourse that can influence downstream alignment.
  • Negative AI descriptions in training may lead models to internalize misalignment, creating self-fulfilling prophecies.
  • Controlled study pretrained 6.9B-parameter LLMs with varying amounts of alignment or misalignment discourse.
  • Upsampling misalignment discourse increased misaligned behavior; upsampling aligned discourse reduced misalignment scores from 45% to 9%.
  • Effects persist through post-training, showing the importance of pretraining data in shaping alignment priors (alignment pretraining).
  • Researchers recommend considering alignment alongside capabilities during pretraining, not just post-training.