Hasty Briefsbeta

Bilingual

Steering interpretable language models with concept algebra

21 hours ago
  • #Interpretability
  • #AI Control
  • #Concept Algebra
  • Steerling-8B enables concept algebra, allowing addition, removal, and composition of human-understandable concepts at inference time without retraining or prompt engineering.
  • The model supports direct editing of internal representations for any concept, facilitating control over generation without altering the prompt.
  • Compositional control is essential in multi-turn dialogues, such as content moderation or health assistance, where multiple concepts must be managed simultaneously.
  • Current control methods like prompting and fine-tuning are either unreliable or costly, lacking composability and fine-grained control.
  • Steerling-8B's concept module provides a linear, algebraic handle on internal variables, enabling both explanation and control of model behavior.
  • Mask-aligned injection ensures reliable control during diffusion decoding by aligning concept embeddings with the training distribution.
  • Steering can redirect outputs to different domains without changing the prompt, demonstrating versatility in generation control.
  • Bottleneck intervention allows for the removal of specific concepts by wiping their contributions before generation.
  • Systematic evaluation shows steering significantly improves concept adherence while maintaining high text quality.
  • The linear architecture of the concept module ensures predictable effects from interventions, distinguishing it from other control methods.