Steering interpretable language models with concept algebra
21 hours ago
- #Interpretability
- #AI Control
- #Concept Algebra
- Steerling-8B enables concept algebra, allowing addition, removal, and composition of human-understandable concepts at inference time without retraining or prompt engineering.
- The model supports direct editing of internal representations for any concept, facilitating control over generation without altering the prompt.
- Compositional control is essential in multi-turn dialogues, such as content moderation or health assistance, where multiple concepts must be managed simultaneously.
- Current control methods like prompting and fine-tuning are either unreliable or costly, lacking composability and fine-grained control.
- Steerling-8B's concept module provides a linear, algebraic handle on internal variables, enabling both explanation and control of model behavior.
- Mask-aligned injection ensures reliable control during diffusion decoding by aligning concept embeddings with the training distribution.
- Steering can redirect outputs to different domains without changing the prompt, demonstrating versatility in generation control.
- Bottleneck intervention allows for the removal of specific concepts by wiping their contributions before generation.
- Systematic evaluation shows steering significantly improves concept adherence while maintaining high text quality.
- The linear architecture of the concept module ensures predictable effects from interventions, distinguishing it from other control methods.