Steering interpretable language models with concept algebra

21 hours ago

Steerling-8B enables concept algebra, allowing addition, removal, and composition of human-understandable concepts at inference time without retraining or prompt engineering.
The model supports direct editing of internal representations for any concept, facilitating control over generation without altering the prompt.
Compositional control is essential in multi-turn dialogues, such as content moderation or health assistance, where multiple concepts must be managed simultaneously.
Current control methods like prompting and fine-tuning are either unreliable or costly, lacking composability and fine-grained control.
Steerling-8B's concept module provides a linear, algebraic handle on internal variables, enabling both explanation and control of model behavior.
Mask-aligned injection ensures reliable control during diffusion decoding by aligning concept embeddings with the training distribution.
Steering can redirect outputs to different domains without changing the prompt, demonstrating versatility in generation control.
Bottleneck intervention allows for the removal of specific concepts by wiping their contributions before generation.
Systematic evaluation shows steering significantly improves concept adherence while maintaining high text quality.
The linear architecture of the concept module ensures predictable effects from interventions, distinguishing it from other control methods.

Hasty Briefsbeta