Show HN: Steerling-8B, a language model that can explain any token it generates
a day ago
- #Interpretability
- #AI
- #LanguageModels
- Steerling-8B is the first inherently interpretable language model capable of tracing any generated token back to its input context, human-understandable concepts, and training data.
- Trained on 1.35 trillion tokens, it performs comparably to models trained on 2–7× more data.
- Key capabilities include concept suppression/amplification at inference, training data provenance, and inference-time alignment via concept control.
- The model decomposes embeddings into supervised concepts, discovered concepts, and a residual pathway, ensuring interpretability without performance tradeoffs.
- Steerling-8B achieves competitive performance on benchmarks despite lower training compute, with 84% of token-level contributions coming from the concept module.
- Upcoming releases will explore concept steering, discovery, alignment without fine-tuning, and memorization/training data valuation.