Show HN: Steerling-8B, a language model that can explain any token it generates

a day ago

Steerling-8B is the first inherently interpretable language model capable of tracing any generated token back to its input context, human-understandable concepts, and training data.
Trained on 1.35 trillion tokens, it performs comparably to models trained on 2–7× more data.
Key capabilities include concept suppression/amplification at inference, training data provenance, and inference-time alignment via concept control.
The model decomposes embeddings into supervised concepts, discovered concepts, and a residual pathway, ensuring interpretability without performance tradeoffs.
Steerling-8B achieves competitive performance on benchmarks despite lower training compute, with 84% of token-level contributions coming from the concept module.
Upcoming releases will explore concept steering, discovery, alignment without fine-tuning, and memorization/training data valuation.

Hasty Briefsbeta