LLMs are not the black box you were promised

6 hours ago

Mechanistic interpretability has advanced significantly, making LLMs less of a 'black box'.
Circuit tracing trains a second model to decompose activations into human-interpretable features like 'Texas'.
Models exhibit multi-step reasoning, such as activating features for 'Dallas', 'Texas', then 'Austin' in sequence.
This pseudo-symbolic inference resembles higher reasoning and is not unique to LLMs, as seen in AlphaZero's chess concepts.
Understanding internal algorithms can improve learning methods, like Claude 3.5 Haiku's parallel addition pathways.
LLMs lack metacognitive insight, having a 'subconscious' that allows external analysis of their processes.
These insights aid in detecting misbehavior, steering behavior, and developing better algorithms.

Hasty Briefsbeta