LLMs are not the black box you were promised
4 hours ago
- #mechanistic interpretability
- #circuit tracing
- #AI transparency
- Mechanistic interpretability has advanced significantly, making LLMs less of a 'black box'.
- Circuit tracing trains a second model to decompose activations into human-interpretable features like 'Texas'.
- Models exhibit multi-step reasoning, such as activating features for 'Dallas', 'Texas', then 'Austin' in sequence.
- This pseudo-symbolic inference resembles higher reasoning and is not unique to LLMs, as seen in AlphaZero's chess concepts.
- Understanding internal algorithms can improve learning methods, like Claude 3.5 Haiku's parallel addition pathways.
- LLMs lack metacognitive insight, having a 'subconscious' that allows external analysis of their processes.
- These insights aid in detecting misbehavior, steering behavior, and developing better algorithms.