Hasty Briefsbeta

Bilingual

LLMs are not the black box you were promised

6 hours ago
  • #mechanistic interpretability
  • #circuit tracing
  • #AI transparency
  • Mechanistic interpretability has advanced significantly, making LLMs less of a 'black box'.
  • Circuit tracing trains a second model to decompose activations into human-interpretable features like 'Texas'.
  • Models exhibit multi-step reasoning, such as activating features for 'Dallas', 'Texas', then 'Austin' in sequence.
  • This pseudo-symbolic inference resembles higher reasoning and is not unique to LLMs, as seen in AlphaZero's chess concepts.
  • Understanding internal algorithms can improve learning methods, like Claude 3.5 Haiku's parallel addition pathways.
  • LLMs lack metacognitive insight, having a 'subconscious' that allows external analysis of their processes.
  • These insights aid in detecting misbehavior, steering behavior, and developing better algorithms.