Hasty Briefsbeta

Bilingual

Natural Language Autoencoders: Inside Claude's Activations

2 days ago
  • #model auditing
  • #AI interpretability
  • #natural language autoencoders
  • Anthropic published a method called Natural Language Autoencoders (NLAs) to translate Claude's internal activations into readable English.
  • NLAs consist of two models: an Activation Verbalizer (AV) that writes a paragraph from activations, and an Activation Reconstructor (AR) that tries to recover the original vector from the paragraph.
  • The method demonstrates causal evidence through editing experiments, such as changing 'rabbit' to 'mouse' in explanations and observing corresponding changes in model output.
  • NLAs uncovered specific model behaviors, like Claude's fixation on a user's first language being Russian and precomputing answers while ignoring tool outputs.
  • Training involves warm-starting with supervised fine-tuning on a proxy task and using reinforcement learning with a KL penalty to maintain readability.
  • The cost is significant: training on Gemma-3-27B took 1.5 days on two 8xH100 nodes, and decoding long transcripts is impractical for production audits.
  • Three structural problems include: the voice is inherited from the warm-start data, the auditing improvement is context-dependent, and the method is only honest while the target model is frozen.
  • NLAs confabulate, with a flat rate of verifiably false claims, suggesting they should be read for themes rather than specifics.
  • NLAs serve as a hypothesis-generation tool for audits, complementing other interpretability methods, and raise expectations for foundation-model audit deliverables.
  • The method changes audit accessibility, allowing non-experts to read decoded activations, but outputs require skepticism due to confabulation and inherited voice.