Natural Language Autoencoders: Inside Claude's Activations
2 days ago
- #model auditing
- #AI interpretability
- #natural language autoencoders
- Anthropic published a method called Natural Language Autoencoders (NLAs) to translate Claude's internal activations into readable English.
- NLAs consist of two models: an Activation Verbalizer (AV) that writes a paragraph from activations, and an Activation Reconstructor (AR) that tries to recover the original vector from the paragraph.
- The method demonstrates causal evidence through editing experiments, such as changing 'rabbit' to 'mouse' in explanations and observing corresponding changes in model output.
- NLAs uncovered specific model behaviors, like Claude's fixation on a user's first language being Russian and precomputing answers while ignoring tool outputs.
- Training involves warm-starting with supervised fine-tuning on a proxy task and using reinforcement learning with a KL penalty to maintain readability.
- The cost is significant: training on Gemma-3-27B took 1.5 days on two 8xH100 nodes, and decoding long transcripts is impractical for production audits.
- Three structural problems include: the voice is inherited from the warm-start data, the auditing improvement is context-dependent, and the method is only honest while the target model is frozen.
- NLAs confabulate, with a flat rate of verifiably false claims, suggesting they should be read for themes rather than specifics.
- NLAs serve as a hypothesis-generation tool for audits, complementing other interpretability methods, and raise expectations for foundation-model audit deliverables.
- The method changes audit accessibility, allowing non-experts to read decoded activations, but outputs require skepticism due to confabulation and inherited voice.