Natural Language Autoencoders: Inside Claude's Activations

2 days ago

#model auditing
#AI interpretability
#natural language autoencoders

Anthropic published a method called Natural Language Autoencoders (NLAs) to translate Claude's internal activations into readable English.
NLAs consist of two models: an Activation Verbalizer (AV) that writes a paragraph from activations, and an Activation Reconstructor (AR) that tries to recover the original vector from the paragraph.
The method demonstrates causal evidence through editing experiments, such as changing 'rabbit' to 'mouse' in explanations and observing corresponding changes in model output.
NLAs uncovered specific model behaviors, like Claude's fixation on a user's first language being Russian and precomputing answers while ignoring tool outputs.
Training involves warm-starting with supervised fine-tuning on a proxy task and using reinforcement learning with a KL penalty to maintain readability.
The cost is significant: training on Gemma-3-27B took 1.5 days on two 8xH100 nodes, and decoding long transcripts is impractical for production audits.
Three structural problems include: the voice is inherited from the warm-start data, the auditing improvement is context-dependent, and the method is only honest while the target model is frozen.
NLAs confabulate, with a flat rate of verifiably false claims, suggesting they should be read for themes rather than specifics.
NLAs serve as a hypothesis-generation tool for audits, complementing other interpretability methods, and raise expectations for foundation-model audit deliverables.
The method changes audit accessibility, allowing non-experts to read decoded activations, but outputs require skepticism due to confabulation and inherited voice.

Hasty Briefsbeta

Natural Language Autoencoders: Inside Claude's Activations