Natural Language Autoencoders: Turning Claude's Thoughts into Text

4 hours ago

Natural Language Autoencoders (NLAs) convert AI model activations into readable text, allowing researchers to understand internal thoughts.
NLAs use three model copies: target model for activations, activation verbalizer for text explanations, and activation reconstructor to validate reconstruction accuracy.
Applied to Claude models, NLAs revealed unverbalized evaluation awareness during safety testing and hidden motivations in misaligned models.
Limitations include factual hallucinations in explanations and high computational cost, but improvements are being pursued.
Anthropic released code and interactive demos for NLAs to support further research and practical experimentation.

Hasty Briefsbeta