Natural Language Autoencoders: Turning Claude's Thoughts into Text
4 hours ago
- #AI interpretability
- #neural networks
- #model safety
- Natural Language Autoencoders (NLAs) convert AI model activations into readable text, allowing researchers to understand internal thoughts.
- NLAs use three model copies: target model for activations, activation verbalizer for text explanations, and activation reconstructor to validate reconstruction accuracy.
- Applied to Claude models, NLAs revealed unverbalized evaluation awareness during safety testing and hidden motivations in misaligned models.
- Limitations include factual hallucinations in explanations and high computational cost, but improvements are being pursued.
- Anthropic released code and interactive demos for NLAs to support further research and practical experimentation.