A 20-Year-Old Algorithm Can Help Us Understand Transformer Embeddings
14 days ago
- #LLM-interpretability
- #dictionary-learning
- #KSVD-algorithm
- Understanding internal states of LLMs by decomposing embeddings into interpretable concept vectors.
- Dictionary learning as a method to break down complex embeddings into simpler, interpretable elements.
- Introduction of the superposition hypothesis by Elhage et al. in 2022 supporting the decomposition approach.
- Use of sparse autoencoders (SAEs) by Bricken et al. in 2023 for dictionary learning in large datasets.
- Revival of the KSVD algorithm with modifications leading to significant speed improvements (DB-KSVD).
- Competitive performance of DB-KSVD against SAEs on the SAEBench benchmark across multiple metrics.
- Discussion on the feasibility of dictionary learning, including factors like dataset size and embedding dimensions.
- Potential applications of dictionary learning beyond language models, such as in robotics and vision tasks.