A 20-Year-Old Algorithm Can Help Us Understand Transformer Embeddings

14 days ago

Copy Link

Understanding internal states of LLMs by decomposing embeddings into interpretable concept vectors.
Dictionary learning as a method to break down complex embeddings into simpler, interpretable elements.
Introduction of the superposition hypothesis by Elhage et al. in 2022 supporting the decomposition approach.
Use of sparse autoencoders (SAEs) by Bricken et al. in 2023 for dictionary learning in large datasets.
Revival of the KSVD algorithm with modifications leading to significant speed improvements (DB-KSVD).
Competitive performance of DB-KSVD against SAEs on the SAEBench benchmark across multiple metrics.
Discussion on the feasibility of dictionary learning, including factors like dataset size and embedding dimensions.
Potential applications of dictionary learning beyond language models, such as in robotics and vision tasks.

Hasty Briefsbeta