Understanding Transformers Using a Minimal Example
7 days ago
- #Visualization
- #LLM
- #Transformer
- The article visualizes Transformer LLM internals using a simplified model and minimal dataset.
- A minimal dataset of fruits and tastes is used to train the model.
- Tokenization is simplified with a regex-based approach, creating a small vocabulary of 19 tokens.
- The model architecture is drastically scaled down: 2 layers, 2 attention heads, and 20-dimensional embeddings.
- Training achieves low loss, and the model correctly predicts 'chili' for the validation input.
- Token embeddings are visualized as stacks of boxes, showing unique and shared features.
- The attention mechanism's role in refining token representations is highlighted.
- The final token's representation evolves to resemble the predicted next token's embedding.
- The article concludes that the simplified approach offers valuable insights into Transformer mechanisms.