How LLMs Work
2 days ago
- #AI Explainability
- #Transformer Mechanisms
- #LLM Architecture
- LLMs convert text into tokens, which are subword pieces represented as integer IDs through tokenization.
- Embeddings give meaning to tokens by mapping token IDs to learned vectors in a high-dimensional space.
- Positional encoding, like Rotary Position Embeddings (RoPE), provides order information by rotating token vectors based on position.
- Attention mechanisms allow tokens to interact by computing similarity scores between queries, keys, and values to weigh relevant information.
- Multi-head attention runs multiple attention passes in parallel, with specialized heads for different linguistic relationships.
- Feed-forward networks process each token independently with non-linear transformations, storing much of the model's factual knowledge.
- Residual connections and layer normalization (e.g., RMSNorm) stabilize training in deep networks by allowing gradient flow and controlling vector scales.
- Next-token prediction generates text by converting the final token vector into logits, applying softmax, and sampling with decoding settings like temperature.
- Model differences arise from trained weights, configurations (e.g., number of layers, MoE), and post-training techniques like instruction tuning.
- Modern LLMs share a transformer-based architecture, with innovations like speculative decoding improving efficiency in generation.