Self-Attention Solved the Sequential Bottleneck
4 hours ago
- #AI Architecture
- #LLM Internals
- #Transformer Models
- LLMs have largely moved away from recurrent neural networks (RNNs) due to sequential bottlenecks and long-range decay issues.
- Transformers, introduced in the 'Attention Is All You Need' paper, replaced RNNs with self-attention mechanisms, enabling parallel processing and eliminating information decay.
- Decoder-only transformers (like GPT) gained popularity over encoder-only or encoder-decoder models because they require no labeled data and have a universal objective for tasks such as translation and summarization.
- Tokenization, typically using Byte-Pair Encoding (BPE), converts text into numerical tokens but has challenges with spelling, arithmetic, and multilingual inequality.
- Embeddings map token IDs to vectors learned during training, capturing semantic relationships, while positional encodings (like RoPE) inject order information.
- Attention mechanisms compute relationships between tokens using query, key, and value vectors, with multi-head attention allowing parallel specialization.
- Other critical transformer components include add & norm layers for gradient flow, feed-forward networks for token-specific processing, and final linear layers for output logits.
- Advanced techniques like Grouped-Query Attention (GQA), FlashAttention, and Mixture of Experts (MoE) optimize memory, computation, and model scalability.