Self-Attention Solved the Sequential Bottleneck

4 hours ago

LLMs have largely moved away from recurrent neural networks (RNNs) due to sequential bottlenecks and long-range decay issues.
Transformers, introduced in the 'Attention Is All You Need' paper, replaced RNNs with self-attention mechanisms, enabling parallel processing and eliminating information decay.
Decoder-only transformers (like GPT) gained popularity over encoder-only or encoder-decoder models because they require no labeled data and have a universal objective for tasks such as translation and summarization.
Tokenization, typically using Byte-Pair Encoding (BPE), converts text into numerical tokens but has challenges with spelling, arithmetic, and multilingual inequality.
Embeddings map token IDs to vectors learned during training, capturing semantic relationships, while positional encodings (like RoPE) inject order information.
Attention mechanisms compute relationships between tokens using query, key, and value vectors, with multi-head attention allowing parallel specialization.
Other critical transformer components include add & norm layers for gradient flow, feed-forward networks for token-specific processing, and final linear layers for output logits.
Advanced techniques like Grouped-Query Attention (GQA), FlashAttention, and Mixture of Experts (MoE) optimize memory, computation, and model scalability.

Hasty Briefsbeta