Hasty Briefsbeta

Bilingual

Self-Attention Solved the Sequential Bottleneck

4 hours ago
  • #AI Architecture
  • #LLM Internals
  • #Transformer Models
  • LLMs have largely moved away from recurrent neural networks (RNNs) due to sequential bottlenecks and long-range decay issues.
  • Transformers, introduced in the 'Attention Is All You Need' paper, replaced RNNs with self-attention mechanisms, enabling parallel processing and eliminating information decay.
  • Decoder-only transformers (like GPT) gained popularity over encoder-only or encoder-decoder models because they require no labeled data and have a universal objective for tasks such as translation and summarization.
  • Tokenization, typically using Byte-Pair Encoding (BPE), converts text into numerical tokens but has challenges with spelling, arithmetic, and multilingual inequality.
  • Embeddings map token IDs to vectors learned during training, capturing semantic relationships, while positional encodings (like RoPE) inject order information.
  • Attention mechanisms compute relationships between tokens using query, key, and value vectors, with multi-head attention allowing parallel specialization.
  • Other critical transformer components include add & norm layers for gradient flow, feed-forward networks for token-specific processing, and final linear layers for output logits.
  • Advanced techniques like Grouped-Query Attention (GQA), FlashAttention, and Mixture of Experts (MoE) optimize memory, computation, and model scalability.