Hasty Briefsbeta

Bilingual

Writing an LLM from scratch, part 13 – attention heads are dumb

a year ago
  • #deep-learning
  • #LLM
  • #attention-mechanism
  • The post discusses the 'why' of self-attention in LLMs, emphasizing that individual attention heads are simpler than initially thought.
  • Multi-head attention and layering combine to create complex representations, with each layer building upon the previous one.
  • Attention heads use simple pattern matching, projecting input embeddings into a shared space for query and key vectors to determine attention scores.
  • The example of 'the fat cat sat on the mat' illustrates how an attention head might focus on matching articles with nouns in a simplified embedding space.
  • The post highlights the elegance of 'dumb' attention heads, which perform basic operations that collectively enable sophisticated language understanding.
  • The author plans to explore context lengths in a future post, noting the benefits and potential downsides of a hidden state that grows with input sequence length.