Writing an LLM from scratch, part 13 – attention heads are dumb
a year ago
- #deep-learning
- #LLM
- #attention-mechanism
- The post discusses the 'why' of self-attention in LLMs, emphasizing that individual attention heads are simpler than initially thought.
- Multi-head attention and layering combine to create complex representations, with each layer building upon the previous one.
- Attention heads use simple pattern matching, projecting input embeddings into a shared space for query and key vectors to determine attention scores.
- The example of 'the fat cat sat on the mat' illustrates how an attention head might focus on matching articles with nouns in a simplified embedding space.
- The post highlights the elegance of 'dumb' attention heads, which perform basic operations that collectively enable sophisticated language understanding.
- The author plans to explore context lengths in a future post, noting the benefits and potential downsides of a hidden state that grows with input sequence length.