Writing an LLM from scratch, part 13 – attention heads are dumb

a year ago

The post discusses the 'why' of self-attention in LLMs, emphasizing that individual attention heads are simpler than initially thought.
Multi-head attention and layering combine to create complex representations, with each layer building upon the previous one.
Attention heads use simple pattern matching, projecting input embeddings into a shared space for query and key vectors to determine attention scores.
The example of 'the fat cat sat on the mat' illustrates how an attention head might focus on matching articles with nouns in a simplified embedding space.
The post highlights the elegance of 'dumb' attention heads, which perform basic operations that collectively enable sophisticated language understanding.
The author plans to explore context lengths in a future post, noting the benefits and potential downsides of a hidden state that grows with input sequence length.

Hasty Briefsbeta