Inside The Transformer: The Life of a Token

7 hours ago

The post provides a detailed walkthrough of a modern dense transformer's forward pass, using Rnj 1.5 as an example, focusing on single-GPU training while ignoring backward passes and distributed systems.
Key transformer components are explained: RMSNorm for normalization, GeGLU MLP for non-linear relationships, multi-head attention (with group query attention), YaRN for positional embeddings in long contexts, and core attention mechanisms.
YaRN modifies RoPE to better extrapolate to longer context lengths by injecting positional information through rotation, enabling relative positional encoding via pairwise coordinate rotations of query and key vectors.
Core attention uses masking to prevent cross-document attention and maintain causality, with block-local and global layers differing only in mask structure for efficient long-context handling.
Transformer math covers KV cache for efficient autoregressive inference, parameter count estimation (focusing on MLP and attention matrices), and FLOPs per token calculations for cluster sizing, including the 6N formula under specific conditions.

Hasty Briefsbeta