Autoregressive next token prediction and KV Cache in transformers
3 days ago
- #Autoregressive Models
- #Transformer Optimization
- #KV Cache
- Autoregressive language models process prompts token by token, starting with a BOS token and tokenizing the input into IDs.
- During prefill, the prompt is processed in parallel to produce the first predicted token and populate the KV cache for efficiency.
- Each decoder block uses multi-head self-attention and MLP layers with residual connections to refine token embeddings.
- Attention involves computing Q, K, V matrices per head, applying causal masking, and combining outputs via concatenation and projection.
- The KV cache stores key and value tensors from the prompt, allowing decode steps to process only new tokens without recomputing past ones.
- Decode steps generate one token at a time by appending new K and V to the cache and performing attention against the full cached history.
- This optimization reduces computational cost from quadratic to linear, making long-context generation feasible in transformers.