Autoregressive next token prediction and KV Cache in transformers

3 days ago

Autoregressive language models process prompts token by token, starting with a BOS token and tokenizing the input into IDs.
During prefill, the prompt is processed in parallel to produce the first predicted token and populate the KV cache for efficiency.
Each decoder block uses multi-head self-attention and MLP layers with residual connections to refine token embeddings.
Attention involves computing Q, K, V matrices per head, applying causal masking, and combining outputs via concatenation and projection.
The KV cache stores key and value tensors from the prompt, allowing decode steps to process only new tokens without recomputing past ones.
Decode steps generate one token at a time by appending new K and V to the cache and performing attention against the full cached history.
This optimization reduces computational cost from quadratic to linear, making long-context generation feasible in transformers.

Hasty Briefsbeta