Hasty Briefsbeta

Bilingual

Autoregressive next token prediction and KV Cache in transformers

3 days ago
  • #Autoregressive Models
  • #Transformer Optimization
  • #KV Cache
  • Autoregressive language models process prompts token by token, starting with a BOS token and tokenizing the input into IDs.
  • During prefill, the prompt is processed in parallel to produce the first predicted token and populate the KV cache for efficiency.
  • Each decoder block uses multi-head self-attention and MLP layers with residual connections to refine token embeddings.
  • Attention involves computing Q, K, V matrices per head, applying causal masking, and combining outputs via concatenation and projection.
  • The KV cache stores key and value tensors from the prompt, allowing decode steps to process only new tokens without recomputing past ones.
  • Decode steps generate one token at a time by appending new K and V to the cache and performing attention against the full cached history.
  • This optimization reduces computational cost from quadratic to linear, making long-context generation feasible in transformers.