How LLMs Work

2 days ago

LLMs convert text into tokens, which are subword pieces represented as integer IDs through tokenization.
Embeddings give meaning to tokens by mapping token IDs to learned vectors in a high-dimensional space.
Positional encoding, like Rotary Position Embeddings (RoPE), provides order information by rotating token vectors based on position.
Attention mechanisms allow tokens to interact by computing similarity scores between queries, keys, and values to weigh relevant information.
Multi-head attention runs multiple attention passes in parallel, with specialized heads for different linguistic relationships.
Feed-forward networks process each token independently with non-linear transformations, storing much of the model's factual knowledge.
Residual connections and layer normalization (e.g., RMSNorm) stabilize training in deep networks by allowing gradient flow and controlling vector scales.
Next-token prediction generates text by converting the final token vector into logits, applying softmax, and sampling with decoding settings like temperature.
Model differences arise from trained weights, configurations (e.g., number of layers, MoE), and post-training techniques like instruction tuning.
Modern LLMs share a transformer-based architecture, with innovations like speculative decoding improving efficiency in generation.

Hasty Briefsbeta