Contextualization Machines
10 days ago
- #LLMs
- #contextualization
- #transformers
- Transformers are viewed as contextualization machines rather than just next-token predictors.
- The residual chain is the backbone of the model, with layers adding contextualization to hidden states.
- Tokenizers and embedding matrices provide precontextualized meanings, with larger vocabularies offering more specific meanings.
- Increasing tokenizer size improves model performance by enhancing precontextualization.
- Attention mechanisms enable local contextualization by allowing tokens to share information within the sequence.
- Feed-forward layers act as global contextualization, integrating broader knowledge from training data.
- Next-token prediction involves speculative contextualization, refining hidden states to resemble output distributions.
- Multi-token prediction improves model performance by encouraging deeper speculative contextualization.
- Evidence from papers supports the mental model of transformers as contextualization machines.