Contextualization Machines

10 days ago

Copy Link

Transformers are viewed as contextualization machines rather than just next-token predictors.
The residual chain is the backbone of the model, with layers adding contextualization to hidden states.
Tokenizers and embedding matrices provide precontextualized meanings, with larger vocabularies offering more specific meanings.
Increasing tokenizer size improves model performance by enhancing precontextualization.
Attention mechanisms enable local contextualization by allowing tokens to share information within the sequence.
Feed-forward layers act as global contextualization, integrating broader knowledge from training data.
Next-token prediction involves speculative contextualization, refining hidden states to resemble output distributions.
Multi-token prediction improves model performance by encouraging deeper speculative contextualization.
Evidence from papers supports the mental model of transformers as contextualization machines.

Hasty Briefsbeta