Writing an LLM from scratch, part 15 – from context vectors to logits

a year ago

The post discusses the transition from context vectors to logits in LLMs, highlighting the simplicity of using a single linear layer for this conversion.
Explains the concept of weight tying, where the embedding matrix's transpose is used to project context vectors back to vocab space, simplifying the model's output generation.
Details the process of converting token IDs to embeddings via one-hot vectors and matrix multiplication, emphasizing the role of embeddings in representing token meanings.
Introduces the idea of logits as unnormalized probabilities that can be converted into actual probabilities through softmax, serving as the LLM's output for next-token prediction.
Clarifies that each context vector in the output corresponds to a prediction of the next token based on the input sequence up to that point, not just the final token.
Discusses the practical reasons for avoiding weight tying in training, noting that separate trainable layers for embeddings and logits can yield better results due to the enriched nature of context vectors.
Mentions the concept of perplexity as a measure of the model's certainty in its predictions, linking it to the distribution of logits.

Hasty Briefsbeta