Writing an LLM from scratch, part 32d – Interventions: adding attention bias
3 months ago
- #attention-bias
- #GPT-2
- #LLM
- The article discusses an experiment to improve the test loss of a GPT-2 small base model by adding bias to the attention weight matrices (QKV bias).
- The author references Sebastian Raschka's book, which initially disables QKV bias in LLMs as it's considered unnecessary for performance.
- Despite the common practice, the experiment shows that adding QKV bias improves the test set loss by 0.023, more than gradient clipping did (0.014).
- The model's parameter count increases by less than 0.02% with QKV bias, making the improvement notable.
- The author speculates that QKV bias might stabilize training or add intelligence due to extra parameters, especially in smaller models.
- Future steps include verifying the results' significance by repeating the baseline train and exploring learning rate tweaks.