Writing an LLM from scratch, part 32d – Interventions: adding attention bias

3 months ago

The article discusses an experiment to improve the test loss of a GPT-2 small base model by adding bias to the attention weight matrices (QKV bias).
The author references Sebastian Raschka's book, which initially disables QKV bias in LLMs as it's considered unnecessary for performance.
Despite the common practice, the experiment shows that adding QKV bias improves the test set loss by 0.023, more than gradient clipping did (0.014).
The model's parameter count increases by less than 0.02% with QKV bias, making the improvement notable.
The author speculates that QKV bias might stabilize training or add intelligence due to extra parameters, especially in smaller models.
Future steps include verifying the results' significance by repeating the baseline train and exploring learning rate tweaks.

Hasty Briefsbeta