Hasty Briefsbeta

Bilingual

Writing an LLM from scratch, part 32d – Interventions: adding attention bias

3 months ago
  • #attention-bias
  • #GPT-2
  • #LLM
  • The article discusses an experiment to improve the test loss of a GPT-2 small base model by adding bias to the attention weight matrices (QKV bias).
  • The author references Sebastian Raschka's book, which initially disables QKV bias in LLMs as it's considered unnecessary for performance.
  • Despite the common practice, the experiment shows that adding QKV bias improves the test set loss by 0.023, more than gradient clipping did (0.014).
  • The model's parameter count increases by less than 0.02% with QKV bias, making the improvement notable.
  • The author speculates that QKV bias might stabilize training or add intelligence due to extra parameters, especially in smaller models.
  • Future steps include verifying the results' significance by repeating the baseline train and exploring learning rate tweaks.