Hasty Briefsbeta

Why Can't Transformers Learn Multiplication?

19 hours ago
  • #Long-Range Dependencies
  • #Transformers
  • #Machine Learning
  • Language models struggle with multi-digit multiplication despite their increasing capabilities.
  • Reverse-engineering reveals the model uses implicit chain-of-thought to learn multiplication.
  • Key findings include evidence of long-range structure, mechanism of encoding dependencies, and geometry of partial products.
  • The model encodes long-range dependencies via attention to construct a directed acyclic graph for caching and retrieving partial products.
  • Partial products are implemented in attention heads using Minkowski sums and digits represented in a Fourier basis.
  • Standard fine-tuning models converge to a local optimum lacking necessary long-range dependencies.
  • An auxiliary loss predicting the 'running sum' via linear regression enables successful learning of multi-digit multiplication.
  • The study highlights a pitfall in learning long-range dependencies in Transformers and suggests inductive bias as a solution.