Why Can't Transformers Learn Multiplication?
19 hours ago
- #Long-Range Dependencies
- #Transformers
- #Machine Learning
- Language models struggle with multi-digit multiplication despite their increasing capabilities.
- Reverse-engineering reveals the model uses implicit chain-of-thought to learn multiplication.
- Key findings include evidence of long-range structure, mechanism of encoding dependencies, and geometry of partial products.
- The model encodes long-range dependencies via attention to construct a directed acyclic graph for caching and retrieving partial products.
- Partial products are implemented in attention heads using Minkowski sums and digits represented in a Fourier basis.
- Standard fine-tuning models converge to a local optimum lacking necessary long-range dependencies.
- An auxiliary loss predicting the 'running sum' via linear regression enables successful learning of multi-digit multiplication.
- The study highlights a pitfall in learning long-range dependencies in Transformers and suggests inductive bias as a solution.