Pretraining Language Models via Neural Cellular Automata
4 days ago
- #language models
- #neural cellular automata
- #synthetic data
- Large language models require exponentially more data, with high-quality natural language projected to run out by 2028.
- Neural cellular automata (NCA) use neural networks to produce diverse spatiotemporal dynamics, offering an alternative to natural language for training models.
- NCA trajectories are tokenized and fed to transformers, requiring models to infer latent rules in-context, enhancing reasoning capabilities.
- NCA pre-pre-training outperforms natural language and other synthetic data in terms of convergence speed and final perplexity across various domains.
- NCA data, despite having zero linguistic content, teaches models to track long-range dependencies and infer latent rules, similar to language models.
- Optimal NCA complexity varies by domain, with simpler dynamics benefiting code and more complex dynamics preferred for math and web text.
- NCA pre-pre-training forces models to learn general mechanisms for rule inference rather than memorizing specific rules, supported by empirical findings.
- The approach opens a new axis of control for training language models, allowing tuning of synthetic data structure to match target domains.
- Long-term vision includes foundation models acquiring reasoning from synthetic data and semantics from a small, curated natural language corpus to reduce human biases.