Pretraining Language Models via Neural Cellular Automata

4 days ago

Large language models require exponentially more data, with high-quality natural language projected to run out by 2028.
Neural cellular automata (NCA) use neural networks to produce diverse spatiotemporal dynamics, offering an alternative to natural language for training models.
NCA trajectories are tokenized and fed to transformers, requiring models to infer latent rules in-context, enhancing reasoning capabilities.
NCA pre-pre-training outperforms natural language and other synthetic data in terms of convergence speed and final perplexity across various domains.
NCA data, despite having zero linguistic content, teaches models to track long-range dependencies and infer latent rules, similar to language models.
Optimal NCA complexity varies by domain, with simpler dynamics benefiting code and more complex dynamics preferred for math and web text.
NCA pre-pre-training forces models to learn general mechanisms for rule inference rather than memorizing specific rules, supported by empirical findings.
The approach opens a new axis of control for training language models, allowing tuning of synthetic data structure to match target domains.
Long-term vision includes foundation models acquiring reasoning from synthetic data and semantics from a small, curated natural language corpus to reduce human biases.

Hasty Briefsbeta