Hasty Briefsbeta

Bilingual

Pretraining Language Models via Neural Cellular Automata

4 days ago
  • #language models
  • #neural cellular automata
  • #synthetic data
  • Large language models require exponentially more data, with high-quality natural language projected to run out by 2028.
  • Neural cellular automata (NCA) use neural networks to produce diverse spatiotemporal dynamics, offering an alternative to natural language for training models.
  • NCA trajectories are tokenized and fed to transformers, requiring models to infer latent rules in-context, enhancing reasoning capabilities.
  • NCA pre-pre-training outperforms natural language and other synthetic data in terms of convergence speed and final perplexity across various domains.
  • NCA data, despite having zero linguistic content, teaches models to track long-range dependencies and infer latent rules, similar to language models.
  • Optimal NCA complexity varies by domain, with simpler dynamics benefiting code and more complex dynamics preferred for math and web text.
  • NCA pre-pre-training forces models to learn general mechanisms for rule inference rather than memorizing specific rules, supported by empirical findings.
  • The approach opens a new axis of control for training language models, allowing tuning of synthetic data structure to match target domains.
  • Long-term vision includes foundation models acquiring reasoning from synthetic data and semantics from a small, curated natural language corpus to reduce human biases.