Hasty Briefsbeta

Bilingual

An LLM trained only on data from certain time periods to reduce modern bias

10 months ago
  • #AI
  • #Natural Language Processing
  • #Historical Simulation
  • TimeCapsule LLM is an experimental project aimed at simulating the worldview and language of specific historical eras by training exclusively on texts from those periods.
  • The model is trained on texts from 1800-1850 London, with plans to expand to 1800-1875, to avoid modern bias and concepts.
  • Training involves collecting and cleaning historical texts, building a custom tokenizer, and training from scratch using nanoGPT by Andrej Karpathy.
  • Initial training used 187MB of data (50 books), producing outputs with 1800s language but limited coherence. Goal is to scale to 500-600 books for better reasoning.
  • The project focuses on curating historical data and preparing it for training, with the model currently having ~16 million parameters.
  • Challenges include ensuring texts are unaltered by modern interpretations and dealing with OCR errors or annotations.
  • Outputs so far reflect 1800s language and lack modern concepts, though sentence structure and coherence need improvement with more data.