An LLM trained only on data from certain time periods to reduce modern bias
10 months ago
- #AI
- #Natural Language Processing
- #Historical Simulation
- TimeCapsule LLM is an experimental project aimed at simulating the worldview and language of specific historical eras by training exclusively on texts from those periods.
- The model is trained on texts from 1800-1850 London, with plans to expand to 1800-1875, to avoid modern bias and concepts.
- Training involves collecting and cleaning historical texts, building a custom tokenizer, and training from scratch using nanoGPT by Andrej Karpathy.
- Initial training used 187MB of data (50 books), producing outputs with 1800s language but limited coherence. Goal is to scale to 500-600 books for better reasoning.
- The project focuses on curating historical data and preparing it for training, with the model currently having ~16 million parameters.
- Challenges include ensuring texts are unaltered by modern interpretations and dealing with OCR errors or annotations.
- Outputs so far reflect 1800s language and lack modern concepts, though sentence structure and coherence need improvement with more data.