An LLM trained only on data from certain time periods to reduce modern bias

10 months ago

TimeCapsule LLM is an experimental project aimed at simulating the worldview and language of specific historical eras by training exclusively on texts from those periods.
The model is trained on texts from 1800-1850 London, with plans to expand to 1800-1875, to avoid modern bias and concepts.
Training involves collecting and cleaning historical texts, building a custom tokenizer, and training from scratch using nanoGPT by Andrej Karpathy.
Initial training used 187MB of data (50 books), producing outputs with 1800s language but limited coherence. Goal is to scale to 500-600 books for better reasoning.
The project focuses on curating historical data and preparing it for training, with the model currently having ~16 million parameters.
Challenges include ensuring texts are unaltered by modern interpretations and dealing with OCR errors or annotations.
Outputs so far reflect 1800s language and lack modern concepts, though sentence structure and coherence need improvement with more data.

Hasty Briefsbeta