Talkie: a 13B vintage language model from 1930

a month ago

Introducing talkie-1930-13b, a 13B vintage language model trained on pre-1931 text to simulate historical perspectives.
Vintage LMs allow contamination-free studies on AI generalization, future prediction, and idea generation beyond training data.
They help explore the impact of data diversity on model behavior, compared to modern web-trained models.
Talkie underperforms its modern counterpart on some benchmarks but shows promise in language understanding and numeracy.
Challenges include ensuring no post-1930 data leakage, improving OCR quality, and creating era-appropriate post-training data.
Future plans include scaling talkie to GPT-3 and GPT-3.5 levels and developing multilingual corpora.

Hasty Briefsbeta