Trained LLMs exclusively on pre-1913 texts

a day ago

Copy Link

A family of 4 billion parameter large language models (LLMs) based on the Qwen3 architecture trained from scratch on 80B tokens of historical data up to knowledge-cutoffs.
Models are time-locked, meaning they do not have access to any information beyond their knowledge-cutoff date.
The project aims to create windows into the past for research in humanities, social sciences, and computer science.
Models will reproduce historical biases and views present in their training data, which is a feature for understanding historical discourse.
A responsible access framework is being developed to make models available to researchers while preventing misuse.
The project invites comments and suggestions on periods, regions, questions, validation methods, and access frameworks.

Hasty Briefsbeta