Making a vintage LLM from scratch

a day ago

Author created a vintage LLM from scratch trained only on old texts (pre-1900), sharing adventures and open-sourcing code on GitHub.
Model architecture is based on Llama with 340M parameters, processed custom dataset, tokenizer, and training scripts, costing about $80 for GPU.
Data processing involved de-duplication and filtering using metrics like compression ratio, entropy, and custom quality score to ensure text quality.
Training involved two base-training stages totaling ~9B tokens, fine-tuning attempts for dialogue, and math tests showing limited arithmetic capability.
Project is a hobbyist effort with historic accuracy, avoiding alignment to preserve vintage content, and plans further fine-tuning and dataset expansion.

Hasty Briefsbeta