Making a vintage LLM from scratch
a day ago
- #LLM Training
- #Vintage Dataset
- #Open Source AI
- Author created a vintage LLM from scratch trained only on old texts (pre-1900), sharing adventures and open-sourcing code on GitHub.
- Model architecture is based on Llama with 340M parameters, processed custom dataset, tokenizer, and training scripts, costing about $80 for GPU.
- Data processing involved de-duplication and filtering using metrics like compression ratio, entropy, and custom quality score to ensure text quality.
- Training involved two base-training stages totaling ~9B tokens, fine-tuning attempts for dialogue, and math tests showing limited arithmetic capability.
- Project is a hobbyist effort with historic accuracy, avoiding alignment to preserve vintage content, and plans further fine-tuning and dataset expansion.