Hasty Briefsbeta

Bilingual

Making a vintage LLM from scratch

20 hours ago
  • #LLM Training
  • #Vintage Dataset
  • #Open Source AI
  • Author created a vintage LLM from scratch trained only on old texts (pre-1900), sharing adventures and open-sourcing code on GitHub.
  • Model architecture is based on Llama with 340M parameters, processed custom dataset, tokenizer, and training scripts, costing about $80 for GPU.
  • Data processing involved de-duplication and filtering using metrics like compression ratio, entropy, and custom quality score to ensure text quality.
  • Training involved two base-training stages totaling ~9B tokens, fine-tuning attempts for dialogue, and math tests showing limited arithmetic capability.
  • Project is a hobbyist effort with historic accuracy, avoiding alignment to preserve vintage content, and plans further fine-tuning and dataset expansion.