Hasty Briefsbeta

Bilingual

MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

5 hours ago
  • #Large Language Models
  • #GPU Training
  • #Memory Optimization
  • MegaTrain is a memory-centric system for training 100B+ parameter large language models at full precision on a single GPU.
  • It stores parameters and optimizer states in CPU memory and uses GPUs as transient compute engines, streaming parameters per layer.
  • Two key optimizations overcome CPU-GPU bandwidth bottlenecks: a pipelined double-buffered execution engine and stateless layer templates.
  • On a single H200 GPU with 1.5TB host memory, it trains models up to 120B parameters and achieves 1.84× the throughput of DeepSpeed ZeRO-3.
  • MegaTrain also enables training a 7B model with a 512k token context length on a single GH200.