MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
5 hours ago
- #Large Language Models
- #GPU Training
- #Memory Optimization
- MegaTrain is a memory-centric system for training 100B+ parameter large language models at full precision on a single GPU.
- It stores parameters and optimizer states in CPU memory and uses GPUs as transient compute engines, streaming parameters per layer.
- Two key optimizations overcome CPU-GPU bandwidth bottlenecks: a pipelined double-buffered execution engine and stateless layer templates.
- On a single H200 GPU with 1.5TB host memory, it trains models up to 120B parameters and achieves 1.84× the throughput of DeepSpeed ZeRO-3.
- MegaTrain also enables training a 7B model with a 512k token context length on a single GH200.