MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

5 hours ago

MegaTrain is a memory-centric system for training 100B+ parameter large language models at full precision on a single GPU.
It stores parameters and optimizer states in CPU memory and uses GPUs as transient compute engines, streaming parameters per layer.
Two key optimizations overcome CPU-GPU bandwidth bottlenecks: a pipelined double-buffered execution engine and stateless layer templates.
On a single H200 GPU with 1.5TB host memory, it trains models up to 120B parameters and achieves 1.84× the throughput of DeepSpeed ZeRO-3.
MegaTrain also enables training a 7B model with a 512k token context length on a single GH200.

Hasty Briefsbeta