How Unsloth and Nvidia made LLM training 25% faster on consumer GPUs

4 hours ago

Unsloth and NVIDIA collaborated to improve GPU training speeds by about 25% through three optimizations.
Packed-sequence metadata caching avoids repeated reconstruction of metadata across layers, saving time.
Double-buffered checkpoint reload overlaps copy and compute to hide latency during backward passes.
Optimized MoE routing reduces dynamic queries by using bincount, minimizing CPU-GPU synchronization.
These improvements target bottlenecks in metadata handling, data movement, and routing overhead.
Performance gains are validated on models like Qwen3-14B and larger dense models with minimal memory overhead.

Hasty Briefsbeta