GitHub - unslothai/unsloth: Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek, Qwen, Llama, Gemma, TTS 2x faster with 70% less VRAM.
2 months ago
- #fine-tuning
- #machine-learning
- #performance-optimization
- Notebooks are beginner-friendly, allowing users to add datasets, run, and deploy trained models.
- Performance comparison of various models (e.g., gpt-oss, Qwen3, Gemma 3) showing speed and memory improvements.
- Unsloth supports faster embedding fine-tuning (~1.8-3.3x) and new batching algorithms for longer context RL.
- New RoPE & MLP Triton Kernels & Padding Free + Packing offer 3x faster training and 30% less VRAM.
- Training a 20B model with >500K context is now possible on an 80GB GPU.
- FP8 Reinforcement Learning is now supported on consumer GPUs.
- DeepSeek-OCR improves language understanding by 89%.
- Unsloth Docker image simplifies setup and environment issues.
- Vision RL now supports training VLMs with GRPO or GSPO.
- Quantization-Aware Training recovers ~70% accuracy.
- Memory-efficient RL introduces faster RL with 50% less VRAM and 10× more context.
- Support for Mistral 3, Gemma 3n, Qwen3, and other models.
- Dynamic 2.0 quants set new benchmarks on 5-shot MMLU & Aider Polyglot.
- Unsloth supports all models (TTS, BERT, Mamba), FFT, and MultiGPU.
- Long-context Reasoning (GRPO) allows training reasoning models with just 5GB VRAM.
- Unsloth Dynamic 4-bit Quantization increases accuracy with <10% more VRAM than BnB 4-bit.
- Support for Llama 4, Phi-4, Vision models, and Llama 3.3 (70B).
- Cut Cross Entropy supports 89K context for Llama 3.3 (70B) on an 80GB GPU.
- Memory usage cut by 30%, supporting 4x longer context windows.
- Installation guides for pip, Conda, and Docker.
- Example code for fine-tuning gpt-oss-20b provided.
- RL support includes GRPO, GSPO, FP8 training, DrGRPO, DAPO, PPO, and more.
- Benchmarks show Unsloth's speed, VRAM reduction, and longer context capabilities.
- Citations and acknowledgments for contributors and libraries used.