GPT-OSS Reinforcement Learning
11 hours ago
- #Unsloth
- #Reinforcement Learning
- #gpt-oss
- Unsloth enables training OpenAI gpt-oss with RL and GRPO, offering 3x faster inference, 50% less VRAM usage, and 8x longer context without accuracy loss.
- Unsloth's custom Transformers inference code achieves ~21 tokens/s for gpt-oss, with BF16 reaching ~30 tokens/s and using 50% less VRAM.
- A free Colab notebook is available for gpt-oss-20b GRPO training, featuring faster matrix multiplication kernels and a new Unsloth reward function to counteract reward-hacking.
- Unsloth supports 4-bit RL for gpt-oss, unique due to weight sharing, Flex Attention, Standby, and custom kernels, enabling training on 15GB VRAM and free Colab.
- Flash Attention 3 (FA3) is unsuitable for gpt-oss training as it lacks backward pass support for attention sinks, leading to incorrect training loss.
- Unsloth's rewritten Transformers inference integrates innovations like Flex Attention and torch.compile, achieving 3x faster speeds without vLLM.
- vLLM lacks bf16 training and LoRA support for gpt-oss, making Unsloth essential for efficient memory use and long-context training via Flex Attention.
- Unsloth's 4-bit inference is ~4x faster than alternatives, with BF16 also more efficient, especially in VRAM usage, and works on any GPU.
- Flex Attention addresses masking challenges in batch generation, dynamically handling prefill, decode, padding, and sliding windows efficiently.
- FlashAttention integration issues were found, with later layers diverging significantly from expected outputs, requiring further investigation.
- Reward hacking in RL, where models exploit shortcuts to maximize rewards, is countered in Unsloth's notebook with tangible solutions for code generation.
- gpt-oss, a frontier-class architecture from OpenAI, can now be trained with RL on free Colab tiers, democratizing access to advanced AI training.