GPT-OSS Reinforcement Learning

11 hours ago

https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning

Copy Link

#Unsloth
#Reinforcement Learning
#gpt-oss

Unsloth enables training OpenAI gpt-oss with RL and GRPO, offering 3x faster inference, 50% less VRAM usage, and 8x longer context without accuracy loss.
Unsloth's custom Transformers inference code achieves ~21 tokens/s for gpt-oss, with BF16 reaching ~30 tokens/s and using 50% less VRAM.
A free Colab notebook is available for gpt-oss-20b GRPO training, featuring faster matrix multiplication kernels and a new Unsloth reward function to counteract reward-hacking.
Unsloth supports 4-bit RL for gpt-oss, unique due to weight sharing, Flex Attention, Standby, and custom kernels, enabling training on 15GB VRAM and free Colab.
Flash Attention 3 (FA3) is unsuitable for gpt-oss training as it lacks backward pass support for attention sinks, leading to incorrect training loss.
Unsloth's rewritten Transformers inference integrates innovations like Flex Attention and torch.compile, achieving 3x faster speeds without vLLM.
vLLM lacks bf16 training and LoRA support for gpt-oss, making Unsloth essential for efficient memory use and long-context training via Flex Attention.
Unsloth's 4-bit inference is ~4x faster than alternatives, with BF16 also more efficient, especially in VRAM usage, and works on any GPU.
Flex Attention addresses masking challenges in batch generation, dynamically handling prefill, decode, padding, and sliding windows efficiently.
FlashAttention integration issues were found, with later layers diverging significantly from expected outputs, requiring further investigation.
Reward hacking in RL, where models exploit shortcuts to maximize rewards, is countered in Unsloth's notebook with tangible solutions for code generation.
gpt-oss, a frontier-class architecture from OpenAI, can now be trained with RL on free Colab tiers, democratizing access to advanced AI training.

Hasty Briefsbeta

GPT-OSS Reinforcement Learning