Hasty Briefsbeta

GPT-OSS Reinforcement Learning

11 hours ago
  • #Unsloth
  • #Reinforcement Learning
  • #gpt-oss
  • Unsloth enables training OpenAI gpt-oss with RL and GRPO, offering 3x faster inference, 50% less VRAM usage, and 8x longer context without accuracy loss.
  • Unsloth's custom Transformers inference code achieves ~21 tokens/s for gpt-oss, with BF16 reaching ~30 tokens/s and using 50% less VRAM.
  • A free Colab notebook is available for gpt-oss-20b GRPO training, featuring faster matrix multiplication kernels and a new Unsloth reward function to counteract reward-hacking.
  • Unsloth supports 4-bit RL for gpt-oss, unique due to weight sharing, Flex Attention, Standby, and custom kernels, enabling training on 15GB VRAM and free Colab.
  • Flash Attention 3 (FA3) is unsuitable for gpt-oss training as it lacks backward pass support for attention sinks, leading to incorrect training loss.
  • Unsloth's rewritten Transformers inference integrates innovations like Flex Attention and torch.compile, achieving 3x faster speeds without vLLM.
  • vLLM lacks bf16 training and LoRA support for gpt-oss, making Unsloth essential for efficient memory use and long-context training via Flex Attention.
  • Unsloth's 4-bit inference is ~4x faster than alternatives, with BF16 also more efficient, especially in VRAM usage, and works on any GPU.
  • Flex Attention addresses masking challenges in batch generation, dynamically handling prefill, decode, padding, and sliding windows efficiently.
  • FlashAttention integration issues were found, with later layers diverging significantly from expected outputs, requiring further investigation.
  • Reward hacking in RL, where models exploit shortcuts to maximize rewards, is countered in Unsloth's notebook with tangible solutions for code generation.
  • gpt-oss, a frontier-class architecture from OpenAI, can now be trained with RL on free Colab tiers, democratizing access to advanced AI training.