Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries
10 hours ago
- #open-source
- #reinforcement-learning
- #async-training
- Async RL training addresses the bottleneck of idle GPUs during data generation by separating inference and training onto different GPU pools connected via a rollout buffer.
- 16 open-source RL libraries were surveyed, revealing common patterns like Ray for orchestration and NCCL broadcast for weight transfer.
- Key findings include Ray's dominance in orchestration, NCCL as the default weight transfer method, and sparse LoRA support.
- Staleness management varies from dropping old samples to using importance-sampling correction.
- Partial rollout handling strategies include implicit continuation, abort-and-retry, and explicit save/resume.
- LoRA training is supported in some libraries, enabling efficient adapter-only weight sync.
- Distributed MoE support is emerging as a key differentiator for future-proofing libraries.
- Critic-free algorithms reduce memory usage but increase weight sync pressure due to larger group sizes.
- Process rewards introduce new synchronization barriers, requiring async reward pipelines.
- Multi-agent co-evolution exacerbates the straggler problem, necessitating episode-level buffer design.
- Training-inference mismatch in MoE models requires solutions like Keep Routing and Keep Sampling Mask.
- On-policy distillation shares the same async coordination problems as RL, suggesting a unified infrastructure approach.
- TRL's future async trainer will focus on lightweight orchestration, NCCL weight sync with packed transfers, and partial rollout support.