Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries

2 months ago

Async RL training addresses the bottleneck of idle GPUs during data generation by separating inference and training onto different GPU pools connected via a rollout buffer.
16 open-source RL libraries were surveyed, revealing common patterns like Ray for orchestration and NCCL broadcast for weight transfer.
Key findings include Ray's dominance in orchestration, NCCL as the default weight transfer method, and sparse LoRA support.
Staleness management varies from dropping old samples to using importance-sampling correction.
Partial rollout handling strategies include implicit continuation, abort-and-retry, and explicit save/resume.
LoRA training is supported in some libraries, enabling efficient adapter-only weight sync.
Distributed MoE support is emerging as a key differentiator for future-proofing libraries.
Critic-free algorithms reduce memory usage but increase weight sync pressure due to larger group sizes.
Process rewards introduce new synchronization barriers, requiring async reward pipelines.
Multi-agent co-evolution exacerbates the straggler problem, necessitating episode-level buffer design.
Training-inference mismatch in MoE models requires solutions like Keep Routing and Keep Sampling Mask.
On-policy distillation shares the same async coordination problems as RL, suggesting a unified infrastructure approach.
TRL's future async trainer will focus on lightweight orchestration, NCCL weight sync with packed transfers, and partial rollout support.

Hasty Briefsbeta