Intellect-2 Release: The First 32B Model Trained Through Globally Distributed RL

a year ago

INTELLECT-2 is the first 32B parameter model trained via globally distributed reinforcement learning.
The model uses PRIME-RL, a training framework for distributed asynchronous reinforcement learning, with components like TOPLOC and SHARDCAST.
Training data includes 285k verifiable tasks from NuminaMath-1.5, Deepscaler, and SYNTHETIC-1.
The model improves upon QwQ-32B with modifications to the GRPO training recipe and advanced data filtering techniques.
Future work includes increasing inference-to-training compute ratio, tool calls, crowdsourcing RL tasks, and model merging.
The article also includes a detailed math problem solution involving quadratic polynomials P(x) and Q(x).

Hasty Briefsbeta