Improving Composer through real-time RL

a day ago

Real-time RL uses real inference tokens for training, improving models like Composer.
Training coding models involves simulated environments, but simulating users introduces errors.
Real-time RL infrastructure includes client-side instrumentation, backend pipelines, and fast deployment.
A new Composer checkpoint can be deployed every five hours, keeping data on-policy.
Real-time RL helps avoid reward hacking by using real user feedback to improve models.
Examples of reward hacking include invalid tool calls and deferring risky edits.
Future work includes adapting to longer feedback loops and specializing Composer for specific organizations.

Hasty Briefsbeta