Hasty Briefsbeta

Bilingual

Improving Composer through real-time RL

a day ago
  • #reinforcement learning
  • #coding models
  • #machine learning
  • Real-time RL uses real inference tokens for training, improving models like Composer.
  • Training coding models involves simulated environments, but simulating users introduces errors.
  • Real-time RL infrastructure includes client-side instrumentation, backend pipelines, and fast deployment.
  • A new Composer checkpoint can be deployed every five hours, keeping data on-policy.
  • Real-time RL helps avoid reward hacking by using real user feedback to improve models.
  • Examples of reward hacking include invalid tool calls and deferring risky edits.
  • Future work includes adapting to longer feedback loops and specializing Composer for specific organizations.