Hasty Briefsbeta

Bilingual

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

3 hours ago
  • #self-play curriculum
  • #large language models
  • #reinforcement learning
  • RLVR uses reinforcement learning with verifiable rewards to train large language models, relying on tasks with checkable outcomes like code tests or math solutions.
  • Traditional methods use fixed, hand-curated task distributions, which may become too easy or narrow, limiting adaptive learning.
  • Synthetic tasks from generators or self-play can scale training, but single-agent self-play tends to collapse, generating easier tasks the model already solves.
  • PopuLoRA introduces co-evolving populations of teacher and student LoRA adapters: teachers generate verifiable tasks, students solve them, with rewards tied to student failure to maintain challenge.
  • The system uses a Python executor for deterministic verification, task types include predicting outputs, finding inputs, and completing functions from examples.
  • PopuLoRA's design prevents curriculum collapse by leveraging inter-population dynamics, where difficulty is measured across different models, promoting task complexity and diversity.
  • Training involves matching teachers and students via prioritized fictitious self-play with TrueSkill, joint policy-gradient updates, and periodic evolution replacing weak members in weight space.
  • Results show PopuLoRA outperforms single-agent baselines on code benchmarks like HumanEval+ and MBPP+, with gains also observed on math tasks, indicating transfer from a broader code curriculum.
  • The approach offers a scalable, adaptive autocurriculum, suggesting futures for self-improving systems through distributed, co-evolving populations.