PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play
3 hours ago
- #self-play curriculum
- #large language models
- #reinforcement learning
- RLVR uses reinforcement learning with verifiable rewards to train large language models, relying on tasks with checkable outcomes like code tests or math solutions.
- Traditional methods use fixed, hand-curated task distributions, which may become too easy or narrow, limiting adaptive learning.
- Synthetic tasks from generators or self-play can scale training, but single-agent self-play tends to collapse, generating easier tasks the model already solves.
- PopuLoRA introduces co-evolving populations of teacher and student LoRA adapters: teachers generate verifiable tasks, students solve them, with rewards tied to student failure to maintain challenge.
- The system uses a Python executor for deterministic verification, task types include predicting outputs, finding inputs, and completing functions from examples.
- PopuLoRA's design prevents curriculum collapse by leveraging inter-population dynamics, where difficulty is measured across different models, promoting task complexity and diversity.
- Training involves matching teachers and students via prioritized fictitious self-play with TrueSkill, joint policy-gradient updates, and periodic evolution replacing weak members in weight space.
- Results show PopuLoRA outperforms single-agent baselines on code benchmarks like HumanEval+ and MBPP+, with gains also observed on math tasks, indicating transfer from a broader code curriculum.
- The approach offers a scalable, adaptive autocurriculum, suggesting futures for self-improving systems through distributed, co-evolving populations.