PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

3 hours ago

#self-play curriculum
#large language models
#reinforcement learning

RLVR uses reinforcement learning with verifiable rewards to train large language models, relying on tasks with checkable outcomes like code tests or math solutions.
Traditional methods use fixed, hand-curated task distributions, which may become too easy or narrow, limiting adaptive learning.
Synthetic tasks from generators or self-play can scale training, but single-agent self-play tends to collapse, generating easier tasks the model already solves.
PopuLoRA introduces co-evolving populations of teacher and student LoRA adapters: teachers generate verifiable tasks, students solve them, with rewards tied to student failure to maintain challenge.
The system uses a Python executor for deterministic verification, task types include predicting outputs, finding inputs, and completing functions from examples.
PopuLoRA's design prevents curriculum collapse by leveraging inter-population dynamics, where difficulty is measured across different models, promoting task complexity and diversity.
Training involves matching teachers and students via prioritized fictitious self-play with TrueSkill, joint policy-gradient updates, and periodic evolution replacing weak members in weight space.
Results show PopuLoRA outperforms single-agent baselines on code benchmarks like HumanEval+ and MBPP+, with gains also observed on math tasks, indicating transfer from a broader code curriculum.
The approach offers a scalable, adaptive autocurriculum, suggesting futures for self-improving systems through distributed, co-evolving populations.

Hasty Briefsbeta

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play