Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
3 days ago
- #Machine Learning
- #Reinforcement Learning
- #Meta-Learning
- Investigates whether a pretrained LLM can generate an automated curriculum for problems it cannot solve.
- Introduces SOAR, a self-improvement framework using meta-RL where a teacher model proposes synthetic problems for a student model.
- SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards.
- Study conducted on the hardest subsets of mathematical benchmarks (0/128 success rate).
- Key findings include the feasibility of bi-level meta-RL under sparse, binary rewards.
- Grounded rewards outperform intrinsic reward schemes, avoiding instability and diversity collapse.
- Structural quality and well-posedness of generated questions are more critical for learning progress than solution correctness.
- Suggests that generating useful stepping stones does not require the ability to solve hard problems initially.