Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate
5 hours ago
- #Multi-Agent Debate
- #LLM Distillation
- #Activation Steering
- Proposes a framework to distill multi-agent debate into a single LLM via a two-stage fine-tuning pipeline, reducing token usage by up to 93%.
- Internalizes debate through structure learning, dynamic reward scheduling, and length clipping, matching or exceeding explicit multi-agent debate performance.
- Identifies agent-specific subspaces via activation steering, showing interpretable activation directions for different agent perspectives.
- Demonstrates a practical application by instilling malicious agents and using negative steering to control harmful behaviors with less performance loss.
- Provides code availability and insights for understanding and controlling internalized reasoning in distilled models.