Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

5 hours ago

Proposes a framework to distill multi-agent debate into a single LLM via a two-stage fine-tuning pipeline, reducing token usage by up to 93%.
Internalizes debate through structure learning, dynamic reward scheduling, and length clipping, matching or exceeding explicit multi-agent debate performance.
Identifies agent-specific subspaces via activation steering, showing interpretable activation directions for different agent perspectives.
Demonstrates a practical application by instilling malicious agents and using negative steering to control harmful behaviors with less performance loss.
Provides code availability and insights for understanding and controlling internalized reasoning in distilled models.

Hasty Briefsbeta