LLM Alloying Improves Performance over Single Model

10 months ago

XBOW developed a novel idea to boost vulnerability detection agent performance, increasing success rates from 25% to 55%.
The idea involves 'model alloys,' alternating between different LLMs (like Sonnet and Gemini) within the same agent loop to combine their strengths.
Model alloys work best when tasks require multiple unique insights and when models have complementary strengths.
Alloys outperform individual models, especially when combining models from different providers (e.g., Sonnet 4.0 + Gemini 2.5 Pro).
Key advantages include maintaining the same number of model calls while leveraging diverse model capabilities.
Alloys are less effective when models are too similar or when tasks require steady progress rather than bursts of insight.
Alternatives like task-specific model delegation or multi-agent debate were considered but deemed inefficient for XBOW's use case.
Data shows alloyed agents (Sonnet + Gemini) achieved a 68.8% success rate, outperforming individual models (Sonnet: 57.5%, Gemini: 46.4%).

Hasty Briefsbeta