Will It Mythos?

3 hours ago

Benchmark created to compare AI models' ability to find security bugs against Mythos.
Corpus includes 9 confirmed bugs found by Mythos, verified by Opus.
Models tested with a simple harness, no hints, and access to full repositories.
Gemini's Antigravity CLI was unsuitable for security work due to guardrails.
Gemma 4 MoE detected 4/9 bugs but had high failure rates.
Qwen 3.6 27B performed well, beating some commercial models.
Chinese models like MiMo and DeepSeek are competitive and cheaper.
Mistral Medium failed completely, likely due to safety restrictions.
Results suggest Mythos may be better, but public models could improve with better tooling.

Hasty Briefsbeta