Will It Mythos?
3 hours ago
- #Bug Detection
- #Mythos Comparison
- #AI Security Benchmark
- Benchmark created to compare AI models' ability to find security bugs against Mythos.
- Corpus includes 9 confirmed bugs found by Mythos, verified by Opus.
- Models tested with a simple harness, no hints, and access to full repositories.
- Gemini's Antigravity CLI was unsuitable for security work due to guardrails.
- Gemma 4 MoE detected 4/9 bugs but had high failure rates.
- Qwen 3.6 27B performed well, beating some commercial models.
- Chinese models like MiMo and DeepSeek are competitive and cheaper.
- Mistral Medium failed completely, likely due to safety restrictions.
- Results suggest Mythos may be better, but public models could improve with better tooling.