Hasty Briefsbeta

Bilingual

Will It Mythos?

4 hours ago
  • #Bug Detection
  • #Mythos Comparison
  • #AI Security Benchmark
  • Benchmark created to compare AI models' ability to find security bugs against Mythos.
  • Corpus includes 9 confirmed bugs found by Mythos, verified by Opus.
  • Models tested with a simple harness, no hints, and access to full repositories.
  • Gemini's Antigravity CLI was unsuitable for security work due to guardrails.
  • Gemma 4 MoE detected 4/9 bugs but had high failure rates.
  • Qwen 3.6 27B performed well, beating some commercial models.
  • Chinese models like MiMo and DeepSeek are competitive and cheaper.
  • Mistral Medium failed completely, likely due to safety restrictions.
  • Results suggest Mythos may be better, but public models could improve with better tooling.