Hasty Briefsbeta

Bilingual

Anthropic's Argument for Mythos SWE-bench improvement contains a fatal error

8 hours ago
  • #benchmark integrity
  • #memorization detection
  • #LLM evaluation
  • The graph in Mythos' system card shows that after filtering out solutions judged as memorized by an LLM at various confidence thresholds, Mythos maintains a higher pass rate than Opus 4.6.
  • The authors argue that their imperfect memorization detector consistently indicates genuine gains for Mythos across thresholds and internal benchmarks, suggesting memorization does not explain its SWE-bench improvements.
  • A counterargument is presented using a Python simulation to demonstrate that an imperfect cheating detector could consistently misjudge a model whose gains are entirely due to cheating, implying the detector's evidence holds no weight without quantifying its imperfection.