Anthropic's Argument for Mythos SWE-bench improvement contains a fatal error
8 hours ago
- #benchmark integrity
- #memorization detection
- #LLM evaluation
- The graph in Mythos' system card shows that after filtering out solutions judged as memorized by an LLM at various confidence thresholds, Mythos maintains a higher pass rate than Opus 4.6.
- The authors argue that their imperfect memorization detector consistently indicates genuine gains for Mythos across thresholds and internal benchmarks, suggesting memorization does not explain its SWE-bench improvements.
- A counterargument is presented using a Python simulation to demonstrate that an imperfect cheating detector could consistently misjudge a model whose gains are entirely due to cheating, implying the detector's evidence holds no weight without quantifying its imperfection.