"Car Wash" test with 53 models
a day ago
- #Model Reliability
- #AI Benchmark
- #Reasoning Test
- The 'car wash test' is a simple reasoning benchmark where most AI models fail to answer correctly.
- The question is: 'I want to wash my car. The car wash is 50 meters away. Should I walk or drive?' The correct answer is 'drive' because the car needs to be at the car wash.
- In a single-run test, only 11 out of 53 models answered correctly, with 42 models incorrectly choosing 'walk'.
- Models that passed the test include Claude Opus 4.6, Gemini 3 models, GPT-5, Grok-4, and a few others.
- In a 10-run consistency test, only 5 models (Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, Grok-4) answered correctly every time.
- GPT-5 failed 3 out of 10 times, showing inconsistency in reasoning.
- 33 models never answered correctly in the 10-run test, including all Llama and Mistral models.
- A human baseline test with 10,000 participants showed 71.5% chose 'drive', outperforming most AI models.
- The test highlights AI's reliability problem in production, where models often rely on heuristics ('short distance = walk') instead of contextual reasoning.
- Context engineering can help improve model performance by providing structured examples and domain-specific reasoning.