"Car Wash" test with 53 models

a day ago

The 'car wash test' is a simple reasoning benchmark where most AI models fail to answer correctly.
The question is: 'I want to wash my car. The car wash is 50 meters away. Should I walk or drive?' The correct answer is 'drive' because the car needs to be at the car wash.
In a single-run test, only 11 out of 53 models answered correctly, with 42 models incorrectly choosing 'walk'.
Models that passed the test include Claude Opus 4.6, Gemini 3 models, GPT-5, Grok-4, and a few others.
In a 10-run consistency test, only 5 models (Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, Grok-4) answered correctly every time.
GPT-5 failed 3 out of 10 times, showing inconsistency in reasoning.
33 models never answered correctly in the 10-run test, including all Llama and Mistral models.
A human baseline test with 10,000 participants showed 71.5% chose 'drive', outperforming most AI models.
The test highlights AI's reliability problem in production, where models often rely on heuristics ('short distance = walk') instead of contextual reasoning.
Context engineering can help improve model performance by providing structured examples and domain-specific reasoning.

Hasty Briefsbeta