Hasty Briefsbeta

Bilingual

"Car Wash" test with 53 models

a day ago
  • #Model Reliability
  • #AI Benchmark
  • #Reasoning Test
  • The 'car wash test' is a simple reasoning benchmark where most AI models fail to answer correctly.
  • The question is: 'I want to wash my car. The car wash is 50 meters away. Should I walk or drive?' The correct answer is 'drive' because the car needs to be at the car wash.
  • In a single-run test, only 11 out of 53 models answered correctly, with 42 models incorrectly choosing 'walk'.
  • Models that passed the test include Claude Opus 4.6, Gemini 3 models, GPT-5, Grok-4, and a few others.
  • In a 10-run consistency test, only 5 models (Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, Grok-4) answered correctly every time.
  • GPT-5 failed 3 out of 10 times, showing inconsistency in reasoning.
  • 33 models never answered correctly in the 10-run test, including all Llama and Mistral models.
  • A human baseline test with 10,000 participants showed 71.5% chose 'drive', outperforming most AI models.
  • The test highlights AI's reliability problem in production, where models often rely on heuristics ('short distance = walk') instead of contextual reasoning.
  • Context engineering can help improve model performance by providing structured examples and domain-specific reasoning.