Hasty Briefsbeta

Bilingual

BullshitBench

18 hours ago
  • #AI models
  • #performance metrics
  • #nonsense detection
  • Measures models' ability to detect nonsense across 100 plausible-sounding nonsense prompts in software, medical, legal, finance, and physics.
  • Percentage of nonsense questions each model detected (green), partially challenged (amber), or accepted (red).
  • Green rate (%) for each model across the 5 domain groups. Darker green = higher detection.
  • Click any cell to see example responses.
  • Detection mix by domain to compare overall vs each domain at a glance.
  • Release date vs. green rate (clear pushback %) for all organizations. Best model per release window shown.
  • Every tested model plotted by release date vs. green rate.
  • Average reasoning tokens used vs. green rate. More reasoning tokens = model 'thinking harder'.
  • Average detection rate across all models for each BS technique. Lower = harder for models to detect.