Evaluating GPT5's reasoning ability using the Only Connect game show

12 days ago

Copy Link

Evaluating GPT-5's reasoning abilities beyond knowledge-based benchmarks, focusing on pattern recognition, lateral thinking, and contextual reasoning.
Assessing decision-making in models, especially when choosing between educated guesses or retrieving additional information.
Comparing GPT-5's performance with previous models using reasoning effort and verbosity parameters.
Only Connect game used as a benchmark for testing LLMs' reasoning capabilities due to its focus on lateral thinking and pattern recognition.
Methodology involved sourcing questions from Only Connect, using structured output parameters, and simulating episodes for evaluation.
Results showed GPT-5 and reasoning-optimized models performed best, with higher reasoning parameters leading to better accuracy.
Missing Vowels round was easiest for models, while The Wall round was most challenging due to prompt complexity.
Future steps include publishing the dataset, granular analysis of challenging questions, and implementing competitive model pairings.

Hasty Briefsbeta