Evaluating GPT5's reasoning ability using the Only Connect game show
12 days ago
- #Only Connect
- #Reasoning Benchmark
- #GPT-5
- Evaluating GPT-5's reasoning abilities beyond knowledge-based benchmarks, focusing on pattern recognition, lateral thinking, and contextual reasoning.
- Assessing decision-making in models, especially when choosing between educated guesses or retrieving additional information.
- Comparing GPT-5's performance with previous models using reasoning effort and verbosity parameters.
- Only Connect game used as a benchmark for testing LLMs' reasoning capabilities due to its focus on lateral thinking and pattern recognition.
- Methodology involved sourcing questions from Only Connect, using structured output parameters, and simulating episodes for evaluation.
- Results showed GPT-5 and reasoning-optimized models performed best, with higher reasoning parameters leading to better accuracy.
- Missing Vowels round was easiest for models, while The Wall round was most challenging due to prompt complexity.
- Future steps include publishing the dataset, granular analysis of challenging questions, and implementing competitive model pairings.