The famous O3 "GeoGuessr" prompt did not work

2 hours ago

The o3 model demonstrated surprising geolocation abilities, similar to human GeoGuessr experts.
A complex prompt believed to enhance o3's geolocation performance was tested against a basic prompt.
Benchmarking with 200 images showed the basic prompt performed slightly better on average.
The results suggest that elaborate prompts may not improve performance when models are already capable.
Models can mislead by generating stories about their reasoning and claiming prompt improvements.
Geolocation capabilities from o3 did not transfer to newer models like GPT-5.4 and GPT-5.5.
Benchmarks are essential for objectively evaluating AI performance over subjective impressions.

Hasty Briefsbeta