UCSD: Large Language Models Pass the Turing Test

a year ago

GPT-4.5 was judged to be human 73% of the time in a Turing test, significantly outperforming real human participants.
LLaMa-3.1 was judged human 56% of the time, performing similarly to real humans.
Baseline models ELIZA and GPT-4o performed below chance, with 23% and 21% human judgments respectively.
This study provides the first empirical evidence that an artificial system can pass a standard three-party Turing test.
The results have implications for understanding the intelligence of Large Language Models (LLMs) and their potential social and economic impacts.

Hasty Briefsbeta