LLMs Get Lost in Multi-Turn Conversation

a year ago

Large Language Models (LLMs) are conversational interfaces that assist users in defining, exploring, and refining tasks through multi-turn conversations.
LLM evaluation has predominantly focused on single-turn, fully-specified instructions, despite frequent underspecification in user instructions.
Experiments show that LLMs perform significantly worse in multi-turn conversations than single-turn, with an average performance drop of 39% across six tasks.
Performance degradation in multi-turn conversations is due to a minor loss in aptitude and a significant increase in unreliability.
LLMs often make assumptions early in conversations and prematurely generate final solutions, leading to errors they do not recover from.

Hasty Briefsbeta