Getting an LLM to Play Text Adventures

9 months ago

Research investigates LLMs playing text adventures, with mixed results.
ChatGPT 3.5 and GPT-4o-mini show limited capability in text adventure games.
LLMs struggle with state transitions in text adventures, getting it wrong 40% of the time.
Prompt engineering is used to guide LLMs, but they still make errors like context poisoning.
LLMs often get stuck in loops or obsess over irrelevant details.
Examples include failing to place a gold watch on the floor or misusing commands.
LLMs sometimes ignore hints and revert to previous obsessions.
Performance varies by model, with Claude 3.5 Haiku showing some promise but still flawed.
Cost is a significant barrier, with $1 spent to complete an easy text adventure.
Future work includes benchmarking different LLMs on text adventure performance.

Hasty Briefsbeta