Evaluating LLMs Playing Text Adventures

12 days ago

Copy Link

LLMs are evaluated for playing text adventures using a turn limit and achievement-based scoring system.
Achievements are defined for early game actions to measure progress within limited turns.
Models like Claude 4 Sonnet and Gemini 2.5 Flash perform well, with Gemini 2.5 Flash being cost-effective.
Evaluation shows significant performance variation across different games, with linear games being easier to assess.
The testing protocol reveals large score variations in some games, indicating the need for careful game selection in evaluations.
Conclusions highlight the feasibility of using Perl to connect LLMs to text adventures, but note LLMs' limited proficiency without guidance.
Gemini 2.5 Flash is recommended for cost-effective performance, though Claude 4 Sonnet shows strong results, possibly due to prompt calibration.

Hasty Briefsbeta