Evaluating LLMs Playing Text Adventures
12 days ago
- #Text Adventures
- #AI Performance
- #LLM Evaluation
- LLMs are evaluated for playing text adventures using a turn limit and achievement-based scoring system.
- Achievements are defined for early game actions to measure progress within limited turns.
- Models like Claude 4 Sonnet and Gemini 2.5 Flash perform well, with Gemini 2.5 Flash being cost-effective.
- Evaluation shows significant performance variation across different games, with linear games being easier to assess.
- The testing protocol reveals large score variations in some games, indicating the need for careful game selection in evaluations.
- Conclusions highlight the feasibility of using Perl to connect LLMs to text adventures, but note LLMs' limited proficiency without guidance.
- Gemini 2.5 Flash is recommended for cost-effective performance, though Claude 4 Sonnet shows strong results, possibly due to prompt calibration.