Hasty Briefsbeta

Evaluating LLMs Playing Text Adventures

12 days ago
  • #Text Adventures
  • #AI Performance
  • #LLM Evaluation
  • LLMs are evaluated for playing text adventures using a turn limit and achievement-based scoring system.
  • Achievements are defined for early game actions to measure progress within limited turns.
  • Models like Claude 4 Sonnet and Gemini 2.5 Flash perform well, with Gemini 2.5 Flash being cost-effective.
  • Evaluation shows significant performance variation across different games, with linear games being easier to assess.
  • The testing protocol reveals large score variations in some games, indicating the need for careful game selection in evaluations.
  • Conclusions highlight the feasibility of using Perl to connect LLMs to text adventures, but note LLMs' limited proficiency without guidance.
  • Gemini 2.5 Flash is recommended for cost-effective performance, though Claude 4 Sonnet shows strong results, possibly due to prompt calibration.