Evaluating LLMs for my personal use case

18 days ago

Copy Link

The author evaluated various LLMs for personal use, focusing on basic Rust, Python, Linux, and life questions.
130 real prompts were categorized into Programming, Sysadmin, Technical explanations, and General knowledge/creative tasks.
Models evaluated included Claude Sonnet, DeepSeek, Gemini, Kimi, GPT-OSS-120B, Qwen, and GLM, among others.
Open Router was used for evaluations due to its comprehensive model availability, low latency, and cost-effectiveness.
Key findings: Most models performed well, with cost and latency being major differentiators. Closed models were not superior to open ones.
Gemini 2.5 Flash was notably fast, while Gemini 2.5 Pro was overpriced. Reasoning rarely improved results except in creative tasks like poetry.
The author's current setup involves querying multiple models simultaneously for quick answers and second opinions.
A favorite poem about Florida, written in the style of Shel Silverstein, was shared as a bonus.

Hasty Briefsbeta