Evaluating LLMs for my personal use case
18 days ago
- #AI Models
- #Open Router
- #LLM Evaluation
- The author evaluated various LLMs for personal use, focusing on basic Rust, Python, Linux, and life questions.
- 130 real prompts were categorized into Programming, Sysadmin, Technical explanations, and General knowledge/creative tasks.
- Models evaluated included Claude Sonnet, DeepSeek, Gemini, Kimi, GPT-OSS-120B, Qwen, and GLM, among others.
- Open Router was used for evaluations due to its comprehensive model availability, low latency, and cost-effectiveness.
- Key findings: Most models performed well, with cost and latency being major differentiators. Closed models were not superior to open ones.
- Gemini 2.5 Flash was notably fast, while Gemini 2.5 Pro was overpriced. Reasoning rarely improved results except in creative tasks like poetry.
- The author's current setup involves querying multiple models simultaneously for quick answers and second opinions.
- A favorite poem about Florida, written in the style of Shel Silverstein, was shared as a bonus.