Hasty Briefsbeta

Evaluating LLMs for my personal use case

18 days ago
  • #AI Models
  • #Open Router
  • #LLM Evaluation
  • The author evaluated various LLMs for personal use, focusing on basic Rust, Python, Linux, and life questions.
  • 130 real prompts were categorized into Programming, Sysadmin, Technical explanations, and General knowledge/creative tasks.
  • Models evaluated included Claude Sonnet, DeepSeek, Gemini, Kimi, GPT-OSS-120B, Qwen, and GLM, among others.
  • Open Router was used for evaluations due to its comprehensive model availability, low latency, and cost-effectiveness.
  • Key findings: Most models performed well, with cost and latency being major differentiators. Closed models were not superior to open ones.
  • Gemini 2.5 Flash was notably fast, while Gemini 2.5 Pro was overpriced. Reasoning rarely improved results except in creative tasks like poetry.
  • The author's current setup involves querying multiple models simultaneously for quick answers and second opinions.
  • A favorite poem about Florida, written in the style of Shel Silverstein, was shared as a bonus.