Hasty Briefsbeta

Bilingual

Cache hit rates of Inference are more meaningful than the headline costs

9 hours ago
  • #agent workflows
  • #cache hit rates
  • #LLM pricing
  • Agent workflows involve many turns, leading to high context processing costs due to repeated full conversation history.
  • Cache hit rates are crucial for cost efficiency, as they reduce the need to process new tokens.
  • Analysis of 60+ providers reveals DeepSeek and other Chinese labs have highest cache rates (>75%), while some providers like io.net have 0%.
  • Cache hit rates vary significantly across providers, affecting effective input pricing; e.g., DeepSeek V4 Pro's cheapest price is $0.056 vs. $1.722 on Parasail.
  • US labs like Google show lower cache rates compared to competitors, even on their own hardware.
  • Small models can be more expensive than larger ones due to low cache rates; e.g., DeepSeek V4 Flash is cheaper than Qwen3.6 models.
  • The 'cheap' providers may not be cost-effective due to low cache hit rates and price increases.
  • Hybrid or local setups are recommended for coding agents due to rising inference costs.