Cache hit rates of Inference are more meaningful than the headline costs
10 hours ago
- #agent workflows
- #cache hit rates
- #LLM pricing
- Agent workflows involve many turns, leading to high context processing costs due to repeated full conversation history.
- Cache hit rates are crucial for cost efficiency, as they reduce the need to process new tokens.
- Analysis of 60+ providers reveals DeepSeek and other Chinese labs have highest cache rates (>75%), while some providers like io.net have 0%.
- Cache hit rates vary significantly across providers, affecting effective input pricing; e.g., DeepSeek V4 Pro's cheapest price is $0.056 vs. $1.722 on Parasail.
- US labs like Google show lower cache rates compared to competitors, even on their own hardware.
- Small models can be more expensive than larger ones due to low cache rates; e.g., DeepSeek V4 Flash is cheaper than Qwen3.6 models.
- The 'cheap' providers may not be cost-effective due to low cache hit rates and price increases.
- Hybrid or local setups are recommended for coding agents due to rising inference costs.