Cache hit rates of Inference are more meaningful than the headline costs

25 days ago

Agent workflows involve many turns, leading to high context processing costs due to repeated full conversation history.
Cache hit rates are crucial for cost efficiency, as they reduce the need to process new tokens.
Analysis of 60+ providers reveals DeepSeek and other Chinese labs have highest cache rates (>75%), while some providers like io.net have 0%.
Cache hit rates vary significantly across providers, affecting effective input pricing; e.g., DeepSeek V4 Pro's cheapest price is $0.056 vs. $1.722 on Parasail.
US labs like Google show lower cache rates compared to competitors, even on their own hardware.
Small models can be more expensive than larger ones due to low cache rates; e.g., DeepSeek V4 Flash is cheaper than Qwen3.6 models.
The 'cheap' providers may not be cost-effective due to low cache hit rates and price increases.
Hybrid or local setups are recommended for coding agents due to rising inference costs.

Hasty Briefsbeta