Prompt caching: 10x cheaper LLM tokens, but how?

3 days ago

Copy Link

Cached input tokens are 10x cheaper than regular input tokens for OpenAI and Anthropic APIs.
Prompt caching can reduce latency by up to 85% for long prompts.
Cached tokens are not saved responses but involve KV caching (Key-Value matrices from attention mechanisms).
LLMs convert text into tokens, then embeddings, which are processed through attention mechanisms.
Attention mechanisms determine the importance of each token in context using weights.
KV caching avoids recalculating attention weights for repeated prompt prefixes, saving computation.
OpenAI and Anthropic handle caching differently, with OpenAI automating it and Anthropic offering more control.
Parameters like temperature, top_p, and top_k affect output randomness but not prompt caching.

Hasty Briefsbeta