Prompt caching: 10x cheaper LLM tokens, but how?
3 days ago
- #LLM
- #Attention Mechanism
- #Prompt Caching
- Cached input tokens are 10x cheaper than regular input tokens for OpenAI and Anthropic APIs.
- Prompt caching can reduce latency by up to 85% for long prompts.
- Cached tokens are not saved responses but involve KV caching (Key-Value matrices from attention mechanisms).
- LLMs convert text into tokens, then embeddings, which are processed through attention mechanisms.
- Attention mechanisms determine the importance of each token in context using weights.
- KV caching avoids recalculating attention weights for repeated prompt prefixes, saving computation.
- OpenAI and Anthropic handle caching differently, with OpenAI automating it and Anthropic offering more control.
- Parameters like temperature, top_p, and top_k affect output randomness but not prompt caching.