5 months ago
- Cached input tokens are 10x cheaper than regular input tokens for OpenAI and Anthropic APIs.
- Prompt caching can reduce latency by up to 85% for long prompts.
- Cached tokens are not saved responses but involve KV caching (Key-Value matrices from attention mechanisms).
- LLMs convert text into tokens, then embeddings, which are processed through attention mechanisms.
- Attention mechanisms determine the importance of each token in context using weights.
- KV caching avoids recalculating attention weights for repeated prompt prefixes, saving computation.
- OpenAI and Anthropic handle caching differently, with OpenAI automating it and Anthropic offering more control.
- Parameters like temperature, top_p, and top_k affect output randomness but not prompt caching.