Hasty Briefsbeta

Prompt caching: 10x cheaper LLM tokens, but how?

3 days ago
  • #LLM
  • #Attention Mechanism
  • #Prompt Caching
  • Cached input tokens are 10x cheaper than regular input tokens for OpenAI and Anthropic APIs.
  • Prompt caching can reduce latency by up to 85% for long prompts.
  • Cached tokens are not saved responses but involve KV caching (Key-Value matrices from attention mechanisms).
  • LLMs convert text into tokens, then embeddings, which are processed through attention mechanisms.
  • Attention mechanisms determine the importance of each token in context using weights.
  • KV caching avoids recalculating attention weights for repeated prompt prefixes, saving computation.
  • OpenAI and Anthropic handle caching differently, with OpenAI automating it and Anthropic offering more control.
  • Parameters like temperature, top_p, and top_k affect output randomness but not prompt caching.