Hasty Briefsbeta

Bilingual

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

2 days ago
  • #AI Memory
  • #KV Cache
  • #Transformer Architecture
  • KV cache is the physical storage of conversation state in AI models, representing actual memory addresses in GPU silicon.
  • Without KV cache, each new token would require reprocessing all previous tokens, making computation quadratic instead of linear.
  • Memory evolution shows a progression from full recall (GPT-2) to shared perspectives (Llama 3), compressed abstraction (DeepSeek V3), and selective attention (Gemma 3).
  • State space models like Mamba eliminate KV cache entirely by using a fixed-size hidden state, filtering information in real time.
  • In practice, users experience delays when old caches are evicted, and long conversations degrade due to context rot and attention thinning.
  • There is no native medium-term memory in AI; external systems like databases and RAG fill the gap with deterministic but transparent storage.
  • Compaction summarizes context to manage memory limits, but lossy compression can discard critical details unpredictably.
  • Learned compaction, trained via reinforcement learning, shows promise in coding tasks but struggles in domains without clear reward signals.
  • External memory systems (files, databases) provide transparent, auditable storage, unlike the opaque KV cache.
  • Future AI memory design raises questions about what gets preserved, who decides, and whether AI will gain agency over its own memory management.