From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
2 days ago
- #AI Memory
- #KV Cache
- #Transformer Architecture
- KV cache is the physical storage of conversation state in AI models, representing actual memory addresses in GPU silicon.
- Without KV cache, each new token would require reprocessing all previous tokens, making computation quadratic instead of linear.
- Memory evolution shows a progression from full recall (GPT-2) to shared perspectives (Llama 3), compressed abstraction (DeepSeek V3), and selective attention (Gemma 3).
- State space models like Mamba eliminate KV cache entirely by using a fixed-size hidden state, filtering information in real time.
- In practice, users experience delays when old caches are evicted, and long conversations degrade due to context rot and attention thinning.
- There is no native medium-term memory in AI; external systems like databases and RAG fill the gap with deterministic but transparent storage.
- Compaction summarizes context to manage memory limits, but lossy compression can discard critical details unpredictably.
- Learned compaction, trained via reinforcement learning, shows promise in coding tasks but struggles in domains without clear reward signals.
- External memory systems (files, databases) provide transparent, auditable storage, unlike the opaque KV cache.
- Future AI memory design raises questions about what gets preserved, who decides, and whether AI will gain agency over its own memory management.