From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

2 days ago

KV cache is the physical storage of conversation state in AI models, representing actual memory addresses in GPU silicon.
Without KV cache, each new token would require reprocessing all previous tokens, making computation quadratic instead of linear.
Memory evolution shows a progression from full recall (GPT-2) to shared perspectives (Llama 3), compressed abstraction (DeepSeek V3), and selective attention (Gemma 3).
State space models like Mamba eliminate KV cache entirely by using a fixed-size hidden state, filtering information in real time.
In practice, users experience delays when old caches are evicted, and long conversations degrade due to context rot and attention thinning.
There is no native medium-term memory in AI; external systems like databases and RAG fill the gap with deterministic but transparent storage.
Compaction summarizes context to manage memory limits, but lossy compression can discard critical details unpredictably.
Learned compaction, trained via reinforcement learning, shows promise in coding tasks but struggles in domains without clear reward signals.
External memory systems (files, databases) provide transparent, auditable storage, unlike the opaque KV cache.
Future AI memory design raises questions about what gets preserved, who decides, and whether AI will gain agency over its own memory management.

Hasty Briefsbeta