Exploiting Local KV Cache Asymmetry for Long-Context LLMs

10 months ago

KV cache compression is crucial for efficient long-context modeling in LLMs.
A key-value asymmetry exists: keys show local homogeneity, while values are heterogeneous.
Existing compression methods fail to address this asymmetry, treating keys and values uniformly.
Proposed AsymKV framework combines key merging and lossless value compression.
AsymKV outperforms SOTA methods, e.g., achieving 43.95 on LongBench vs. H$_2$O's 38.89.

Hasty Briefsbeta