ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

6 months ago

ChunkLLM is a lightweight pluggable framework designed to accelerate LLM inference by addressing computational inefficiencies in Transformer-based models.
The framework introduces two key components: QK Adapter (Q-Adapter and K-Adapter) for feature compression and chunk attention, and Chunk Adapter for detecting chunk boundaries using contextual semantic information.
During training, only the QK Adapter and Chunk Adapter are trained while the backbone parameters remain frozen, with an attention distillation method enhancing key chunk recall.
Inference is accelerated by triggering chunk selection only when a chunk boundary is detected, optimizing performance.
ChunkLLM achieves comparable performance on short-text benchmarks and maintains 98.64% performance on long-context benchmarks with a 48.58% key-value cache retention rate.
The framework achieves a maximum speedup of 4.48x compared to vanilla Transformers when processing 120K long texts.

Hasty Briefsbeta