ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference
6 months ago
- #Inference Optimization
- #LLM
- #Transformer
- ChunkLLM is a lightweight pluggable framework designed to accelerate LLM inference by addressing computational inefficiencies in Transformer-based models.
- The framework introduces two key components: QK Adapter (Q-Adapter and K-Adapter) for feature compression and chunk attention, and Chunk Adapter for detecting chunk boundaries using contextual semantic information.
- During training, only the QK Adapter and Chunk Adapter are trained while the backbone parameters remain frozen, with an attention distillation method enhancing key chunk recall.
- Inference is accelerated by triggering chunk selection only when a chunk boundary is detected, optimizing performance.
- ChunkLLM achieves comparable performance on short-text benchmarks and maintains 98.64% performance on long-context benchmarks with a 48.58% key-value cache retention rate.
- The framework achieves a maximum speedup of 4.48x compared to vanilla Transformers when processing 120K long texts.