Refrag: Rethinking RAG Based Decoding

8 months ago

REFRAG is proposed as an efficient decoding framework for RAG applications.
It addresses the trade-off between knowledge enrichment and system efficiency in LLMs.
REFRAG compresses, senses, and expands to improve latency, achieving a 30.85% acceleration in time-to-first-token.
The framework extends the context size of LLMs by 16× without loss in perplexity.
Validation across diverse long-context tasks shows substantial speedup with no accuracy loss compared to LLaMA models.

Hasty Briefsbeta