RAGDoll: Efficient Offloading-Based Online RAG System on a Single GPU

a year ago

RAGDoll is an efficient offloading-based online RAG system designed for single GPU deployment.
It enhances large language model (LLM) generation by incorporating external knowledge, addressing challenges in memory-limited consumer-grade platforms.
RAGDoll decouples retrieval and generation into parallel pipelines to optimize resource usage and reduce idle times.
The system employs joint memory placement and dynamic batch scheduling strategies for diverse hardware and workloads.
Experiments show RAGDoll achieves up to 3.6 times speedup in average latency compared to serial RAG systems like vLLM.

Hasty Briefsbeta