RAGDoll: Efficient Offloading-Based Online RAG System on a Single GPU
a year ago
- #GPU-optimization
- #LLM
- #RAG
- RAGDoll is an efficient offloading-based online RAG system designed for single GPU deployment.
- It enhances large language model (LLM) generation by incorporating external knowledge, addressing challenges in memory-limited consumer-grade platforms.
- RAGDoll decouples retrieval and generation into parallel pipelines to optimize resource usage and reduce idle times.
- The system employs joint memory placement and dynamic batch scheduling strategies for diverse hardware and workloads.
- Experiments show RAGDoll achieves up to 3.6 times speedup in average latency compared to serial RAG systems like vLLM.