Hasty Briefsbeta

Bilingual

RAGDoll: Efficient Offloading-Based Online RAG System on a Single GPU

a year ago
  • #GPU-optimization
  • #LLM
  • #RAG
  • RAGDoll is an efficient offloading-based online RAG system designed for single GPU deployment.
  • It enhances large language model (LLM) generation by incorporating external knowledge, addressing challenges in memory-limited consumer-grade platforms.
  • RAGDoll decouples retrieval and generation into parallel pipelines to optimize resource usage and reduce idle times.
  • The system employs joint memory placement and dynamic batch scheduling strategies for diverse hardware and workloads.
  • Experiments show RAGDoll achieves up to 3.6 times speedup in average latency compared to serial RAG systems like vLLM.