Two Qwen3 models on one DGX Spark: the residency math
2 days ago
- #GPU-memory-management
- #local-LLM
- #vLLM
- The author runs an agent stack with Hermes on a workstation and models on a DGX Spark over LAN, using an HTTP proxy for communication.
- As the Hermes agent fleet grew, load increased, challenging a single-model server setup, leading to a shift from ollama to vLLM for better memory management.
- vLLM offers features like PagedAttention for KV cache reclamation and gpu_memory_utilization for per-container memory budgeting, enabling coresidency of multiple models.
- The goal was to run Qwen3-Next-80B-Instruct-FP8 for heavy tasks and Qwen3-4B-Instruct-2507 for quick tasks on one Spark, accessible via a single LiteLLM proxy endpoint.
- Initial attempts failed due to memory issues: gpu_memory_utilization targets total GPU memory, not free memory, requiring co-resident processes to sum below ~0.95 to avoid OOM.
- Tool call issues arose with Qwen3-Next-80B-Thinking model, which only supports thinking mode and doesn't emit tool calls automatically; swapping to an Instruct model fixed this.
- Adjusting model parameters (e.g., max_model_len) for coresidency revealed that on Qwen3-Next, KV pool demand is influenced by Mamba state alignment, not just attention KV.
- Key insights: gpu_memory_utilization is a snapshot at process start against total memory, not free memory; actual residency after stabilization is crucial for planning.
- A playbook for two-model deployment: load the larger model first, measure actual residency with nvidia-smi, then size the smaller model's allocation against free memory minus overhead.
- Action item: Check current vLLM deployment with nvidia-smi to compare actual memory used vs. gpu_memory_utilization target; if divergence exceeds 10%, adjust sizing empirically.