Two Qwen3 models on one DGX Spark: the residency math

2 days ago

#GPU-memory-management
#local-LLM
#vLLM

The author runs an agent stack with Hermes on a workstation and models on a DGX Spark over LAN, using an HTTP proxy for communication.
As the Hermes agent fleet grew, load increased, challenging a single-model server setup, leading to a shift from ollama to vLLM for better memory management.
vLLM offers features like PagedAttention for KV cache reclamation and gpu_memory_utilization for per-container memory budgeting, enabling coresidency of multiple models.
The goal was to run Qwen3-Next-80B-Instruct-FP8 for heavy tasks and Qwen3-4B-Instruct-2507 for quick tasks on one Spark, accessible via a single LiteLLM proxy endpoint.
Initial attempts failed due to memory issues: gpu_memory_utilization targets total GPU memory, not free memory, requiring co-resident processes to sum below ~0.95 to avoid OOM.
Tool call issues arose with Qwen3-Next-80B-Thinking model, which only supports thinking mode and doesn't emit tool calls automatically; swapping to an Instruct model fixed this.
Adjusting model parameters (e.g., max_model_len) for coresidency revealed that on Qwen3-Next, KV pool demand is influenced by Mamba state alignment, not just attention KV.
Key insights: gpu_memory_utilization is a snapshot at process start against total memory, not free memory; actual residency after stabilization is crucial for planning.
A playbook for two-model deployment: load the larger model first, measure actual residency with nvidia-smi, then size the smaller model's allocation against free memory minus overhead.
Action item: Check current vLLM deployment with nvidia-smi to compare actual memory used vs. gpu_memory_utilization target; if divergence exceeds 10%, adjust sizing empirically.

Hasty Briefsbeta

Two Qwen3 models on one DGX Spark: the residency math