Hasty Briefsbeta

Bilingual

Two Qwen3 models on one DGX Spark: the residency math

2 days ago
  • #GPU-memory-management
  • #local-LLM
  • #vLLM
  • The author runs an agent stack with Hermes on a workstation and models on a DGX Spark over LAN, using an HTTP proxy for communication.
  • As the Hermes agent fleet grew, load increased, challenging a single-model server setup, leading to a shift from ollama to vLLM for better memory management.
  • vLLM offers features like PagedAttention for KV cache reclamation and gpu_memory_utilization for per-container memory budgeting, enabling coresidency of multiple models.
  • The goal was to run Qwen3-Next-80B-Instruct-FP8 for heavy tasks and Qwen3-4B-Instruct-2507 for quick tasks on one Spark, accessible via a single LiteLLM proxy endpoint.
  • Initial attempts failed due to memory issues: gpu_memory_utilization targets total GPU memory, not free memory, requiring co-resident processes to sum below ~0.95 to avoid OOM.
  • Tool call issues arose with Qwen3-Next-80B-Thinking model, which only supports thinking mode and doesn't emit tool calls automatically; swapping to an Instruct model fixed this.
  • Adjusting model parameters (e.g., max_model_len) for coresidency revealed that on Qwen3-Next, KV pool demand is influenced by Mamba state alignment, not just attention KV.
  • Key insights: gpu_memory_utilization is a snapshot at process start against total memory, not free memory; actual residency after stabilization is crucial for planning.
  • A playbook for two-model deployment: load the larger model first, measure actual residency with nvidia-smi, then size the smaller model's allocation against free memory minus overhead.
  • Action item: Check current vLLM deployment with nvidia-smi to compare actual memory used vs. gpu_memory_utilization target; if divergence exceeds 10%, adjust sizing empirically.