4× RTX Pro 6000 Blackwell on Water, and the One Card That Wouldn't Behave
a day ago
- #Thermal Management
- #GPU Water Cooling
- #Hardware Repair
- A pilot water-cooling conversion for a GPU training rig revealed a cracked solder joint on an 85N power inductor after a week of sustained load.
- The cracked joint caused the GPU to drop from the PCIe bus under heavy loads (Xid 79 + DPC containment), which was initially mistaken for a software issue.
- Critical lesson: Warm the GPU to about 90°C before disassembly to prevent thermal pads from pulling off small SMD components like inductors.
- Repair involved microsoldering by a local phone repair shop, costing $40 and taking 20 minutes, avoiding an RMA.
- After fixing the pilot card and converting the remaining three with the warm disassembly method, all four GPUs performed stably under full load.
- The water-cooled setup sustains 2.4 kW of heat, allowing continuous training at full boost without thermal throttling, unlike air cooling.
- The rig now handles multi-day training jobs and doubles as an inference endpoint with balanced performance across all GPUs.