Hasty Briefsbeta

Bilingual

4× RTX Pro 6000 Blackwell on Water, and the One Card That Wouldn't Behave

a day ago
  • #Thermal Management
  • #GPU Water Cooling
  • #Hardware Repair
  • A pilot water-cooling conversion for a GPU training rig revealed a cracked solder joint on an 85N power inductor after a week of sustained load.
  • The cracked joint caused the GPU to drop from the PCIe bus under heavy loads (Xid 79 + DPC containment), which was initially mistaken for a software issue.
  • Critical lesson: Warm the GPU to about 90°C before disassembly to prevent thermal pads from pulling off small SMD components like inductors.
  • Repair involved microsoldering by a local phone repair shop, costing $40 and taking 20 minutes, avoiding an RMA.
  • After fixing the pilot card and converting the remaining three with the warm disassembly method, all four GPUs performed stably under full load.
  • The water-cooled setup sustains 2.4 kW of heat, allowing continuous training at full boost without thermal throttling, unlike air cooling.
  • The rig now handles multi-day training jobs and doubles as an inference endpoint with balanced performance across all GPUs.