Despite the best efforts of American politicians, there are plenty of restricted Nvidia data center GPUs in China, but because the parts are smuggled in relatively limited quantities, their owners are inclined to repair failed A100 or H100 processors, as they are obviously not covered by any warranties. As such, Reuters reports that there is a booming underground industry focused on servicing high-end Nvidia AI GPUs that are officially restricted from export to the country.
Around a dozen small firms in Shenzhen now reportedly provide repair services for advanced Nvidia GPUs. Two companies confirmed that they primarily handle A100 and H100 units, which can be used to build powerful supercomputers both for AI and HPC, and are restricted from being shipped to Chinese entities. One of them began offering these services in late 2024 and now handles up to 500 GPU repairs per month. These businesses have even set up facilities with server rooms to simulate real-world data center conditions for testing.
Apparently, the profitability of this gray-market repair work has prompted businesses to form dedicated offshoots for handling just AI GPUs. AI accelerators — both add-in cards and SXM modules — are complex devices that can experience several types of breakdowns due to extreme thermal, electrical, and mechanical stresses they endure in data center environments.
You may like
-
Chinese companies allegedly smuggled in $1bn worth of Nvidia AI chips in the last three months, despite increasing export controls -
RTX 5090 turns into an AI-Ready GPU with a blower-style cooler in China, factory transplants core and memory onto a custom PCB -
China plans 39 AI data centers with 115,000 restricted Nvidia Hopper GPUs
Continuous heavy workloads can cause wear-related failures like dry thermal paste, fan issues (in case of some cards), component fatigue on the PCB, and damaged or oxidized connector pins in the SXM interface. More complex problems probably include failure of the power delivery subsystem, solder joint cracks under the massive GPU or HBM packages, or even degradation of the HBM memory. Fatal failures like die cracking (when using liquid cooling) or interposer delamination are relatively rare and unrepairable, but many of the aforementioned issues can be fixed.