TensorWave just deployed the largest AMD GPU training cluster in North America — features 8,192 MI325X AI accelerators tamed by direct liquid-cooling

AI infrastructure company TensorWave has revealed the deployment of a massive 8,192-GPU cluster powered by AMD’s latest Instinct MI325X accelerators, claiming to be the largest AMD-based AI training installation in North America to date. The system also features direct liquid cooling, making it the first public deployment of its kind at this scale. Announced on X, the company showcased photos of the cluster’s high-density racks with bright orange cooling loops, confirming the system is now fully operational.

The AMD Instinct MI325X, officially launched late last year, was the company’s most aggressive attempt to challenge NVIDIA in the AI accelerator space—up until it was succeeded by MI350X and MI355X last month. Regardless, each MI325X unit features 256GB of HBM3e memory, enabling 6TB/s of bandwidth, along with 2.6 PFLOPS of FP8 compute, thanks to its chiplet design with 19,456 stream processors clocked up to 2.10GHz.

The GPU confidently stands its ground against Nvidia’s H200 while being a lot cheaper, but you pay that cost elsewhere in the form of an 8-GPU cluster limitation compared to the Green Team’s 72. That’s one of the primary reasons it didn’t quite take off and precisely what makes TensorWave’s approach so interesting. Instead of trying to compete with scale per node, TensorWave focused on thermal headroom and density per rack. The entire cluster is built around a proprietary direct-to-chip liquid cooling loop, using bright orange (sometimes yellow?) tubing to circulate coolant through cold plates mounted directly on each MI325X.

Don’t miss these

  • Best CPU deals 2025 — deals on AMD and Intel processors

  • Best Tech and PC Hardware deals — deals on GPUs, CPUs, SSDs, Monitors, and more

  • Best monitor deals 2025 — pixel-perfect deals for 4K, gaming, and more

At 1,000 watts per GPU, running even a fraction of this hardware takes serious engineering. Thankfully, there are no 16-pin power connectors in sight. Anyhow, a total of 8,192 GPUs will produce over 2 petabytes/s of aggregate memory bandwidth and an estimated 21 exaFLOPS of FP8 throughput, though, as always, sustained performance depends heavily on model parallelism (splitting the AI model across GPUs) and interconnect design. TensorWave’s business model is cloud capacity for rent, so the real challenge of scaling models falls to the tenants themselves.