Nvidia's new CPX GPU aims to change the game in AI inference — how the debut of cheaper and cooler GDDR7 memory could redefine AI inference infrastructure -

Data center GPUs from Nvidia have become the gold standard for AI training and inference due to their high performance, the use of HBM with extreme bandwidth, fast rack-scale interconnects, and a perfected CUDA software stack. However, as AI becomes more ubiquitous and models are becoming larger (especially at hyperscalers), it makes sense for Nvidia to disaggregate its inference stack and use specialized GPUs to accelerate the context phase of inference, a phase where the model must process millions of input tokens simultaneously to produce the initial output without using expensive and power-hungry GPUs with HBM memory. This month, the company announced its approach to solving that problem with its Rubin CPX— Content Phase aXcelerator — that will sit next to Rubin GPUs and Vera CPUs to accelerate specific workloads.

The shift to GDDR7 provides several benefits, despite delivering significantly lower bandwidth than HBM3E or HBM4; it consumes less power, costs dramatically less per GB, and does not require expensive advanced packaging technology, such as CoWoS, which should ultimately reduce the product’s costs and alleviate production bottlenecks.

What is long-context inference?

Modern large language models (such as GPT-5, Gemini 2, and Grok 3) are larger, more capable in reasoning, and able to process inputs that were previously impossible, which end-users utilize extensively. The models are not only larger in size, they are also architecturally more capable of using extended context windows effectively. Inference in large-scale AI models is increasingly divided into two parts: an initial compute-intensive context phase that processes the input to generate the first output token, and a second phase that generates additional tokens based on the processed context.

You may like

Nvidia Rubin CPX forms one half of new, “disaggregated” AI inference architecture
Nvidia shares Blackwell Ultra’s secrets — NVFP4 boost detailed and PCIe 6.0 support
Nvidia outlines plans for using silicon photonics and co-packaged optics in AI clusters by 2026

As models evolve into agentic systems, long-context inference becomes essential for enabling step-by-step reasoning, persistent memory across tasks, coherent multi-turn dialogue, and the ability to plan and revise over extended inputs, as otherwise these capabilities would be limited by context windows. Perhaps the most important factor why long-context inference becomes important is not just because models can do it, but because users need AI to analyze large documents, codebases, or generate long videos.

Nvidia’s new CPX GPU aims to change the game in AI inference — how the debut of cheaper and cooler GDDR7 memory could redefine AI inference infrastructure

What is long-context inference?

Related Posts

Mobile RTX 2070 with shunt mod nearly eclipses desktop performance — 60W boost provides 15% performance uplift

Nvidia RTX 5070 Ti Super and RTX 5070 Super TDP leaked — long-rumored RTX 50 Super series GPUs appear in power supply calculator

AMD silently launches RX 7700 non-XT with 16 GB VRAM — New RDNA 3 GPU uses nerfed Navi 32 die, offers reduced performance and increased power draw

Repair wizard converts an RTX 4080 into 4080 Super using BGA magic — Donor board gets intense surgery for a reball upgrade like never before

FSR 4 modded to run on RDNA 2 GPUs improves image quality by “leaps and bounds,” but carries 10% performance penalty — AMD’s leaked source code turns into modding frenzy

China’s latest GPU arrives with claims of CUDA compatibility and RT support — Fenghua No.3 also boasts 112GB+ of HBM memory for AI

Radeon RX 9070 gains 25% performance in synthetic benchmarks using RX 9070 XT vBIOS