Data center GPUs from Nvidia have become the gold standard for AI training and inference due to their high performance, the use of HBM with extreme bandwidth, fast rack-scale interconnects, and a perfected CUDA software stack. However, as AI becomes more ubiquitous and models are becoming larger (especially at hyperscalers), it makes sense for Nvidia to disaggregate its inference stack and use specialized GPUs to accelerate the context phase of inference, a phase where the model must process millions of input tokens simultaneously to produce the initial output without using expensive and power-hungry GPUs with HBM memory. This month, the company announced its approach to solving that problem with its Rubin CPX— Content Phase aXcelerator — that will sit next to Rubin GPUs and Vera CPUs to accelerate specific workloads.
The shift to GDDR7 provides several benefits, despite delivering significantly lower bandwidth than HBM3E or HBM4; it consumes less power, costs dramatically less per GB, and does not require expensive advanced packaging technology, such as CoWoS, which should ultimately reduce the product’s costs and alleviate production bottlenecks.
Pascari X200: High-capacity data centers need reliable and comprehensive storage solutions. Phison’s Pascari X200 SSD provides the performance you’re looking for, with top-notch PCIe Gen5 performance and efficiency. Built with enterprise workloads in mind, Phison’s engineering offers cutting-edge tech that will make your life simpler and more energy efficient.
View Deal
What is long-context inference?
Modern large language models (such as GPT-5, Gemini 2, and Grok 3) are larger, more capable in reasoning, and able to process inputs that were previously impossible, which end-users utilize extensively. The models are not only larger in size, they are also architecturally more capable of using extended context windows effectively. Inference in large-scale AI models is increasingly divided into two parts: an initial compute-intensive context phase that processes the input to generate the first output token, and a second phase that generates additional tokens based on the processed context.
You may like
-
Nvidia Rubin CPX forms one half of new, “disaggregated” AI inference architecture -
Nvidia shares Blackwell Ultra’s secrets — NVFP4 boost detailed and PCIe 6.0 support -
Nvidia outlines plans for using silicon photonics and co-packaged optics in AI clusters by 2026
As models evolve into agentic systems, long-context inference becomes essential for enabling step-by-step reasoning, persistent memory across tasks, coherent multi-turn dialogue, and the ability to plan and revise over extended inputs, as otherwise these capabilities would be limited by context windows. Perhaps the most important factor why long-context inference becomes important is not just because models can do it, but because users need AI to analyze large documents, codebases, or generate long videos.