A few take-aways from GTC 2026 Keynote

What I think was significant in Jensen's keynote address at GTC 2026

Mar 23, 2026

I saw Jensen’s GTC 2026 keynote with forlorn mood. I had thought GTC 2026 was the conference to personally be at this year. Sadly, it just didn’t happen for me. Nevertheless, I thought I’d talk a little about the things that piqued my interest in his keynote.

I think the biggest surprise from Jensen Huang’s GTC 2026 keynote was not that Nvidia bought Groq earlier. It was how Nvidia chose to use Groq. A lot of people assumed the Groq move was defensive. Buy the company, remove a future threat, stop another hyperscaler from acquiring it, and take the engineering prowess and forget about the product. What I don’t think many expected was Jensen openly positioning Groq LPUs alongside Nvidia GPUs inside the Vera Rubin platform itself. That is a much more interesting move.

I did not-so-technical deep-dives into both Vera Rubin NVL72 platform as well as Groq LPUs. They make for a helpful read to better understand the technical nuances of what’s going on.

The architecture he outlined for Vera Rubin NVL72 makes the logic pretty clear. Prefill still runs on Rubin GPUs. Decode, however, gets split. The attention part of decode stays on GPUs, while the Multi-Layer Perceptron (MLP) or Feed-Forward Network (FFN) part moves to Groq LPUs.

In simple terms, this looks like a memory hierarchy decision hiding as an inference architecture. Attention is where the system keeps touching the KV cache, and KV cache wants HBM. That favours GPUs. MLP execution, on the other hand, is much more about repeatedly applying weights as fast as possible, and that favours Groq’s SRAM-heavy design. So the split is elegant: keep KV-cache close to HBM-heavy GPUs, keep weights close to SRAM-heavy LPUs, and let each device do the part of the decode loop it is better at. That is a much more nuanced story than “GPU versus LPU.” It is really about matching memory behaviour to the right silicon.

My second takeaway is that Samsung’s role here is bigger than it may look at first glance. Jensen explicitly thanked Samsung for manufacturing the Groq LP30 chip and said the chips were already in production, with shipment planned for the second half of 2026. Reuters also reported Samsung is making the chip on its 4nm process. That matters because it shows Samsung is not just orbiting this stack as a memory supplier anymore. It is now part of the logic side of Nvidia’s inference strategy too. That is a serious commitment.

The last point that stayed with me was Jensen talking about CPUs as a future multi-billion dollar business. Some time back, it started to look like AI was pulling compute value away from CPUs and toward GPUs. People in the know called in one of the three underlying paradigm shifts brought about by AI. Training moved to accelerators. Inference moved to accelerators. The CPU looked increasingly like plumbing.

But the agentic era complicates that story. GPUs are incredible at parallel compute. But if they are starved of data, orchestration, tool results, memory lookups, sandboxing, scheduling, and environment management, they sit idle. And a lot of agentic engineering lives exactly there: in the harness around the model. The model call may hit the GPU, but the workflow around it is full of CPU-bound work. Tool execution, state management, retrieval glue, memory handling, routing, queues, session control, and sandboxed environments all lean heavily on the CPU side.

That makes Jensen’s Vera CPU strategy easier to understand. He is not arguing that CPUs beat GPUs for model compute. He is arguing that agentic systems create a much larger surrounding control plane, and that control plane needs very fast, very efficient CPUs with strong single-threaded performance and high data output. Nvidia is clearly trying to own that layer too. Jensen has already said he expects Vera to become a multi-billion dollar CPU business. In other words, the shift is not from CPUs to GPUs. It is from general-purpose compute to a tightly orchestrated CPU-GPU-LPU stack, where each part becomes more important as agents get more complex.

My take is that the real theme of this keynote was not just more AI compute. It was heterogeneous inference, designed around bottlenecks1. GPUs are still central. But Nvidia is now openly acknowledging that one kind of silicon is not enough for where inference is going. HBM-heavy GPUs, SRAM-heavy LPUs, and agent-optimized CPUs each solve a different constraint. What changes is that Nvidia is no longer just selling the fastest accelerator. It is trying to own the entire serving path.

One final point I want to talk about is the near future importance and importance of Optics in chip design and interoperability. Jensen hinted at it. I am going to dedicated a separate post to talk about it.

Thank you for reading!

Heterogenous inference architecture

Ayaz on Tech & AI

Discussion about this post

Ready for more?