Thoughts on improving inference time compute

What it means for Nvidia to acquire Groq

Dec 25, 2025

A few weeks a friend I used to work with and I were talking about how there is a pressing need for inference time compute optimizations to bring down soaring energy demands. Keep in mind that inference-time (test-time) compute is the lever everyone in the AI industry is looking at, more than training time compute, and for good reason. Inference, as they say, is god. Going back to our discussion, the idea of having a specialized uni-kernel kind of operating system for LLMs came about.

My opinion was that since Nvidia (and others) have spent so much time making the tooling (CUDA, et al) around GPUs and LLMs smooth and efficient (and also importantly cross platform), having an optimized OS would likely not work that well. Why would Nvidia not just do it themselves if they felt it would make a sizable difference in performance?

That led to a discussion of a new startup called Unconventional AI, which aims to tackle the same problem space both from the software and hardware perspective. This led me to thinking about Google’s TPU chips which are gaining a lot of coverage and for all the right reasons. GPUs have traditionally been built for generic purposes, first graphics and gaming, then crypto, and now LLMs. TPUs on the other hand are hand optimized for LLM usage by Google, and therefore are the reason for the difference in price to performance they offer over comparable Nvidia GPUs.

And now we have news of Nvidia acquiring Grok for their LPUs. Google is continuing to keep their TPUs proprietary for the most part because they don’t want competition getting access to it — although it is only a matter of time before somebody else figures out how it’s done. The idea of Nvidia going about Grok is partly what I had in my mind few weeks ago.

Inference time compute can likely be optimized in the following ways:

Advancements in model architectures, like the introduction of Model of Experts by DeepSek, better linking between GPU threads for improved data sharing, model router paradigm, hybrid models, among others.
Improvements in model inference operability, such as introducing the use of token caching, etc.
Continuous improvement is GPU architecture in the form of better GPU models
Improvements in how CPUs are utilized for serving inference for small models instead of GPUs, just Ampere Computing which got acquired by Softbank.
Continued rare advancement in chip design, leading to architectures/chips like Google TPUs and Grok LPUs.

In future posts I may delve into each one those. For now, my read of Nvidia’s acquisition of Grok is that they’ll absorb their engineering prowess and learnings into continuing to improve and advance their existing GPUs.

It’s an exciting time to be alive.

Ayaz on Tech & AI

Discussion about this post

Ready for more?