Is Moore's law dead?
And some thoughts on the Straggler Effect during model training, and why xAI's Colossus 1 cluster was at 11% GPU utilization for training.
The last podcast Dwarkesh did with Jensen Huang was explosive. It contained interesting foresights that show what Jensen is looking at. I want to spend time here talking about a couple of those because they piqued my interest more than others.
Moore’s law is dead
Jensen said Moore’s law is dead. In 1965, Gordon Moore predicted that the number of transistors that can fit on a chip will roughly double every two years. This law has held true for decades. Since 2010s, though, it has been slowing down. In order to fit many more transistors per chip, you have to reduce the size of the transistor down consistently. The size of the transistor cannot go lower than the size of the atom, however. In addition to that, when the size becomes too small, you run into other issues such as:
Quantum tunneling where electrons begin to leak through barriers because the barriers are getting thinner (again, nearing atomic sizes, this is a problem).
Having more transistors in a small area leads to more heat generation.
And finally, you cannot go beyond the size of the atom.
While transistor companies have engineered solutions to work around some of these, the reality is that these are physical limits that define the floor in terms of transistor density per chip. In addition to these physical limits, the cost of building high density chips goes up. Building a cutting edge fab to generate these chips now costs much more than fabs that operate bigger node sizes. At the moment, 2nm nodes are in production across TSMC, Intel and Samsung. TSMC’s 1.4nm node is slated to go into production in 2028, which is already a big jump in timelines that doesn’t follow Moore’s law. Despite that, even the 2nm fabs are expensive to build and operate.
So, that’s why I suspect Jensen believes that Moore’s law is dead. If Moore’s law is dead and we are hitting physical limits at the transistor level, the only way, Jensen claims, for AI to continue to improve is via software engineering. Through model architectures, and advancements in algorithms. He is very bullish on that. The transformer architecture pioneered generative AI. The improvements DeepSeek’s teams have made have been instrumental in moving the model space forward, by proving that the same (or even throttled) hardware can run more advanced, bigger models more efficiently and at lower costs. My last post on SubQ was proof of Jensen’s belief in software and algorithms paving the way for next breakthroughs in AI. If proven, SubQ’s improvements on how attention is calculated can provide a big breakthrough not only in terms of performance and cost but also the amount of context tokens that can be reliably referenced by a model. Context windows today are a bottleneck for agentic engineering. Models tend to hallucinate and break apart near context limits which are already low. Given how DeepSeek continues to innovate in the software, architecture, and algorithmic layers, it is only a matter of time before more players publish papers that offer techniques that push the frontier forward.
ASICs aren’t a threat
Dwarkesh confronted Jensen on why ASICs, like Google’s TPUs, wasn’t a threat for Nvidia. Jensen believes that ASICs are not going to pose any problems for Nvidia. He claims that there are more ASICs that come and go (fail), and that the process of researching, developing, and operating ASICs is difficult and time consuming to be viable at scale. Nvidia’s goal is to have their GPUs everywhere, and have them run with everything, which is in direct contrast to ASICs which are custom designed for specific hardware needs as well as for running specific models. He also thinks the co-design needed for ASICs is also its Achilles’ heel because one generation of a specific ASIC can only be efficiently run with a specific set of architecture and model.
I strongly believe hyper-scalers are betting on ASICs for several reasons:
There is a genuine shortage of GPUs, CPUs, and memory in the world. They are hedging their bets to make sure they have alternatives around, so they can grab their hands on as many xPUs/chips they can get. For this reason, it’s not Nvidia vs custom Silicon. They just want as much of the hardware as they can.
They want to avoid the Nvidia tax as much as possible. Nvidia’s margins are heavy.
GPUs are general purpose. For more specific use-cases, running a custom Silicon ASIC that is tailored to a hyper-scaler’s use-case can generate better performance at a fraction of a cost which they can pass down to their customers.
Nvidia’s GPUs are used heavily everywhere, and a big part of the reason is the CUDA eco-system that Jensen has smartly built and spread. It’s similar to a vendor lock-in. It runs everywhere, on all kinds of hardware, and most model development and training is tied to that eco-system. So if you are trying something out, you are more likely to be doing it using CUDA on Nvidia GPUs. However, hyper-scalers don’t have to be tied down to just that for their workloads, and they don’t have to make it available for a large swathe of the population. They have the means to work with custom ASICs and custom software eco-systems that compete with CUDA, only so they can offer better margins to their customers.
The Straggler Effect
The co-design argument equally applies to Nvidia’s GPUs and eco-system. Even within the same eco-system, for training workloads, you still run into issues across different generations of Nvidia’s GPUs. Recently, a piece of news was published about xAI’s Colossus 1 cluster running at only 11% GPU capacity. A lot of people took it to mean different things. The reality was different. The Colossus 1 is a heterogenous cluster combining Hopper and Blackwell GPUs for Nvidia. It was primarily being used for training workloads. In the model training world, distributed training across different generations of GPUs is considered a disaster. At the heart of it is what is called the Straggler Effect.
In distributed model training, GPUs have to wait until the existing step is completed across all the GPUs, before they can move on to the next step. If there’s a generation of GPUs which are slower, then the faster GPUs will still have to wait for the slower ones to finish before moving forward. This happens because in a usual model training step, the following things happen:
A forward pass, which runs the entire input through the model to get a prediction.
Loss calculation, which determines how wrong (error) the prediction was.
A backward pass, which calculates the gradient, which is how much each weight in the model contributed to that error.
A weight update pass, which updates individual weights based on the calculated gradients.
For distributed training, each GPU contains a copy of the model, but gets a different slice of data to work on. Therefore, each GPU has to determine the gradient for that slice of data, which is then shared across all the GPUs, after which the averaged gradient is applied by all GPUs to update the weights. This is crucial and has to be done across the entire GPU fleet in order to update the weights properly. This is what causes GPUs to wait for a “straggler” GPU to finish its computation.
This explains why Colossus 1, as a heterogenous cluster, has very low utilization during training. However, Elon Musk has rented out the entire cluster to Anthropic for “inference” instead of training, while focusing on xAI’s Colossus 2, which is a Blackwell only cluster, for training workloads. So, even across different generations of Nvidia GPUs, co-design is a problem for training workloads.
When you consider Moore’s law is dead and transistors have hit a physical limit, it makes sense that breakthroughs will come from advancements in the software layer, but it’s not wrong to imagine that breakthroughs will continue to come from the hardware layer where more customized chips can be tweaked to perform better for specific architectures instead of general purpose chips. It will be fascinating to see where the world takes us.
Thanks for reading.

