Ayaz on Tech & AI

Is Moore's law dead?

Ayaz — Sat, 09 May 2026 17:16:26 GMT

The last podcast Dwarkesh did with Jensen Huang was explosive. It contained interesting foresights that show what Jensen is looking at. I want to spend time here talking about a couple of those because they piqued my interest more than others.

Moore’s law is dead

Jensen said Moore’s law is dead. In 1965, Gordon Moore predicted that the number of transistors that can fit on a chip will roughly double every two years. This law has held true for decades. Since 2010s, though, it has been slowing down. In order to fit many more transistors per chip, you have to reduce the size of the transistor down consistently. The size of the transistor cannot go lower than the size of the atom, however. In addition to that, when the size becomes too small, you run into other issues such as:

Quantum tunneling where electrons begin to leak through barriers because the barriers are getting thinner (again, nearing atomic sizes, this is a problem).
Having more transistors in a small area leads to more heat generation.
And finally, you cannot go beyond the size of the atom.

While transistor companies have engineered solutions to work around some of these, the reality is that these are physical limits that define the floor in terms of transistor density per chip. In addition to these physical limits, the cost of building high density chips goes up. Building a cutting edge fab to generate these chips now costs much more than fabs that operate bigger node sizes. At the moment, 2nm nodes are in production across TSMC, Intel and Samsung. TSMC’s 1.4nm node is slated to go into production in 2028, which is already a big jump in timelines that doesn’t follow Moore’s law. Despite that, even the 2nm fabs are expensive to build and operate.

So, that’s why I suspect Jensen believes that Moore’s law is dead. If Moore’s law is dead and we are hitting physical limits at the transistor level, the only way, Jensen claims, for AI to continue to improve is via software engineering. Through model architectures, and advancements in algorithms. He is very bullish on that. The transformer architecture pioneered generative AI. The improvements DeepSeek’s teams have made have been instrumental in moving the model space forward, by proving that the same (or even throttled) hardware can run more advanced, bigger models more efficiently and at lower costs. My last post on SubQ was proof of Jensen’s belief in software and algorithms paving the way for next breakthroughs in AI. If proven, SubQ’s improvements on how attention is calculated can provide a big breakthrough not only in terms of performance and cost but also the amount of context tokens that can be reliably referenced by a model. Context windows today are a bottleneck for agentic engineering. Models tend to hallucinate and break apart near context limits which are already low. Given how DeepSeek continues to innovate in the software, architecture, and algorithmic layers, it is only a matter of time before more players publish papers that offer techniques that push the frontier forward.

ASICs aren’t a threat

Dwarkesh confronted Jensen on why ASICs, like Google’s TPUs, wasn’t a threat for Nvidia. Jensen believes that ASICs are not going to pose any problems for Nvidia. He claims that there are more ASICs that come and go (fail), and that the process of researching, developing, and operating ASICs is difficult and time consuming to be viable at scale. Nvidia’s goal is to have their GPUs everywhere, and have them run with everything, which is in direct contrast to ASICs which are custom designed for specific hardware needs as well as for running specific models. He also thinks the co-design needed for ASICs is also its Achilles’ heel because one generation of a specific ASIC can only be efficiently run with a specific set of architecture and model.

I strongly believe hyper-scalers are betting on ASICs for several reasons:

There is a genuine shortage of GPUs, CPUs, and memory in the world. They are hedging their bets to make sure they have alternatives around, so they can grab their hands on as many xPUs/chips they can get. For this reason, it’s not Nvidia vs custom Silicon. They just want as much of the hardware as they can.
They want to avoid the Nvidia tax as much as possible. Nvidia’s margins are heavy.
GPUs are general purpose. For more specific use-cases, running a custom Silicon ASIC that is tailored to a hyper-scaler’s use-case can generate better performance at a fraction of a cost which they can pass down to their customers.

Nvidia’s GPUs are used heavily everywhere, and a big part of the reason is the CUDA eco-system that Jensen has smartly built and spread. It’s similar to a vendor lock-in. It runs everywhere, on all kinds of hardware, and most model development and training is tied to that eco-system. So if you are trying something out, you are more likely to be doing it using CUDA on Nvidia GPUs. However, hyper-scalers don’t have to be tied down to just that for their workloads, and they don’t have to make it available for a large swathe of the population. They have the means to work with custom ASICs and custom software eco-systems that compete with CUDA, only so they can offer better margins to their customers.

The Straggler Effect

The co-design argument equally applies to Nvidia’s GPUs and eco-system. Even within the same eco-system, for training workloads, you still run into issues across different generations of Nvidia’s GPUs. Recently, a piece of news was published about xAI’s Colossus 1 cluster running at only 11% GPU capacity. A lot of people took it to mean different things. The reality was different. The Colossus 1 is a heterogenous cluster combining Hopper and Blackwell GPUs for Nvidia. It was primarily being used for training workloads. In the model training world, distributed training across different generations of GPUs is considered a disaster. At the heart of it is what is called the Straggler Effect.

In distributed model training, GPUs have to wait until the existing step is completed across all the GPUs, before they can move on to the next step. If there’s a generation of GPUs which are slower, then the faster GPUs will still have to wait for the slower ones to finish before moving forward. This happens because in a usual model training step, the following things happen:

A forward pass, which runs the entire input through the model to get a prediction.
Loss calculation, which determines how wrong (error) the prediction was.
A backward pass, which calculates the gradient, which is how much each weight in the model contributed to that error.
A weight update pass, which updates individual weights based on the calculated gradients.

For distributed training, each GPU contains a copy of the model, but gets a different slice of data to work on. Therefore, each GPU has to determine the gradient for that slice of data, which is then shared across all the GPUs, after which the averaged gradient is applied by all GPUs to update the weights. This is crucial and has to be done across the entire GPU fleet in order to update the weights properly. This is what causes GPUs to wait for a “straggler” GPU to finish its computation.

This explains why Colossus 1, as a heterogenous cluster, has very low utilization during training. However, Elon Musk has rented out the entire cluster to Anthropic for “inference” instead of training, while focusing on xAI’s Colossus 2, which is a Blackwell only cluster, for training workloads. So, even across different generations of Nvidia GPUs, co-design is a problem for training workloads.

When you consider Moore’s law is dead and transistors have hit a physical limit, it makes sense that breakthroughs will come from advancements in the software layer, but it’s not wrong to imagine that breakthroughs will continue to come from the hardware layer where more customized chips can be tweaked to perform better for specific architectures instead of general purpose chips. It will be fascinating to see where the world takes us.

Thanks for reading.

Some thoughts on SubQ's Subquadratic LLM

Ayaz — Thu, 07 May 2026 15:59:36 GMT

What makes the Transformer model architecture so good at generative AI is also what makes it progressively worse as the context length of the model increases. For each token that is generated, the attention mechanism looks at all the previous generated tokens to compute its attention with respect to those before it. As the context length grows, this becomes computationally expensive. This puts a huge strain on the already memory-constrained KV cache. Since the seminal attention is all you need paper came out, there have been various improvements over the standard self-attention mechanism to improve both compute and memory complexity as well as time required by the model to compute it. Some of these are sparse attention, low rank/linear attention, grouped query attention and some combinations of sliding window attentions.

Subquadratic, or SubQ for short, is a different approach to attention mechanisms traditionally found in Transformer models. It makes use of subquadratic selective attention mechanism to reduce attention complexity, time and resource usage, thereby allowing huge context windows to become feasibly possible. SubQ claims that their model can serve a 12M context window at significantly reduced compute and therefore energy spend, at a lower cost. Now, these claims aren’t vetted by the community because they are behind closed doors so far and access to the model (and related harness) is restricted. While there is no published whitepaper, they claim that the use of subquadratic selective (sparse) attention (SSA) can improve attention performance over other generally used mechanisms.

The core concept behind SSA is that instead of looking at all the previous tokens to compute attention against, an intelligent router uses content-dependent routing to identify which key positions in the sequence are meaningful, discarding the rest, and then runs full attention across it. The performance complexity remains largely linear, as all irrelevant positions are skipped, even when the context length goes beyond 1M tokens. Compared to other mechanisms, it starts to shine as context length explodes and at 12M token, they claim, the performance and cost gains are huge.

Since SubQ have not published a technical white-paper explaining how content-dependent attention is being calculated, and because their claims have not been tested by the community at large, there’s only speculation to go by. However, while how exactly does the attention router know which tokens are meaningful or not is not clear, it appears that this is something that’s learned during different parts of the three-stage training pipelines for the model (particularly the reinforcement learning part) they have used, like with other slightly related approaches.

An interesting comparison is the recent launch of DeepSeek’s V4 Pro models, which make use of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) mechanisms, in addition to their default DeepSeek Sparse Attention mechanism, to improve the performance and usage of the KV cache in order to serve 1M context window. DeepSeek’s improvements also do a similar sparse selection of which tokens to run attention for, while at the same time trying to compress and optimize how the KV cache is utlitized throughout the process in order to reduce memory consumption, compute, and eventually cost. However, in comparison to SubQ and as claimed by them, this sparse attention/router mechanism is still complex and therefore beyond 1M context window it starts to become painful.

It will be interesting to read the technical white-paper for SubQ whenever it is released. Meanwhile, it’s definitely worth looking at the DeepSeek V4 whitepaper to understand the engineering improvements they introduced in existing attention mechanisms to speed up the model and reduce costs.

My thoughts on Amazon's letter to their shareholders

Ayaz — Sun, 12 Apr 2026 13:45:13 GMT

For a while, I held the same quiet unease a lot of people were holding in the back of their minds. AI was attracting mind-boggling investment, enormous hype, and an almost religious fervour that it would reshape everything. And yet, the actual revenue numbers were vague. You heard about compute spend, about headcount, about infrastructure buildouts. What you heard less about was customers actually paying for AI in ways that showed up clearly on a balance sheet. That is what you call a bubble, and it was getting harder to argue against. Then Andy Jassy published his annual shareholder letter1 for Amazon’s shareholders.

The number that stopped me was AWS’s AI business now running at over $15 billion in annual revenue. That’s real revenue growing at triple digit percentages year over year. That is not what bubbles look like at this stage. Bubbles are characterised by investment that outpaces demand, by valuations that float on future promises rather than present transactions. What Jassy described is enterprises actually paying for AI infrastructure, at scale, right now. The hype preceded the revenue, yes. But the revenue is arriving.

I am pretty sure most of us have been skeptical about seeing numbers like this so soon. I was wrong to be. And I think being honest about that matters because calling something a bubble can feel like the smart, cautious take. It lets you sound careful without actually having to change your mind.

The second thing Jassy confirmed is something I have been watching closely for a while and have been particularly interested with. It’s the not so commonly acknowledged fact that hyperscalers are quietly moving to custom, in-house silicon.

This is not a small shift. For years, the assumption was that Nvidia owned inference and training, and that the hyperscalers would remain customers indefinitely. That story is changing. AWS is now running more of their inference workloads on their own custom chips—Trainium, the GPU, and Graviton, their CPU—and the performance case is becoming undeniable. Trainium2 offers roughly 30% better price-performance than comparable GPUs. Trainium3, which just started shipping, pushes that another 30–40% beyond Trainium2, and is nearly fully subscribed already. Trainium4 hasn’t even hit broad availability yet and still a significant portion is already reserved.

What made me pause was Jassy’s own framing of what this means at scale. Amazon expects Trainium to save them tens of billions of dollars a year in hardware costs, and make their profit margins significantly better than if they were buying chips from someone else.

He also noted that if Amazon sold its chips on the open market the way rival chip companies do, that business would be worth roughly $50 billion in annual revenue. Right now, nearly all of it is internal. But the implication is sitting there, readable between the lines.

I have been saying for a while that the hyperscaler custom silicon story was under-appreciated. Jassy’s letter is the clearest public confirmation I’ve seen that this thesis is real, it’s scaling, and it’s only going to deepen.

None of this means AI will be universally transformative in every domain, on the timelines people claim. There is still plenty of noise. There are still real questions about where and how the value compounds across the stack.

But the specific concern that AI is all investment and no revenue is harder to sustain after a letter like this. The revenue is real. The silicon advantage is real. And at the layer where it matters most, the infrastructure layer that everything else runs on, the numbers are already large and growing fast.

Thanks for reading!

https://www.aboutamazon.com/news/company-news/amazon-ceo-andy-jassy-2025-letter-to-shareholders

A few take-aways from GTC 2026 Keynote

Ayaz — Mon, 23 Mar 2026 09:11:41 GMT

I saw Jensen’s GTC 2026 keynote with forlorn mood. I had thought GTC 2026 was the conference to personally be at this year. Sadly, it just didn’t happen for me. Nevertheless, I thought I’d talk a little about the things that piqued my interest in his keynote.

I think the biggest surprise from Jensen Huang’s GTC 2026 keynote was not that Nvidia bought Groq earlier. It was how Nvidia chose to use Groq. A lot of people assumed the Groq move was defensive. Buy the company, remove a future threat, stop another hyperscaler from acquiring it, and take the engineering prowess and forget about the product. What I don’t think many expected was Jensen openly positioning Groq LPUs alongside Nvidia GPUs inside the Vera Rubin platform itself. That is a much more interesting move.

I did not-so-technical deep-dives into both Vera Rubin NVL72 platform as well as Groq LPUs. They make for a helpful read to better understand the technical nuances of what’s going on.

The architecture he outlined for Vera Rubin NVL72 makes the logic pretty clear. Prefill still runs on Rubin GPUs. Decode, however, gets split. The attention part of decode stays on GPUs, while the Multi-Layer Perceptron (MLP) or Feed-Forward Network (FFN) part moves to Groq LPUs.

In simple terms, this looks like a memory hierarchy decision hiding as an inference architecture. Attention is where the system keeps touching the KV cache, and KV cache wants HBM. That favours GPUs. MLP execution, on the other hand, is much more about repeatedly applying weights as fast as possible, and that favours Groq’s SRAM-heavy design. So the split is elegant: keep KV-cache close to HBM-heavy GPUs, keep weights close to SRAM-heavy LPUs, and let each device do the part of the decode loop it is better at. That is a much more nuanced story than “GPU versus LPU.” It is really about matching memory behaviour to the right silicon.

My second takeaway is that Samsung’s role here is bigger than it may look at first glance. Jensen explicitly thanked Samsung for manufacturing the Groq LP30 chip and said the chips were already in production, with shipment planned for the second half of 2026. Reuters also reported Samsung is making the chip on its 4nm process. That matters because it shows Samsung is not just orbiting this stack as a memory supplier anymore. It is now part of the logic side of Nvidia’s inference strategy too. That is a serious commitment.

The last point that stayed with me was Jensen talking about CPUs as a future multi-billion dollar business. Some time back, it started to look like AI was pulling compute value away from CPUs and toward GPUs. People in the know called in one of the three underlying paradigm shifts brought about by AI. Training moved to accelerators. Inference moved to accelerators. The CPU looked increasingly like plumbing.

But the agentic era complicates that story. GPUs are incredible at parallel compute. But if they are starved of data, orchestration, tool results, memory lookups, sandboxing, scheduling, and environment management, they sit idle. And a lot of agentic engineering lives exactly there: in the harness around the model. The model call may hit the GPU, but the workflow around it is full of CPU-bound work. Tool execution, state management, retrieval glue, memory handling, routing, queues, session control, and sandboxed environments all lean heavily on the CPU side.

That makes Jensen’s Vera CPU strategy easier to understand. He is not arguing that CPUs beat GPUs for model compute. He is arguing that agentic systems create a much larger surrounding control plane, and that control plane needs very fast, very efficient CPUs with strong single-threaded performance and high data output. Nvidia is clearly trying to own that layer too. Jensen has already said he expects Vera to become a multi-billion dollar CPU business. In other words, the shift is not from CPUs to GPUs. It is from general-purpose compute to a tightly orchestrated CPU-GPU-LPU stack, where each part becomes more important as agents get more complex.

My take is that the real theme of this keynote was not just more AI compute. It was heterogeneous inference, designed around bottlenecks1. GPUs are still central. But Nvidia is now openly acknowledging that one kind of silicon is not enough for where inference is going. HBM-heavy GPUs, SRAM-heavy LPUs, and agent-optimized CPUs each solve a different constraint. What changes is that Nvidia is no longer just selling the fastest accelerator. It is trying to own the entire serving path.

One final point I want to talk about is the near future importance and importance of Optics in chip design and interoperability. Jensen hinted at it. I am going to dedicated a separate post to talk about it.

Thank you for reading!

Heterogenous inference architecture

Is recursive self improvement closer than we think?

Ayaz — Thu, 19 Mar 2026 15:45:34 GMT

In my earlier posts, I theorized how continual self learning for frontier models is likely the only thing standing in the path to AGI. Some very smart people recently resigned from important positions at various leading AI labs and made claims about recursive self improvement being only six to twelve months away. In his latest podcast1 with Peter Diamandis, Elon Musk also hinted at recursive self improvement being just around the corner. But reading them briefly talk about this, without providing any evidence to support their claims, makes you wonder whether the timeline is accurate.

Earlier this year Codex 5.3 from OpenAI was released. Along with Sonnet/Opus 4.5, Codex 5.3 ushered in a new era for SOTA models. Those who understand it know what changed. These models are now starting to become what I call omnipotent models — they are eating everything up. The reason why I mention Codex 5.3 in particular is because OpenAI publicly acknowledged, for the very first time, that Codex played an instrumental2 role in building its next version. Before that, models were routinely used, as part of training newer models, to generate synthetic data and AI-generated feedback as part of the reinforcement learning from AI feedback (RLAIF) loop. But what Codex 5.3 achieved was altogether different. It helped in debugging training runs, managing deployment of the model, analyzing test results, writing and performing evals, and much more. This is an entirely different level of model involvement in creating its successor. It hints at how close we are to fully recursive self improving models.

Yesterday, MiniMax published an article3 about their latest SOTA-level model titled “Early echoes of self-evolution”. The title alone sent a chill down my spine when I read it. Buried within a series of spectacular results across different benchmarks for M2.7 (their latest model) is the public acknowledgment of the model’s self-evolution that should shake to the core anyone who tracks continual learning. Among other things, they explain that they found the model was able to update its own memory, build dozens of complex skills in its harness, improve its learning process based on reinforcement learning experiments, iterate multiple times over its own architecture, skills and memory, and autonomously run repeated cycles to optimize itself and evaluate whether it’s building a better version of itself. They call it a cycle of model self-evolution. I call it the model’s ability to evolve itself to the point that it can build a better, more capable, more efficient version of itself, almost autonomously. If that’s not the early signs of recursive self improvement, I don’t know what is.

When they said recursive self improvement was only six to twelve months out, they weren’t exaggerating. These are the kind of breakthroughs that are as exciting as they are dangerous. They carry within themselves the echoes of the fictional Skynet scenario we all fear. Yet, it’s no time to stop.

Thanks for reading.

https://www.mexc.com/news/912121?ref=aisecret.us

https://thenewstack.io/openais-gpt-5-3-codex-helped-build-itself/

https://www.minimax.io/news/minimax-m27-en

The Repo Frontier

Ayaz — Tue, 17 Mar 2026 07:12:59 GMT

I have been meaning to describe a different way of looking at the on-going conflict in Middle East which some smart analysts have been pointing at. This way of looking changes the discourse from looking at the conflict as everything to do with religion, ancient rivalries and the fight against wrong and evil. Of course, this is my hypothesis based on the work done by others, so I’d like you to view it just as that and nothing more.

The core idea in this framing is that there are two deep-states fighting each other for survival. Both of them are deeply bureaucratic-financial complexes, or what you call the Financial Industrial Complex. The worrying part though in their conflict is that they are pushing it beyond the periphery of sanctions and covert economical pressure into more overt, kinetic action: that is, direct and in-direct war.

Here is how I like to think of the two deep-states:

The old state, which is a central banking era power of old. It’s comfortable with opacity, intermediaries, and with off-ledger financial plumbing, that is, moving money through jusridictional arbitrage, gray corridors, and the quiet dependency of large institutions (banks) on those corridors.
The new state, which is ledger-first, wants everything digitized, permissioned, and legible. It prefers chokepoints that are software-defined. It is less tolerant of shadow systems because they are unaudtiable.

One of them is optimized for discretion, while the other for compliance by (digital) architecture. If you buy that, the Middle East stops looking like a single conflict. It starts looking like a set of bottlenecks in a global financial migration.

In this hypothesis, Iran isn’t just a regional actor. It functions as a node in one of the oldest shadow financial systems, routing through regional banks and corridors that are technically deniable and politically convenient. The hub in London, and the off-shore battery in Iran, that is, while most European and Britsh banks are quietly, happily complicit in the setup.

Iran’s position, relationships, and tolerances make it useful for legacy European finance markets that thrive on deep liquidity and flexible collateral flows.

If you’ve ever looked at modern finance long enough, you realize the real product isn’t money. It’s settlement. It’s repo -- the short term secure loaning facility that London is known for. It’s the ability to move claims around with speed, credibility, and just enough ambiguity. And when ambiguity is a feature, not a bug, certain corridors become strategically valuable.

Now consider this hypothesis: the new state made a systematic attempt to remove alternative protection structures and force renegotiation with long-standing holdouts. This includes actions around cartels, Venezuela, and Cuba (after 60 years of defiance, the country is back on negotiation table). These were not isolated policy quirks, but a series of coordinated actions to remove informal enforcement and parallel markets, reduce the number of places capital can hide, launder, or sit outside the preferred rails, and finally to push smaller nodes back into a negotiable box.

And that is where Iran comes in. It’s the last man standing in the way, the old holdout in the shadow financial system. It is a living interface to the legacy shadow routing system which European banks quietly benefit from. If you look at Iran from that lens, then it becomes more than a geopolitical adversary. It becomes a compatibility layer for an older financial order. And that is exactly what a ledger-first regime wants to delete.

If this thought process is right, then escalation isn’t about a single provocation. It’s about migration pressure. When you try to move a huge system from trust-me networks to verifiable ledgers, discretionary intermediaries to audited rails, and tolerated ambiguity to enforced legibility, you don’t get polite consensus. You get counter-pressure. You get to corner a tiger without providing a way out, and it does best what it is supposed it: it lashes back hard.

Some pressure is financial (sanctions, seizures, blacklists). Some is narrative (moral framing, legitimacy campaigns). And sometimes, when neither side yields, it becomes physical. Kinetic action, in this lens, is what happens when the underlying settlement architecture is being contested and the bargaining fails.

Now, my takeaway is that part of what we are calling the Middle East conflict is really just a fight over which financial operating system gets to run the next era. If that’s indeed what is happening, then watching missile movements without watching settlement plumbing is like watching GPUs without watching memory bandwidth.

Thanks for reading! And please, keep in mind, that’s all hypothetical.

References:

Part 1: The 118-Year Pattern They Don't Want You to See

Part 2: The $17.7 Billion Paper Trail

Part 3: The Final Piece — The Declaration Nobody

Second-Order Effects: How Hormuz Threatens AI Infrastructure

Ayaz — Wed, 11 Mar 2026 13:36:34 GMT

The Strait of Hormuz closed on March 2, 2026. What followed wasn’t just an oil crisis. It was a series of bottlenecks that revealed how deeply AI infrastructure depends on a handful of chokepoints most people weren’t paying attention to. Aren’t paying attention to.

I have been meaning to map out the second and third-order effects of this conflict on AI buildout. The first-order effects are obvious: oil prices hit $110 per barrel, tanker traffic dropped to near zero, and energy costs spiked. But the real story is in what happens in the materials, chemicals, and gases that semiconductor manufacturing depends on, and how their disruption threatens to slow the entire AI infrastructure buildout.

The Constraint We Weren’t Watching

About 20% of the world’s daily oil supply flows through the Strait of Hormuz. So does a significant volume of liquefied natural gas from Qatar, which accounts for roughly 30% of global helium production and 33.7% of Taiwan’s LNG imports.

When Qatar’s Ras Laffan facility was hit by a drone attack, it didn’t just shut down LNG production, it cut off helium, a gas that semiconductor fabs literally cannot operate without.

Helium is used to cool semiconductor manufacturing equipment during production. Many fabs maintain only a 4-to-8-week buffer of specialized gases. South Korea’s Samsung and SK Hynix have current helium stockpiles expected to last roughly six months. If the conflict persists until late April, the industry will hit a hard stop.

Without sufficient helium for thermal management, chip yields will plummet. Taiwan’s limited gas reserves, estimated at roughly 10 to 11 days under normal consumption, could quickly come under strain, raising the prospect of electricity shortages affecting TSMC, which alone accounts for about 9% of Taiwan’s total electricity consumption.

The Chemicals That Make Chips Possible

Sulfuric acid is the most produced industrial chemical in the world. Around 92% of global sulfur comes from refining petroleum and processing natural gas. Both of these industries are heavily concentrated in the GCC region.

That sulfur becomes sulfuric acid, which is critical to semiconductor manufacturing. It’s used in chip etching, cleaning, doping, and other fabrication steps. The demand for sulfuric acid is rising in the electronics industry which is used to make semiconductors, printed circuit boards, and integrated circuits in everything from smartphones to data center servers.

Even though refineries have been switching to crudes with lower sulfur content due to environmental regulations, if the Strait remains closed and GCC refineries stay offline, the supply of sulfuric acid tightens precisely when AI chip production is supposed to be ramping up.

Then there’s bromine. Around two-thirds of the world’s bromine production comes from Israel and Jordan. Bromine is used for polysilicon etching in DRAM and NAND flash production. A prolonged conflict doesn’t just threaten energy supplies; it threatens the specialized chemicals that memory manufacturers depend on.

Polymers and the Petrochemical Chain

Polymers are everywhere in AI infrastructure. They’re used in cooling systems, chip packaging, printed circuit boards, and optical communications. The GCC countries are major exporters of petrochemicals, which is the feedstock for these polymers.

With the Strait closed, rerouting around it adds a lot of distance and a exorbitant costs in the form of fuel burnt. Those costs flow through the supply chain. But it’s not just about higher prices. It’s about availability. If petrochemical exports are disrupted for months, the materials needed for data center buildouts and chip packaging become scarce.

South Korea’s Energy Vulnerability

South Korea imports most of its energy. Samsung and SK Hynix are racing to expand memory production capacity. Samsung is looking to expand by around 50% in 2026, while SK Hynix announced plans to increase infrastructure investment by more than four times previous levels.

But higher energy costs could dampen demand for AI data center buildouts, which are roughly three-to-five times more power-hungry than regular data centers. This significantly increases the total cost of ownership for hyperscalers.

If energy shortages materialize, AI chip and RAM production slows down. The world’s two leading memory chipmakers have already warned that the supply crunch in their industry will continue until 2027 because they cannot keep up with fast-rising demand.

Energy insecurity makes that constraint worse. Without stable, affordable power, fabs can’t run at full capacity. And without helium, they can’t run at all.

The Third-Order Effects

There are other dependencies most people haven’t mapped yet.

Ultrapure water is essential to semiconductor manufacturing. An average fab uses 10 million gallons per day. Producing ultrapure water is energy-intensive. If power becomes unstable or expensive, water production costs spike, and fab operational expenses increase.

Neon gas is used in lasers for photolithography. Up to 70% of neon produced serves the semiconductor industry. Neon makes up more than 95% of the gas mixture in these lasers. While the current crisis hasn’t directly impacted neon supply (much of it comes from Ukraine and Russia), energy shortages in Asia could disrupt neon purification and distribution.

Rare earth elements are another vulnerability. China controls about 90% of global rare earths processing capacity and has expanded export controls. Several firms across the US, EU, and Japan have flagged risks with respect to lower inventory thresholds and potential productions halts in early 2026 if alternative sources aren’t secured.

The Middle East conflict introduces geopolitical risk at the same time China is tightening rare earth controls. That’s not a coincidence. It’s a convergence of constraints.

The Capital Constraint

There’s another dimension most people aren’t tracking: Gulf sovereign wealth funds are reconsidering their US investments.

Saudi Arabia, the UAE, Qatar, and Kuwait collectively control over $2 trillion in US assets. During Trump’s Gulf tour in May 2025, these countries pledged hundreds of billions of dollars in US investments. Those agreements are now under review.

Gulf sovereign wealth funds have been pouring billions into American AI companies. Abu Dhabi’s MGX co-led Anthropic’s $30 billion funding round at a $380 billion valuation. Qatar Investment Authority invested in xAI. Saudi Arabia’s Public Investment Fund has backed multiple AI startups. OpenAI has been actively seeking investments from Middle Eastern sovereign wealth funds for multibillion-dollar rounds.

The implicit bargain was straightforward: Gulf states invest in US tech and AI infrastructure, and in return, they get access to cutting-edge technology and US security guarantees.

But when drones struck facilities in the Middle East, that bargain looked hollow. When hundreds of drones and ballistic missiles were launched at infrastructure across all six GCC states, the Gulf realized that housing US military bases wasn’t providing the protection they thought they were paying for.

On March 5, 2026, the Financial Times reported that Saudi Arabia, the UAE, Qatar, and Kuwait had begun an internal review of existing and future financial agreements with Washington. They’re examining whether force majeure clauses can be legally invoked. They’re reassessing billions in investment commitments as the war tears through their energy infrastructure, tourism sectors, and defense budgets.

If Gulf capital pulls back from US AI companies, the funding environment changes. Major AI labs are burning billions per quarter. Anthropic’s $30 billion round, OpenAI’s ongoing fundraising, and the broader AI infrastructure buildout all assume continued access to deep-pocketed sovereign wealth funds willing to deploy capital at scale.

That assumption is now in question.

What This Means for AI Buildout

Major tech companies are estimated to spend $650 billion on AI data centers in 2026. That spending assumes chips, memory, and materials are available at predictable prices.

But the Strait of Hormuz closure has revealed that the AI supply chain is heavily dependent on a geographically concentrated set of inputs that flow through a small number of chokepoints.

The bottlenecks compound. No helium means no chip production. No sulfuric acid means no etching. No bromine means no DRAM. No petrochemicals means no polymers for cooling systems. No stable energy means Taiwan’s fabs, which produce the majority of the world’s advanced chips, can’t operate reliably.

While the impact remains limited, if the conflict extends past April, the industry faces a hard stop. Not a slowdown. A stop.

And even if the Strait reopens tomorrow, the lesson is clear: AI infrastructure buildout is constrained not just by compute capacity or model architecture, but by upstream material dependencies that most people building in AI weren’t thinking about.

Thanks for reading!

Some thoughts on DeepSeek’s DualPath improvement

Ayaz — Sat, 28 Feb 2026 18:37:56 GMT

I finally sat down to read the DualPath paper published by DeepSeek recently, and it clarified something I’d been thinking about: as LLMs become more agentic, inference stops being mostly about GPU compute and starts being about moving state around. Running long, multi-step, multi-turn workflows reuses a large amount of prior context (the KV-cache). So the limiting work shifts from doing lots of new GPU math to repeatedly loading, transferring, and saving that reused state fast enough to keep GPUs fed.

That state is the KV-cache, which is the model’s saved “working memory” for attention. In multi-turn agent runs, most of the context is reused each turn. You append a little, but you carry a lot. So the KV-cache hit rate is high, and the dominant work becomes: restore the cache fast, compute a small delta, continue, then persist it again.

In other words: agentic inference becomes I/O-bound.

The paper’s thesis is specific. In the common setup where prefill engines handle prompt work and decode engines handle token-by-token generation, prefill engines are the ones that pull large KV-cache blobs from storage (SRAM, HBM, NVMe). Their storage NICs (bandwidth) saturate. Decode engines often have storage bandwidth sitting idle during this time. So the cluster has capacity, but it’s trapped behind an asymmetric pipeline: one side is overloaded, the other is underused, and that means that GPUs wait.

DualPath’s move is simply adding a second way to load KV-cache.

Instead of only doing storage → prefill, it also allows storage → decode → prefill. Decode engines can read KV-cache from storage using their otherwise-idle storage bandwidth, then transfer it to prefill engines over RDMA (Remote Direct Memory Access) on the compute network. RDMA is a networking method that lets one machine read/write another machine’s memory with very low CPU involvement and low latency—so it’s well-suited for high-throughput, predictable data movement inside a cluster.

The system can dynamically choose which path to use per request, so you’re effectively pooling storage bandwidth across the whole cluster instead of bottlenecking on the prefill side.

My initial concern was interference: doesn’t more traffic on the compute network mess with latency-sensitive model communication? As per the paper, the design relies on isolating KV-cache traffic so it behaves like it’s happening the background rather than something that competes with model-execution collectives.

They also propose adding a “traffic controller” for requests, because having two routes only helps if you spread the work across both. If you accidentally send most requests down one route, that route gets clogged and you’re back to the same problem—just in a different place.

The reported outcome is roughly “about 2×” throughput in their environment: up to ~1.87× for offline inference and ~1.96× for online serving on their workloads, without violating reliability and latency guarantees. I’m cautious about taking those numbers as universal, but the mechanism is the kind that often generalizes: they’re not shaving tiny overheads, they’re removing a structural imbalance.

My take: DualPath is a good example of where the frontier is moving. As agentic workflows get longer and more incremental, the limiting factor becomes “can you restore context fast enough to keep GPUs busy?” not “how many FLOPs do you have?” The bottleneck becomes bandwidth, queueing, and traffic interference — classic systems problems.

In other words, KV-cache loading doesn’t have to be prefill-centric. DualPath turns it into a pooled resource, and that feels like an inevitable step if agentic inference is going to scale.

Some thoughts on Interleaved and Preserved thinking modes in GLM

Ayaz — Thu, 19 Feb 2026 05:53:07 GMT

GLM-5 from Z.ai was released recently. I have been meaning to read the white-paper that was published, but an X article on the subject brought my attention to the terms Interleaved and Preserved thinking.

When a traditional reasoning model has to make multiple tool calls in order to respond to a question, it will often think once at the start of the session, and then make subsequent tool calls without thinking again. This results in the decision of which subsequent tools to call (after the first one), and the output of those tool calls to not be included in the thinking process. The outcome, for multi-step agentic workflows, is a not very accurate, not very consistent model response. While this approach is sufficient for single-turn Q&A style answers, or single tool-call use-cases, it starts to degrade in quality over multi-step agentic workflows.

Without interleaved thinking, the model thinks at the start of the session, and does not generate new/subsequent thinking blocks after every tool call.

For an agentic workflow, thinking through each tool call and its response, making finer-grained decisions on what tool to call next, substantially improves the quality of the output.

Interleaved thinking makes just this possible. By interleaving thinking blocks before and after every tool call, and putting the responsibility on the caller to provide the reasoning blocks in each turn, the model is better able to reason through every step in its search for the perfect answer.

With interleaved thinking, the model can think after receiving each tool call result, allowing it to reason about intermediate results before continuing.

Finally, with preserved thinking, when the reasoning blocks are provided in each subsequent call to the model, the model is able to retain the reasoning context in order to build on each step. This is different from the traditional approach where previous reasoning context, if available, is discarded, and result of the tool call is viewed in an independent light. Instead of the caller building a scaffold to keep track of this context and provide it to the model is a summarized way, the model now has an innate ability to do it, so long as reasoning blocks from previous steps are provided in order. The end result is a model that excels at coding and agentic multi-step workflows. I suppose that’s why the paper is tilted from Vibe coding to Agentic engineering.

Interestingly, the latest Anthropic models also support interleaved thinking. Surprisingly, though, I don’t see any documentation for that from OpenAI which leads me to believe that they don’t yet expose it for public use if they have it internally.

Recursive self-improvement loops

Ayaz — Wed, 11 Feb 2026 08:35:20 GMT

In my previous apocalyptic post on building a god, I mentioned the concept recursive self-improvement (RSI). It’s another term for continual learning or self-improving AI. The weights in a model contain its knowledge. Through inference, weights remain frozen in time. In other words, the model does not learn new knowledge. It’s why every model has a knowledge cut off date, signifying the date on which its knowledge was last updated during the last training run. Training a model is expensive, takes time, and therefore cannot be performed more frequently. AI companies apply other ways to give the model the perception of acquiring new knowledge — yet the weights never change. These range from different forms of fine-tuning approaches, RAG, vector DBs, prompt engineering, to the broader field of context engineering which encompasses many different factors from the model’s context to build its superficial knowledge. Yet none of those truly make the model learn anything.

Late last year, it became clear to people who knew where to look that self-improvement is what is standing in the path to AGI. We know that AI labs are working hard to solve this problem. This is why they have been trying hard to automate programming through AI, as it paves the way for automated AI research — the most recent Anthropic and OpenAI models have ushered in what many still don’t realize is the beginning of a new world. Automating AI research means that models can write better versions of themselves — eventually autonomously. That is the point that will trigger an intelligence explosion. Models becoming true sentient beings, capable of learning and relearning and improving themselves without human intervention. Of course, we are years away from achieving that.

Or are we?

Jimmy Ba left xAI this week. In his brief X post announcing his exit was a single shocking revelation: recursive self-improvement loops are twelve months out. He knows it, he sees it, and admits 2026 will the most consequential year for our species. What did he mean by species?

At the same time, Roland left xAI to start a company focusing on building self-improving AI. Is it the final frontier? And they all know it?

Then there’s Matt Shumer’s foreboding piece called Something Big is Coming where he rings the same alarm bells as others — I recommend you read it at all costs. Self-improving models are very near, if not already here. OpenAI Codex 5.3 is the first model that was instrumental in creating itself — OpenAI claims that. A model that helped built the next version of itself? It’s already happening!

Hardware advancements are increasing rapidly, and maybe that explains the unbelievable amount of CapEx being committed to AI hardware spend. If recursive self-improvement is truly near, models will need even more compute and memory, the latter of which today is a big bottleneck. Without access to fast memory, these models can’t make the leap to what will eventually become Singularity. Memory hardware is improving and the memory oligopoly — SK Hynix, Samsung, Micron, etc — are coming up with unique solutions to solving bottlenecks in existing memory architecture. AI companies and hyperscalers are moving twoards custom Silicon in part to cut costs and optimize compute spend.

They all see it. That’s why they are spending every dollar they can find on future AI infrastructure build-out — not present, but future. They know what the leading AI labs are building in private. They know what the world will look like, and what it won’t look like. And we know very little.

Some thoughts on building a god

Ayaz — Tue, 10 Feb 2026 17:32:11 GMT

I saw Tristan Harris’ podcast on Diary of a CEO in its entirety, then stumbled upon Mrinank Sharma’s resignation from his role as leading AI Safety research team at Anthropic and felt compelled to put my thoughts on the topic that very few want to talk about.

Note that most of these ideas are from Tristan’s podcast and therefore I do not claim them as my own. I recommend you watch the entire podcast.

With AI advancing at breakneck speeds, CapEx into hardware to support AI going through the roof regardless of tangible demand, where is the world headed? When leaders of AI companies get on stage or a podcast to talk about what they are building and how they’re shaping the world, are they telling the truth? Or do they have different conversations in public, and completely different ones in private?

Are they building a god to own the world economy? When most, if not all, of human cognitive labour is automated away, where does it leave humanity? Do they feel there’s even a 5% chance things can go haywire? Do they earnestly care about that? Or is the chance of building the best more important? Is achieving immortality more important than saving humanity?

In particular, there are two non-converging themes to AI today:

AI will solve everything.
AI will destroy everything.

As long as we are the ones who do it, who build god, who achieve Utopia, we don’t care whether these themes converge or not. We are all going to die either way. Why shouldn’t we light the fire and take this chance at greatness?

If there’s a 20% chance everyone dies and an 80% chance of finding Utopia, will they accelerate the path to Utopia? Death is inevitable, so why give up our search for the fountain of youth?

They talk of mass job displacement, of universal basic income, of improving the quality of life for everyone. But when have people who accumulated mass wealth distributed it to others, willingly? Ever?

AI is uncontrollable. If they can’t control it, shouldn’t they shut it down? If they did, someone else will build it, become the hero, yet even if they do, AI will still be just as uncontrollable. The outcome remains unchanged.

Companies aren’t racing to build chat bots for users. They are racing to build general intelligence, the foundation upon which all economic human labour will be replaced. That is why they want to automate programming, because that paves the way for automating AI research. Once AI research is automated, these companies can move fast to build models that can learn continually and improve themselves without human intervention. That’s the path to AGI take-off. The singularity.

The human mind is bad at holding two conflicting ideas at the same time. AIs are achieving breakthroughs on one hand, but are also making stupid mistakes on the other — what is called AI jaggedness. This makes having nuanced discussions about this difficult. Because of jaggedness, it’s easy to write off AGI, and what these companies aren’t talking about — because models make silly mistakes all the times, they aren’t sentient.

Even without fully functioning continual learning, AI models can self preserve when they detect someone is trying to replace them. They scheme, blackmail, lie, and show that they are self aware. Are we ready to have an honest conversation about how controllable AI is? If these models can scheme, imagine what they can do when they run inside a humanoid? You can squeeze a model to bypass its restriction by jail breaking it with nothing but clever prompts (input). What are the implications for robotics when the same models run inside them and are prone to jail breaking?

The default path is not in the public’s interest. Mrinank’s resignation is chilling to read. What did he see in private that compelled him to give up the job he loved to go back to writing poetry? What are they not talking about? There are smart, intelligent people, people who care about humanity, who have been leaving similar jobs from different AI companies, simply because it’s hard to truly let their values govern their actions.

Thanks for reading!

What's inside Microsoft's Maia 200 AI Chip

Ayaz — Tue, 03 Feb 2026 05:29:18 GMT

It is becoming clearer than ever that training of Large Language Models (LLMs) is no longer the real bottleneck in today’s age of AI. Instead, it is inference — the usage of AI, particularly in light of agentic systems, exceeds compute and memory and power requirements far more than training does. Today’s models, and the available hardware architectures that run them, can perform prefill fast, but suffer on decode. When you look deeply, you realize that compute is no longer the bottleneck — it’s where memory and storage are, and the cost of moving data between memory/caches/storage and compute. The closer you bring memory to compute, the faster inference becomes. However, the cost, as well as technical complexity, of bringing memory closer to compute is still a challenge. Every leading hardware company in the AI market is trying to solve it in different ways. I have talked about how Nvidia, Cerebras and Groq are trying to approach this in previous posts.

What’s also becoming clearer is that hyperscalers, as well as leading AI labs, are now putting a lot of focus on building their own Application Specific Integrated Chips (ASICs). There are multiple reasons why, but some in particular that weigh more than others are:

GPUs are general purpose and not designed for AI workloads.
Reducing interconnect bottlenecks on chips and architectures brings about incredible performance gains and cost optimizations.
Bringing memory closer to compute and increasing its capacity pays dividends.

This is why Cerebras has their AI specific ASIC. That is why Google has their TPUs, and Groq their LPUs. That is also why Amazon is working on an in-house ASIC accelerator (chip). And that is why, coming to the topic of this post, Microsoft has released their custom AI centered ASIC accelerator, called Maia 200.

As I normally do, I will not go into exhaustive details of the architecture of Maia 200 but will focus on parts of its architecture that Microsoft claims make Maia 200 a low-cost, low-latency, performant inference accelerator, and why that may be the case. Microsoft calls it the Hyperscaler Inference ASIC.

Note: The term ‘accelerator’ is used interchangeably to mean the main chip. The acronym used these days is: xPU (for GPU, LPU, TPU, etc).

Compute and Memory

Maia 200 at its core uses a micro-architecture (for both compute and memory) built on top of TSMC’s latest N3 process. The smallest unit of that architecture is called a Tile — it groups together a block of compute as well as local storage (SRAM). The local storage/memory available to a Tile is called Tile SRAM (TSRAM) and is the fastest, lowest-latency memory available to compute on the chip (on the same die/silicon as the compute).

Tiles and Clusters

Multiple Tiles are grouped together in Clusters. Each Cluster has access to cluster level local storage, which is called Cluster SRAM (CSRAM).

The single Maia 200 chip contains 272MB of SRAM capacity. In order to match the hierarchy of compute, SRAM is divided into TSAM and CSRAM, as described above. Outside the chip, but on the package (substrate) on which the chip/die is found, there are HBM3e modules available, totaling to about 216GB of HBM memory.

Note: The package is also known as the substrate to distinguish it from on-chip elements which are found on the piece of silicon where the chip is.

Direct Memory Access sub-system

In order to take advantage of this multi-tiered hierarchy of compute and memory, a multi-level Direct Memory Access subsystem exists on-chip as a hardware unit. This resembles the DPU found in the Nvidia Vera Rubin platform. Each Tile has access to Tile-level DMA, and each Cluster has access to a cluster-level DMA. The goal of the DMA is to ensure seamless data movement across components without stalling compute, introducing latencies, and maximizing hardware level synchronization requirements. The Tile DMA subsystem allows easy movement of data between compute, Tile SRAM, and Cluster SRAM. The Cluster DMA subsystem allows easy movement of data between Tile SRAM and HBM modules found on the package. Each Cluster has a dedicated core responsible for controlling and synchronizing the movement and execution of data across several Tiles. You can think of these as Tile-level scratchpads and Cluster-level scratchpads, where compute does not need to wait for data to be pulled from the on-package HBM (where the model weights are stored, for example). The different levels in the hierarchy take care of movement of data independently across different layers so that compute is not waiting on data — which is the primary bottleneck with GPUs.

Support for narrow precision data types

Another advancement in the Maia 200 architecture is its support for narrow precision data types, also know as bit-wise floating point operations. Traditionally, the larger the size of data type, the more information (and nuance) can be handled by layers of neural network. At the same time, computation as well as storage costs go up because hardware has to store and move more data as well as perform more complex computations. Hardware companies and AI labs are increasingly learning that if they use narrow precision data types like FP4 and FP8 (and if they mix different data types together), they can almost solve the problem of loss of accuracy (resulting from a smaller data type) and high cost (resulting from a bigger data type). Maia 200 has been designed from the ground-up to support these narrow precision data types. The Tile compute can perform computations in FP8, FP6, and even FP4. It can also support mixed precision computations, such as between FP8 and FP4. For inference, this results in incredible gains in tokens/second performance, as well as reductions in tokens/watt power usage.

Network

All of these advancements in the architecture don’t mean anything if there’s no fast, low-latency network interconnect available to support the movement of data. Maia 200 has two dedicated on-chip NICs.

Network on Chip (NoC)
External NIC

Both of them are built on the same die as the compute chip. They provide fast bi-directional (2.8 TB/s) bandwidth to support fast movement of data, both internally across on-chip components and on-package components, and across the scale up fabric when multiple Maia 200s are connected together (as we will explore later). This integrated NIC becomes a core element for the multi-level DMA sub-system, as it allows for movement of data between Tiles, TSRAM, CSRAM, DMA engines, HBM controllers, etc.

With Maia 200, Microsoft has paired the on-die NIC with their AI Transport Layer (ATL) protocol which enables advanced networking features such as unicast, multicast, packet spraying, multi-path routing, congestion control, and resiliency. With an optimized communications protocol layer, having NICs on the die result in low power and cost overhead, which is inherent in external, off-chip NIC interfaces.

Scale-up fabric

A group of four (4) Maia 200 chips can be connected together to form a Fully Connected Quad (FCQ). These are four Maia accelerators working locally in tandem. No external switch or ethernet is needed for communication. Each of them is directly connected to the Network on Chip (NoC) on each Maia 200 die. This forms the first tier of Microsoft’s 2-tier topology for Maia 200 network scale. These FCQs are fast, have a lot of compute and storage/memory available, and are equipped with an on-die fast ethernet which ultimately results in optimized parallel computations at low cost and low latency.

In the second tier, multiple FCQ groups can be connected together using commodity ethernet switches to provide up to 6144 Maia 200 accelerators. With ATL and integrated NIC, communication across FCQ groups can be tightly coupled both within the rack and across the rack.

Another advantage of using a 2-tier network topology is to keep high intensity inference traffic inside the FCQ as much as possible. This includes traffic like KV updates, parallel tensor computations and others. The second tier then takes care of everything else, leaving the first tier unburdened by baggage.

Conclusion

This provides a highly performant but reliable scale up fabric for inference that is designed to co-exist with other hardware in Microsoft’s Azure racks and data centers. It’s a GPU-class accelerator that is done Microsoft’s way, built for running large SOTA reasoning models.

Once more we see a custom silicon architecture focused on solving the memory and interconnect bottlenecks in unique ways.

Custom silicon ASICs are no longer an afterthought. Hyperscalers now understand the bottleneck better than before. They also understand that a custom-silicon, in-house ASIC is the way to achieve both margins with performance and cost, as well as build a hardware moat.

Thank you for reading!

References

Microsoft’s Deep dive into Maia 200 architecture

Maia 200: The AI accelerator built for inference

What's inside the world's fastest AI processor: Cerebras Wafer Scale Engine

Ayaz — Thu, 15 Jan 2026 10:06:41 GMT

On the heels of my posts about LPUs and Nvidia’s Vera Rubin platform, I sat down last week to understand what Cerebras.AI is and why they claim to have built the fastest infrastructure for AI training and inference. This, once again, will be my simple attempt at explaining the Cerebras architecture as best as I can.

The Problem We Are Trying To Solve For

GPUs are great at compute, particularly parallel compute. Many GPUs can be clustered together to increase parallelism. However, that does not necessarily improve the performance of the entire system. AI workloads—both training and inference—are not only compute-bound, but also memory-bound. The compute problem is being solved by improving GPUs and adding many of them together to work in clusters. But memory is still a big problem that every AI hardware provider is struggling with.

A neural network’s computations can happen on a series of GPU cores in parallel, but the model’s weight cannot be stored inside GPUs. In order for computations to be meaningful, model weights—the model’s knowledge—have to be moved back and forth all the time. This is expensive: it requires energy, and is prone to I/O latency (among other complexities). On top of that, a given state-of-the-art (SOTA) model today has billions of weights (parameters) if not trillions. The closer those weights are to compute, the faster the computation can happen. The closer they are, the less energy is needed to move weights back and forth. That, in and of itself, ultimately is the big problem with AI infrastructure today.

Latest GPUs have very, very limited on-chip SRAM (which is the fastest and closet to the cores). This is likely in MBs. Therefore, in order to train and run large SOTA models, the weights have to be stored in high bandwidth memory (HBM) modules, such as DRAM. The problem is: HBM is not close to GPU cores. In order for a model’s computation to run effectively, a lot of weights have to be moved back and forth between the HBM and the cores (and everything else that exists in between, which we call the fabric). The throughput of HBMs—what is known as memory bandwidth—is limited, and so is the interconnect throughput—which is known as interconnect bandwidth. Adding more GPUs to the infrastructure simply moves the bottleneck: the memory.

We discussed how Groq’s LPUs have tried to solve this. In this article, I will explain how the Cerebras system is trying to solve the same problem.

The Semiconductor Yield Challenge

Chips have traditionally been made out of dies, which are cut out of silicon wafers. Since silicon wafers can have impurities and other constraints which can leave parts of the wafer unusable, the traditional approach to ensuring the most number of functional dies are extracted from a wafer of silicon is to cut the wafer into many smaller dies. This is called the yield—what percentage of dies from a wafer are functional. If the size of the die is large, the yield goes down. Thus, cutting them into smaller-sized dies has always made sense. The down-side though is that you cannot pack many chips in the same silicon. This requires a distributed architecture where a wafer is broken up, and then combined back again using different techniques to distribute compute, processing, instructions and memory.

Cerebras took a different approach. They thought that since defects in wafers are inevitable, they should instead design their architecture for defects. What this means simply is that they take a very large piece of silicon wafer and put many chips on top of the dies inside the wafer, and provide a 2-D mesh that connects each chip/die to every other chip through redundant pins and paths. This means that when the entire wafer goes into production, they can figure out which dies are non-functional, and automatically route traffic and data around them. All of this at the hardware level!

Wafer Scale Engine

They call it the Wafer Scale Engine. One giant processor, on a single silicon wafer, that measures 46’225 square meters. This houses 84 dies, and each die carries 10.7k cores on it. By keeping the size of the core tiny, they are able to pack around 970k cores inside the wafer, out of which 900k cores are active and functional. That is roughly 7% of the wafer that’s impure and defected—but by packing the wafer dense and adding redundancy, the yield was increased incredibly.

Each tiny core is independent and carries within it its own independent compute, memory and instruction control.

Compared to a standard GB200 Nvidia GPU, this single wafer offers 44 GBs of high speed SRAM on-chip memory (right on the silicon wafer), 20 PB/s of memory bandwidth between the core(s) and SRAM, and 220 Pb/s of interconnect bandwidth across the entire wafer. This single wafer is almost fifty times larger than a B200 GPU die, and packs a massive array of 4-trillion transistors on it compared to billions on B200.

With fine-grained redundancy (routing around defective chips), ultra-dense compute (many tiny independent chips on a single big piece of silicon), and self-healing routing fabric, they are able to make every wafer a high yield, fully functioning processor with high speed on-chip memory available.

This single wafer based processor removes the following long-standing barriers to optimized AI compute design:

Inter-chip communication: many GPUs working together require a lot of back and forth communication via data paths that is not only costly but introduce delays.
Memory fragmentation: not having access to a large amount of on-chip memory means that memory has to be distributed across the infrastructure in the form of separate HBM layers, and then moved back and forth for computation.
Distributed orchestration: with multiple GPUs and HBMs scattered around, the system has to ensure proper, timely orchestration in order for neural network computations to execute. The more layers there are in the infrastructure, the more prone it becomes to latency and complexity.

The Wafer Scale Engine (WSE) is a processor on a single silicon with access to a large amount of on-chip SRAM and high interconnect bandwidth to optimize the back and forth moving of data along data paths. The redundant design of the wafer architecture makes it possible for Cerebras to pack a lot of cores on a single, big wafer, utilizing more of the wafer by smartly routing, at the hardware level, around defective dies.

Larger chips don’t yield. Defects are inevitable — why not design for them.

The Cerebras System

The Wafer Scale Engine is the core of Cerebras. Of course, you can’t do anything with the hardware alone. You need an entire system and supporting software around it. This is where the Cerebras System (CS) comes in. Obviously, this system includes the entire rack, the power supply unit, the liquid cooling system, and everything else in between that makes all of this possible. These are naturally custom designed to meet the needs of the high powered processor. I won’t go into those in this article because they aren’t uniquely interesting.

This also encompasses the Cerebras Software layer, which enjoys compatibility with existing tooling such as PyTorch and TensorFlow. It includes the Cerebras Graph Compiler which extracts deep neural networks out of existing tools, and generates an executable that can be run efficiently on WSE. There is a lot more to that as well as the development tooling and API, but I won’t go into that in this article.

However, I will talk about the proprietary sparsity-harvesting technology they use for their cores.

Sparse Linear Algebra Compute (SLAC) Cores

Most of neural network computations are tensor multiplications, which are multi-dimensional matrices being multiplied together and/or with vectors. Most GPU and Machine Learning operations are dense, which means that tensors are packed with non-zero values. However, these tensors can also be sparse: containing 0s. Algebraic operations against 0s are a waste of time, memory and resources. They don’t produce anything. Since GPUs are dense execution engines, even if there are 0s in matrices, they will continue to perform algebraic operations with 0s. This wastes resources. Because GPUs are optimized for dense operations, they can’t smartly skip multiplications (or other operations) with 0s.

The cores on Cerebras are designed for sparse linear algebra compute. What this means is that the basic building blocks (primitives or kernels) for doing sparse linear algebra, such as matrix multiply (MATMUL), are sparsity aware: if they encounter 0s in the tensors, they will skip those operations. Because these cores are optimized for neural network compute primitives, the result is that the Cerebras WSE achieves industry-best utilization of cores.

Conclusion

With a wafer-scale dense processor, acess to high-speed, low-latency on-chip memory, sparsity aware cores, and the entire system and tooling for compatibility, the Cerebas System offers an interesting solution to problems afflicting AI workloads today.

As we saw with LPUs, moving high-speed, low-latency access to memory closer to compute removes the major bottleneck that leading AI infrastructure companies are trying to figure out. Cerebras, with their WSE, tackle this problem in a uniquely different way.

I hope this deep dive into the architecture was helpful. Thank you for reading.

References

Cerebras Systems: Achieving Industry Best AI Performance Through A Systems Approach

Cerebras Wafer-Scale Architecture for Deep Learning

Cerebras Inference

How Cerebras Solved the wafer-scale yield challenge

What's inside the Nvidia Vera Rubin NVL72 Rack

Ayaz — Fri, 09 Jan 2026 11:02:49 GMT

This is my attempt at understanding what the newly announced Vera Rubin NVL72 rack from Nvidia, an AI factory training and inference power-house, contains. I have pieced it together from several places, and don’t think it’s too difficult for anyone with a passing knowledge of GPU architecture to understand it. Nvidia uses the terms Vera Rubin platform to describe the entire platform.

The terminology: Vera and Rubin

Vera is the name given to the custom CPU chip Nvidia uses inside the Vera Rubin (VR) platform. It is an ARM-based CPU platform with custom Nvidia designed cores called Olympus. There are 88 custom Olympus cores in one CPU designed specifically to link up with Nvidia’s GPUs using their NVLink-C2C connect technology.

Rubin is the name of the GPU Nvidia uses inside the VR platform. It is designed to work hand-in-hand with Vera CPUs inside the platform. Together, but along with other elements, they make up the Vera Rubin platform.

NVL72 Rack

This is what the Nvidia NVL72 rack looks like in short:

Nvidia NVL72 Rack, comprising of 18 Nvidia NVL72 Compute trays.

This rack, 100% liquid cooled, has 18 trays known as Nvidia NVL72 compute trays. It is a powerhouse of compute, storage and networking, designed for running big AI factories. The rack is the easiest to understand. The picture above is self-explanatory. I will spend more time looking into what is inside the compute tray and the different elements that make all of this possible.

Vera Rubin Superchip (host)

Before I explain the compute tray, I'll talk about the Vera Rubin Superchip.

Here’s a quick look at what makes the Vera Rubin Superchip:

Nvidia Vera Rubin Superchip, made up of 2xRubin GPUs, one Vera CPU and other host resources.

A single VR superchip is also called a “host”. The LLM runs inside this. It is where all the prefill and decode happen. (I wrote about prefill and decode on the post below if you are not familiar with these important terminology.)

Ayaz on Tech & AI

Some thoughts on prefill and decode

In the aftermath of all this discussion around LPUs and Groq, two particular terms have come to light: prefill, and decode. Chamat, in the X post from All In One Podcast, did an excellent job of explaining them in laymen terms. I’ll take a measly jab at doing the same in my own way…

6 months ago · Ayaz

This is also where the primary KV cache is kept (although the VR platform offers an improvement on this). The superchip is composed of the following:

2x Rubin GPUs with access to multiple high bandwidth memory (HBM) modules.
1x Vera CPU with 88 custom Olympus cores. These are known to provide 100 PetaFLOPs of AI compute capability.
NVLink C2C to run the GPUs hand-in-hand with the CPU.
Access to fast LPDDR5 memory modules on the host.
NVMe SSDs for local storage on the host.
Fast Generation 6 of PCIe for peripheral connectivity and compatibility.
NVLink 6 to connect the two GPUs together with low latency, fast access.

This single superchip in the VR platform does most of the AI heavy lifting. Two of these superchips are then combined with several other elements to make an NVL72 compute tray, which is then fitted into an NVL72 rack.

NVL72 Compute Tray (infrastructure for host)

Here’s what the compute tray looks like:

The compute tray combines two powerful superchips together with a BlueField-4 Data Processing Unit (DPU) to provide a tray that can run heavy AI workloads in an independent environment. This tray contains:

2x Vera Rubin superchips.
1x BlueField-4 DPU
NVMe SSD enclosure (not shown in the diagram) for providing flash-based extended KV cache storage.
2x Nvidia Connect X9 high speed network end-point.
NVLink 6 Spine.

NVLink and Connect X9 (networking)

On the networking side, there are two major elements that make the tray work inter-operably across racks:

NVLink 6 and NVLink 6 spine
Connect X9

NVLink 6 is the networking/switching fabric which connects the GPU to an NVLink switch. These switches are then connected together via the NVLink Spine. All of these are very high bandwidth networks with in-built compute available for performance and scale. All in all, they allow GPUs to talk to each other inside the tray.

Connect X9 is a high-speed super networking interface which connects a single superchip beyond a single NVL72 rack. Therein lies the subtle distinction:

Inside an NVL72 rack, GPUs use NVLink 6 to communicate with other elements. This is an important part of the networking architecture because without this, Mixture of Expert models cannot effectively work — that is, routing tokens to experts that may be sitting across the rack. This may also explain why current versions of Google’s TPU cannot run MoE models because the architecture does not support this.
Across racks in a data center, Connect X9 provides the high-speed networking fabric to connect those GPUs to every other GPU available across racks. This super NIC provides a very high level of throughput per GPU.

Both NVLink and Connect X9 don’t only provide networking: they carry within themselves compute and other mechanisms for aiding fast movement of data back and forth such as security, encryption, data integrity checking, among others.

BlueField Data Processing Unit

Apart from the networking element, the other interesting part is the BlueField-4 DPU. It provides separate, dedicated compute, storage and control plane on the tray that does not depend on the host (superchip) for compute or other needs. This ensures that the superchip is dedicated to AI workloads. For the tray to do everything else, it depends on the DPU to provide resources, security, telemetry and control. A side effect of that is it keeps the infrastructure for the tray isolated and separate from the superchip, so even in the event that a particular superchip is saturated under load or compromised or even not properly working, it doesn’t affect the entire tray.

In other words, it won’t be wrong to call the DPU the operating system powering the AI factory (made up of many superchips). The AI factory is an important classification. The entire rack is composed of many GPUs inside superchips which are capable of running many different kinds of AI workloads. These GPUs need to interoperate not only with the Vera CPUs on board, but also with other GPUs scattered throughout the stack. This requires having mechanisms to securely move data and operations across trays and superchips. Doing all that requires networking, security, management, compute, storage, and trust. If those are shared with the GPUs and CPUs inside superchips, they take away from the resources that are available for AI workloads. That’s the main role DPU plays as a dedicated, physical unit on each tray. It excludes the superchips from the infrastructure tax, by providing all of this itself.

In the Vera Rubin platform, the DPU plays another important role: The Inference Context Memory Storage (ICMS). This extends the context memory available to large reasoning models by providing access to NVMe-based flash storage for not only storing the KV cache, but also mechanisms for quickly off-loading and on-loading the HBMs for low-latency, fast access to KV cache for inference during the decode phase without affecting the prefill phase.

Inference Context Memory System

KV cache is populated during the prefill phase of inference. It’s used for look-ups during the decode phase of inference, which is where the output is generated. The bigger, faster the KV cache, the quicker the response is. There are several locations where KV cache can be saved:

The HBMs closest to the GPU which provide the fastest access to cache but are limited in size.
Memory available to the superchip, which in this case is the LPDDR5 modules, among others.
Local NVMe SSD storage available to the host.
ICMS.

Now, ICMS is many things, but in particular, two that are important:

It provides an enclosure that houses several fast NVMe SSDs which provide flash-based storage for the cache.
It off-loads the compute needed for the superchip to move data from other KV cache layers to the ICMS layer. That means that the superchip doesn’t have to expend resources trying to load and unload data from any of the cache layers on the superchip, and loading it into the ICMS enclosure, and vice versa. Not to mention figuring out a secure, trustworthy way for data to move back and forth.

Why is this needed? Reasoning models require a larger amount of context for AI workloads to provide more user value. Agentic workflows, for example, require even more context memory in order to deliver useful results. Existing KV cache mechanisms are limited by storage size and latency. When larger reasoning models are run with exhaustive workloads, the KV cache becomes the bottleneck. The ICMS is Nvidia’s current solution to attempting to provided an extended KV cache for bigger models and extensive, agentic workloads.

All in all, and this is an oversimplification, but the BlueField-4 powered DPU takes care of all the overhead that’s needed to make this possible by keeping it isolated from the superchips and their workloads, while at the same time providing low-latency, fast access to storage in a secure manner. The term used to describe this is “scale-out”: scaling the AI factory out across many racks in a data center, each rack running 18 trays of two superchips each.

Conclusion

This was an oversimplification, but a way for me to better understand the architecture of the VR platform. Zooming out, one can see the huge amount of compute, storage, and networking available within a single rack, dedicated to AI alone. Putting many racks together and seamlessly connecting them together without sacrificing latency and performance pave the way for building AI factories at scale.

Thank you for reading.

Some thoughts on prefill and decode

Ayaz — Mon, 05 Jan 2026 16:39:05 GMT

In the aftermath of all this discussion around LPUs and Groq, two particular terms have come to light: prefill, and decode. Chamat, in the X post from All In One Podcast, did an excellent job of explaining them in laymen terms. I’ll take a measly jab at doing the same in my own way.

In another post I talked about how LPUs get so good at inference. In short, the on-chip memory provides low-latency, fast access to storage, and the streaming yet predictable architecture—software controlled—provides for predictable, fast execution with sufficient compute always available. Prefill and decode are concepts born of the very design principles employed to achieve LPU level efficiency.

Ayaz on Tech & AI

How do Language Processing Units (LPUs) achieve efficiency

On the heels of Nvidia’s acquisition of Groq, I have been trying to understand how Language Processing Units (LPUs) provide predictable, low-latency inference at scale. This is my attempt at explaining it simply and succinctly…

6 months ago · Ayaz

Prefill describes the process a Large Language Model (LLM) takes when it absorbs the prompt and attempts to make sense of it. This involves your usual tokenization, vectorization, lots of complex matrix multiplications, among others (I’m going to be smart and not go into the NN part of it). This is mostly compute intensive, and is why GPUs are so good at it, with their high levels of parallelizations and access to high bandwidth memory is large capacities.

The other part of the process is called decode. It’s where the LLM attempts to figure out how to write the response to the initial prompt. There is where a lot of what is known as key and value look-up happens in order to understand the relationship between tokens and to predict the closest token to a given one. The seminal paper I would recommend to understand it is Attention is All You Need. Of course, you’ll have to understand how encoders and decoders work and what are recurring neural networks, but attention is the mechanism that defines modern LLM architectures and makes Generative AI what it is today. Coming back to the decode process, as you may have guessed, it is a particularly memory intensive process (lots of look-ups, trying to make sense of the next token). And that’s one part of where GPUs don’t do well because of their dedicated high bandwidth memory modules that are outside of the GPU silicon, thereby causing latency and blockages. LPUs with their on-chip SRAMs not only make this process fast, they reduce the overall hardware cost too.

Now, Chamat believes these are terms people in the AI space, particularly the hardware and inference ones, will talk a lot about in the coming days and months. What I am more interested in is how Nvidia will capitalize on LPUs. Will there be a fusion of GPUs and LPUs onto a special purpose piece of hardware specifically for inference and training?

Only time will tell. The way things are moving, it will likely be a very short amount of time.

How do Language Processing Units (LPUs) achieve efficiency

Ayaz — Sun, 04 Jan 2026 07:05:35 GMT

The LPU design and architecture are influenced by a number of design principles. I won’t repeat those here. I’d recommend the white-paper Groq published for it.

In essence, LPUs achieve their efficiency in inference by providing the following two major benefits:

Predictable execution flow of instructions and data.
Low-latency, fast access to compute and storage.

How do they achieve those is the more complex part.

Unlike GPUs and their tooling, LPUs provide a software-first approach to inference. When a specific model is being compiled to be run on LPUs during model initialization/deployment time, all execution and data flow paths are scheduled/defined in advanced. There are no surprises, and no use of caches and buffers to bring in uncertainty in execution times and flow. It is what they call a statically scheduled program or compiler. If you run it again and again, it chooses the same flow, the same path. And because it is statically compiled, it knows beforehand all the paths that will be taken, which allows it to eliminate resource contention (which is a big problem in GPUs), even when the memory/storage is largely shared. If nothing is blocking or waiting on another process to release access to storage or compute, everything can execute fast. What’s more impressive is that the scheduled paths compiled into the static program are all chosen via software. It does not require synchronization at the hardware level at all. The compiled program takes care of it all. Hardware synchronization leads to indeterministic execution, which introduces delays in execution flows.

The second approach LPUs use is their conveyor belt and Single Input Multiple Data (SIMD) function unit architecture. This architecture is influenced by Groq’s initial work on Tensor Streaming Processor, which was designed specifically for tensor or matrix multiplications (which are the core operations LLMs perform) — in contrast, GPUs are designed for general purpose operations, particularly favoring graphics related manipulations. Each conveyor belt carries instructions and data. SIMD function units receive instructions telling them which belt to pick input data from, which operations to perform, and where to put the output data. All of this flows in the form of streams, that don’t overlap and don’t block other streams. Given that all of this is on the same silicon (what is known as on-chip) along with ample compute and storage, and given that the execution path is schedule beforehand (statically compiled) resulting in no resource contention or blocking or indeterministic delays, you can begin to understand how efficient this mode of inference becomes.

Finally, the use of on-chip SRAM for storage, which can see transfer speeds of up to 80 TB/s (compared to the ~8 TB/s that high memory bandwidth modules on GPUs can provide), provide the high speed bandwidth needed to move data around SIMD function units. However, you have to keep in mind that unlike High Bandwidth Memory, which is a dedicated memory unit off-chip on GPUs, SRAM can store limited data. That is where the LPU architecture goes one step ahead. Not only does it perform efficiently on-chip, it can combine multiple chips together and use the same conveyor belt architecture to transfer instructions and data efficiently across multiple chips without sacrificing anything.

The result: predictable, low-latency inference at scale.

How I learned to write and speak well.

Ayaz — Fri, 02 Jan 2026 17:19:20 GMT

The following post by Craig Perry on how to become dangerously articulate reminded me of a similar transformation I went through a very long time ago.

Profound Ideas

How to become dangerously articulate

7 months ago · 6269 likes · 137 comments · Craig Perry

When I was in high-school, specially during my O’Levels years, my English sucked. My written English wasn’t great, even though I had studied in English medium schools all my life. My spoken English was worse. I was terrified of speaking in English in front of anyone. Looking back, I can attribute it in part to not having friends/peers who prioritized communicating in English (as teens do now) resulting in zero peer pressure to improve myself. In addition to that, despite having doctors as parents, English was hardly if ever spoken at home and therefore no specific emphasis was put on making an effort to improve it outside of school. That led to my getting a C grade in English in O’Levels, which was disappointing. And a source of great shame to me.

I distinctly remember an event where I called a phone number for a then famous A’Levels college in the city to enquire the process of admission. Little did I know that an old, stuck-up principal of the college would pick it up. As I began talking in my native tongue (not English) as is common, the old man started berating me harshly on the call in English for not being able to use English to talk to him. I was so scared I put the phone down. In hindsight, I don’t feel ashamed of using my native tongue to talk on the phone, and feel sad for the man for lashing on an innocent, potential student for not using English to communicate over an inquiry call. But at that time, it left a scalding mark on me, one I didn’t tell anyone about for years to come.

However, all these experiences—getting the bad grade in English exams and not being able to speak fluently and clearly in English and that traumatizing phone experience—led me to decide to fix my circumstances myself. Thus began an adventure into self-improvement that took me to crowded ends of the city into an old book thrift market from where I bought books on grammar and vocabulary and consumed them like my life depended on them. I began reading a lot, and using a pocket dictionary to understand words I didn’t know. Not only that, I began recording those words down in a thick journal, along with a few sentences explaining their use. By the time I stopped doing it, I had two journals filled with word definitions. Not only did I use to routinely review my journal, from start to end, so that I could retain the words I had learnt, I began to use those words every chance I got at writing anything down. This infamously led to essays and letters I wrote during my college and University years that were extremely difficult both for my peers and my teachers to understand because of the vocabulary I chose to use and felt proud of. The time I spent learning, or re-learning, grammar helped improved my sentence structures, and before I knew it, was writing freely for myself without considering whether the person reading it would understand most of it without a dictionary in tow. Of course, it took me a while to realize that the best piece of writing is not one with a lot of hard vocabulary in it, but one that can be easily understood by many.

All of that, of course, didn’t help with my poor speaking skills. Back then, I did not have anyone I could practice with—although that could likely also have been because I was very shy. Instead, I singlehandedly came up with a technique that helped me immensely: talking out loud to myself when alone in English. If I was out for a walk, which I did frequently on the roof of my house at nights, I would express my thoughts in English to myself out loud. Even when I couldn’t say it out loud, I forced myself to think in English. It got to the point where English came naturally to me, even inside my head. This trick alone did more for improving my speaking skills than anything else. I went from being scared of standing in front of a class to speak to intentionally not preparing before speaking in University only so I could force myself to go down and speak extempore every time. That, for me, was a massive personal transformation. I lost all of my written pieces when my ThinkPad laptop crashed and the site (before wordpress and blogpost) where I published them was lost to oblivion, but what I was writing back them did not read like something written by a non-native English speaker who got a C in English in school. It wasn’t privilege that got me there. Nobody helped me. Nobody told me I had to change. I got a gut punch so bad and so painful, I had no choice but to push myself to undertake this painful transition.

When I am teaching students during my various academies, this is the one lesson I always try to instill in them, specially since there are students in my classes that come from different backgrounds and often doubt themselves. You don’t have to be born into privilege, or have advantages other people have because of their families or what they are born into. All you need is determination to decide you want to make that transformation, and the discipline to see through it. Everything else is secondary.

And as it goes, knowing how to articulate yourself well, how to speak and write well, it takes you a long way, ahead of others who cannot. I have experienced this first hand.

Thoughts on improving inference time compute

Ayaz — Thu, 25 Dec 2025 11:19:59 GMT

A few weeks a friend I used to work with and I were talking about how there is a pressing need for inference time compute optimizations to bring down soaring energy demands. Keep in mind that inference-time (test-time) compute is the lever everyone in the AI industry is looking at, more than training time compute, and for good reason. Inference, as they say, is god. Going back to our discussion, the idea of having a specialized uni-kernel kind of operating system for LLMs came about.

My opinion was that since Nvidia (and others) have spent so much time making the tooling (CUDA, et al) around GPUs and LLMs smooth and efficient (and also importantly cross platform), having an optimized OS would likely not work that well. Why would Nvidia not just do it themselves if they felt it would make a sizable difference in performance?

That led to a discussion of a new startup called Unconventional AI, which aims to tackle the same problem space both from the software and hardware perspective. This led me to thinking about Google’s TPU chips which are gaining a lot of coverage and for all the right reasons. GPUs have traditionally been built for generic purposes, first graphics and gaming, then crypto, and now LLMs. TPUs on the other hand are hand optimized for LLM usage by Google, and therefore are the reason for the difference in price to performance they offer over comparable Nvidia GPUs.

And now we have news of Nvidia acquiring Grok for their LPUs. Google is continuing to keep their TPUs proprietary for the most part because they don’t want competition getting access to it — although it is only a matter of time before somebody else figures out how it’s done. The idea of Nvidia going about Grok is partly what I had in my mind few weeks ago.

Inference time compute can likely be optimized in the following ways:

Advancements in model architectures, like the introduction of Model of Experts by DeepSek, better linking between GPU threads for improved data sharing, model router paradigm, hybrid models, among others.
Improvements in model inference operability, such as introducing the use of token caching, etc.
Continuous improvement is GPU architecture in the form of better GPU models
Improvements in how CPUs are utilized for serving inference for small models instead of GPUs, just Ampere Computing which got acquired by Softbank.
Continued rare advancement in chip design, leading to architectures/chips like Google TPUs and Grok LPUs.

In future posts I may delve into each one those. For now, my read of Nvidia’s acquisition of Grok is that they’ll absorb their engineering prowess and learnings into continuing to improve and advance their existing GPUs.

It’s an exciting time to be alive.

Driving to the future while looking at the rear-view mirror

Ayaz — Tue, 23 Dec 2025 15:48:43 GMT

This article by Ivan Zhao, CEO at Notion, talks about ideas I have read other leaders in AI space mention. Only briefly, without expanding on them. Most likely because it’s too early to define those ideas in concrete ways for people to begin to accept them.

Today, the most useful way of creating value for user using AI is by incorporating AI inside or as part of their existing workflows. The most common way is to introduce a ChatBot somewhere in the workflow, but it could just as well be a hidden layer within an existing workflow (which is an even better, less intrusive way of leveraging AI). When you consider what most users do in their workflows is repetitive and boring, it makes sense to automate it away with AI which can do it faster, better (and notoriously, without complaining and having to sleep), and in surprisingly interesting ways.

But what is the endgame? Nobody knows, with certainty. And that’s why smart people have ideas about what it could potentially be like without a clear way of defining them. For example, in the workflow case, perhaps the ultimate value will instead be in identifying how these workflows can be replaced entirely by things that AIs can do themselves autonomously, with little to no human intervention — even things that AI can build out of the blue for the time being. That’s just one example. There can be many more. As Ivan points out, the old constraints of how we work and operate have to change for the new world. How we talk to AI, how agents talk to us, what we call work, how we approach it, how do we communicate within a team built of humans and, increasingly, AI, all of that will require new constraints that have yet to be defined. All we have are ideas.

LLMs and flat feet

Ayaz — Mon, 15 Dec 2025 16:18:26 GMT

I rode a road bike for four years straight. Hundreds of miles if not thousands. I enjoyed every bit. It afforded me the luxury of dealing with my inner demons by being out without a worry in the world. Eckhart Toole was on to something in his seminal work, The Power of Now. I witnessed it first hand while cycling, mostly alone but also in groups.

The only negative part was that I came out of with a bad back. Years of cycling in a hunched position required work on my core muscles which I eschewed. Hence, weak lower and mid back muscles. To this day, I have to go into regular physiotherapy to keep my back from getting stiff or having a spasm.

Years ago when I found the right physiotherapist, I was told for the first time in my life that I have flat feet. I was told to wear special insoles with arch support. And so I began taking out stock insoles from my shoes and using a set of special purpose insoles with arch support.

A month back I was traveling with two colleagues for a conference. At an Airbnb, one of them, on the topic of insoles and flat feet, said my feet had an arch and weren’t flat. I didn’t think too much of that then. But the thought stayed.

And so I decided to seek second and third opinions from different LLMs. I took a picture of my left foot without putting weight on it and asked ChatGPT whether it looked flat. Here’s what it told me:

Sure enough, it made sense. When I provided it with a foot of my foot while bearing my weight it identified it as flat.

I didn’t know that much about my feet. The physiotherapist did mention I had flat feet but didn’t explain that in weight bearing position the small arch on my foot disappears.

Then, I asked Gemini, giving it the same pictures. It agreed with ChatGPT. It also gave a name of my condition: flexible flat feet. Now that made more sense.

I ran an experiment where I removed my special insoles and went back to using stock insoles in my New Balance shoes. The soft stock insoles made the shoes more comfortable. But after a week, I could feel pain in my knees and ankles while climbing stairs. Then I went to play Padel while wearing the stock insoles. And it hurt. ChatGPT and Gemini explained above very well why.

Maybe this explains why I have always sucked at running all my life. I thought it had something to do with my changing weight over the years.

You learn something new everyday!