Some thoughts on prefill and decode
Why people in the inference space will be talking about them more.
In the aftermath of all this discussion around LPUs and Groq, two particular terms have come to light: prefill, and decode. Chamat, in the X post from All In One Podcast, did an excellent job of explaining them in laymen terms. I’ll take a measly jab at doing the same in my own way.
In another post I talked about how LPUs get so good at inference. In short, the on-chip memory provides low-latency, fast access to storage, and the streaming yet predictable architecture—software controlled—provides for predictable, fast execution with sufficient compute always available. Prefill and decode are concepts born of the very design principles employed to achieve LPU level efficiency.
Prefill describes the process a Large Language Model (LLM) takes when it absorbs the prompt and attempts to make sense of it. This involves your usual tokenization, vectorization, lots of complex matrix multiplications, among others (I’m going to be smart and not go into the NN part of it). This is mostly compute intensive, and is why GPUs are so good at it, with their high levels of parallelizations and access to high bandwidth memory is large capacities.
The other part of the process is called decode. It’s where the LLM attempts to figure out how to write the response to the initial prompt. There is where a lot of what is known as key and value look-up happens in order to understand the relationship between tokens and to predict the closest token to a given one. The seminal paper I would recommend to understand it is Attention is All You Need. Of course, you’ll have to understand how encoders and decoders work and what are recurring neural networks, but attention is the mechanism that defines modern LLM architectures and makes Generative AI what it is today. Coming back to the decode process, as you may have guessed, it is a particularly memory intensive process (lots of look-ups, trying to make sense of the next token). And that’s one part of where GPUs don’t do well because of their dedicated high bandwidth memory modules that are outside of the GPU silicon, thereby causing latency and blockages. LPUs with their on-chip SRAMs not only make this process fast, they reduce the overall hardware cost too.
Now, Chamat believes these are terms people in the AI space, particularly the hardware and inference ones, will talk a lot about in the coming days and months. What I am more interested in is how Nvidia will capitalize on LPUs. Will there be a fusion of GPUs and LPUs onto a special purpose piece of hardware specifically for inference and training?
Only time will tell. The way things are moving, it will likely be a very short amount of time.


