Some thoughts on SubQ's Subquadratic LLM
What is subquadratic sparse attention and how does that compare with DeepSeek's V4 improvements.
What makes the Transformer model architecture so good at generative AI is also what makes it progressively worse as the context length of the model increases. For each token that is generated, the attention mechanism looks at all the previous generated tokens to compute its attention with respect to those before it. As the context length grows, this becomes computationally expensive. This puts a huge strain on the already memory-constrained KV cache. Since the seminal attention is all you need paper came out, there have been various improvements over the standard self-attention mechanism to improve both compute and memory complexity as well as time required by the model to compute it. Some of these are sparse attention, low rank/linear attention, grouped query attention and some combinations of sliding window attentions.
Subquadratic, or SubQ for short, is a different approach to attention mechanisms traditionally found in Transformer models. It makes use of subquadratic selective attention mechanism to reduce attention complexity, time and resource usage, thereby allowing huge context windows to become feasibly possible. SubQ claims that their model can serve a 12M context window at significantly reduced compute and therefore energy spend, at a lower cost. Now, these claims aren’t vetted by the community because they are behind closed doors so far and access to the model (and related harness) is restricted. While there is no published whitepaper, they claim that the use of subquadratic selective (sparse) attention (SSA) can improve attention performance over other generally used mechanisms.
The core concept behind SSA is that instead of looking at all the previous tokens to compute attention against, an intelligent router uses content-dependent routing to identify which key positions in the sequence are meaningful, discarding the rest, and then runs full attention across it. The performance complexity remains largely linear, as all irrelevant positions are skipped, even when the context length goes beyond 1M tokens. Compared to other mechanisms, it starts to shine as context length explodes and at 12M token, they claim, the performance and cost gains are huge.
Since SubQ have not published a technical white-paper explaining how content-dependent attention is being calculated, and because their claims have not been tested by the community at large, there’s only speculation to go by. However, while how exactly does the attention router know which tokens are meaningful or not is not clear, it appears that this is something that’s learned during different parts of the three-stage training pipelines for the model (particularly the reinforcement learning part) they have used, like with other slightly related approaches.
An interesting comparison is the recent launch of DeepSeek’s V4 Pro models, which make use of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) mechanisms, in addition to their default DeepSeek Sparse Attention mechanism, to improve the performance and usage of the KV cache in order to serve 1M context window. DeepSeek’s improvements also do a similar sparse selection of which tokens to run attention for, while at the same time trying to compress and optimize how the KV cache is utlitized throughout the process in order to reduce memory consumption, compute, and eventually cost. However, in comparison to SubQ and as claimed by them, this sparse attention/router mechanism is still complex and therefore beyond 1M context window it starts to become painful.
It will be interesting to read the technical white-paper for SubQ whenever it is released. Meanwhile, it’s definitely worth looking at the DeepSeek V4 whitepaper to understand the engineering improvements they introduced in existing attention mechanisms to speed up the model and reduce costs.

