Some thoughts on DeepSeek’s DualPath improvement
Exploring bottlenecks in Agentic inference.
I finally sat down to read the DualPath paper published by DeepSeek recently, and it clarified something I’d been thinking about: as LLMs become more agentic, inference stops being mostly about GPU compute and starts being about moving state around. Running long, multi-step, multi-turn workflows reuses a large amount of prior context (the KV-cache). So the limiting work shifts from doing lots of new GPU math to repeatedly loading, transferring, and saving that reused state fast enough to keep GPUs fed.
That state is the KV-cache, which is the model’s saved “working memory” for attention. In multi-turn agent runs, most of the context is reused each turn. You append a little, but you carry a lot. So the KV-cache hit rate is high, and the dominant work becomes: restore the cache fast, compute a small delta, continue, then persist it again.
In other words: agentic inference becomes I/O-bound.
The paper’s thesis is specific. In the common setup where prefill engines handle prompt work and decode engines handle token-by-token generation, prefill engines are the ones that pull large KV-cache blobs from storage (SRAM, HBM, NVMe). Their storage NICs (bandwidth) saturate. Decode engines often have storage bandwidth sitting idle during this time. So the cluster has capacity, but it’s trapped behind an asymmetric pipeline: one side is overloaded, the other is underused, and that means that GPUs wait.
DualPath’s move is simply adding a second way to load KV-cache.
Instead of only doing storage → prefill, it also allows storage → decode → prefill. Decode engines can read KV-cache from storage using their otherwise-idle storage bandwidth, then transfer it to prefill engines over RDMA (Remote Direct Memory Access) on the compute network. RDMA is a networking method that lets one machine read/write another machine’s memory with very low CPU involvement and low latency—so it’s well-suited for high-throughput, predictable data movement inside a cluster.
The system can dynamically choose which path to use per request, so you’re effectively pooling storage bandwidth across the whole cluster instead of bottlenecking on the prefill side.
My initial concern was interference: doesn’t more traffic on the compute network mess with latency-sensitive model communication? As per the paper, the design relies on isolating KV-cache traffic so it behaves like it’s happening the background rather than something that competes with model-execution collectives.
They also propose adding a “traffic controller” for requests, because having two routes only helps if you spread the work across both. If you accidentally send most requests down one route, that route gets clogged and you’re back to the same problem—just in a different place.
The reported outcome is roughly “about 2×” throughput in their environment: up to ~1.87× for offline inference and ~1.96× for online serving on their workloads, without violating reliability and latency guarantees. I’m cautious about taking those numbers as universal, but the mechanism is the kind that often generalizes: they’re not shaving tiny overheads, they’re removing a structural imbalance.
My take: DualPath is a good example of where the frontier is moving. As agentic workflows get longer and more incremental, the limiting factor becomes “can you restore context fast enough to keep GPUs busy?” not “how many FLOPs do you have?” The bottleneck becomes bandwidth, queueing, and traffic interference — classic systems problems.
In other words, KV-cache loading doesn’t have to be prefill-centric. DualPath turns it into a pooled resource, and that feels like an inevitable step if agentic inference is going to scale.

