DeepSeek V4: Don't Look at What You Don't Need - Writing

DeepSeek V4 is a 1.6 trillion parameter Mixture-of-Experts model with a 1 million token context window, trained on 33 trillion tokens, weights on Hugging Face under MIT. Per the paper, it runs a 1M-token forward pass on 27% of the FLOPs and 10% of the KV cache that the previous V3.2 needed.

The interesting bit isn’t the parameter count. It’s the trade DeepSeek made to stop paying full attention cost on a million tokens. The whole architecture is built around the idea that the model shouldn’t look at things it doesn’t need to look at.

Why a million tokens is hard

Standard attention compares every new token to every previous token. At 1,000 tokens that’s a million comparisons per layer, which is fine. At a million tokens it’s a trillion comparisons per layer, which is not.

There’s a second cost that’s often worse than the FLOPs. The model stores a key-value cache of every past token so it doesn’t have to recompute them. At a million tokens that cache is gigabytes of GPU memory per request. The bottleneck is usually how much KV you can fit, not how fast the GPU is.

If you want a million-token context, you have to give up something. Either accuracy on long-range retrieval, or the assumption that all tokens get equal attention. DeepSeek gave up the second one.

Compress, then ignore most of it

V4 has three attention pathways running in parallel, interleaved layer by layer. The first is Compressed Sparse Attention (CSA): chunk the past into groups of m tokens, compress each group into one entry with a learned compression weight, then attend to only the top-k most relevant compressed entries. CSA reuses the DeepSeek Sparse Attention selection mechanism that shipped in V3.2.

tokens:        [t1 t2 t3 t4][t5 t6 t7 t8][t9 t10 t11 t12]...
compressed:    [    c1     ][    c2     ][      c3      ]...
top-k subset:  [    c1     ]              [      c3      ]

The selection step is a thing they call the Lightning Indexer. It scores all the compressed blocks fast, picks the few that matter, drops the rest. The model never runs full attention over the whole compressed stream, only over the selected subset plus a small sliding window for local context.

That’s already a sequence-length reduction by a factor of m, then a further reduction to top-k. The compute drops by orders of magnitude.

Heavily compressed for the gist

Compression always loses information. CSA hides this by being selective: when a detail matters, it pulls the relevant compressed block back. But sometimes you want a low-resolution view of the whole document, not a search.

That’s the second pathway, Heavily Compressed Attention (HCA). Same compression idea, much bigger groups, something like m' = 128 instead of m = 4. After that compression the sequence is short enough that you can run dense attention over all of it cheaply. You lose detail but you keep the global shape.

Interleaving these two through the network means each layer can either zoom in on a few precise blocks (CSA) or take in everything at low resolution (HCA), depending on what the layer is for.

The last few pages, uncompressed

The third pathway is the obvious one. A sliding window over the most recent ~128 tokens at full fidelity, no compression. Whatever the model just saw, it can still see exactly.

Three views of the same document:

Sliding window. Last few pages, word for word.
CSA. Moderately compressed past, retrieved selectively when needed.
HCA. Heavily compressed past, attended to as a whole.

This is roughly how a person reads a long book. You remember the last few paragraphs in detail, you have a rough mental summary of the first half, and when you need a specific quote you flip back and find it. DeepSeek made that into a layered architecture and trained the model to use all three pathways at once.

Stopping a 1.6T model from blowing up

The other half of the paper is about training stability at this scale. Neural network signals can amplify layer to layer, and at 1.6T parameters the standard fix (residual connections, hyperconnections) isn’t enough on its own. The training run can spike and diverge.

V4 introduces Manifold-Constrained Hyper-Connections (mHC). The residual mapping matrix at each layer is constrained to be doubly stochastic, which means every row sums to 1 and every column sums to 1. Geometrically, the matrix is forced to live on the Birkhoff polytope. Practically, it bounds the spectral norm at 1, so the transformation never amplifies its input. The signal can’t grow because the math doesn’t let it.

The constraint is enforced before each layer using the Sinkhorn-Knopp algorithm, around 20 alternating row/column normalisation steps. That sounds expensive at 1.6T parameters, but with fused GPU kernels they got the overhead down to 6.7% of training runtime. A small premium for a training run that doesn’t crash.

There’s a separate trick called anticipatory routing for routing stability in the MoE layers. It uses slightly stale parameter snapshots to make routing decisions, so the routing doesn’t react to the noisy step-by-step fluctuations. Combined with mHC, the training run self-stabilises instead of needing to be rolled back.

They also dropped AdamW for Muon, an optimiser that orthogonalises gradient updates via Newton-Schulz iteration. Aggressive updates first, then fine-grained ones. Faster convergence, more stability.

Numbers that matter

The headline efficiency numbers, all measured against V3.2 at 1M-token context:

27% of the inference FLOPs (3.7x reduction)
10% of the KV cache memory (about 10x reduction)
6.7% training overhead from the mHC constraint
33T tokens of training data, curriculum-staged from 4K to 1M context

The benchmark that’s hardest to argue with is Putnam-2025, the formal math competition. V4 scored 120/120, full marks, ahead of every other model that’s tried it. On long-context retrieval at the 1M-token limit, DeepSeek-V4-Pro beats Gemini 3.1 Pro despite having a smaller training compute budget.

Why the architecture matters

V4 doesn’t beat the closed labs by spending more. It spends less per useful operation, and every architectural choice in the paper is “how do we not pay for tokens that don’t help”.

CSA is “don’t attend to history that isn’t relevant to the current token”. HCA is “if you do need to glance at all of it, look at a tiny summary instead”. The sliding window is “if it’s already nearby, don’t bother compressing”. mHC is “don’t waste a training run on a signal explosion you could prevent by construction”.

That bias matters more than the architecture details. Closed labs have been buying their way out of the same problems with more chips. DeepSeek can’t, so they had to solve them in software. The result is a model whose marginal cost per token is small enough that the 1M context window is practical for real work.

The other thing worth noting is that DeepSeek published all of it. The paper, the weights, the kernels in TileLang, and Z3 proofs of kernel correctness. The closed labs keep this layer proprietary; DeepSeek released it.