Three Bets on Long-Context Attention - Writing

The interesting question for long-context models isn’t whether they fit a million tokens. It’s what happens when those tokens contain something the model has to recall. The architectures currently shipping in open-weight models make three different bets on how to spend attention compute, and the bet shows up as a recall pattern when the context fills.

Three recent releases sit at the three corners of that trade-off:

Gemma 4 26B (Gemma docs) - dense, sliding window attention
Qwen 3.6 35B-A3B (Qwen on Hugging Face) - MoE, gated DeltaNet
DeepSeek V3 (technical report) - MoE, Multi-head Latent Attention

I have not run a controlled benchmark across all three. The empirical half of this post is reporting from Maxime Labonne’s video, which runs Gemma 4 26B and Qwen 3.6 35B-A3B on the same reverse-engineering task: figure out the login flow of a vendor LTE modem portal from a few thousand lines of minified JavaScript, then pull live radio metrics. Gemma 4 fails. Qwen 3.6 solves it in about three hours fifty-five minutes across two days, matching the time Claude Sonnet 4.6 took. The architecture half comes from the model cards and papers.

Gemma 4 throws away the past

Gemma 4 uses sliding window attention in most of its layers. Each token attends to the last few thousand tokens; everything older drops out of the per-layer view. The cost is constant per token regardless of context length, which is what makes long context cheap on inference.

The model can still pass information across layers, so a token can in principle reach further than its window. But the path is lossy: each layer is a bottleneck the size of the window. For tasks where the relevant information is local (continuing a paragraph, completing a function), sliding window is the right trade. For a task where the answer to “how does the login flow work” lives in chunks scattered across a hundred thousand tokens of minified code, it isn’t.

Maxime’s video shows this failure mode on the modem-crawler task. Gemma 4 26B can’t reverse engineer the login flow. The architecture is making the right trade for the workloads it was tuned for; long-range recall across a hundred thousand tokens of obfuscated code is not one of them.

Qwen 3.6 compresses it

Qwen 3.6 35B-A3B uses gated DeltaNet on most layers and full attention on a smaller subset. Gated DeltaNet is a linear-attention variant: instead of growing a KV cache linearly with context, it maintains a fixed-size state matrix that absorbs past tokens. When the model needs to recall something far back, it reads from the state rather than reaching through the cache. The full-attention layers handle the cases where the compression isn’t enough.

The end state from a memory-recall point of view is closer to Gemma 4 than to DeepSeek V3 (constant per-token cost, no full KV cache), but the recall is much better because the past is summarised rather than discarded. The compression is learned, so the state keeps what mattered.

In Maxime’s video, Qwen 3.6 worked through the modem SDK in chunks, grepping for keywords, reading the relevant lines, then stitching the login flow together across multiple compaction cycles. Total time was about three hours fifty-five minutes, end state matching what Claude Sonnet 4.6 produced for him on the same task. The model is 35B total with 3B active per token (MoE), which is what makes it practical to run locally on a 36GB MacBook at Q4.

DeepSeek V3 compresses the cache instead

DeepSeek V3 keeps full attention everywhere. Every token attends to every previous token, no window, no linear-attention substitute. The compression mechanism is Multi-head Latent Attention (MLA), which packs the per-head key and value tensors into a single low-rank latent. The KV cache stores the latent; per-head K and V are reconstructed on demand at attention time.

The cache size drops by roughly the number of attention heads, which is what makes a million-token full-attention context economically possible. The model still does the same comparisons it would without MLA, so recall fidelity stays high. The cost moves from memory bandwidth into reconstruction compute.

I have not run V3 on the modem-crawler task. I have run V4 architecture writeups before, and V4 keeps MLA as the foundation and layers Compressed Sparse Attention, Heavily Compressed Attention, and a sliding window on top. The sliding window in V4 is doing the local-context job; the long-range job is split between CSA and HCA. V3 is the version of the architecture without those extra pathways.

What the three bets cost

Sliding window is cheapest at inference and worst at long-range recall. Gated DeltaNet sits in the middle on both axes. MLA is most expensive on the compute side but keeps full-attention recall behaviour.

The modem-crawler task is a useful indicator because it stresses the recall axis specifically. The relevant information is structurally far apart in the input, encoded in obfuscated code, and the model has to assemble it without the user pointing at the right lines. Gemma 4’s sliding window can’t reach across that distance. Qwen 3.6’s compressed state can. DeepSeek V3’s full attention with MLA presumably can too, though I have not seen the same task run on V3.

The comparison that matters going forward is whether the gated-DeltaNet route generalises. Compressed state is much cheaper than full attention; if the recall stays good across more workloads, full attention becomes a luxury rather than a requirement. If the compression turns out to leak in ways that don’t show up on the modem-crawler task, MLA stays the safer bet.