DeepSeek V4 on Apple Silicon: Current Status - Writing

A follow-up to the DeepSeek V4 architecture post: what the published checkpoints weigh, and how that lines up with current Apple Silicon memory.

DeepSeek released two variants. V4-Pro is 1.6T total parameters with 49B active per token. V4-Flash is 284B total with 13B active. Both ship with the same 1M-token context window and the same hybrid attention stack.

V4-Pro is too big for any Mac

Pro at full precision is around 3.2TB of weights. Even at 4-bit it’s 800GB before the KV cache. The largest single Apple Silicon configuration, the M3 Ultra Mac Studio, maxes out at 512GB of unified memory.

V4-Flash sizes

The native format ships with MoE expert weights in FP4 and the rest (attention, norm, router) in FP8. The native checkpoint is around 146GB on disk.

Community GGUF builds from tecaprovn:

Q3_K_M ~100GB
Q4_K_M ~120GB
Q5_K_M ~138GB
Q8_0 ~194GB
BF16 ~334GB (full precision)

The mlx-community has published 4-bit, 3-bit, 8-bit, and a 2-bit dynamic-quantised build.

Apple Silicon memory ceilings

Configurations that put V4-Flash within range:

MacBook Pro M3 Max / M4 Max: up to 128GB unified memory
Mac Studio M4 Max: up to 128GB
Mac Studio M3 Ultra: 96GB, 256GB, or 512GB

Memory bandwidth from the same chips:

M3 Max (16-core, 128GB): 400GB/s
M3 Ultra: 819GB/s

A 128GB MacBook Pro is the laptop ceiling. The 4-bit GGUF (~120GB) leaves around 8GB for the OS, the KV cache, and everything else. LM Studio’s compatibility check flags Q4_K_M as “likely too large” on a 128GB machine. A 256GB or 512GB Mac Studio has the headroom for the 4-bit build at full precision and a longer context.

Why MoE matters for the bandwidth math

V4-Flash has 284B total parameters but only 13B are active per token. The router selects a small subset of experts and only those experts run for that token.

Per-token throughput on a memory-bandwidth-bound system is roughly (active parameter bytes) / (memory bandwidth). With Flash, that’s the 13B active parameters plus routing and attention paths, not the full 284B. So the per-token cost on Apple Silicon tracks closer to a dense 13-15B model than the full 284B parameter count would suggest.

A dense 70B+ model of comparable benchmark quality would have to read its full weight count per token, which is several times more memory traffic. That’s why a 284B-MoE-with-13B-active fits Apple Silicon’s bandwidth profile better than a 70B-dense model, even if both fit in memory.

Follow-up

I’ll write a follow-up with measured throughput, real first-token latency, and the runtime I settled on once I have those numbers in hand.