llama.cpp Merges Multi-Token Prediction: 78% Throughput Gain on Qwen3.6

llama.cpp has merged multi-token prediction (MTP) support, delivering a 78% throughput increase on Qwen3.6-27B on A10G (25 → 45 tokens/second) with zero accuracy loss using two flags. Unlike speculative decoding, MTP uses a single model with no second-model overhead. Models with MTP today: DeepSeek V3/V4, Nemotron 3 Super/Ultra, Qwen 3.5 and 3.6 dense models.

1 min read|agenticonsult Intelligence

llama.cpp Merges Multi-Token Prediction: 78% Throughput Gain on Qwen3.6

llama.cpp has merged multi-token prediction (MTP), bringing a capability VLLM has had for some time to the GGUF / LM Studio ecosystem. On Qwen3.6-27B on an A10G, users see 25 → 45 tokens/second (+78%) with --spec-type draft-mtp --spec-draft-n-max 2 — zero accuracy degradation. Unlike speculative decoding, MTP folds prediction into a single model with no second-context overhead. Models currently shipping MTP weights include DeepSeek V3/V4 base/flash, Nemotron 3 Super and Ultra, and Qwen 3.5 and 3.6 dense variants.

Why It Matters

Because llama.cpp drives LM Studio and most consumer-grade local inference tooling, this unlock reaches the broadest local-AI audience immediately. Re-downloading MTP-quantized GGUFs from HuggingFace is required — existing weights do not have MTP tensors.

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

llama.cpp Merges Multi-Token Prediction: 78% Throughput Gain on Qwen3.6

llama.cpp Merges Multi-Token Prediction: 78% Throughput Gain on Qwen3.6

Why It Matters

Live Intel Feed