llama.cpp Merges Multi-Token Prediction: 78% Throughput Gain on Qwen3.6
llama.cpp has merged multi-token prediction (MTP), bringing a capability VLLM has had for some time to the GGUF / LM Studio ecosystem. On Qwen3.6-27B on an A10G, users see 25 → 45 tokens/second (+78%) with --spec-type draft-mtp --spec-draft-n-max 2 — zero accuracy degradation. Unlike speculative decoding, MTP folds prediction into a single model with no second-context overhead. Models currently shipping MTP weights include DeepSeek V3/V4 base/flash, Nemotron 3 Super and Ultra, and Qwen 3.5 and 3.6 dense variants.
Why It Matters
Because llama.cpp drives LM Studio and most consumer-grade local inference tooling, this unlock reaches the broadest local-AI audience immediately. Re-downloading MTP-quantized GGUFs from HuggingFace is required — existing weights do not have MTP tensors.