llama.cpp Merges Multi-Token Prediction: 78% Throughput Gain on Qwen3.6
llama.cpp merges MTP support, boosting Qwen3.6-27B throughput 78% on A10G with zero accuracy loss — no second model needed, just two CLI flags to activate.
llama.cpp merges MTP support, boosting Qwen3.6-27B throughput 78% on A10G with zero accuracy loss — no second model needed, just two CLI flags to activate.
Nous Hermes Agent v0.14.0 exposes Claude Pro, ChatGPT Pro, and SuperGrok as local OpenAI-compatible endpoints, eliminating the pay-twice problem for subscription holders using coding agents.
Qwen3 35B MoE distilled from Claude Opus is available free as a quantized GGUF — near-frontier local inference capability at zero cost.
Shimmy v1.9.0 is a 4.8MB single-binary OpenAI-compatible local inference server that bundles all GPU backends and claims 142x size advantage over Ollama.
Developers running DeepSeek V4 Flash with 2-bit selective GGUF via llama.cpp describe it as 'the first time I feel I have a frontier model running on my computer' — a milestone for local AI.
Intel releases W4A16 INT4 quantizations of DeepSeek-V4-Pro and Flash via AutoRound — no MXFP4 hardware required, expanding which hardware can self-host DeepSeek V4 at near-full quality.
Qwen3.6-27B drops quietly under Apache 2.0: AAII score 46, optimized for M-series local inference, strong agentic coding — the best dense local model available.
Curated AI insights — sent when there's something worth your inbox.