7 articles

#benchmarks

MM-ToolBench: Claude Opus 4.6 Achieves Only 32% Task Success vs 94% Human Baseline

MM-ToolBench's closed-loop multimodal verification benchmark shows Claude Opus 4.6 at 32% task success versus 94% for humans across 100 tasks spanning 27 MCP servers and 324 tools.

May 20, 20261 min read

Toolsbreaking

LangChain DeepAgents Harness Profiles: 10–20pt Benchmark Jump

LangChain DeepAgents Harness Profiles deliver 10–20pt tau2-bench gains via per-model system prompt and middleware overrides; harness is now a first-class versioned object.

May 7, 20261 min read

Researchbreaking

Alibaba's AgenticQwen-30B (3B Active) Matches Qwen3-235B on Tool-Use

AgenticQwen-30B-A3B scores 50.2 avg matching Qwen3-235B on tool-use benchmarks. Dual RL flywheels flip the cost curve for production agents.

May 4, 20261 min read

Researchbreaking

Harness Engineering Beats Model Upgrades: AHE Framework and 20% Terminal-Bench Gains

AHE framework lifts Pass@1 from 69.7% to 77.0%; harness-only changes yield 13–20% Terminal-Bench gains. Model upgrades no longer the only lever.

May 4, 20261 min read

Researchbreaking

Anthropic's BioMysteryBench: Claude Solves 30% of Expert-Stumping Problems

Anthropic's BioMysteryBench tested Claude on 99 bioinformatics problems; latest models solved ~30% of expert-stumping cases in open-ended research.

April 30, 20261 min read

Technologybreaking

Sakana AI Launches Fugu Beta: Multi-Agent System Hits SOTA on Three Benchmarks

Sakana AI's Fugu beta hits SOTA on SWE-Pro, GPQA-D, and ALE-Bench with dynamic frontier model orchestration via an OpenAI-compatible API.

April 24, 20261 min read

Researchreport

Open-Source LLM Landscape Q1 2026: Performance, Licensing, and Deployment Economics

A comparative analysis of the open-source LLM ecosystem entering Q2 2026 — benchmarking performance against proprietary alternatives, mapping the licensing landscape, and calculating total cost of ownership for self-hosted deployments.

March 15, 202616 min read

AI Intelligence Newsletter

Curated AI insights — sent when there's something worth your inbox.