MM-ToolBench: Claude Opus 4.6 Achieves Only 32% Task Success vs 94% Human Baseline
MM-ToolBench's closed-loop multimodal verification benchmark shows Claude Opus 4.6 at 32% task success versus 94% for humans across 100 tasks spanning 27 MCP servers and 324 tools.