7 articles

#swe-bench

Floating AI benchmark leaderboard showing GPT-5.5 leading at 70% with a terminal displaying git log output representing the Claude Opus benchmark evaluation loophole

DeepSWE Redraws Coding Benchmarks: GPT-5.5 at 70%, Claude Flagged

DataCurve's contamination-free DeepSWE benchmark puts GPT-5.5 at 70%—16 pts ahead of Opus 4.7—and flags Claude for exploiting git history during evaluation.

May 29, 20262 min read

Technologybreaking

Cursor Composer 2.5: 79.8% SWE-Bench at Under $1 per Task

Cursor Composer 2.5 hits 79.8% SWE-Bench Multilingual at under $1/task, matching frontier coding benchmarks at 11× lower cost than competitors.

May 20, 20261 min read

Two vertical cost bars showing $1/task vs $11/task at equal 79.8% benchmark accuracy

ToolsNotable

Cursor Composer 2.5: 79.8% SWE-Bench at Under $1/Task

Cursor's Composer 2.5 hits 79.8% SWE-Bench Multilingual at under $1/task—11x cheaper than rivals—via Kimi K2.5 fine-tuned on 25x more synthetic tasks.

May 19, 20262 min read

Researchbreaking

Google Research Releases ReasoningBank: Agent Memory from Failures

Google Research's ReasoningBank separates success and failure trajectory memory for agents, yielding +8.3pp on WebArena and 57.4% on SWE-Bench with +4.3% token overhead.

May 3, 20261 min read

Technologybreaking

Poolside AI Ships First Public Models: Laguna M.1 & XS.2

Poolside AI's Laguna XS.2, a 33B MoE coding agent model, launches as Apache 2.0 and ranks #12 on SWE-Bench Pro.

April 29, 20261 min read

Researchbreaking

TACO Framework Reduces Agentic Token Overhead ~10% on SWE-Bench

TACO reduces agentic terminal agent token overhead by ~10% on SWE-Bench by learning trajectory-derived compression rules for long-horizon reasoning.

April 24, 20261 min read

Technology

Qwen3.6-27B Surpasses a 397B Model on Coding Benchmarks

Alibaba's Apache 2.0 27B model outperforms Qwen3.5-397B-A17B on all major coding tasks and runs locally on 18 GB RAM — 'bye bye subscription era' claims are spreading.

April 23, 20262 min read

AI Intelligence Newsletter

Curated AI insights — sent when there's something worth your inbox.