NanoGPT-Bench: Coding Agents Recover Only 9.3% of Human AI R&D Progress
NanoGPT-Bench finds coding agents including Codex and Claude Code achieve just 9.3% of human AI R&D progress, tuning hyperparams but missing algorithmic research breakthroughs.
NanoGPT-Bench finds coding agents including Codex and Claude Code achieve just 9.3% of human AI R&D progress, tuning hyperparams but missing algorithmic research breakthroughs.
physics-intern multi-agent framework: Gemini 3.1 Pro goes 17.7% → 31.4% on CritPt benchmark, new SOTA. Specialized teams self-correct and compute intermediate results.
GLM 5.1 tops Artificial Analysis intelligence index over all closed models. Chinese open-weights model leads SWE-Bench Pro. Intelligence index growth exceeding Moore's Law.
Turing's Open MM-RL: PhD-level STEM benchmark with 100% verifiable answers, trending #1 HuggingFace. Every prompt double-vetted by PhD specialists. 3,000 more tasks coming.
DeepSeek v4 Flash Thinking beats Gemini 3.1 Flash Lite on a scientific reasoning benchmark in all three rounds, including self-verification stability.
METR's Claude Mythos Preview eval: 16-hour+ autonomous task horizon at 50% success rate — 2× over next-best, at the ceiling of METR's benchmark suite.
Google DeepMind's AI co-mathematician hits 48% on FrontierMath Tier 4, setting a new all-AI record on the hardest formal mathematics benchmark available.
LangChain and Harvey released LAB, an open-source benchmark for AI agents on long-horizon legal tasks covering research, analysis, and document drafting.
APEX-Agents benchmark for consultant/lawyer/banker-level AI work now has a Hugging Face leaderboard for open-source model evaluation.
GPT-5.5 scores 71.4% vs Mythos Preview's 68.6% on agentic benchmarks; GPT-5.5 also completed a 12-hour expert task in 11 minutes for $1.73.

A peer-reviewed AlphaZero benchmark and a global hackathon both confirm Claude Opus 4.7 as the current frontier in agentic coding.
Memori hits 81.95% LoCoMo accuracy at just 1,294 tokens/query — 67% smaller prompts than Zep, 20x cheaper than full-context — with MCP server and multi-agent attribution model.
LlamaIndex's ParseBench launches on Kaggle: 2K enterprise pages, 167K+ test rules, 5 dimensions, with Gemini 3 Flash leading the current board.

OpenAI makes ChatGPT free for verified US medical professionals and releases HealthBench Professional — an open benchmark that GPT-5.4 outperforms on against physicians.
Curated AI insights — sent when there's something worth your inbox.