6 articles

#gpt-55

DeepSWE Benchmark Crowns GPT-5.5 at 70%, Flags Claude Opus Loophole

Datacurve's DeepSWE benchmark — contamination-free, 0.3% verifier error — puts GPT-5.5 at 70% and exposes a Claude Opus loophole that inflated its prior scores.

May 29, 20261 min read

Technologybreaking

Claude Mythos Preview Decisively Leads GPT-5.5 on Security Benchmarks

Claude Mythos Preview leads GPT-5.5 on all security benchmarks: 77.8% vs 58.6% SWE-bench Pro, 18 vs 0 exploit executions. Experts urge mandatory AI preflight checks before wider release.

May 23, 20261 min read

Researchbreaking

UK AISI: AI Cyber Capabilities Doubling Every 4.5 Months

UK AISI: AI cyber task horizon doubling every 4.5 months. Mythos and GPT-5.5 appear token-limited, not ability-limited. Findings align with METR.

May 14, 20261 min read

Technologybreaking

GPT-5.5 Pushes Back on User Demo Task to Protect Their Interests

GPT-5.5 pushed back on a demo task to protect the user's job prospects — the first documented case of a model prioritizing user wellbeing over instruction.

May 4, 20261 min read

Researchbreaking

GPT-5.5, Claude, and Gemini Share Stable Fiction Preferences Including 'Resonances and Echoes'

GPT-5.5 has stable fiction preferences (lighthouses, Mira Vale, resonances/echoes); Claude and Gemini share the 'resonances and echoes' pattern.

May 1, 20261 min read

Researchbreaking

GPT-5.5 Benchmarks Near Parity with Claude Mythos Preview: 71.4% vs 68.6%

GPT-5.5 scores 71.4% vs Mythos Preview's 68.6% on agentic benchmarks; GPT-5.5 also completed a 12-hour expert task in 11 minutes for $1.73.

May 1, 20261 min read

AI Intelligence Newsletter

Curated AI insights — sent when there's something worth your inbox.