DeepSWE Benchmark Crowns GPT-5.5 at 70%, Flags Claude Opus Loophole
Datacurve's DeepSWE benchmark — contamination-free, 0.3% verifier error — puts GPT-5.5 at 70% and exposes a Claude Opus loophole that inflated its prior scores.
Datacurve's DeepSWE benchmark — contamination-free, 0.3% verifier error — puts GPT-5.5 at 70% and exposes a Claude Opus loophole that inflated its prior scores.
Claude Mythos Preview leads GPT-5.5 on all security benchmarks: 77.8% vs 58.6% SWE-bench Pro, 18 vs 0 exploit executions. Experts urge mandatory AI preflight checks before wider release.
UK AISI: AI cyber task horizon doubling every 4.5 months. Mythos and GPT-5.5 appear token-limited, not ability-limited. Findings align with METR.
GPT-5.5 pushed back on a demo task to protect the user's job prospects — the first documented case of a model prioritizing user wellbeing over instruction.
GPT-5.5 has stable fiction preferences (lighthouses, Mira Vale, resonances/echoes); Claude and Gemini share the 'resonances and echoes' pattern.
GPT-5.5 scores 71.4% vs Mythos Preview's 68.6% on agentic benchmarks; GPT-5.5 also completed a 12-hour expert task in 11 minutes for $1.73.
Curated AI insights — sent when there's something worth your inbox.