Claude Mythos Preview Decisively Leads GPT-5.5 on Security Benchmarks
Third-party benchmark data places Claude Mythos Preview ahead of GPT-5.5 on every major security evaluation: SWE-bench Pro (77.8% vs 58.6%), HLE (56.8% vs 41.4%), UK AISI cyber ranges (6/10 vs 3/10), and ExploitBench — where Mythos produced 18 arbitrary code executions versus GPT-5.5's zero. Mythos also demonstrated superior token efficiency, finding more exploits per LLM call. Researcher Gary Marcus called the results "a major wakeup call wrt security" and argued a full release "would cause a huge mess," contrasting Anthropic's cautious handling with the risk posed by less careful actors.
Why It Matters
The gap between Mythos and GPT-5.5 on offensive security tasks is not marginal — it represents a step-change in autonomous vulnerability exploitation capability that raises urgent questions about capability disclosure timelines and mandatory AI preflight checks for models at this level.