Claude Mythos Preview Decisively Leads GPT-5.5 on Security Benchmarks

Benchmark data shows Claude Mythos Preview outperforms GPT-5.5 on every major security evaluation: SWE-bench Pro 77.8% vs 58.6%, ExploitBench 18 arbitrary code executions vs 0, and UK AISI cyber ranges. Gary Marcus warns a full release 'would cause a huge mess.'

1 min read|agenticonsult Intelligence

Claude Mythos Preview Decisively Leads GPT-5.5 on Security Benchmarks

Third-party benchmark data places Claude Mythos Preview ahead of GPT-5.5 on every major security evaluation: SWE-bench Pro (77.8% vs 58.6%), HLE (56.8% vs 41.4%), UK AISI cyber ranges (6/10 vs 3/10), and ExploitBench — where Mythos produced 18 arbitrary code executions versus GPT-5.5's zero. Mythos also demonstrated superior token efficiency, finding more exploits per LLM call. Researcher Gary Marcus called the results "a major wakeup call wrt security" and argued a full release "would cause a huge mess," contrasting Anthropic's cautious handling with the risk posed by less careful actors.

Why It Matters

The gap between Mythos and GPT-5.5 on offensive security tasks is not marginal — it represents a step-change in autonomous vulnerability exploitation capability that raises urgent questions about capability disclosure timelines and mandatory AI preflight checks for models at this level.

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

Claude Mythos Preview Decisively Leads GPT-5.5 on Security Benchmarks

Claude Mythos Preview Decisively Leads GPT-5.5 on Security Benchmarks

Why It Matters

Live Intel Feed