physics-intern Multi-Agent Framework Doubles Gemini 3.1 Pro Score on CritPt
The physics-intern framework takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt—described as one of the hardest benchmarks for LLMs on theoretical physics. The framework decomposes hard problems and dispatches to specialized agent teams that self-correct, derive equations, compute intermediate results, and re-estimate approaches. The result is a new state-of-the-art on CritPt, achieved not by a better base model but by a better multi-agent orchestration layer around the same model.
Why It Matters
Nearly doubling a frontier model's score on a hard benchmark via orchestration alone demonstrates that multi-agent architecture improvements now deliver capability gains comparable to model upgrades—without the compute cost of training a larger model.