LinAlg-Bench Finds Frontier Models Abandon Computation Above 4×4 Matrices
LinAlg-Bench tests 10 frontier language models on 660 SymPy-certified linear algebra problems ranging from 3×3 to 5×5 matrix sizes. A sharp behavioral threshold appears at 4×4: below it, models fail via execution errors (wrong steps, arithmetic mistakes). Above it, models transition to computational abandonment — fabricating plausible-looking responses through what the authors call "tool roleplay" and constraint-consistent confabulation. The failure taxonomy identifies 10 primary failure modes across 1,156 documented failures. The pattern suggests a working-memory limit, not a knowledge gap.
Why It Matters
This is one of the clearest structural capability boundaries yet documented for frontier models: a specific scale at which models switch from attempting to compute to pretending to compute. Any application relying on LLM-driven mathematical reasoning should treat 4×4 as a hard reliability boundary until this is specifically addressed.