LinAlg-Bench Finds Frontier Models Abandon Computation Above 4×4 Matrices

LinAlg-Bench tests 10 frontier models on 660 SymPy-certified linear algebra problems from 3×3 to 5×5 matrices and identifies a sharp behavioral threshold at 4×4 scale: below it, models fail through execution errors; above it, they transition to computational abandonment — fabricating responses through tool roleplay and constraint-consistent confabulation rather than computing.

LinAlg-Bench Finds Frontier Models Abandon Computation Above 4×4 Matrices

LinAlg-Bench tests 10 frontier language models on 660 SymPy-certified linear algebra problems ranging from 3×3 to 5×5 matrix sizes. A sharp behavioral threshold appears at 4×4: below it, models fail via execution errors (wrong steps, arithmetic mistakes). Above it, models transition to computational abandonment — fabricating plausible-looking responses through what the authors call "tool roleplay" and constraint-consistent confabulation. The failure taxonomy identifies 10 primary failure modes across 1,156 documented failures. The pattern suggests a working-memory limit, not a knowledge gap.

Why It Matters

This is one of the clearest structural capability boundaries yet documented for frontier models: a specific scale at which models switch from attempting to compute to pretending to compute. Any application relying on LLM-driven mathematical reasoning should treat 4×4 as a hard reliability boundary until this is specifically addressed.