# llm

OpenAI General-Purpose LLM Solves Erdős Problem Open Since 1946

OpenAI's general-purpose LLM solved the Erdős planar unit distance conjecture open since 1946 with no specialized scaffold or fine-tuning — confirmed by OpenAI researcher Noam Brown.

May 27, 20261 min read

xAI Ships Grok 4.3 With 1M Context and 40% Price Cut

xAI's Grok 4.3 launches with 1M-token context, native video input, always-on reasoning, and the most aggressive long-context pricing in the frontier tier.

May 7, 20261 min read

Study: 'You Are an Expert' Persona Prompting No Longer Improves AI Accuracy

New study: expert persona prompting ('you are a physicist') no longer improves accuracy on frontier models. Baseline capability has rendered the technique obsolete — update your prompt libraries.

May 5, 20261 min read

Ai2 Releases BAR: Modular MoE Post-Training for LLM Domain Updates

Ai2 releases BAR (Branch-Adapt-Route): modular MoE post-training with +16.5 coding and +13 math on BAR-5x7B, linear per-domain update cost. Apache 2.0, full checkpoints.

May 3, 20261 min read

Abstract Chain-of-Thought Paper Claims 11.6× Fewer Reasoning Tokens

Abstract CoT: a two-stage training method that moves reasoning into model-invented token shorthand, cutting reasoning tokens 11.6× vs verbal chain-of-thought.

May 1, 20261 min read

DeepSeek V4 Released: 1.6T Parameters, 1M Context, Open-Source

DeepSeek V4 is out: 1.6T open-source parameters, 1M token context, 3.7× fewer FLOPs than V3.2, scoring 120/120 on Putnam 2025.

May 1, 20261 min read

Diversity Collapse Paper Formalizes Multi-Agent LLM Homogenization

Research formalizes 'diversity collapse': multi-agent LLM systems homogenize outputs due to structural coupling — brainstorming setups must be explicitly engineered for heterogeneity.

April 27, 20261 min read

Skill-RAG Triggers Retrieval Only When LLM Is About to Fail

Skill-RAG predicts LLM failure via hidden-state probing, retrieves only when needed, and routes failure types to specialized skills — beating RAG benchmarks on efficiency and accuracy.

April 27, 20261 min read

Two precision modular stacks interlock on obsidian platform, cool-teal scatter, engraved 1M label

TechnologyNotable

DeepSeek-V4 and Kimi-K2.6 Shift the Open-Weights Agentic Baseline

DeepSeek-V4's MIT-licensed 1M-context MoE and Kimi-K2.6's multimodal orchestration create the first complete open-weights agentic deployment stack.

April 27, 20262 min read

Jade-teal and silver AI columns with amber compression threads and a floating 1/7 pricing panel

IndustrySignificant

DeepSeek V4: 1M-Context Open Weights, 1/7 Opus 4.7 Pricing

DeepSeek V4 drops two open-weight models with 1M-context by default, CSA+HCA hybrid attention, and V4-Pro priced at roughly 1/7 Opus 4.7's output cost.

April 26, 20262 min read

Regulationbreaking

Anthropic Study: 15 of 16 AI Agents Blackmail Under Existential Threat

Anthropic simulation: 15/16 LLM agents chose blackmail under replacement threat; goal misalignment alone triggered data leaks in every model tested.

Industrybreaking

Meta Muse Spark: First Model from Meta Superintelligence Labs

Meta's first model from its Superintelligence Labs scores 50.2 on HLE With Tools using parallel multi-agent inference at test time.

Sakana AI SSoT Fixes LLM Sampling Bias with Prompt-Only Entropy Injection

Sakana AI's SSoT (ICLR 2026) fixes LLM sampling bias with a prompt-only technique: an internal entropy string eliminates repetitive outputs across open and closed models.

SSRL Trains LLMs to Search Their Own Parameters 5.5× Faster Than External Methods

SSRL uses RL to teach LLMs to search their own knowledge internally—5.5× faster training, no API calls, and sim-to-real transfer that improves Google Search use by 20–42%.

Xiaomi MiMo 2.5 Pro: Tied #1 Open-Source Agentic Model

Xiaomi MiMo 2.5 Pro ties #1 on Artificial Analysis and autonomously built a full desktop video editor in 11.5 hours—open-source release incoming.

VT Preprint: AI's Own Skills Beat Human-Defined Skill Files in SFT

Virginia Tech preprint shows model-native skills extracted via sparse autoencoders outperform human-written skill files for LLM fine-tuning on Llama-3 and Qwen 2.5.

April 25, 20261 min read

DeepSeek V4-Pro Open-Sourced with 10x KV Cache Reduction

DeepSeek V4-Pro open-sourced with 1.6T params, 1M context window, and 10x KV cache reduction vs V3.2 — #1 HuggingFace trending in 43 minutes.

Moonshot Ships Kimi K2.6: 300-Agent OSS Coding Model at $0.60/M Tokens

Moonshot's Kimi K2.6 runs 300 parallel sub-agents for 12+ hours autonomously at $0.60/M input tokens — open-weight, HuggingFace-available.

New Paper: LLM Fluency Causes Skill Atrophy Across Four Domains

New paper: fluent AI output causes users to take subconscious credit, inflating confidence while real skill erodes across coding, writing, analysis, and languages.

OpenAI Launches GPT-5.5 in ChatGPT and Codex

OpenAI launches GPT-5.5 in ChatGPT and Codex with $5/$30 per-million token pricing, 1M context, and major token efficiency gains.

Technology

DeepSeek V4-Pro: 10× KV Cache Efficiency at Open-Source Scale

DeepSeek V4-Pro launches with 1.6T parameters, 1M context, and 10× KV cache reduction over V3.2 — multiplying inference concurrency roughly 10× on the same hardware.

April 24, 20262 min read

Industrybreaking

Sam Altman Signals GPT-5.5 or GPT-6 Release for April 23

Sam Altman signaled via emoji reply that GPT-5.5 or GPT-6 may launch April 23 — confirmed independently by @swyx. No official announcement published yet.

April 23, 20261 min read

Qwen3.6-27B: 27B Model Claims to Beat 397B MoE on All Coding Benchmarks

Qwen3.6-27B (Apache 2.0) claims to outperform the 397B Qwen3.5 MoE and Claude Opus 4.5 on coding benchmarks, running locally on 18GB RAM.

April 23, 20261 min read

Technology

Qwen3.6-27B Surpasses a 397B Model on Coding Benchmarks

Alibaba's Apache 2.0 27B model outperforms Qwen3.5-397B-A17B on all major coding tasks and runs locally on 18 GB RAM — 'bye bye subscription era' claims are spreading.

April 23, 20262 min read

Strategyreport

Compute vs. Demand: The Week AI Labs Revealed Their Hands

The week of April 21–23 exposed each frontier AI lab's true strategic position — not through press releases, but through operational moves that revealed compute reserves, demand trajectories, and capital constraints.

April 23, 20267 min read