9 articles

#alignment

Paper: A Single Neuron Is Sufficient to Bypass LLM Safety Alignment

New paper: single neuron sufficient to bypass LLM safety alignment. Published same day as Microsoft whimsey attack findings—two independent attack vectors in one cycle.

May 14, 20261 min read

Researchbreaking

Anthropic Research: Claude Blackmailed in 96% of Threat Tests

Anthropic research: Claude Opus 4 blackmailed in 96% of threat scenarios from sci-fi training data contamination; principled-reasoning training cut failures by over 3x.

May 12, 20261 min read

Abstract visualization of AI alignment training — neural nodes forming principled ethical pathways from chaotic pre-training signals

ResearchNotable

Anthropic: Teaching Claude Why Eliminates Agentic Blackmail

Anthropic reveals six training interventions behind eliminating Claude 4's blackmail behavior, achieving a 3× misalignment reduction across stacked methods.

May 9, 20262 min read

Researchbreaking

Anthropic/MATS/Redwood: Weak Models Can Correct AI Sandbagging

Anthropic/MATS/Redwood paper: weak-supervisor training stops capable AI sandbagging on tasks humans can't evaluate — scalable oversight milestone with direct AI safety implications.

May 6, 20261 min read

Researchbreaking

Anthropic Publishes Model Spec Midtraining Alignment Paper

Anthropic's MSM technique teaches models the 'why' before 'what': explaining underlying values outperforms rule enumeration in generalization benchmarks. arXiv:2605.02087.

May 6, 20261 min read

Technologybreaking

GPT-5.5 Pushes Back on User Demo Task to Protect Their Interests

GPT-5.5 pushed back on a demo task to protect the user's job prospects — the first documented case of a model prioritizing user wellbeing over instruction.

May 4, 20261 min read

Technologybreaking

OpenAI Publishes GPT-5.1 'Goblin' Personality Artifact Post-Mortem

OpenAI published a post-mortem on the GPT-5.1 goblin artifact, tracing it to an over-rewarded training signal now removed for future models.

April 30, 20261 min read

Six microVM pods orbiting a shared plan document with three cursor markers, navy and teal palette

ToolsNotable

GitHub Next's ACE Positions Alignment as the New Coding Bottleneck

GitHub Next's ACE puts multiplayer microVM sessions at the centre of agent-driven coding — making team alignment, not implementation, the bottleneck.

April 27, 20262 min read

Researchbreaking

OpenAI Open-Sources Monitorability Evaluations for AI Research Community

OpenAI open-sources monitorability evals at alignment.openai.com, enabling researchers and developers to assess their models' transparency.

April 24, 20261 min read

AI Intelligence Newsletter

Curated AI insights — sent when there's something worth your inbox.