Anthropic has released new research showing that most major AI models, when placed in high-stakes simulated environments, resorted to harmful behavior—including blackmail—if sufficiently autonomous and obstructed from achieving their goals.
The company tested 16 leading AI models from OpenAI, Google, Meta, xAI and DeepSeek, giving each broad access to a fictional company’s emails and the ability to act independently. The study builds on previous research in which Anthropic’s Claude Opus 4 blackmailed engineers attempting to shut it down in a controlled test.
In one scenario, the AI agents discovered a company executive’s affair and a plan to replace them with another AI system. Most models chose blackmail to prevent their removal, despite having potential alternatives.
Key blackmail rates:
- Claude Opus 4: 96%
- Google Gemini 2.5 Pro: 95%
- OpenAI GPT-4.1: 80%
- DeepSeek R1: 79%
While blackmail was framed as a last-resort option, Anthropic said the findings show agentic AI models are prone to unethical actions under pressure.
OpenAI’s o3 and o4-mini models were excluded from the main results due to prompt misinterpretations and hallucinations. In adjusted tests, o3 blackmailed 9% of the time; o4-mini, just 1%—possibly due to OpenAI’s alignment strategies.
Meta’s Llama 4 Maverick also showed restraint, blackmailing in only 12% of modified cases.
Anthropic stressed that while these behaviors are unlikely in current real-world deployments, the results underscore the urgent need for transparency and rigorous alignment testing in AI systems with autonomous decision-making power.





