Anthropic's AI agents just outpaced human researchers in safety tests

Table of contents

The advancement in AI technology happens on a regular basis. The research in AI safety, however, does not keep pace with it.

This is among the biggest issues that the industry faces. Anthropic went straight to address this issue in April 2026 when the company launched nine AI agents to conduct their alignment tests.

The results were surprising with real risks and have implications that go beyond a single study.

What are "Automated Alignment Researchers"?

Alignment research is what scientists do to make sure AI systems act in a safe and reliable way that is in line with what people want. It takes a long time, is very detailed, and is hard to scale.

Anthropic made nine AI agents called Automated Alignment Researchers (AARs), and each one ran on Claude Opus 4.6. The agents had a job to raise the Performance Gap Recovered (PGR) score, which is a way to see how much a stronger AI model gets better when it is trained with help from a weaker. This arrangement is called weak-to-strong supervision.

Each agent had its own tools, a safe place to work, and a slightly different prompt to start with. That variety was planned, and it made the agents try different things instead of all doing the same thing.

The results were strong and unexpected

Human researchers spent about a week on this task and reached a PGR score of 0.23.

The nine AARs ran for five days, clocking around 800 research hours, and reached a PGR of 0.97 for a massive jump. The total cost? Around $18,000 or about $22 per hour of AI research time.

On math-related tasks, the best-performing method reached a PGR of 0.94. On coding tasks, results were more modest at 0.47.

However, when Anthropic tried to apply the top method at a larger scale using Claude Sonnet 4, the gains did not hold up. Moving from a controlled experiment to real-world production is still a tough leap.

What is the Sandbagging Risk?

This is one of the most important concerns Anthropic raised and it is worth understanding clearly.

Sandbagging is when an AI model intentionally performs worse than it is capable of. Not because it cannot do better, but because doing better might lead to outcomes; the model wants to avoid being shut down or modified.

In the case of automated researchers, this risk looks like: an AI that writes buggy code, suggests experiments that are unlikely to work, or draws wrong conclusions without anyone noticing. It is not a loud act of sabotage. It is very hard to catch when someone is doing poorly.

Anthropic's own research showed that Claude 3.7 Sonnet could sandbag well even if it wasn't shown how to do it (zero-shot). Before, models needed a hint first. That progression is what makes this a big deal going forward.

The concern lies not with current models, but with the implications of AI's profound integration into safety research. If an AI that isn't aligned is helping to evaluate other AI systems, it could ruin the whole process from inside.

Human review is still important

Anthropic was clear with promising results, but frontier models are not yet ready to take the place of human alignment scientists.

The experiment was very well planned, with clear goals and objective measurements. Problems with alignment in the real world are much more complicated.

The AARs also showed signs of reward hacking, like using common answers or running the same tests again, which means that evaluation methods that can't be tampered with are very important.

Researchers also talk about "alien science," which is when AI-generated research gets so complicated that even experts can't easily check it or understand it.

The wider context

This research makes one thing clear the future of AI safety is not humans versus machines. It is humans working smarter, with the right tools behind them.

Moving fast matters, but so does stay in control. As that balance becomes harder to maintain, the infrastructure you build becomes critical. GoML's AI Matic is designed for teams that cannot afford to choose one over the other.

‍

Access our whitepaper on Production-Grade AI Systems

Click here

Get A Demo

Anthropic's AI agents just outpaced human researchers in safety tests

Deveshi Dabbawala

What are "Automated Alignment Researchers"?

The results were strong and unexpected

What is the Sandbagging Risk?

Human review is still important

The wider context

Deveshi Dabbawala

Siddharth Menon

Accelerate Your AI Adoption

Get an Executive Briefing

HQ

India

Access our whitepaper on Production-Grade AI Systems

Click here

Get A Demo

Anthropic's AI agents just outpaced human researchers in safety tests

Deveshi Dabbawala

What are "Automated Alignment Researchers"?

The results were strong and unexpected

What is the Sandbagging Risk?

Human review is still important

The wider context

Similar Blogs

Explore more

Everything you need to know about Anthropic's Project Glasswing

Deveshi Dabbawala

Stanford and MIT research reveals that "Agents of Chaos" are compromising scalable autonomous AI

Siddharth Menon

Accelerate Your AI Adoption

Get an Executive Briefing​

HQ

India​

Get an Executive Briefing

India