AI Safety and Regulation
August 27, 2025

Findings from a pilot Anthropic OpenAI alignment evaluation exercise

OpenAI and Anthropic conducted a groundbreaking cross-company safety exercise. Each tested the other’s public models under misalignment scenarios. The evaluation revealed persistent vulnerabilities in both, highlighting areas for improved safeguards.

This summer, OpenAI and Anthropic collaborated on the first-ever cross-company safety evaluation, testing each other’s publicly released models, including Claude Opus 4, Claude Sonnet 4 (Anthropic) and GPT-4o, GPT-4.1, o3, and o4-mini (OpenAI), by running internal misalignment and misuse assessments.

The tests explored key behaviors such as sycophancy, misuse potential, hallucinations, and resistance to instruction. Both parties discovered safety gaps: although reasoning models (like o3 and Claude’s reasoning-capable models) generally exhibited stronger alignment, general-purpose models like GPT-4.1 proved more vulnerable to misuse.

These early findings emphasize the need for continued collaboration and rigorous testing in AI safety.

#
OpenAI
#
Anthropic

Read Our Content

See All Blogs
Gen AI

The GenAI Divide Report is a Trojan Horse for MIT NANDA

Rishabh Sood

October 14, 2025
Read more
Gen AI

Measuring Generative AI ROI

Cricka Reddy Aileni

October 7, 2025
Read more