Findings from a pilot Anthropic OpenAI alignment evaluation exercise

OpenAI and Anthropic conducted a groundbreaking cross-company safety exercise. Each tested the other’s public models under misalignment scenarios. The evaluation revealed persistent vulnerabilities in both, highlighting areas for improved safeguards.

This summer, OpenAI and Anthropic collaborated on the first-ever cross-company safety evaluation, testing each other’s publicly released models, including Claude Opus 4, Claude Sonnet 4 (Anthropic) and GPT-4o, GPT-4.1, o3, and o4-mini (OpenAI), by running internal misalignment and misuse assessments.

The tests explored key behaviors such as sycophancy, misuse potential, hallucinations, and resistance to instruction. Both parties discovered safety gaps: although reasoning models (like o3 and Claude’s reasoning-capable models) generally exhibited stronger alignment, general-purpose models like GPT-4.1 proved more vulnerable to misuse.

These early findings emphasize the need for continued collaboration and rigorous testing in AI safety.

OpenAI

Anthropic