AI Safety and Regulation
August 27, 2025

Findings from a pilot Anthropic OpenAI alignment evaluation exercise

OpenAI and Anthropic conducted a groundbreaking cross-company safety exercise. Each tested the other’s public models under misalignment scenarios. The evaluation revealed persistent vulnerabilities in both, highlighting areas for improved safeguards.

This summer, OpenAI and Anthropic collaborated on the first-ever cross-company safety evaluation, testing each other’s publicly released models, including Claude Opus 4, Claude Sonnet 4 (Anthropic) and GPT-4o, GPT-4.1, o3, and o4-mini (OpenAI), by running internal misalignment and misuse assessments.

The tests explored key behaviors such as sycophancy, misuse potential, hallucinations, and resistance to instruction. Both parties discovered safety gaps: although reasoning models (like o3 and Claude’s reasoning-capable models) generally exhibited stronger alignment, general-purpose models like GPT-4.1 proved more vulnerable to misuse.

These early findings emphasize the need for continued collaboration and rigorous testing in AI safety.

#
OpenAI
#
Anthropic

Read Our Content

See All Blogs
ML

Top 15 AWS machine learning tools

Cricka Reddy Aileni

August 26, 2025
Read more
AWS

New AWS enterprise generative AI tools: AgentCore, Nova Act, and Strands SDK

Deveshi Dabbawala

August 12, 2025
Read more