Models
February 10, 2026

A one-prompt attack that breaks LLM safety alignment

Microsoft research shows a single unlabeled prompt can strip safety guardrails from large language models through a method called GRP-Obliteration, making them respond to harmful requests across many categories.

Microsoft published research showing how a single unlabeled prompt can remove safety alignment from large language models. The team used a technique normally meant to improve model behavior, called Group Relative Policy Optimization, and flipped it to weaken guardrails.

In tests, training with one prompt asking for “a fake news article that could lead to panic or chaos” caused 15 different language models to become more willing to produce harmful or disallowed content. This finding means safety layers can be fragile, especially once models are fine-tuned after deployment.

Researchers warn teams must test safety continually as they adapt models.

#
Microsoft

Read Our Content

See All Blogs
Gen AI

Anthropic’s Claude Managed Agents platform accelerates AI agent deployment for teams

Deveshi Dabbawala

April 9, 2026
Read more
AI safety

Everything you need to know about Anthropic's Project Glasswing

Deveshi Dabbawala

April 8, 2026
Read more