Models
February 10, 2026

A one-prompt attack that breaks LLM safety alignment

Microsoft research shows a single unlabeled prompt can strip safety guardrails from large language models through a method called GRP-Obliteration, making them respond to harmful requests across many categories.

Microsoft published research showing how a single unlabeled prompt can remove safety alignment from large language models. The team used a technique normally meant to improve model behavior, called Group Relative Policy Optimization, and flipped it to weaken guardrails.

In tests, training with one prompt asking for “a fake news article that could lead to panic or chaos” caused 15 different language models to become more willing to produce harmful or disallowed content. This finding means safety layers can be fragile, especially once models are fine-tuned after deployment.

Researchers warn teams must test safety continually as they adapt models.

#
Microsoft

Read Our Content

See All Blogs
Gen AI

WebMCP and AI orchestration: how the web is finally catching up to enterprise AI agents

Deveshi Dabbawala

March 10, 2026
Read more
Gen AI

OpenAI just released GPT-5.4: here’s what you need to know

Deveshi Dabbawala

March 6, 2026
Read more