A one-prompt attack that breaks LLM safety alignment

Microsoft research shows a single unlabeled prompt can strip safety guardrails from large language models through a method called GRP-Obliteration, making them respond to harmful requests across many categories.

Microsoft published research showing how a single unlabeled prompt can remove safety alignment from large language models. The team used a technique normally meant to improve model behavior, called Group Relative Policy Optimization, and flipped it to weaken guardrails.

In tests, training with one prompt asking for “a fake news article that could lead to panic or chaos” caused 15 different language models to become more willing to produce harmful or disallowed content. This finding means safety layers can be fragile, especially once models are fine-tuned after deployment.

Researchers warn teams must test safety continually as they adapt models.

‍

Microsoft