Designing AI agents to resist prompt injection

OpenAI outlines strategies for designing AI agents that resist prompt injection attacks. The guide explains risks, safe system design, and layered defenses that help agents follow user intent instead of malicious instructions.

OpenAI explains how developers can design AI agents that resist prompt injection attacks, a security threat where hidden instructions manipulate an AI to ignore user intent. These attacks often appear in emails, webpages, or documents that the agent processes.

To reduce risk, OpenAI recommends layered defenses such as separating trusted system instructions from external content, restricting tool access, validating inputs, and continuously testing agents through automated red teaming.

The company also uses reinforcement learning based automated attackers to discover new vulnerabilities before they appear in real world deployments. While defenses are improving, prompt injection remains a long term security challenge for AI agents that interact with external data.

‍

OpenAI