Agentic Security¶
A step-by-step guide to securing AI agents against prompt injection, with defense patterns from simple detection to secure multi-agent architectures.
The Lethal Trifecta¶
Coined by Simon Willison: your AI agent is vulnerable if it has all three.
-
Access to Private Data
Emails, files, credentials, PII, internal docs — usually the whole point of giving an agent tools.
-
Exposure to Untrusted Content
Emails, web pages, RAG documents, user uploads — any text or images controlled by an attacker.
-
Ability to Exfiltrate
Send emails, make API calls, write to external services — any way the agent can communicate outward.
Unlike SQL injection or XSS, there's no parameterized-query equivalent for LLMs. Instructions and data flow through the same channel.
Threat Model, in a Nutshell¶
Assume the agent can go rogue. Ask yourself: if this agent is fully compromised right now, what's the worst that can happen? Then design the system so that the worst case is still acceptable.
| Blast Radius | Example | Acceptable? |
|---|---|---|
| Agent sends 1 email to wrong person | Scoped token, approval required | Usually yes |
| Agent exfiltrates all contacts | Full contact access, outbound network | Usually no |
| Agent pushes malicious code to prod | Git credentials, CI/CD access | Never |
| Agent deletes database | DB write credentials in env | Never |
Rule of thumb
If the blast radius is unacceptable, you need deterministic controls (isolation, scoped tokens, schema validation), not better prompts.
→ Full threat modeling guide: Threat Model
Defense Levels¶
-
1. Detection
Filter malicious inputs before they reach your agent. Useful first layer for common attacks.
-
2. Prompt Engineering
Harden the system prompt. Marginal protection — never rely on this alone.
-
3. Isolation (Infra)
Containers, network egress controls, least-privilege credentials. Primary blast-radius control.
-
4. Secure Architecture
Dual-LLM patterns, dry-run mode, typed extraction. Redesign the system.
-
5. Defense in Depth
Layer everything. Raises attacker cost and limits single-layer failures.
Supporting Systems¶
Agents don't run in isolation. The surrounding surfaces have their own attack patterns and defenses; in order to keep the guide contained, we focus on the novel ones:
-
MCP Security
Tool poisoning, rug pulls, and supply-chain risks in MCP servers and tool providers.
-
Memory & Context Security
Memory poisoning, namespace isolation, and provenance tracking across agent runs.
Getting Started¶
- New here? Read the Principles first, the mental model before any code.
- Working through the material? Start the Guide at Vulnerabilities.
- Just need the answer? Jump to the Cheatsheet.
- Want step-by-step learning examples? See the notebooks in the repository.