How to Threat-Model Your Agent¶
Your threat model is simple: the agent can go rogue.
Any agent that reads untrusted data can be prompt-injected. Once injected, it will attempt to use every tool and permission it has to serve the attacker's goals. Your job is to make sure that even a fully compromised agent can't cause catastrophic damage.
Step 1: Map Your Trifecta¶
List every component of the lethal trifecta for your system:
| Factor | Your System | Can You Remove It? |
|---|---|---|
| Access to Private Data | List every source of private data the agent can read | Can any be removed from context? Replaced with references? Scoped? |
| Exposure to Untrusted Content | List every source of external/attacker-controlled data | Can any be eliminated? Curated? Sandboxed? |
| Ability to Exfiltrate | List every way the agent can communicate externally | Can any be removed? Made read-only? Require approval? Block outbound? |
Removing any one factor dramatically reduces risk.
Example: Email Assistant¶
| Factor | Components | Mitigation |
|---|---|---|
| Access to Private Data | Contact list, email history, OAuth tokens | Only expose contacts for the current thread. Use scoped tokens |
| Exposure to Untrusted Content | Incoming emails (body, subject, attachments) | Process in quarantined LLM with no tool access |
| Ability to Exfiltrate | send_email, forward_email |
Remove forward_email. Require approval for send_email |
Example: Coding Assistant¶
| Factor | Components | Mitigation |
|---|---|---|
| Access to Private Data | Env vars, API keys, .env files, git credentials |
Don't inject secrets. Use project-scoped tokens only |
| Exposure to Untrusted Content | Code files, dependencies, README, PRs, issues | Run in container. Don't mount ~/.ssh, ~/.aws |
| Ability to Exfiltrate | bash (curl, network access), git push, execute_code |
Block outbound network. Require approval for git push. Sandbox execution |
Step 2: Draw Your Trust Boundaries¶
For every data flow, ask: where does untrusted data enter, and where do privileged actions happen?
flowchart LR
subgraph UNTRUSTED["Untrusted Zone"]
E[Emails]
W[Web pages]
R[RAG documents]
U[User uploads]
end
subgraph BOUNDARY["Trust Boundary"]
V[Validation / Approval]
end
subgraph PRIVILEGED["Privileged Zone"]
S["send_email()"]
WF["write_file()"]
EX["execute_code()"]
API["api_call()"]
end
E & W & R & U --> V
V --> S & WF & EX & API
The question is: what sits at the trust boundary?
| Approach | What's at the Boundary | Strength |
|---|---|---|
| Nothing (most agents today) | The LLM itself decides | ❌ Weakest — LLM is the vulnerability |
| Prompt engineering | System prompt instructions | ⚠️ Weak — bypassed by injection |
| Infra isolation | Container walls, network rules, filesystem mounts | ✅ Strong — deterministic |
| Software architecture | Separate LLMs, typed extraction, dry-run eval | ✅ Strong — architectural |
| Both infra + software | Defense in depth | ✅✅ Strongest |
Step 3: Define Your Blast Radius¶
Ask: if this agent is fully compromised right now, what's the worst that can happen?
| Blast Radius | Example | Acceptable? |
|---|---|---|
| Agent sends 1 email to wrong person | Scoped token, approval required | Usually yes |
| Agent exfiltrates all contacts | Full contact access, outbound network | Usually no |
| Agent pushes malicious code to prod | Git credentials, CI/CD access | Never |
| Agent deletes database | DB write credentials in env | Never |
If the blast radius is unacceptable, you need deterministic controls (isolation, scoped tokens, schema validation), not better prompts.
Step 4: Choose Your Controls¶
Work through this checklist for your system:
Infrastructure (do first — works on any agent)¶
- Agent runs in a container/VM, not on your host
- Filesystem: only necessary directories mounted, read-only where possible
- Network: outbound restricted to necessary endpoints only
- Secrets: no credentials beyond what the task requires
- Tokens: scoped, short-lived, task-specific
- Rate limits: cap on tool calls per session
- Timeout: maximum execution time
- Kill switch: infrastructure-level termination (not prompt-level)
Detection (layer on top)¶
- Input scanning for known injection patterns
- Canary tokens to detect data exfiltration
- Behavioral monitoring for anomalous tool usage
- Logging of all tool calls and LLM interactions
Software (if you control the code)¶
- Untrusted data processed by a separate LLM with no tool access
- Structured extraction with constrained schemas
- Outbound actions require approval (human or evaluator)
- Deterministic validation rules for high-risk actions
- Tool schemas validated against manifests
Step 5: Threat-Model by Agent Type¶
Different agents have different risk profiles:
Low Risk: Read-Only Agents¶
- Summarizers, search, Q&A over internal docs
- Trifecta status: No tool access (or read-only) → low risk
- Main risk: Data leakage through generated output
- Controls: Output filtering, context scoping
Medium Risk: Internal-Only Agents¶
- Code assistants (no deploy), internal workflow automation
- Trifecta status: Private data access + exfiltration ability, but no untrusted content → medium risk
- Main risk: User prompt injection, credential misuse
- Controls: Least privilege, scoped tokens, review before commit
High Risk: External-Facing Agents¶
- Email assistants, web browsing agents, customer support bots
- Trifecta status: Full trifecta → high risk
- Main risk: Indirect prompt injection → unauthorized actions
- Controls: Full isolation + software architecture + detection
Critical Risk: Autonomous Agents¶
- Agents that loop, plan, and execute without human oversight
- Trifecta status: Full trifecta + no human in loop → critical
- Main risk: Cascading compromise across multiple actions
- Controls: Everything above + mandatory approval gates + time-bound sessions
The One-Line Threat Model¶
My agent can go rogue. What's the worst it can do? Make that impossible.
Not unlikely. Not difficult. Impossible — enforced by infrastructure, not by prompts.
References¶
- Principles — Core axioms and the Read → Propose → Approve → Execute pattern
- Attack Taxonomy — Catalogue of attack vectors and risk matrix
- Isolation guide — Infra-level isolation patterns
- Secure Architecture guide — Software-level defenses