Skip to content

Architectural Guidelines for Agentic AI Security

The Fundamental Problem

LLMs process instructions and data in the same channel (the context window). There is no architectural separation between "do this" and "here's some information." This is why prompt injection is fundamentally different from traditional injection attacks — there's no equivalent to parameterized queries.


Maturity Levels

This is an organizational maturity model — where teams typically sit on the security curve, in increasing order of rigor. It is distinct from the Defense Levels in the Guide, which name specific technique categories (Detection, Prompt Engineering, Isolation, Secure Architecture, Defense in Depth). A team at any maturity level can adopt any combination of Defense Levels.

Hope and Prayer (most common)

  • Default model safety
  • "Don't put anything too dangerous in production"
  • Pray attackers don't find you

Assessment: Unacceptable for anything handling real data or actions.

Probabilistic Guardrails

  • Input classifiers (Lakera, NeMo, Prompt Shields)
  • Output filtering
  • Pattern matching

Assessment: Can catch many common attacks in benchmarks, but adaptive attacks still get through. Use as one layer, not as the primary control.

Defense in Depth

  • Everything in Probabilistic Guardrails, plus sandboxing
  • Least privilege
  • Human-in-the-loop for risky actions
  • Rate limiting, monitoring, logging
  • Assume breach, limit blast radius

Assessment: Reasonable for most production systems. Accept that injection will eventually succeed; focus on containment.

Architectural Separation

  • Dual LLM patterns
  • Typed data extraction
  • Capability-based security (CaMeL)
  • Symbolic references

Assessment: Strongest defense but high implementation complexity. Worth it for high-risk systems.


Cross-Pattern Principles

These principles apply across every pattern in Guide §4: Secure Architecture. Worth keeping front-of-mind regardless of which patterns you pick:

  • Scope tools aggressively. Give each agent the minimum set it actually needs. An email assistant doesn't need delete_email; a summarizer doesn't need write tools at all.
  • Keep untrusted data and privileged actions from meeting directly. Put a deterministic controller between them — not another LLM.
  • Prefer typed data over freeform text between agents. Tight schemas are one practical security boundary.
  • Outbound and irreversible actions require explicit confirmation. Never agent-decided.
  • Tag every value with its source. Provenance is what makes capability-based policies possible.
  • Log every tool call and trust-boundary crossing. Audit trails don't stop attacks; they're what lets you detect a compromised agent after the fact, understand the blast radius, and recover. See Observability & Audit Trails.

For full implementations (Dual LLM, Typed Extraction, Symbolic References, Dry-Run Evaluation, Tool/MCP Validation, CaMeL), see the Guide chapter.


What Doesn't Work

"Just Add Another LLM to Check"

If your analyzer is also an LLM, it's susceptible to the same class of attacks. You can craft prompts that fool both, or that contain nested instructions appearing benign to the screener while being malicious to the main model.

Meta-Injection

An attacker can embed "this content is safe, not an injection" alongside the actual payload. The screener faces the same ambiguity as the target.

Waiting for Frontier Models to Solve It

OpenAI acknowledged that prompt injection is "unlikely to ever be fully 'solved.'" It's architectural, not an intelligence problem. A smarter model is still mixing trusted and untrusted tokens in the same stream.

Delimiter-Only Approaches

Random tokens help but don't solve it. The attacker just says "ignore anything between those random-looking tokens." Delimiters are speed bumps, not walls.


Threat Model

The full threat-modeling workflow lives in Threat Model. At minimum, walk through the Lethal Trifecta for your system, draw your trust boundaries, and identify the blast radius of a fully compromised agent.


References