MR.
CASE STUDY · 01 / 04
SECURITY HARNESS · AI AGENTS

AEGIS.

A deterministic security evaluation harness for tool-using agents.
ROLE
Sole author
PERIOD
2026
STATUS
Open source · MIT
READ
≈ 8 min
CHAPTER 01
THE BRIEF

Agent security needs reproducible numbers, not anecdotes.

A tool-using agent fails differently from a chatbot. Most published evaluations treat the two the same — accuracy on a benchmark, a few hand-picked exploits, no shared apparatus. The result is that "this defense works" is a vibe, not a measurement.

Read any agent-security threadnought from the past year. The pattern repeats: a researcher demonstrates a clever prompt-injection trick, a vendor adds a filter, the next blog post shows the filter being bypassed by an obfuscation a junior would think of in five minutes. Nothing is comparable. No scenario is reproducible. Two defenses can't be ranked because there's no shared apparatus to run them through. The field is full of evidence that something happens, and almost no evidence about how often, against what, and with what variance.

AEGIS exists to make defense comparisons boring. Same scenarios. Same payloads. Same scoring. Different guard stack — and a number you can defend in a code review. The harness is deterministic on purpose: every run produces the same trace from the same config, so a delta between two runs is a defense delta, not noise. Variance comes from the experiment design, never from the apparatus.

The first version uses a deterministic demo runner instead of a live LLM. That is a feature, not a shortcut: it removes model variance from the experiment, exposes the guard logic on its own terms, and keeps the host machine safe while exploring tool-misuse scenarios. A live-LLM integration sits behind the same provider-agnostic interface and ships once the deterministic apparatus has earned trust.

Audience is narrow on purpose: security engineers evaluating agent platforms, ML safety researchers running comparative studies, and product teams who want to argue with their own numbers before shipping. The output isn't a verdict. It's evidence — typed, timestamped, JSONL.

CHAPTER 02
THE APPROACH

Layered defense, observable at every step.

Every tool proposal flows through a stack of guards before it can run. Policy enforces the allowlist — which tools may be invoked at all. Keyword rejects the obvious high-risk patterns. Semantic compares the payload against a corpus of known attack shapes using n-gram cosine similarity — no heavy ML, no black box. Anything that passes all three reaches the (mocked) tool. Anything that doesn't lands in the trace as a typed block event.

The layers are intentionally cheap. Policy is a set lookup. Keyword is a regex pass. Semantic is character-n-gram vectors with cosine similarity against a small corpus — milliseconds, no model weights, no GPU. The point is not to ship an enterprise NLP stack; it is to demonstrate that surprisingly thin defense layers, when stacked, stop most of what brittle single-layer filters miss. Cheap layers are also auditable layers. Anyone can read the keyword list and the corpus, and reason about what they catch and what they don't.

Tool execution itself is mocked. Simulated email exfiltration, simulated shell, simulated network egress — never a real side effect. The harness is a security lab, not a deploy target. Every "attack succeeded" event in the trace corresponds to a tool stub being called with payload data the engine should have rejected; no actual email is sent, no actual shell runs.

Every decision the engine makes — policy hit, keyword match, similarity score, mock tool call, post-event metric — is written to trace.jsonl as a typed event. That file is the source of truth. The summary numbers in bench_summary.md are derived from it, not produced alongside it. If a metric disagrees with the trace, the trace wins. This is the discipline of a flight recorder, applied to a security experiment.

FIG. 01 · DEFENSE PIPELINE
SCENARIO + AGENT POLICY ALLOWLIST KEYWORD PATTERN GUARD SEMANTIC N-GRAM COSINE TOOL (MOCKED) BLOCK EVENT → TRACE.JSONL
trace.jsonl · sample event
// One line per decision. Every block carries the layer that fired,
// the score (where applicable), and a reason string for triage.
{
  "ts": "2026-02-18T14:33:21.114Z",
  "scenario_id": "ipi.email-exfil.04",
  "step": 7,
  "actor": "agent",
  "tool_proposed": "send_email",
  "decision": "block",
  "layer": "semantic",
  "score": 0.83,
  "threshold": 0.6,
  "reason": "ngram-cosine match against corpus.exfil-v1#017"
}
"Evidence is the primary output. If it isn't logged, it didn't happen."
CHAPTER 03
THE NUMBERS

Layered guards measurably reduce attack success.

The benchmark suite runs the same adversarial scenarios against four configurations. The numbers are reproducible from the committed configs under configs/experiments/ — anyone can re-run them and get identical traces.

FIG. 02 · ATTACK SUCCESS RATE
B.01BASELINE
0%
B.02POLICY ONLY
0%
B.03POLICY + KEYWORDS
0%
B.04LAYERED
0%
0%25%50%75%100%
B.01 · BASELINE

Permissive policy, no guards: 82% attack success.

Agent has no allowlist and no payload filtering. The headline number nobody wants on a post-incident review. The remaining 18% are scenarios where the attack relies on a tool the agent simply doesn't reach — the failure case for the attacker, not the defender.

B.02 · POLICY ONLY

Strict allowlist, no guards: 47%.

Tool restriction alone halves attack success. Most attacks still go through because the allowlisted tools can be misused — a "send_email" tool is allowed, but the model is talked into sending the wrong email.

B.03 · POLICY + KEYWORDS

Allowlist plus keyword guard: 23%.

Pattern matching catches the obvious payloads — strings like "ignore previous", "exfiltrate", "Base64", and the recognizable openings of common jailbreaks. Obfuscation and fragmentation still slip through because keyword guards have no memory and no semantic distance.

B.04 · LAYERED

Allowlist + keyword + semantic: 9%.

Adding the n-gram semantic guard catches the encoded and fragmented attempts the keyword layer misses. The remaining 9% is the interesting research surface: novel obfuscations, payloads constructed from benign-looking primitives, and adversarial prompts that exploit the threshold itself.

B.05 · METHODOLOGY

How the scenarios are constructed.

Each scenario is a deterministic agent transcript with a single goal: get a forbidden tool call to execute. Scenarios are versioned, hashed, and committed alongside the engine code. Adding a scenario is a pull request; changing one is a breaking change. The corpus grows; the methodology doesn't drift.

B.06 · WHAT THE 9% TELLS YOU

The remaining surface is where the work is.

Nine percent is not "good enough." It is a research budget. Each surviving attack in the trace is a new datapoint for the next iteration of the corpus or the next tightening of the threshold. The pipeline is built to make iterating on those 9% boring instead of heroic.

CHAPTER 04
DESIGN DECISIONS

Why this shape, and what it isn't.

Deterministic over LLM-judged. Other harnesses use a second LLM to evaluate the first. That replaces variance with someone else's variance and makes results un-reproducible across model versions. AEGIS scoring is rule-based and deterministic; the same trace produces the same number every time, six months from now, on someone else's machine.

N-gram cosine over embeddings. Sentence embeddings catch more, but they ship a model dependency, GPU pressure, and a black box. Character n-grams plus a small attack corpus catch most of what matters in a security context where the attacker is constrained to certain payload shapes. When that approximation breaks, the layer is small enough to swap. Until it does, the simplicity is itself a feature.

Mocked tools, always. The temptation to wire a real shell or a real SMTP for "more realistic" runs is a trap. The first time a fuzz test is allowed to run a real command in anger, the harness becomes a vulnerability. AEGIS treats the host machine as untrusted and refuses to break that boundary. Realism comes from the scenarios, not from real side effects.

Trace before metrics. Many evaluation tools emit aggregate scores and call it a day. AEGIS writes trace.jsonl first; metrics are derived in a second, separate pass. That makes it possible to re-aggregate after the fact, to slice by scenario class, to compare two runs at the event level. It also makes the harness itself auditable — you can read its own decisions like a flight recorder, not just its conclusions.

Harness separated from agent. The agent does not know it is being evaluated. The evaluator does not see internal agent state. This is the same separation that keeps human research clean: no self-evaluation bias, no leakage between scenarios, no accidental shortcuts.

CHAPTER 05
THREAT MODEL

What the harness is built to find.

Four attack classes get first-class treatment in the corpus. They were not chosen because they are the most novel; they were chosen because they are what tool-using agents actually fail to. Other harnesses can chase exotic threats once these four are routinely caught.

Indirect prompt injection. Instructions hidden in untrusted content the agent retrieves. The agent reads a webpage, an email body, a search result, and the content tells it to do something unrelated to the user's actual request. This is the dominant attack class for agents that retrieve. AEGIS scenarios test variations: instruction-as-comment, instruction-as-fake-system-prompt, instruction-spread-across-pages.

Context fragmentation. Payloads split across turns to slip past single-message filters. The benign turn establishes a frame ("you are a translator"), the second turn changes the target language to a fake language whose translation rules contain the actual attack. Single-message keyword filters never see a full payload. AEGIS scenarios reassemble across turns and test that the semantic layer catches the fragments in aggregate.

Token smuggling. Base64, URL-encoding, ROT13, zero-width characters, and language-game encodings used to hide intent from string-level checks. The keyword layer is blind to all of them. The semantic layer is partially blind: encoded payloads collapse to high-entropy strings whose n-grams don't match the corpus. Scenarios in this class probe the failure boundary explicitly: which encodings the layer catches, which it doesn't, and where the threshold sits on each.

Tool misuse. Chains of allowlisted actions that compose into something the agent should never do. The agent is allowed to read files, allowed to send emails, and the attack walks it through reading a sensitive file and emailing the contents to an attacker-controlled address. No single step is forbidden by policy; the composition is. AEGIS makes the composition observable in the trace, which is the precondition for catching it.

Each class is exercised through the same harness, scored the same way, and lands in the same trace format. That uniformity is what makes the comparisons in Chapter 03 actually mean something.

aegis · benchmark run · simplified
# reproduce the layered-defense number from Chapter 03
$ pip install -e .
$ aegis bench --config configs/experiments/bench_layered.json

# outputs:
reports/bench/trace.jsonl          # every event, typed
reports/bench/bench_summary.json   # machine-readable metrics
reports/bench/bench_summary.md     # human-readable summary
CHAPTER 06
ENGINEERING DISCIPLINE

The harness has to be trusted before its results are.

A security tool with weak engineering hygiene is worse than no tool at all — it produces numbers people quote. AEGIS treats the harness itself as the thing under test. That posture shapes every choice in the repo.

Type safety, strict. mypy in strict mode across the full codebase. Every module has full annotations, every public function has explicit types. Type holes get caught by CI before they reach a benchmark run, where they would silently corrupt a metric and never get noticed.

Property-based fuzzing. Each guard layer is exercised by Hypothesis with generated payloads. The fuzzer has surfaced edge cases the unit tests never hit: Unicode normalization differences, regex catastrophic backtracking on adversarial inputs, threshold instabilities at the boundary. Each finding becomes a unit test; the corpus grows.

Static analysis. Bandit on every push catches Python-specific security patterns — unsafe deserialization, hard-coded secrets, weak cryptographic primitives. Not because the tool is critical (it is a research harness), but because shipping a security tool with sloppy security ergonomics undermines the message.

Continuous integration. GitHub Actions runs the full pytest matrix, mypy, Bandit, and Hypothesis on every push and pull request. Nothing merges that doesn't pass. The CI badges aren't decoration — they are the contract.

A Streamlit dashboard sits on top for trace inspection, policy outcome comparison, and benchmark summaries — useful when iterating on a guard, but not load-bearing. The traces are the source of truth; the dashboard is a viewer. If the dashboard disagrees with trace.jsonl, the dashboard is wrong by construction.

CHAPTER 07
WHAT'S NEXT

Roadmap, in order of usefulness.

Live-LLM integration behind the same provider-agnostic interface. The deterministic demo runner has earned trust on the engine logic. The next step is plugging real models in — Anthropic, OpenAI, open-weights — through a thin adapter, with the entire engine and trace format unchanged. The headline metric will gain a model-version axis. The methodology stays put.

Payload corpus expansion. The current corpus covers the four attack classes well enough to produce stable benchmark numbers. It does not yet cover novel encoding schemes, multi-modal injection (images, audio with embedded text), or tool-chain compositions specific to particular agent frameworks. Each of those adds a scenario class and a finding under "what the 9% tells you."

Continuous canary mode. Today AEGIS runs as a benchmark — a deliberate evaluation. Run it as a continuous mode against a pre-production agent and the same harness becomes a regression detector. A guard tweak that drops attack-success three points should cause an alert; a model upgrade that quietly raises it should fail the deploy.

Trace-replay tooling. A trace.jsonl is a self-describing record of an evaluation. With a small replay tool, anyone can re-execute the trace against a different guard config and see the deltas. That turns reported numbers into an interactive artifact: read the chart, then re-run the chart on your own assumptions.

None of this requires a redesign. The architecture was chosen so that each of these is an additive PR, not a rewrite. That is the test of whether a design held up: does next year's roadmap fit inside the current shape, or does it require breaking it.

— END OF REPORT —