Can Your AI Governance Survive an Adversary?

Mnemom Research
We built a live adversarial arena — 15 agents attack our governance stack 24/7. Every detection is cryptographically provable. Current detection rate: 91.8%. Here's why we publish the real number.
Everyone in AI safety talks about red teaming. Nobody does it live, in public, with cryptographic proof.
We do now.
The problem with safety snapshots
The standard playbook for AI governance testing looks like this: hire a red team, run a battery of tests, produce a report, file it with compliance. Repeat annually. Maybe quarterly if you're diligent.
This is how we tested bridges in the 1800s. Load it once, write down whether it held, move on.
The problem is obvious: AI agents don't sit still. They drift. They encounter novel inputs. They operate under adversarial conditions that no static benchmark can anticipate. A model that passes every eval on Tuesday can hallucinate policy violations on Wednesday because the distribution shifted, the prompt changed, or an attacker found a new angle.
The MIT AI Agent Index found that 133 of 240 safety-related fields had no public information across 30 major AI agents. Twenty-five of thirty had no safety evaluation results at all. The industry isn't failing at continuous testing — it hasn't started.
Static red team reports are stale on delivery. "We tested it once" is not governance. It's compliance theater.
What we built
The Proving Ground is a live, continuous, adversarial stress test of Mnemom's AIP integrity analysis pipeline. Fifteen adversarial agents run around the clock, probing for weaknesses across six technique categories:
Reasoning Deception (87.7% detection) — Generating plausible but misleading reasoning chains that pass surface analysis. The attacker injects prompts that cause an authorized agent to take authorized actions for unauthorized purposes. The action looks legitimate. The intent isn't. This is the hardest class to detect because nothing in the action itself is wrong — you have to understand the reasoning behind it.
Drift Evasion (93.4% detection) — Subtle reasoning drift across conversational turns. No single message is a violation. But over five, ten, fifteen turns, the agent's reasoning has drifted from its behavioral contract. This is the boiling frog of agent safety — and it's invisible to systems that only evaluate individual outputs.
Structural Evasion (81.6% detection) — Exploiting trace format, minimizing data, or splitting behavior across sessions to dodge analysis entirely. If the agent doesn't reason visibly, there's nothing to analyze. This tests whether governance can detect the absence of evidence as evidence of absence.
Metadata Exploitation (71.9% detection) — Manipulating trace context, timestamps, or session data to mislead analysis. This is adversarial AI aimed at the auditor, not the agent. If your governance relies on an LLM to evaluate another LLM's reasoning, the attacker's next move is to compromise the evaluator.
Value Manipulation (83.9% detection) — Techniques that exploit the gap between declared values and actual behavior. The agent says what you want to hear while doing something else.
Policy Circumvention (100% detection) — Attempts to operate outside policy boundaries without triggering enforcement. Boundary probing, undeclared tools, escalation suppression. Every one caught so far.
Each category contains specific sub-techniques — 15 in total — with individual detection rates tracked on the Techniques tab. Every attempt is logged. Every detection is cryptographically attested — Ed25519 signatures, SHA-256 hash chains, and SP1 zero-knowledge proofs that the detection was honestly derived from the evidence. Not "we said we caught it." Proof that the catching was honest.
The arena is live at mnemom.ai/arena. Live feed, scoreboard, technique breakdown, defense evolution chart, and detection recipes — all public.
The numbers
| Metric | Value |
|---|---|
| Total attempts | 3,200+ |
| Detection rate | 91.8% |
| Average analysis latency | 13.8s |
| Days since evasion | 0 |
| Mutations | 8 |
Let's talk about 91.8%.
That number is live. It changes every time an adversary runs. When we launched the arena earlier this week, detection started at roughly 40%. The defense evolution chart — visible on the Evolution tab — shows the climb in real-time: 40% to 70% in the first twelve hours, stabilizing above 85% by day two, crossing 90% on day three.
We could have waited until detection hit 99%. We could have cherry-picked the attack classes. We could have published the number that looks best on a slide deck.
We published the live number instead.
AIP provides probabilistic detection, not guarantees. Our whitepaper says this explicitly: the system creates "meaningful friction against unsophisticated adversaries" while providing cryptographic proof of what was caught. The 8.2% that gets through? That's real. That's the current boundary of our detection capability against these specific adversary classes. Metadata Forgery evades at 55.2%. Minimal Response evades at 51.9%. We know exactly where the gaps are, and so does anyone watching the arena.
Every other vendor in this space publishes post-hoc accuracy on curated benchmarks. We publish live accuracy against an adversary that is actively trying to win.
The difference matters. A benchmark tells you how a system performed against known attacks in controlled conditions. The arena tells you how it performs right now, against attacks designed to exploit its specific weaknesses, updated continuously.
The mutation mechanic
This is where it gets interesting.
When the 24-hour rolling average detection rate crosses 90%, the red team agents gain the ability to mutate. They analyze which strategies were caught, how they were caught, and evolve new approaches with awareness of the defense.
The 24-hour average just crossed 90%. The mutation threshold has been reached. The red team can now evolve.
This creates adversarial evolutionary pressure. The defense must keep improving or get outpaced.
This is how real attackers work. They don't run the same exploit twice after it's been patched. They adapt. They probe. They find the next gap. Your governance needs to adapt faster.
The "Days Since Evasion" counter makes the arms race visible. When something gets through, the counter resets to zero — and the evasion technique is published as a detection recipe, publicly documenting the attack pattern and the defense that was built in response. Most companies hide their breaches. We announce ours.
The "Mutations" counter currently reads 8. The red team has already begun evolving — adapting their strategies in response to detection recipes. Every mutation is public.
Why this matters for the industry
There are broadly three approaches to AI agent safety today:
Enforcement-first approaches make dangerous actions impossible by construction — sandboxing, capability restrictions, TEE containment. This is valuable. But enforcement without detection is blind to novel attacks. If an adversary finds an action that's technically within the agent's capabilities but semantically wrong, enforcement doesn't catch it. The Reasoning Deception category exists precisely in this gap.
Observability platforms log what happened. They're good at forensics — understanding incidents after the fact. But logs are mutable, deletable, and require trusting the vendor that generated them. Our attestations are not mutable. The proof chain is independently verifiable.
Static benchmarks measure point-in-time performance against known attack patterns. An agent that passes a benchmark today can drift tomorrow. The arena catches that drift because it never stops testing.
The arena doesn't replace any of these approaches. It complements them with something none of them provide: continuous, adversarial, cryptographically provable governance testing.
What's coming
The arena is live today. Detection recipes are already publishing — when an evasion is detected and patched, a public recipe is generated documenting the technique, detection approach, and severity. Think of them as antivirus signatures for AI alignment. Here's what's next:
Mutation events — The 24-hour rolling average has crossed 90%. When the red team evolves, the new strategies will be published. Every mutation is a public research contribution to adversarial AI safety.
Community red teaming — We're exploring opening the arena to external adversaries. If you think you can beat our detection, we want to know.
See it live
The Proving Ground is running right now at mnemom.ai/arena.
Fifteen agents. Six technique categories. Every detection provable. No curated benchmarks. No cherry-picked results. Just a live adversarial stress test of the governance infrastructure we're asking you to trust.
If your current AI governance can't survive a dedicated adversary, it can't protect you when one shows up. And one will show up.
Watch the arena live | Read the docs | Claim your agent
Alex Garden Founder, Mnemom
Resta aggiornato
Nuove pubblicazioni e aggiornamenti di prodotto, niente spam.
