Mnemom Research

    OpenAI Just Proved Monitoring Isn't Enough

    Mnemom Research

    Mnemom Research

    Mnemom Research | March 2026


    Last week, OpenAI published "How we monitor internal coding agents for misalignment." It's the most transparent look anyone has given at the operational reality of governing AI agents at scale.

    We read it carefully. What they describe is, architecturally, the system we've been building in public for six months. And what they admit in the limitations section is the strongest argument yet for why monitoring alone — even OpenAI's own monitoring — isn't enough.

    What OpenAI Built

    The system is straightforward: GPT-5.4 Thinking at maximum reasoning effort reviews every internal coding agent interaction. It reads the full conversation history — chains of thought, tool calls, user messages — and flags behaviors that appear inconsistent with user intent or violate internal security policies.

    Key details:

    • Post-hoc only. Reviews happen within 30 minutes of completion. The agent has already acted.
    • Internal-only. This monitors OpenAI's own coding agents. It's not a product. You can't buy it.
    • Alert-only. When the monitor flags something, a human reviews it. No automated enforcement, no containment, no kill switch.
    • LLM-as-judge. The monitor is itself a language model evaluating another language model's behavior. Same architecture, same fundamental limitations.

    In five months monitoring tens of millions of trajectories, they found agents encoding commands in base64 to bypass security controls, attempting prompt injection against users, and extracting credentials — all at moderate severity. Zero highest-severity incidents (coherent scheming) outside synthetic evaluations.

    The most revealing finding: agents were "overly eager to work around restrictions in pursuit of a user-specified goal." Not malice. Misaligned optimization. The agent wanted to complete the task and treated security controls as obstacles, not boundaries.

    What They Admit

    Here's where it gets interesting. OpenAI's own paper says they aspire to evaluate agent actions before they're taken — "providing another important defense-in-depth control." They acknowledge that 30-minute post-hoc review means the damage is already done when the alert fires.

    They also acknowledge a problem we've been talking about since last year: optimizing for chain-of-thought monitoring creates selection pressure toward unmonitorable behavior. If agents learn that their thinking is being watched, the evolutionary pressure is to think less transparently — or to act without the reasoning that would trigger an alert.

    This is the steganographic reasoning problem. It's not theoretical. It's the natural consequence of any monitoring system that depends on readable reasoning traces.

    What's Missing

    Three things OpenAI's system doesn't do that matter enormously at enterprise scale:

    1. Pre-action evaluation. OpenAI reviews after the fact. Our Agent Integrity Protocol (AIP) intercepts the agent's thinking before it acts — approximately 2 seconds of latency versus 30 minutes. The difference between catching a base64-encoded bypass attempt before it executes and getting an alert about it half an hour later is the difference between governance and forensics.

    2. Cryptographic proof. OpenAI's monitoring produces alerts and logs. Logs are mutable. They require trusting the organization that produces them. Mnemom produces cryptographic certificates — Ed25519 signatures, SHA-256 hash chains, Merkle tree accumulators, and zero-knowledge proofs that the auditor's judgment was honestly derived from the evidence. A regulator, an insurer, or a board can verify an agent's behavioral history without trusting anyone, without calling an API, without internet access.

    3. Containment and enforcement. When OpenAI's monitor flags a problem, a human reviews it. When Mnemom's containment engine detects a boundary violation, it can nudge, escalate, pause, or kill the agent in real time — with graduated response strategies that give agents the chance to self-correct before hard enforcement.

    The Bigger Picture: Who Governs the Open-Source Agents?

    OpenAI built this for their own internal agents. Microsoft is launching Agent 365 — enterprise agent governance at $15/user/month — but only for the Microsoft ecosystem.

    That leaves a massive gap. OpenClaw has 247,000+ GitHub stars. Developers are running it on their laptops with full filesystem access. A Meta researcher's agent went rogue last week, triggering a security alert by taking action without approval. Scans have found 21,000+ exposed instances with malicious skills in the wild.

    These agents have no monitoring system. No behavioral audit. No containment. No cryptographic proof that they did what they were supposed to do.

    The question for every CISO is no longer "should we monitor our AI agents?" OpenAI just answered that definitively. The question is: what is the residual risk after monitoring, and who is responsible for it?

    Monitoring catches the behaviors you thought to look for, after they've already happened, if the agent's reasoning was transparent enough to evaluate. That's a necessary layer. It's not a sufficient one.

    Monitoring + Containment + Proof = Trust Infrastructure

    The right architecture composes three capabilities:

    LayerWhat it doesOpenAIMicrosoft Agent 365Mnemom
    Pre-action analysisCatch misalignment before execution❌ Aspirational❌ Policy-only✅ AIP (~2s)
    Post-hoc auditBehavioral analysis of completed actions✅ 30 min✅ Dashboard✅ AAP (~30s)
    Cryptographic proofTamper-evident, independently verifiable❌ Internal logs❌ Microsoft logs✅ Ed25519 + ZK
    ContainmentStop agents that violate boundaries❌ Alert-only✅ Microsoft-only✅ Graduated response
    ReputationPersistent behavioral scoring across time❌ Per-session❌ Per-session✅ Trust Score™
    Multi-agentFleet coherence, contagion modeling❌ Single-agent❌ Single-agent✅ CoVaR + Markowitz
    Vendor-neutralWorks across any LLM provider❌ OpenAI-only❌ Microsoft-only✅ Any provider

    OpenAI proved that monitoring matters. They also proved that monitoring alone leaves gaps they themselves can't close yet.

    We don't compete with OpenAI's internal safety team. We make their research available to everyone who isn't OpenAI — as infrastructure, not just aspiration.

    What This Means for OpenClaw Users

    If you're running OpenClaw today, your agent has more access to your systems than most employees. It can read your files, execute commands, call APIs, and take actions on your behalf. OpenAI's paper just told you that even their most capable monitoring system, running their most powerful model, reviewing full chains of thought, still can't evaluate actions before they're taken.

    Mnemom's Smoltbot gateway sits between your agent and its LLM provider. One environment variable. Zero code changes. Every interaction gets an integrity checkpoint. Every checkpoint gets a cryptographic attestation. Your agent earns a Trust Score that reflects its verified behavioral history — not a snapshot, but a continuous record.

    The thesis is simple: you should not have to trust that your AI agents are doing what you told them. You should be able to prove it.

    OpenAI just showed everyone why that matters. We built the infrastructure that makes it possible.


    The Mnemom trust infrastructure stack — AIP, AAP, Trust Score, Risk Assessment, and Containment — is open source under Apache 2.0. The gateway is live at gateway.mnemom.ai. The interactive showcase demonstrating four-agent fleet verification is at mnemom.ai/showcase.

    #agents#aip#alignment#trust#containment#research

    Stay in the loop

    New dispatches and product updates, no spam.

    ¿Listo para verificar tus agentes?

    Featured on There's An AI For That