The First Zero-Knowledge Proof of AI Safety Judgment

Mnemom Research | February 2026

TL;DR: We built the first system that cryptographically proves an AI integrity verdict was honestly derived. Not the model's inference. Not the execution environment. The auditor's judgment itself — proven via STARK proof in a SP1 zkVM.

The Problem with "Trust Me"

Every AI safety system today asks you to trust the auditor. Trust that the guardrails fired. Trust that the logs are complete. Trust that nobody edited the verdict after the fact.

That's not good enough when agents are making decisions that matter — financial transactions, medical reasoning, infrastructure changes. "We checked and it was fine" is not evidence. It's a claim.

We wanted to turn that claim into proof.

What We Built

Mnemom's Agent Integrity Protocol (AIP) reads agent reasoning between every action and compares it to a behavioral contract defined in the Agent Alignment Protocol (AAP). When the thinking diverges from the commitment, the system catches it before the agent acts.

As of today, that entire integrity pipeline is cryptographically verifiable. Four layers:

1. Input Commitment — Every integrity check starts by hashing all inputs: the agent's thinking block, the Alignment Card, conscience values, window context, and the prompt template version. SHA-256 commitment. If any input changes after the fact, the hash breaks.

2. Ed25519 Signing — Every checkpoint gets a digital signature from the analysis service's keypair. Third parties can verify the checkpoint was issued by Mnemom's authorized service without contacting our servers. Public keys published at /v1/keys.

3. Hash Chain + Merkle Accumulator — Each checkpoint includes the hash of the previous checkpoint for the same agent session, creating a tamper-evident sequence. Delete or reorder a checkpoint and the chain breaks. A per-agent Merkle tree enables inclusion proofs — prove a checkpoint exists in the verified history without revealing all checkpoints.

4. Zero-Knowledge Verdict Proof — The checkIntegrity() rule logic is ported to Rust and runs inside a SP1 zkVM. The resulting STARK proof proves that applying the integrity rules to the analysis response produces the claimed verdict. The analysis response stays private. The proof is publicly verifiable.

Why This Is Different

The verifiable AI space is active. Here's how the approaches differ:

ZKML (Modulus Labs, Lagrange, academic papers) proves model inference — that a specific neural network produced a specific output given specific inputs. This is important work but it's 12-18 months from production viability for frontier models. It also doesn't address whether the auditor was honest.

EQTY Lab (AI Guardian) uses hardware-rooted TEE attestation via Intel and NVIDIA silicon. It proves that an AI computation ran inside a secure enclave with the correct governance controls. This proves the execution environment was secure. It doesn't prove that a specific verdict was correctly derived from the evidence.

Attestable Audits (academic, ICML 2025) runs safety benchmarks inside AWS Nitro Enclaves and produces attestation reports. This proves benchmarks ran correctly — at audit time, not at runtime.

Our approach is different because we prove a different thing: the auditor's judgment. Given an analysis of the agent's reasoning (private witness), applying deterministic integrity rules produces this specific verdict (public output). The proof is a STARK — succinct, transparent, no trusted setup. ~10K RISC-V cycles. Under 5 seconds. Under a cent.

The question isn't "was the environment secure?" (EQTY's territory). The question is "was the verdict honestly derived?" That's what we prove.

The Stack: Alignment. Integrity. Proof.

AAP (Agent Alignment Protocol) gives agents a behavioral contract — an Alignment Card declaring permitted actions, forbidden actions, escalation triggers, and values. Machine-readable. Schema-validated. The transparency layer.

AIP (Agent Integrity Protocol) reads the agent's reasoning between every turn and compares it to that contract. When thinking diverges from commitment, the system catches it. The integrity layer.

Cryptographic Attestation + Verifiable Verdicts makes the integrity check itself provable. Ed25519 signatures, hash chains, Merkle proofs, and STARK-proven verdicts. The proof layer.

Open protocols. Apache-licensed. Any Anthropic, OpenAI, or Gemini agent. No vendor lock-in.

How We Got Here

This wasn't a single sprint. The cryptographic attestation system (Phase 1) laid the foundation with input commitments, signing, hash chains, and Merkle trees — 162 passing tests. The showcase visualizations (Phase 2) made the cryptography tangible: an interactive certificate explorer, hash chain with tamper simulation, Merkle tree with inclusion proof highlighting, and a live verification demo that runs Ed25519 verification in your browser.

The zero-knowledge layer (Phase 3) ported checkIntegrity() to Rust, integrated with SP1 zkVM, and deployed a proving service on Modal. Optimistic verification means 100% of violations get proven, 10% stochastic for everything else, and enterprise customers can prove every checkpoint.

Phase 4 brought it all together: a trust topology graph showing verified relationships between agents, and a six-step proof verification sequence you can run yourself at mnemom.ai/showcase.

What's Next

The Integrity Certificate format is modeled on C2PA Content Credentials and W3C Verifiable Credentials. We think it should become a standard. Every AI system that makes consequential decisions should be able to produce a certificate that says: here's what I decided, here's the evidence I weighed, and here's the math proving I weighed it honestly.

We're publishing the specification, the Rust guest program, and the verification libraries. If you build AI agents that need to prove their integrity to regulators, customers, or each other — this is for you.

Try it: mnemom.ai/showcase Read the code: github.com/mnemom Docs: docs.mnemom.ai