Securing deep frontier AI evaluations

May 8
3 min read

Updated: May 13

A confidential-computing facility for trustworthy AI auditing: giving regulators visibility into frontier models without exposing the assets behind them.

Author: Alejandro Tlaie Boria

Executive Summary

Pour Demain is publishing a new technical brief proposing a confidential-computing facility that would let independent evaluators run deep safety audits on frontier AI models without forcing providers to expose the model weights, architecture details, or training artefacts they protect as trade secrets. The brief is accompanied by a pre-pilot already underway, a conditional consortium of partners covering every role the pilot requires, and a Monte Carlo cost model across three deployment topologies.

The problem: black-box evaluation is structurally insufficient

Regulators are increasingly required to verify safety claims that black-box testing cannot reach. Whether a model harbours suppressed dangerous capabilities. Whether refusal behaviour reflects genuine removal of a hazardous mechanism, or merely a surface mask over an internal mechanism that remains intact. Whether a training-time backdoor is buried somewhere in the weights, waiting for a rare trigger.

For these claim classes, evaluators need analytical access to the model's internal signals (gradients, activations, attention patterns) not just its outputs. But granting that access has historically meant exposing assets worth billions, which providers reasonably refuse. The result has been a deadlock: deeper evaluations either don't happen, or are conducted internally by the very provider being evaluated.

The proposal: a de facto glass-box facility

The brief proposes a secure evaluation facility built on three integrated layers:

A confidential-computing substrate. Hardware-rooted attestation on H200-class GPUs, with single-node containment to keep the trusted computing base small. This layer reached general availability in 2025 and is already deployed in production for sensitive workloads.
An Evaluation Instrumentation Interface (EII). A standardised set of typed callback endpoints (covering activations, gradients, attention statistics, controlled interventions, and architecture metadata) that the provider implements against its own model. Auditors invoke these endpoints without ever seeing the provider's inference code.
A governed evidence pipeline. Quantitative export budgets and tiered evidence channels treat the audit report itself as part of the attack surface. Evidence is rendered server-side from bounded aggregates, accumulated against a longitudinal ledger, and bound cryptographically to the measured platform identity.

The facility supports three deployment topologies: sovereign operation, hybrid (sovereign governance on provider-trusted hyperscaler infrastructure), and provider-hosted enclaves under remote oversight. The pilot's security claims are topology-independent.

Why this matters

For regulators: a concrete way to verify the safety claims they are increasingly required to assess, without depending on provider self-attestation.
For model providers: stronger IP protection than current audit arrangements offer, with a defensible legal record across all four jurisdictions that house essentially all current frontier providers (EU, US, UK, China), built on the trade-secret floor in TRIPS Article 39 and recent Chinese Supreme People's Court rulings recognising model weights as protectable trade secrets.
For the host jurisdiction: strategic infrastructure that future-proofs access to frontier models and positions the host as a hub for AI assurance, with multi-use potential beyond evaluation (defence, healthcare, critical infrastructure).

Where we are

The architecture and threat model are complete and published. A pre-pilot benchmarking confidential-computing overhead on a frontier-scale open-weights mixture-of-experts model is already underway, with results feeding directly into the full pilot. Conditional commitments have been secured across every role the pilot requires: a third-party evaluator, a secure-infrastructure provider, a software governance and orchestration partner, a model-fingerprinting provider, and observers from the EU AI Office. Conversations with frontier labs and additional AI safety institutes are open.

The brief proposes a four-stage participation pathway, beginning with a frontier-scale open-weights validation phase. The intent is to break the chicken-and-egg problem that has stalled previous attempts: no provider commits proprietary weights to a novel architecture before that architecture has been demonstrated at frontier-comparable scale, but no architecture can be demonstrated at frontier-comparable scale without access to such weights.

Next steps

The full brief (including the formal threat model, EII specification, empirical detection-rate analysis, legal scaffolding, and Monte Carlo cost model) is available below. We welcome engagement from regulators, providers, evaluators, and infrastructure partners interested in turning the consortium's conditional commitments into a mandated pilot.

Read the full technical brief: