The Wise Operator

Pre-Deployment Evaluation

The practice of an outside party, often a government body, testing a frontier AI model for dangerous capabilities before it is released to the public.


What It Is

Pre-deployment evaluation is the step where someone other than the model’s builder gets to test it before the public does. The “someone” is increasingly a government body. In May 2026, the US Center for AI Standards and Innovation, CAISI, announced agreements with Google DeepMind, Microsoft, and xAI that let the government evaluate their frontier models before public release. The trigger was national-security concern over Anthropic’s cyber-capable model, Mythos, and the broader recognition that frontier models can now find and exploit software vulnerabilities at scale.

The word “evaluation” is doing a lot of work. This is not a quality check on whether the model is helpful or accurate. It is a capability probe: can this model find zero-day vulnerabilities, can it plan a cyberattack, can it assist with biological or chemical harm, can it deceive its evaluators. The evaluation happens after the model is trained but before it ships, in the window between “finished” and “released.” That window is the only place an outside party can still say no.

How It Actually Works

A lab finishes training a model and, before deployment, hands a version of it to the evaluating body under a structured agreement. The evaluators run a battery of tests: red-team exercises, capability elicitation, attempts to make the model do things it should refuse. They are looking for the ceiling of what the model can do, not the average of what it usually does, because an attacker will also push for the ceiling.

The lab keeps control of the release decision. CAISI’s agreements are not a licensing regime; the government does not get a veto in the legal sense. What it gets is early sight and a documented finding. The leverage is reputational and, potentially, regulatory: a lab that ships a model an evaluator flagged as dangerous owns that choice publicly. The arrangement works because both sides want the same surface, a record that the model was inspected before it reached the open internet.

Why It Matters Right Now

The same week CAISI announced these agreements, Palo Alto Networks published security advisories covering 26 CVEs against its usual fewer than five, with most findings coming from frontier models scanning code. Google’s threat team disrupted hackers using an AI model to plan a mass exploitation campaign. The capability that makes pre-deployment evaluation necessary is no longer hypothetical. It is in this month’s news.

This also marks a policy reversal. The Trump administration had moved away from AI oversight; the cyber-capability evidence pulled it back toward inspection. That is worth noticing. Pre-deployment evaluation did not arrive because regulators wanted more process. It arrived because the models crossed a capability line that made the absence of inspection feel reckless.

How TWO Uses It

At TWO we treat pre-deployment evaluation as a signal, not a headline. When a lab agrees to outside evaluation, it is telling you indirectly that its models are now capable enough to be worth inspecting. That is information an operator can use. The labs submitting to evaluation are the ones building cyber-permissive frontier systems, and the gap between “powerful enough to evaluate” and “powerful enough to help you” is small.

Here is the operator-decision it sharpens. When you are choosing a model for a workflow that touches anything sensitive, customer data, internal code, financial records, the question is not only “is it accurate.” It is “would this model class clear a capability evaluation, and do I trust the lab that released it after one.” A model that an evaluator flagged, that shipped anyway, is a different risk profile than one that passed clean. Scott’s rule is to read the evaluation posture the way you would read a vendor’s security page: it does not tell you everything, but a lab that submits to it and publishes the result has shown you something about how it operates. When the evaluation record is silent, that silence is also data.

What to Watch Next

Watch whether evaluation moves from voluntary agreement to requirement, and whether it stays national or goes through a body with cross-border reach. Watch which labs are absent from the list. And watch the lag: if evaluation takes weeks while model releases ship monthly, the process becomes a formality the market routes around. The signal to track is whether a flagged finding ever actually delays a release. The first time it does, pre-deployment evaluation stops being a courtesy and becomes a gate.