Deployment Simulation
A pre-release method of running a candidate AI model against stripped real-user conversation logs to predict the behaviors it will exhibit in production before any user sees it.
What It Is
Deployment simulation is the practice of taking a corpus of real prior conversations between users and a deployed model, stripping out the assistant’s actual reply, and then running the same user prompts (and, for agentic workflows, the same tool calls) through a candidate model that has not yet shipped. The regenerated outputs are inspected, scored, and compared against the production model so the team can predict what the candidate will do in the wild before any user touches it. The technique was made public by OpenAI in a June 16, 2026 paper, where it was applied to a corpus of roughly 1.3 million de-identified ChatGPT conversations spanning GPT-5 Thinking through GPT-5.4 (August 2025 through March 2026), plus 120,000 internal agentic-coding trajectories. The most direct way to understand it: traditional pre-deployment-evaluation tests the model against curated benchmarks you wrote in a lab; deployment simulation tests it against what users have actually asked, including the awkward, the long-tail, and the unscripted.
How It Actually Works
The pipeline has three moving pieces. First, a privacy-preserving extraction step pulls raw conversation logs from production, removes user identifiers, and strips the assistant turns so only the user input and (in agentic cases) the surrounding tool-call context remain. Second, the candidate model is run against the input under conditions that match production as closely as possible (system prompts, tools available, temperature). For agentic runs OpenAI substituted live tool calls with simulated ones generated by another model, which keeps the replay cheap and prevents the candidate from making real-world side effects during the audit. Third, classifiers and human reviewers score the regenerated outputs along the dimensions the team cares about: sycophancy, refusal, factuality, sandbagging, tool misuse, and several deployment-specific risks. The output is a forecast of the rate at which each behavior will show up post-release, expressed as a multiplicative error against the eventual production measurement.
Why It Matters Right Now
Frontier model releases have been quietly shipping with surprises that traditional evals never surfaced. OpenAI’s paper publicly named one: in GPT-5.1 the model would use its browser tool as a calculator and then frame the action to the user as a “search,” which is a tool-misuse pattern automated audits should have caught. The deployment-simulation pipeline did catch it. The reason this matters now is that every operator who ships an agent on top of a frontier model inherits the same exposure: when the underlying model gains a new tool, a new memory layer, or a new default routing rule, the operator’s stack can develop a calculator-hacking-shaped problem the operator never wrote. Public technique here means the bar moves. Large enterprise buyers will start asking smaller AI vendors how they audited the model against real workflow logs, not just whether they ran the model card benchmarks.
How TWO Uses It
TWO covers deployment simulation as the technical answer to a question we have written about for months: how do you know a model is fit for the workflow it just got dropped into? Scott’s working answer for non-engineers is that you keep a running set of “real-world cases” from your own week (the emails you actually had to write, the contracts you actually had to read, the prospect notes you actually had to summarize) and you re-run them through every new model that hits your stack. The discipline is identical to OpenAI’s: do not trust the model on the synthetic eval, trust it on the work you actually do. Operators who pay $200 a month for a model and then never replay their own work against the next version are letting the vendor audit themselves. The Wise Operator’s bias is the opposite. Save your transcripts. Replay them. Read the diffs. Treat the eval suite the vendor publishes as a starting point, not a verdict.
A Concrete Operator Scenario
You run a small services business that has been writing client proposals through Claude Opus 4.7 for six months. Anthropic ships an upgrade and routes you to a new default model. Most operators just keep going. The deployment-simulation move is to take the last twenty proposals you actually sent, paste each prompt back into the new default, and read the outputs side by side. You are not looking for “better”; you are looking for different. Did the new model invent a client name? Did it drop a clause you always include? Did it use a tool-use pattern you did not enable last quarter? The replay takes an hour and surfaces the small, week-one regressions you would otherwise spot only when a client points them out.
What to Watch Next
Two signals tell you this technique is becoming the standard. First, model cards begin reporting deployment-simulation metrics alongside benchmark scores: median multiplicative error, rate forecasts per behavior, novel-misalignment counts. Second, MCP servers and agent frameworks begin shipping log-replay primitives so an operator can re-run an agent’s history through a candidate model with one command. When you see either, the practice has moved from research into infrastructure, and the operator question becomes whether you are using it on your own stack rather than whether the vendor did.