The Wise Operator

Agent Confidence Score

A calibrated reliability rating attached to an AI agent's output, used to decide whether the result is safe to ship or must be routed to a human reviewer.


What It Is

Agent Confidence Score is a reliability rating an AI agent attaches to its own output, expressed as a percentage. Microsoft introduced the framework at Build 2026 when it made Agent Mode the default in Office Copilot. Every time an agent completes a task, the score sits next to the output. If the score falls below the configured threshold, typically 95%, the agent does not ship the result. It routes the output to a human reviewer for sign-off before any action is taken.

The score is not a measure of how confident the agent feels in any colloquial sense. It is a statistical signal built from historical accuracy on similar tasks, the consistency of the agent’s intermediate reasoning steps, and the agent’s track record on the specific tool or document type involved. A 96% score is a calibrated estimate that this category of output, from this agent, on this kind of task, has been correct roughly 96 times out of 100 in prior runs.

How It Actually Works

The score is computed at the end of an agent’s reasoning loop. The orchestrator that ran the agent passes the final output and the reasoning trace through an evaluator model, sometimes a smaller and cheaper one, sometimes the same model running in a self-check mode. The evaluator compares the output against historical correct outputs for similar tasks, scores intermediate steps for consistency, and weights the result against the agent’s reliability profile on that tool. The number is rounded to a percentage and surfaced to the user.

When the score sits above the threshold, the agent commits the action: writes the file, sends the message, books the meeting. When it sits below, the action is held. The output, the score, and the reasoning trace are surfaced to a human reviewer, who approves or rejects with one click. The rejection feeds back into the agent’s reliability profile, which means the threshold tunes itself over time.

Why It Matters Right Now

The shift from synchronous AI assistance to async delegation makes the confidence number the new contract between humans and agents. When Copilot was a sidebar, the human read every output before acting. When Copilot is an agent running long tasks in the background, the human cannot read every output, so the score becomes the trigger for which outputs deserve attention. The score is also a governance artifact. It is logged, auditable, and tunable, which means a compliance officer can require that anything below 99% on financial workflows escalate to a partner.

A Concrete Operator Scenario

You run a small consultancy. You hand Agent Mode in Word a task: draft the kickoff memo for a new client engagement, pulling from the proposal, the SOW, and the discovery transcript. Twenty minutes later, the agent returns a 92% score. Below threshold. The draft sits in your reviewer queue.

You open it. The agent has misclassified one of the budget numbers as a forecast when it was an actual. You correct the line, hit approve, and the agent learns. Three weeks later, the same agent runs the same workflow on a different client and returns a 97% score. You read the draft, ship the memo. The number is now telling you when to invest your attention and when to spend it elsewhere. That is the real product. Not the agent. The dial that tells you when the agent is worth reading.

How TWO Uses It

TWO’s editorial position is that the score is the most important interface change of the year, and it is also the easiest one to misuse. The temptation is to set the threshold at 90%, accept everything above it, and bank nine times the leverage. That works until the first time a 91% output ships wrong and the cost lands on a client.

The discipline is to set the threshold by domain, not by tool. Marketing drafts can live at 90%. Financial memos cannot. Legal language cannot. Anything that touches a customer commitment or a contractual word should sit at 98% or require dual approval. Operators who pick one threshold across all workflows are deciding by default that the worst-case downside is the same as the best-case upside. It is not.

The other discipline is to read the rejection rate, not the acceptance rate. When the agent’s outputs cluster at 91% to 94%, you are running the wrong agent for that task. The score is not just a gate. It is a diagnostic. Read it as one.

The score sits inside the broader managed-agent stack, the cloud-hosted plumbing that runs the agent, evaluates it, and routes the output. It is the layer that makes agentic-coding safe enough to ship into Office workflows in the first place. The thresholding decision touches every long-horizon-agent deployment, because the longer the task, the more the score has to carry on its own.