The Wise Operator

Computer Use

An AI capability in which a model perceives a computer screen and operates its interface directly, clicking, typing, scrolling, and navigating apps to take actions on a user's behalf.


What It Is

Computer use is the capability that lets an AI model do what a person does at a keyboard and mouse: look at the screen, decide what to click, type into fields, scroll, open menus, and move between applications. Instead of returning an answer for a human to act on, the model takes the action itself. It is the skill underneath the phrase “computer-using agent,” and it is the difference between a model that can describe how to file an expense report and one that actually fills the form.

The capability moved into the mainstream this week when Google made computer use a built-in tool inside Gemini 3.5 Flash, its main developer model. The capability had lived in a separate, slower Gemini 2.5 computer-use model that developers called only when they needed an agent to operate software. Folding it into the default model means a single call can both reason about a task and carry it out. On OSWorld, a benchmark that measures agents operating real software, Gemini 3.5 Flash scored 78.4, close to GPT-5.5 at 78.7 and behind Anthropic’s Opus 4.8 at 83.4.

A useful distinction: computer use is the capability, while a computer-use model is a model purpose-built for it. The news this week is that the capability no longer requires the specialty model. It now ships as a tool on a general one.

How It Actually Works

Most computer-use systems combine three parts. A vision component reads the screen as pixels, since the interface was built for human eyes, not for an API. An action component decides the next move, a click at specific coordinates, a keystroke, a scroll. A planning loop holds the goal, observes what changed after each action, and decides what to do next. The model is not reading the application’s underlying code. It is looking at the rendered screen and acting on what it sees, which is why it can drive software that was never designed to be automated.

That same property is what makes computer use both broadly useful and genuinely risky. Because it operates the visible interface, it can work across legacy software, government portals, and vendor tools that expose no clean integration. But a model with a cursor and your credentials can also be steered by text it reads on the screen. That attack is called indirect prompt injection, and it is why Google ships Gemini 3.5 Flash with adversarial training plus two optional enterprise safeguards. A model that can act on what it sees can also be tricked by what it sees.

Why It Matters Right Now

A large share of professional work is still gated by interfaces built for humans: ERP screens, claims systems, CAD packages, billing portals. Anything an AI agent can navigate by sight becomes a candidate for delegation, which changes what counts as automatable almost overnight. It also reshapes two markets at once. Traditional automation that scripts the same clicking breaks whenever a vendor moves a button; a model that adapts to interface changes on the fly reprices that category. And software vendors lose some of the lock-in that came from owning the only usable interface to their data.

How TWO Uses It

TWO treats computer use as the moment the agent question stops being theoretical. When a model can only talk, the worst case is a bad sentence. When a model can act, the worst case is a real action taken in a real system under your name, and the gap between those two is the whole story. The capability does not change whether you should adopt agents. It changes how you are obligated to supervise them.

Scott’s Take: The day your model grows hands is the day your job becomes watching the hands, not admiring them.

The concrete operator decision comes the first time you point a computer-use agent at live software. Do not start with breadth. Pick one bounded, repetitive task gated by a clunky interface, time how long a person takes today, then have the agent do the same task end to end on real data with its logs visible. Ask three questions: did it finish without intervention, where did it ask for help, and is there an audit trail you would show a compliance officer. Only after it passes on one task do you look for the next two where the same skill applies. The operator who knows exactly where computer use works, and where it fails silently, is in a far stronger position than the one whose vendor sold a license that assumes it works everywhere.

What to Watch Next

The signal that this category is maturing is not a higher OSWorld score. It is the arrival of standard, auditable ways to gate, log, and revoke an agent’s access to a screen, the same controls you already expect for a human employee’s accounts. When the safeguards stop being optional add-ons and become the default a serious vendor ships, computer use will have grown up. Until then, the oversight is yours to build, and building it before you hand over the screen is the entire discipline.

Computer use sits next to the computer-use model it grew out of, depends on tool use as the mechanism by which a model invokes actions, and turns a language model into an AI agent that operates rather than only advises.