Computer Use and GUI Agents

What If the Model Could Use a Computer Like You Do?

In the previous articles, we gave models the ability to call functions, reason about multi-step plans, and discover tools dynamically through MCP. But all of those mechanisms share a common assumption: that the target system exposes a programmatic interface — an API endpoint, a function signature, a structured protocol. The model outputs JSON, and a runtime executes the call. That works beautifully when APIs exist. But most of the work humans do on computers doesn't happen through APIs. It happens through graphical user interfaces : web browsers, desktop applications, forms, dropdown menus, checkboxes, and buttons.

Consider how much of the digital world is GUI-only. Legacy enterprise systems — payroll software from 2005, hospital records portals, government filing websites — have no API and never will. Internal tools built with drag-and-drop form builders expose no endpoints. Many SaaS products offer a web dashboard but not an API for the specific workflow you need. Even when an API exists, it often covers only a fraction of what the GUI can do. If you've ever had to manually click through a web portal to do something that should have been automated, you've felt this pain.

Computer use solves this by giving the model the same interface a human has: a screen to look at and a keyboard and mouse to interact with. The model sees a screenshot of the current screen state, decides what to click or type, and the action is executed on its behalf. Then it sees the resulting new screenshot and continues. No API needed. No structured protocol. If a human can accomplish a task by looking at a screen and moving a mouse, the model can attempt it too.

This is the most general form of tool use. Function calling requires that someone has written an API. MCP requires that someone has built a server. Code execution requires a runtime and the right libraries. But computer use requires only what every application already provides: a visual interface. It is, in a sense, the universal tool — the one that works with everything, because everything has a screen.

💡 Computer use is to tool use what natural language is to programming: a less precise but far more universal interface. You trade speed and reliability for the ability to interact with anything that has a GUI, including systems that were never designed for automation.

The Screenshot-Action Loop

The architecture of a computer-use agent follows a loop that should feel familiar from the agent loop we introduced in article 1, but with one crucial difference: the observation is visual rather than textual. Instead of reading API responses or tool output as structured data, the model looks at an image of the screen and must understand what it sees.

The loop works as follows:

1. Screenshot: capture the current state of the screen as an image (typically a PNG at the display's resolution).
2. Observe + Reason: send the screenshot along with the task description (and the history of previous actions) to a vision-language model . The model must understand the visual layout — where buttons are, what text fields contain, which tab is active — and decide what action to take next.
3. Act: the model outputs a structured action: click(x, y) , type("hello") , scroll(down) , key_press("Enter") , or similar.
4. Execute: a runtime (controlling a real or virtual machine) executes the action — actually moving the mouse cursor and clicking, or injecting keystrokes.
5. Repeat: take a new screenshot of the resulting screen state and go back to step 2. Continue until the task is complete or a stopping condition is reached.

def computer_use_agent(task: str, environment: VirtualDesktop):
    """Run a task by seeing and interacting with a screen."""
    action_history = []

    while not task_complete(action_history):
        # 1. Observe: capture what's on screen
        screenshot = environment.take_screenshot()

        # 2. Reason: ask the vision-language model what to do
        action = vlm.generate_action(
            task=task,
            screenshot=screenshot,       # image input
            history=action_history        # what we've done so far
        )

        # 3. Act: execute the action in the environment
        environment.execute(action)       # click, type, scroll, etc.
        action_history.append(action)

    return action_history

Notice how this mirrors the observe-reason-act loop from article 1, except that observation is a screenshot (an image) rather than text. This means the model needs both strong vision capabilities (to parse the visual layout of a GUI — recognising buttons, reading text rendered in arbitrary fonts, understanding spatial relationships between elements) and strong reasoning capabilities (to plan multi-step interactions and recover from unexpected states). This is why computer use became practical only after the emergence of powerful vision-language models (VLMs) that combine visual understanding with language-based reasoning.

The visual understanding component draws heavily on the same vision transformer (ViT) architecture used in other multimodal tasks: the screenshot is divided into patches, each patch is embedded as a token, and the model attends over both the image tokens and the text tokens describing the task. The difference is that here, the model must output not just a text answer but a precise spatial coordinate (where to click) or a sequence of keystrokes (what to type).

One practical consequence of this architecture is latency . Each iteration of the loop involves taking a screenshot, encoding it, sending it to the model, waiting for inference (which processes a large image alongside conversation history), and then executing the action. A single step typically takes 1 to 5 seconds. A complex task like filling out a multi-page form or navigating through several web pages might require 20-50 steps, meaning the total time can be measured in minutes rather than seconds. This is dramatically slower than API-based tool use, where each call completes in milliseconds.

💡 The latency cost is real but often acceptable. Tasks that take a model minutes to complete through computer use might take a human 10-15 minutes of tedious clicking. The comparison isn't computer use vs. API calls — it's computer use vs. a human doing the same GUI task manually.

Claude Computer Use

Anthropic's computer use capability (Anthropic, 2024) enables Claude to see screenshots and output structured computer actions. Rather than adding computer use as a separate system, Anthropic integrated it directly into Claude's tool-use framework: the model receives a screenshot as an image in its conversation context and outputs tool calls that represent mouse and keyboard actions.

The available actions form a compact but complete set for interacting with a desktop environment:

mouse_move(x, y) — move the cursor to pixel coordinates (x, y)
left_click(x, y) — click at the specified position
right_click(x, y) — open a context menu
double_click(x, y) — double-click (e.g. to select a word or open a file)
type("text") — type a string of text
key("Enter") — press a specific key or key combination (Enter, Ctrl+C, Alt+Tab, etc.)
screenshot() — request a fresh screenshot of the current screen state

At each step, the model examines the screenshot, reasons about the current state of the interface ("I see a login form with a username field and a password field; the username field is already filled; I need to click on the password field and type the password"), and outputs the appropriate action as a tool call. The runtime executes the action, captures a new screenshot, and feeds it back for the next step.

import anthropic

client = anthropic.Anthropic()

# The model receives a screenshot and outputs computer actions
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20250124",  # computer use tool
            "name": "computer",
            "display_width_px": 1920,
            "display_height_px": 1080,
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Open Firefox and search for 'MCP protocol specification'"
                },
                {
                    "type": "image",    # current screenshot
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_base64
                    }
                }
            ]
        }
    ]
)

# The model responds with a tool call like:
# {
#     "type": "tool_use",
#     "name": "computer",
#     "input": {
#         "action": "left_click",
#         "coordinate": [512, 738]   # Firefox icon position
#     }
# }

Computer use typically runs in a sandboxed environment — a Docker container with a virtual desktop (e.g. a Linux desktop with a VNC server). This isolation is critical for safety: the model's actions are confined to the sandbox, so a misguided click can't accidentally delete files on your real machine or send emails from your actual account. The trade-off is that the sandboxed environment may not have access to all the applications and accounts needed for a task, requiring setup.

Practical use cases include filling forms on websites that lack APIs, navigating web portals (insurance claims, government filings, travel booking), software testing (clicking through UI flows to verify behaviour), and data entry from unstructured sources (reading a PDF and typing its contents into a form). All of these are tasks where a human would sit at a computer and click through a GUI — and where no API shortcut exists.

The limitations are equally important to understand. Each action takes 1-5 seconds, so complex workflows are slow. The model can click the wrong element — especially when buttons are small, closely packed, or ambiguously labelled. It struggles with dynamic content : animations, loading spinners, auto-completing dropdown menus, and drag-and-drop interfaces are all difficult because the screen changes between the moment the model processes the screenshot and the moment the action executes. And it can get stuck in loops : if it clicks the wrong button and the resulting screen looks unfamiliar, it may not know how to recover.

OpenAI Operator and Google Mariner

Anthropic was not alone in pursuing computer use. Multiple major labs released GUI agent products in late 2024 and early 2025, all converging on the same core architecture: screenshot in, action out, repeat.

OpenAI Operator (January 2025) is a browser-based agent that can navigate websites autonomously. Built on top of GPT-4o's vision capabilities, Operator runs in a sandboxed browser environment hosted by OpenAI. Users describe a task in natural language ("Book me the cheapest round-trip flight from NYC to London for next weekend"), and Operator navigates the web to accomplish it: opening travel sites, entering search criteria, comparing prices, and completing the booking flow. Because it runs in a sandboxed browser (not on the user's machine), it's isolated from the user's local data, but this also means it may need the user to log in to their accounts within the sandbox.

Google Project Mariner (December 2024) takes a different architectural approach. Mariner operates as a Chrome extension that runs directly in the user's browser, using Gemini models for visual understanding and action planning. Because it operates within the user's existing browser session, it has access to the user's cookies, logins, and browsing context — no need to re-authenticate in a sandbox. This makes it more powerful for tasks that require the user's existing accounts, but it also requires a higher degree of trust, since the agent acts with the user's identity and can see everything in their browser.

These products, along with Claude's computer use, share the same underlying paradigm: the Computer-Using Agent (CUA) pattern. The differences between them lie along three axes:

Visual understanding quality: how accurately the model parses the screen. Can it distinguish a clickable button from a decorative image? Can it read small text? Does it understand complex layouts like nested tables or multi-column forms?
Action accuracy: how precisely the model clicks. A few pixels off might mean clicking "Cancel" instead of "Confirm", or selecting the wrong item in a dropdown menu.
Sandboxing approach: sandboxed environments (Claude computer use, OpenAI Operator) are safer but less convenient (limited access to user accounts and local state). User-browser approaches (Google Mariner) are more capable but require trusting the agent with the user's full browsing context.

The convergence of all three major labs on the same screenshot-action architecture is significant. It suggests that this pattern — not some alternative like DOM parsing or accessibility-tree reading — is the primary path to general GUI automation. The screenshot is the universal interface: every application renders one, and a sufficiently capable vision model can parse it.

💡 Some systems do supplement screenshots with structured data from the DOM (the HTML tree underlying a web page) or the operating system's accessibility tree (which labels UI elements for screen readers). This structured data can improve accuracy — it's easier to click a button if you know its exact bounding box from the DOM rather than estimating it from pixels. But screenshots remain the primary input because they work universally, including on remote desktops, images of screens, and applications that don't expose a DOM.

How Models Learn to Use GUIs

A model doesn't come out of pre-training knowing how to click buttons. Computer use is a learned skill that requires specialised training data and addresses challenges that don't arise in standard language or vision tasks.

The core training data consists of (screenshot, action) pairs collected from human demonstrations. A human performs a task on a computer — say, booking a flight or configuring a software setting — while the system records the screen at each step and the action the human took (click at position (x, y), type "departure: March 15", press Enter). This produces a supervised dataset: given this visual state, the correct action is this. The model learns to imitate the human's decision-making process.

But training a model to use GUIs is harder than it might seem. Three challenges dominate:

Grounding is the problem of locating specific UI elements in pixel space. When the model decides "I need to click the Submit button," it must find that button's exact coordinates in the screenshot. This requires understanding layout, not just content — the model must distinguish the Submit button from other buttons, read its label rendered in whatever font the application uses, and estimate the correct (x, y) coordinate to click. If a page has three similarly-styled buttons in a row, the model needs precise spatial understanding to hit the right one. Grounding is especially difficult when elements are small, when text is rendered at unusual sizes or angles, or when the UI uses icons rather than text labels.

State tracking is the problem of maintaining awareness across multiple steps. A multi-step task requires remembering what you've already done: which form fields have been filled, which tabs have been visited, what search results have already been reviewed. The model sees a new screenshot at each step, but it must connect that snapshot to the full history of its actions to understand the current state of the task. If the model fills in a form's first three fields and then scrolls down, the next screenshot won't show those fields anymore — the model must remember they're already filled rather than scrolling back up to fill them again.

Error recovery is the problem of adapting when something goes wrong. If a click misses its target and opens the wrong menu, the model needs to recognise the unexpected state and take corrective action (close the menu, go back, try again). If a page loads a cookie consent banner that obscures the form, the model must dismiss it before continuing. If a website has changed its layout since the training data was collected, the model must generalise. Error recovery is what separates a brittle automation script from a robust agent — scripts break when the UI changes; agents (ideally) adapt.

How well are current models doing? Two standardised benchmarks provide concrete measurements:

OSWorld (Xie et al., 2024) evaluates computer use agents across full desktop operating systems (Ubuntu, Windows, macOS). The benchmark includes real-world tasks like installing software, configuring system settings, editing documents in LibreOffice, and managing files. Each task is run in a real virtual machine, and success is measured by checking whether the final system state matches the expected outcome. Current best models achieve roughly 30-40% success rates on these tasks — far from human performance, which is above 90%. The gap is largest on tasks that require many steps, precise spatial interaction, or recovery from unexpected intermediate states.

WebArena (Zhou et al., 2023) focuses specifically on web-based tasks. It sets up realistic self-hosted web applications — an e-commerce site, a forum, a content management system, a project tracker — and asks agents to accomplish tasks within them: find a product and add it to the cart, post a reply to a forum thread, update a CMS page. WebArena is particularly challenging because the web applications are fully functional replicas with realistic complexity, not simplified toy versions.

These benchmarks reveal that while computer use is functional, it is far from solved. The most common failure modes are grounding errors (clicking the wrong element), getting stuck (repeating the same ineffective action), and failing to recover from unexpected pages or pop-ups. Improving these failure modes is an active area of research.

The Spectrum of Tool Use

Now that we've seen the full range — from structured function calls in article 2, through MCP in article 4, to computer use in this article — it's worth placing these approaches on a spectrum. They form a clear progression from most structured and reliable to most general and flexible:

API / function calling: the model outputs structured JSON that maps directly to a function signature. Fast (milliseconds per call), reliable (well-defined inputs and outputs), and deterministic (same input always calls the same function). But it requires someone to have built an API for the specific operation you need.
MCP tools: the same structured approach, but tools are discovered dynamically through a standard protocol. This solves the integration problem (one protocol to rule them all) but is still fundamentally API-based: every capability must be explicitly implemented as an MCP server.
Code execution: the model writes and runs code. Far more flexible than pre-defined functions, because the model can compose arbitrary logic. But it requires a runtime environment, and the model must generate syntactically and semantically correct code.
Computer use: the model sees a screen and interacts via mouse and keyboard. The most general approach — anything with a GUI is accessible. But it's the slowest (seconds per action), least reliable (pixel-level precision is hard), and most brittle (the same website can look different on different days).

More structured                              More general
More reliable                                Less reliable
Faster                                       Slower
────────────────────────────────────────────────────────────

  Function       MCP          Code          Computer
  Calling       Tools       Execution         Use

  JSON →        JSON →      code →         screenshot →
  function      server      runtime        click/type

  ~100ms        ~100ms      ~1-10s         ~1-5s/step
  per call      per call    per run        (minutes total)

────────────────────────────────────────────────────────────
Requires API    Requires    Requires       Requires only
to exist        MCP server  runtime        a screen

The right choice depends on what's available. If the system you're integrating with has a well-documented API, use function calling or MCP — it will be faster, more reliable, and easier to debug. If you need custom logic that no single API call can express, code execution lets the model compose arbitrary solutions. And if the only interface is a GUI, computer use is your only option.

But the most capable production systems don't choose just one. They combine multiple levels . An agent might use API calls for well-supported operations (send an email via the Gmail API), code execution for data processing (parse a CSV and compute statistics), and computer use for the one step that has no programmatic interface (navigate an internal HR portal to submit a time-off request). The model dynamically selects the most appropriate tool type for each sub-task, preferring structured methods when available and falling back to computer use only when necessary.

The frontier of this work is models that make this transition seamlessly : within a single task, the model might call an API, write some code to process the results, and then switch to computer use to paste the output into a GUI form — all without the user needing to configure which approach to use for which step. The model itself decides, based on what tools are available and what the task requires.

💡 Think of it like a human worker's approach: you'd use a command-line tool if one exists, write a quick script for data processing, and only open the browser for tasks that require a GUI. The most effective agents mirror this pragmatism, using the most efficient interface available for each sub-task.

Quiz

Test your understanding of computer use and GUI agents.

What is the fundamental observation type that distinguishes computer use from other forms of tool use?

The model reads structured JSON responses from API endpoints

The model receives screenshots (images) of the screen and must visually understand the GUI layout

The model parses the HTML DOM tree of web pages to find interactive elements

The model uses text-based terminal output to understand application state

What is 'grounding' in the context of GUI agents?

Connecting the agent to the internet so it can access web-based applications

Providing the model with factual training data about GUI design conventions

Locating specific UI elements (buttons, text fields, links) at precise pixel coordinates in a screenshot

Running the agent in a sandboxed virtual machine to prevent unsafe actions

Why is computer use significantly slower than API-based tool use?

Screenshots use more bandwidth to transmit than JSON payloads

Each step requires capturing a screenshot, encoding a large image, running vision-language model inference, and executing an action — and complex tasks require many such steps

Virtual desktop environments run on slower hardware than API servers

The model must wait for web pages to fully render before it can take a screenshot

When should an agent prefer computer use over API-based tool calling?

When the task involves reading or writing large amounts of data

When the target system only has a graphical interface and no programmatic API exists for the required operation

When the task requires high reliability and deterministic outcomes

When latency is the primary concern and the task must complete in under a second