AutoGUI: A Vendor-Neutral Desktop Automation Agent for LLMs
Published:
Most LLM agents can read files, call APIs, and run shell commands, but they have no reliable way to operate a graphical desktop. They cannot click a button in a running application, verify that a dialog appeared, fill a form field, or observe what is currently on screen. AutoGUI is a research prototype that fills that gap. It connects any OpenAI-compatible LLM — including models served locally through OpenWebUI or directly through Ollama — to a full suite of OS-level desktop controls via a ReAct-style agentic loop.
Experimental software — use at your own risk. AutoGUI is a research prototype. It is not intended for, and has not been evaluated or deemed suitable for, any particular purpose, production use, or critical workload. No warranty is provided, express or implied. The agent operates at OS level and can run shell commands, click anything, type anywhere, read and write files, and take screenshots. Run it only in a sandbox, VM, or container that you are willing to reset. Restrict the REST API to loopback (
AUTOGUI_API_HOST=127.0.0.1) and consider disabling shell access ("allowed_shell": false) if you do not fully trust the task or the model driving it.
Two Delivery Modes
AutoGUI ships in two forms. The standalone Python CLI/TUI agent connects any OpenWebUI instance (or any OpenAI-compatible endpoint) to your desktop. The native TypeScript Pi extension brings the same tools into the Pi terminal harness behind a single /autogui command, with no dependency on the Python agent or OpenWebUI.
Both share the same tool surface — shell execution, filesystem access, pixel and accessibility-tree clicking, Playwright browser automation — but they differ in how the LLM loop is owned. The standalone agent runs its own ReAct loop; the Pi extension delegates loop ownership to Pi.
Architecture
The standalone Python agent is organized around five main components.
main.py Entry point — validation, component wiring, TUI/CLI dispatch
│
├── agent.py ReAct loop + typed-plan controller
│ └─ controller Preflight, predicate checks, replan-on-block, budget ceilings
│
├── tools.py Tool registry
│ ├─ shell_run / fs_read / fs_write / fs_list
│ ├─ desktop_screenshot / click / type / hotkey / scroll / launch
│ ├─ desktop_click_element (a11y-first: UIAutomation / AT-SPI)
│ ├─ desktop_click_mark (Set-of-Mark grounding)
│ ├─ browser_navigate / click / fill / eval (Playwright)
│ ├─ skill_save / skill_list / skill_run
│ └─ memory_get / memory_note
│
├── backends/ Platform-specific automation backends
│ ├─ windows.py UIAutomation + SendInput (user32)
│ ├─ macos.py screencapture + osascript
│ ├─ linux_x11.py xdotool + wmctrl
│ └─ linux_wayland.py grim + ydotool + swaymsg
│
├── api.py FastAPI REST server (auto-started in background)
│ ├─ POST /api/task Submit task
│ ├─ GET /api/task/{id}/stream SSE live event stream
│ └─ GET /api/healthz
│
└── tui.py Textual TUI (status bar, model picker, tool visibility toggle)
The agentic loop in agent.py follows a standard ReAct pattern — append the user message to history, POST the history and tool schemas to the LLM, receive either a tool call or a stop response, execute the tool, append the result, and repeat. On top of that loop sits the controller, which adds:
- Planner — one extra LLM call up front that produces a numbered, high-level plan injected as a
[PLAN]block into the executor’s context - Plan critique — a second LLM call that reviews the plan and returns a revised version when issues are found
- Preflight — before the first state-changing action, verifies that apps are on PATH, files exist, URLs are TCP-reachable, and named tools are registered
- Predicate checks — when a plan step declares a typed post-condition (
window_title_contains,file_exists,text_visible, etc.), the controller verifies it deterministically after each step completes - Replan-on-block — when a step is classified as BLOCKED, the controller re-invokes the planner with the failure reason as context
- Visual diff — perceptual hash of pre/post screenshots flags silent no-ops where a state-changing action left the screen unchanged
- Watchdog — detects when the loop is stuck by hashing the per-iteration signature and routing repeated matches through the BLOCKED path
Platform Support
The correct backend is detected automatically at startup via platform_detect.detect(). No configuration is required.
| Platform | Screenshot | Click/Type | Accessibility Tree |
|---|---|---|---|
| Windows | pyautogui | SendInput (user32) | UIAutomation |
| WSL | pyautogui | pyautogui | PowerShell UIAutomation |
| macOS | screencapture | pyautogui | osascript |
| Linux X11 | pyautogui | xdotool | AT-SPI |
| Linux Wayland | grim | ydotool | AT-SPI |
On Windows, click and type operations go through user32.SendInput directly via ctypes, producing real INPUT events indistinguishable from a physical keyboard and mouse, with correct per-monitor DPI handling and full Unicode support. On Linux, the desktop_click_element tool talks to the accessibility tree via AT-SPI, letting the agent click real UI controls by name and role rather than guessing pixel positions.
A11y-First Clicking and Set-of-Mark Grounding
AutoGUI implements two layers above raw pixel clicking.
desktop_click_element(name, control_type) resolves the target UI control through the OS accessibility API and clicks it by logical identity rather than screen position. This survives window moves, DPI scaling, and async UI redraws that would break a coordinate-based click.
For cases where the accessibility tree is sparse or unavailable, Set-of-Mark grounding overlays numbered boxes on detected UI elements in a screenshot. The model selects an element by ID via desktop_click_mark(mark_id) rather than guessing coordinates. The fallback ladder is: desktop_click_element → desktop_click_text (OCR anchor) → desktop_click_mark → desktop_click(x, y).
Skill Library and App Memory
Two optional persistence layers accumulate knowledge across tasks.
The skill library records successful tool sequences with skill_save and retruns them by keyword with skill_list and skill_run. At the start of each task the planner receives the top-3 matching skills as few-shot exemplars, so recurring workflows get faster and more reliable over time. Skill creation is gated by agent.skills_enabled (default false); reads always work.
The app-memory store records per-app quirks, failure counts, and free-form notes via memory_note. At the start of each task the planner receives app memory hints for any visible applications, biasing plans toward strategies that worked before. Memory creation is gated by agent.memory.enabled (default false); reads always work.
REST API and TUI
The FastAPI REST server starts automatically in the background whenever you run main.py. It exposes task submission, SSE live event streaming, and a liveness probe, making AutoGUI accessible to web UIs, scripts, and CI pipelines without the terminal UI.
The Textual TUI provides a scrollable conversation pane with a status bar showing the current model name, conversation length, and tool visibility state. The live model picker (Ctrl+P → “Change Model”) fetches the current model list from the server; selecting a model takes effect immediately and can optionally be persisted to config.json.
Experimental Nature and Safety Considerations
AutoGUI is a research prototype, not a production tool. A few properties of the current design are worth understanding before running it.
The destructive command guard in shell_run blocks patterns like rm -rf, DROP TABLE, and dd if=, but it is a regex filter, not a sandbox. For untrusted tasks, set "allowed_shell": false or run the agent inside a container with a disposable filesystem.
The REST API has no authentication and binds to 0.0.0.0 by default. Set AUTOGUI_API_HOST=127.0.0.1 for loopback-only access or AUTOGUI_DISABLE_API=1 to disable the background API entirely.
The agent operates at OS level. It can click anything, type anywhere, read and write files, and take screenshots. The safety countdown (safety.command_confirm_delay_seconds, default 5 seconds) gives a visible window before each tool dispatch, with Escape-to-cancel, but it is not a substitute for running in an environment you are prepared to reset.
For the same reason, AutoGUI should not be pointed at tasks that involve sensitive data or credentials unless the environment is appropriately isolated. Screen content captured during observation passes through the model’s context window; if you are using a cloud-hosted model, treat everything visible on screen as potentially logged.
Getting Started
Clone the repository from https://github.com/BillJr99/AutoGUI, install the Python dependencies, copy config.json.example to config.json, and set your OpenWebUI URL, API key, and model. Run python main.py --check to verify connectivity before launching the TUI or issuing single-command tasks.
The optional install script at scripts/install-dependencies.sh (or .cmd/.ps1 on Windows) adds Tesseract for OCR-anchored clicking, Playwright for browser automation, AT-SPI on Linux for accessibility-tree element clicking, and ImageMagick for Set-of-Mark overlays and failure GIF recording.
A pytest suite under tests/ exercises the controller, predicates, budget ceilings, preflight, and visual diff modules with no live model and no desktop required — useful for validating changes to the orchestration logic without burning real API calls.
