AutoGUI: A Vendor-Neutral Desktop Automation Agent for LLMs

8 minute read

Published: May 15, 2026

Most LLM agents can read files, call APIs, and run shell commands, but they have no reliable way to operate a graphical desktop. They cannot click a button in a running application, verify that a dialog appeared, fill a form field, or observe what is currently on screen. AutoGUI is a research prototype that fills that gap. It connects any OpenAI-compatible LLM — including models served locally through OpenWebUI or directly through Ollama — to a full suite of OS-level desktop controls via a ReAct-style agentic loop.

Experimental software — use at your own risk. AutoGUI is a research prototype. It is not intended for, and has not been evaluated or deemed suitable for, any particular purpose, production use, or critical workload. No warranty is provided, express or implied. The agent operates at OS level and can run shell commands, click anything, type anywhere, read and write files, and take screenshots. Run it only in a sandbox, VM, or container that you are willing to reset. Restrict the REST API to loopback (AUTOGUI_API_HOST=127.0.0.1) and consider disabling shell access ("allowed_shell": false) if you do not fully trust the task or the model driving it.

Two Delivery Modes

AutoGUI ships in two forms. The standalone Python CLI/TUI agent connects any OpenWebUI instance (or any OpenAI-compatible endpoint) to your desktop. The native TypeScript Pi extension brings the same tools into the Pi terminal harness behind a single /autogui command, with no dependency on the Python agent or OpenWebUI.

Both share the same tool surface — shell execution, filesystem access, pixel and accessibility-tree clicking, Playwright browser automation — but they differ in how the LLM loop is owned. The standalone agent runs its own ReAct loop; the Pi extension delegates loop ownership to Pi.

Architecture

The standalone Python agent is organized around five main components.

main.py             Entry point — validation, component wiring, TUI/CLI dispatch
│
├── agent.py        ReAct loop + typed-plan controller
│   └─ controller   Preflight, predicate checks, replan-on-block, budget ceilings
│
├── tools.py        Tool registry
│   ├─ shell_run / fs_read / fs_write / fs_list
│   ├─ desktop_screenshot / click / type / hotkey / scroll / launch
│   ├─ desktop_click_element   (a11y-first: UIAutomation / AT-SPI)
│   ├─ desktop_click_mark      (Set-of-Mark grounding)
│   ├─ browser_navigate / click / fill / eval   (Playwright)
│   ├─ skill_save / skill_list / skill_run
│   └─ memory_get / memory_note
│
├── backends/       Platform-specific automation backends
│   ├─ windows.py   UIAutomation + SendInput (user32)
│   ├─ macos.py     screencapture + osascript
│   ├─ linux_x11.py xdotool + wmctrl
│   └─ linux_wayland.py  grim + ydotool + swaymsg
│
├── api.py          FastAPI REST server (auto-started in background)
│   ├─ POST /api/task          Submit task
│   ├─ GET  /api/task/{id}/stream  SSE live event stream
│   └─ GET  /api/healthz
│
└── tui.py          Textual TUI (status bar, model picker, tool visibility toggle)

The agentic loop in agent.py follows a standard ReAct pattern — append the user message to history, POST the history and tool schemas to the LLM, receive either a tool call or a stop response, execute the tool, append the result, and repeat. On top of that loop sits the controller, which adds:

Planner — one extra LLM call up front that produces a numbered, high-level plan injected as a [PLAN] block into the executor’s context
Plan critique — a second LLM call that reviews the plan and returns a revised version when issues are found
Preflight — before the first state-changing action, verifies that apps are on PATH, files exist, URLs are TCP-reachable, and named tools are registered
Predicate checks — when a plan step declares a typed post-condition (window_title_contains, file_exists, text_visible, etc.), the controller verifies it deterministically after each step completes
Replan-on-block — when a step is classified as BLOCKED, the controller re-invokes the planner with the failure reason as context
Visual diff — perceptual hash of pre/post screenshots flags silent no-ops where a state-changing action left the screen unchanged
Watchdog — detects when the loop is stuck by hashing the per-iteration signature and routing repeated matches through the BLOCKED path

Platform Support

The correct backend is detected automatically at startup via platform_detect.detect(). No configuration is required.

Platform	Screenshot	Click/Type	Accessibility Tree
Windows	pyautogui	SendInput (user32)	UIAutomation
WSL	pyautogui	pyautogui	PowerShell UIAutomation
macOS	screencapture	pyautogui	osascript
Linux X11	pyautogui	xdotool	AT-SPI
Linux Wayland	grim	ydotool	AT-SPI

On Windows, click and type operations go through user32.SendInput directly via ctypes, producing real INPUT events indistinguishable from a physical keyboard and mouse, with correct per-monitor DPI handling and full Unicode support. On Linux, the desktop_click_element tool talks to the accessibility tree via AT-SPI, letting the agent click real UI controls by name and role rather than guessing pixel positions.

AutoGUI implements two layers above raw pixel clicking.

desktop_click_element(name, control_type) resolves the target UI control through the OS accessibility API and clicks it by logical identity rather than screen position. This survives window moves, DPI scaling, and async UI redraws that would break a coordinate-based click.

For cases where the accessibility tree is sparse or unavailable, Set-of-Mark grounding overlays numbered boxes on detected UI elements in a screenshot. The model selects an element by ID via desktop_click_mark(mark_id) rather than guessing coordinates. The fallback ladder is: desktop_click_element → desktop_click_text (OCR anchor) → desktop_click_mark → desktop_click(x, y).

Skill Library and App Memory

Two optional persistence layers accumulate knowledge across tasks.

The skill library records successful tool sequences with skill_save and retruns them by keyword with skill_list and skill_run. At the start of each task the planner receives the top-3 matching skills as few-shot exemplars, so recurring workflows get faster and more reliable over time. Skill creation is gated by agent.skills_enabled (default false); reads always work.

The app-memory store records per-app quirks, failure counts, and free-form notes via memory_note. At the start of each task the planner receives app memory hints for any visible applications, biasing plans toward strategies that worked before. Memory creation is gated by agent.memory.enabled (default false); reads always work.

REST API and TUI

The FastAPI REST server starts automatically in the background whenever you run main.py. It exposes task submission, SSE live event streaming, and a liveness probe, making AutoGUI accessible to web UIs, scripts, and CI pipelines without the terminal UI.

The Textual TUI provides a scrollable conversation pane with a status bar showing the current model name, conversation length, and tool visibility state. The live model picker (Ctrl+P → “Change Model”) fetches the current model list from the server; selecting a model takes effect immediately and can optionally be persisted to config.json.

Experimental Nature and Safety Considerations

AutoGUI is a research prototype, not a production tool. A few properties of the current design are worth understanding before running it.

The destructive command guard in shell_run blocks patterns like rm -rf, DROP TABLE, and dd if=, but it is a regex filter, not a sandbox. For untrusted tasks, set "allowed_shell": false or run the agent inside a container with a disposable filesystem.

The REST API has no authentication and binds to 0.0.0.0 by default. Set AUTOGUI_API_HOST=127.0.0.1 for loopback-only access or AUTOGUI_DISABLE_API=1 to disable the background API entirely.

The agent operates at OS level. It can click anything, type anywhere, read and write files, and take screenshots. The safety countdown (safety.command_confirm_delay_seconds, default 5 seconds) gives a visible window before each tool dispatch, with Escape-to-cancel, but it is not a substitute for running in an environment you are prepared to reset.

For the same reason, AutoGUI should not be pointed at tasks that involve sensitive data or credentials unless the environment is appropriately isolated. Screen content captured during observation passes through the model’s context window; if you are using a cloud-hosted model, treat everything visible on screen as potentially logged.

Getting Started

Clone the repository from https://github.com/BillJr99/AutoGUI, install the Python dependencies, copy config.json.example to config.json, and set your OpenWebUI URL, API key, and model. Run python main.py --check to verify connectivity before launching the TUI or issuing single-command tasks.

The optional install script at scripts/install-dependencies.sh (or .cmd/.ps1 on Windows) adds Tesseract for OCR-anchored clicking, Playwright for browser automation, AT-SPI on Linux for accessibility-tree element clicking, and ImageMagick for Set-of-Mark overlays and failure GIF recording.

A pytest suite under tests/ exercises the controller, predicates, budget ceilings, preflight, and visual diff modules with no live model and no desktop required — useful for validating changes to the orchestration logic without burning real API calls.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

mcpproxy: A Config-Driven MCP Host with a Built-In Web UI

5 minute read

Published: May 25, 2026

The Model Context Protocol defines a standard way for AI clients to discover and call tools, but standing up a personal MCP server still usually means writing Python glue code, wiring up a framework, and restarting a process every time you add a tool. mcpproxy takes a different approach: every tool provider is a single YAML file, the server reloads tools at startup without any code changes to the host, and a browser-based web UI handles the full provider lifecycle — editing, secret management, and live command streaming — without leaving the browser.

BetterWebUI: A Faculty-Friendly Agentic Front End for OpenWebUI

11 minute read

Published: May 15, 2026

Most large language model interfaces are designed for developers or for a general consumer audience. Faculty who want to use an AI assistant to help with grading, research, or course preparation either accept the limitations of a consumer chat interface or invest significant time learning to run and configure a developer-grade setup. BetterWebUI is an attempt to close that gap. It is a local Python/FastAPI server with a pure-HTML front end that connects to an existing OpenWebUI instance and layers on the features that make an agentic assistant genuinely useful in a higher-education context: workspaces, skills, MCP server management, CLI shortcuts, math rendering, and a suite of integrations with sibling agentic services.

OSScreenObserver: Giving AI Agents Eyes and Hands on Your Desktop

15 minute read

Published: May 11, 2026

Most AI agents, whether a large language model assistant running locally or a cloud-hosted agentic framework, have no reliable way to see or interact with the desktop applications running on the machine they are supposed to be helping with. They can read files, call APIs, and run shell commands, but they cannot observe that a dialog box appeared, that a form field is waiting for input, or that an application is in a specific state. OSScreenObserver is a prototype that changes that. It exposes the operating system’s UI accessibility tree, textual descriptions from multiple sources, and ASCII spatial sketches of the current screen layout through two simultaneous interfaces: a browser-based web inspector for humans and an MCP sees are always consistent.

A Private AI Knowledge Base: Obsidian, GitHub Sync, and Cross-Platform AI Context

38 minute read

Published: May 02, 2026

For the past year I have been building a knowledge management system with a specific design constraint in mind: every AI system I work with, whether a cloud-hosted assistant, a local agentic coding tool, or an automated GitHub Action, should be able to read the same authoritative description of who I am, what I am working on, and how I want to interact. More importantly, those systems should be able to write back into the knowledge base and have their work appear seamlessly in Obsidian on my local machine the next time I open the app. The proliferation of capable AI tools in 2025-2026 made both sides of this problem, reading and writing, tractable in a way they had not been before. This post documents the architecture I settled on: an Obsidian vault hosted on GitHub, synchronized via the Gitless Sync plugin, structured around three canonical files that any AI system can read and act on, and organized into a curated wiki that agents can query, extend, and maintain across platforms.

Bill Mongan

AutoGUI: A Vendor-Neutral Desktop Automation Agent for LLMs

Two Delivery Modes

Architecture

Platform Support

Skill Library and App Memory

REST API and TUI

Experimental Nature and Safety Considerations

Getting Started

Share on

You May Also Enjoy

mcpproxy: A Config-Driven MCP Host with a Built-In Web UI

BetterWebUI: A Faculty-Friendly Agentic Front End for OpenWebUI

OSScreenObserver: Giving AI Agents Eyes and Hands on Your Desktop

A Private AI Knowledge Base: Obsidian, GitHub Sync, and Cross-Platform AI Context

Bill Mongan

Two Delivery Modes

Architecture

Platform Support

A11y-First Clicking and Set-of-Mark Grounding

Skill Library and App Memory

REST API and TUI

Experimental Nature and Safety Considerations

Getting Started

Share on

You May Also Enjoy

mcpproxy: A Config-Driven MCP Host with a Built-In Web UI

BetterWebUI: A Faculty-Friendly Agentic Front End for OpenWebUI

OSScreenObserver: Giving AI Agents Eyes and Hands on Your Desktop

A Private AI Knowledge Base: Obsidian, GitHub Sync, and Cross-Platform AI Context