OSScreenObserver

OSScreenObserver is a Python prototype that exposes the operating system’s UI accessibility tree, textual descriptions, and ASCII spatial sketches through two simultaneous interfaces: a browser-based web inspector at localhost:5001 for human inspection, and an MCP (Model Context Protocol) stdio server compatible with Claude Desktop and Claude Code for AI agent integration. Both interfaces share the same underlying observer and can run at the same time.

The system supports three description modalities, and get_screen_description always returns every source that is available on the current platform in a single call. The accessibility tree modality traverses the OS accessibility API (UIA on Windows, AXUIElement on macOS, AT-SPI on Linux) to produce a structured JSON element hierarchy. The OCR modality uses Tesseract to extract text from a screenshot. The VLM modality optionally passes a screenshot to Claude Vision for a natural-language description. These results are combined and labeled by source in both the web inspector’s Description tab and in MCP tool responses.

The ASCII sketch renderer produces a Unicode box-drawing spatial layout diagram of a window’s element positions, suitable for consumption by a language model with no image input capability. All inputs and outputs degrade gracefully: if a library is not installed or a platform capability is unavailable, the server continues running and returns whatever it can.

Platform-specific adapters are provided for Windows (full UIA and pywinauto support), macOS (Quartz and pyobjc AX accessibility tree), Linux (wmctrl and pyatspi), and WSL (PowerShell fallback for screenshots and window enumeration when no X11 display is available). A mock adapter allows full development and testing without any OS-level access.

The MCP integration enables AI agents to list visible windows, retrieve accessibility trees, request ASCII sketches and screenshots, obtain combined textual descriptions, enumerate visible bounding boxes, bring windows to the foreground, and execute click, type, key, and scroll actions against real OS controls.

The package is hosted on GitHub at:

OSScreenObserver

Architecture

main.py
  ┌── Flask web inspector (background thread)
  └── MCP stdio server (main thread, stdin/stdout)
         └── ScreenObserver
              ├── Accessibility Tree (observer)
              ├── ASCII Renderer (ascii_renderer)
              └── Description Generator (description)
                   ├── accessibility (tree prose)
                   ├── ocr (Tesseract)
                   └── vlm (Claude Vision)

Quick Start

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python main.py
# Web inspector: http://127.0.0.1:5001
# MCP server:    stdin/stdout

MCP Integration

Add the following block to your Claude Desktop configuration to make the MCP tools available in a conversation:

{
  "mcpServers": {
    "os-screen-observer": {
      "command": "python",
      "args": ["/absolute/path/to/screen_observer/main.py", "--mode", "both"]
    }
  }
}

Available MCP Tools

ToolDescription
list_windowsEnumerate all visible top-level windows
get_window_structureFull accessibility element tree as JSON
get_screen_descriptionCombined description from all available sources
get_screen_sketchASCII spatial layout diagram
get_screenshotScreenshot as base64 PNG
get_full_screenshotScreenshot and ASCII sketch in one call
get_visible_areasVisible (non-occluded) bounding boxes for a window
bring_to_foregroundRaise a window using the platform focus API
click_atClick at pixel coordinates
type_textType text into the focused element
press_keyPress a key combination (e.g., ctrl+c)
scrollScroll the mouse wheel at a screen position

Platform Support

FeatureWindowsmacOSLinuxWSL
Window enumerationFullFullFullFull
Accessibility treeFull UIAAXUIElementAT-SPIStub (no X11)
ScreenshotFullFullFullmss or PowerShell
OCRFullFullFullFull
VLM descriptionFullFullFullFull
ASCII sketchFullFullFullFull
Input actionsFullFullFullRequires DISPLAY

Known Limitations

Screen content is included verbatim in MCP tool results. Malicious content on screen could attempt to influence the AI’s behavior; appropriate trust boundaries should be applied before deploying this server in production contexts. Input action tools (click_at, type_text, press_key) execute real OS input events and should be used with appropriate authorization controls.