OSScreenObserver
OSScreenObserver is a Python prototype that exposes the operating system’s UI accessibility tree, textual descriptions, and ASCII spatial sketches through two simultaneous interfaces: a browser-based web inspector at localhost:5001 for human inspection, and an MCP (Model Context Protocol) stdio server compatible with Claude Desktop and Claude Code for AI agent integration. Both interfaces share the same underlying observer and can run at the same time.
The system supports three description modalities, and get_screen_description always returns every source that is available on the current platform in a single call. The accessibility tree modality traverses the OS accessibility API (UIA on Windows, AXUIElement on macOS, AT-SPI on Linux) to produce a structured JSON element hierarchy. The OCR modality uses Tesseract to extract text from a screenshot. The VLM modality optionally passes a screenshot to Claude Vision for a natural-language description. These results are combined and labeled by source in both the web inspector’s Description tab and in MCP tool responses.
The ASCII sketch renderer produces a Unicode box-drawing spatial layout diagram of a window’s element positions, suitable for consumption by a language model with no image input capability. All inputs and outputs degrade gracefully: if a library is not installed or a platform capability is unavailable, the server continues running and returns whatever it can.
Platform-specific adapters are provided for Windows (full UIA and pywinauto support), macOS (Quartz and pyobjc AX accessibility tree), Linux (wmctrl and pyatspi), and WSL (PowerShell fallback for screenshots and window enumeration when no X11 display is available). A mock adapter allows full development and testing without any OS-level access.
The MCP integration enables AI agents to list visible windows, retrieve accessibility trees, request ASCII sketches and screenshots, obtain combined textual descriptions, enumerate visible bounding boxes, bring windows to the foreground, and execute click, type, key, and scroll actions against real OS controls.
The package is hosted on GitHub at:
Architecture
main.py
┌── Flask web inspector (background thread)
└── MCP stdio server (main thread, stdin/stdout)
└── ScreenObserver
├── Accessibility Tree (observer)
├── ASCII Renderer (ascii_renderer)
└── Description Generator (description)
├── accessibility (tree prose)
├── ocr (Tesseract)
└── vlm (Claude Vision)
Quick Start
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python main.py
# Web inspector: http://127.0.0.1:5001
# MCP server: stdin/stdout
MCP Integration
Add the following block to your Claude Desktop configuration to make the MCP tools available in a conversation:
{
"mcpServers": {
"os-screen-observer": {
"command": "python",
"args": ["/absolute/path/to/screen_observer/main.py", "--mode", "both"]
}
}
}
Available MCP Tools
| Tool | Description |
|---|---|
list_windows | Enumerate all visible top-level windows |
get_window_structure | Full accessibility element tree as JSON |
get_screen_description | Combined description from all available sources |
get_screen_sketch | ASCII spatial layout diagram |
get_screenshot | Screenshot as base64 PNG |
get_full_screenshot | Screenshot and ASCII sketch in one call |
get_visible_areas | Visible (non-occluded) bounding boxes for a window |
bring_to_foreground | Raise a window using the platform focus API |
click_at | Click at pixel coordinates |
type_text | Type text into the focused element |
press_key | Press a key combination (e.g., ctrl+c) |
scroll | Scroll the mouse wheel at a screen position |
Platform Support
| Feature | Windows | macOS | Linux | WSL |
|---|---|---|---|---|
| Window enumeration | Full | Full | Full | Full |
| Accessibility tree | Full UIA | AXUIElement | AT-SPI | Stub (no X11) |
| Screenshot | Full | Full | Full | mss or PowerShell |
| OCR | Full | Full | Full | Full |
| VLM description | Full | Full | Full | Full |
| ASCII sketch | Full | Full | Full | Full |
| Input actions | Full | Full | Full | Requires DISPLAY |
Known Limitations
Screen content is included verbatim in MCP tool results. Malicious content on screen could attempt to influence the AI’s behavior; appropriate trust boundaries should be applied before deploying this server in production contexts. Input action tools (click_at, type_text, press_key) execute real OS input events and should be used with appropriate authorization controls.
