vrr/airi

mirror of https://github.com/moeru-ai/airi.git synced 2026-05-17 04:20:26 +00:00

History

duyua9 5ab8b0e33e Some checks failed CI / Lint (push) Waiting to run Details CI / Build Test (stage-tamagotchi) (push) Waiting to run Details CI / Build Test (stage-tamagotchi-godot) (push) Waiting to run Details CI / Build Test (stage-web) (push) Waiting to run Details CI / Build Test (ui-loading-screens) (push) Waiting to run Details CI / Build Test (ui-transitions) (push) Waiting to run Details CI / Type Check (push) Waiting to run Details CI / Check Provenance (push) Waiting to run Details Cloudflare Workers / Deploy - stage-web (push) Waiting to run Details Update Nix assets Hash / update (push) Has been cancelled Details Update Nix pnpmDeps Hash / update (push) Has been cancelled Details fix(computer-use-mcp): bound terminal output capture (#1802 ) ## Summary - cap local `terminal_exec` stdout/stderr capture at a fixed per-stream limit - report whether each stream was truncated, along with the original captured length - add runner coverage for commands that emit large stdout and stderr payloads ## Why The local shell runner currently appends stdout/stderr without a boundary before returning `TerminalCommandResult`. Large command output can grow the MCP response and stored terminal state far beyond what is useful for the agent. This keeps command execution semantics the same while bounding the returned text and making truncation explicit to callers. ## Tests - `pnpm -F @proj-airi/computer-use-mcp test -- src/terminal/runner.test.ts` - `pnpm -F @proj-airi/computer-use-mcp typecheck` - `git diff --check origin/main...HEAD` Co-authored-by: 刘梓恒 <160735726+3361559784@users.noreply.github.com>		2026-05-13 12:47:57 +08:00
..
chrome-extension	feat(computer-use-mcp): add read-only DOM tools parity to extension bridge (#1733 )	2026-04-27 15:24:27 +08:00
fixtures	feat(stage-tamagotchi,stage-ui,computer-use-mcp): intro agent-owned session and ghost pointer phases (#1649 )	2026-04-25 04:14:33 +08:00
src	fix(computer-use-mcp): bound terminal output capture (#1802 )	2026-05-13 12:47:57 +08:00
AGENTS.md	chore(agent): add computer-use-mcp agent governance (#1737 )	2026-04-27 15:31:02 +08:00
airi-cli-architecture.md	chore(computer-use-mcp): added architecture docs outlining airi cli `chafa` (#1735 )	2026-05-13 11:55:33 +08:00
coding-plast-mem-bridge-contract.md	chore(computer-use-mcp): added docs defining plast-mem plugin (#1779 )	2026-05-13 11:59:06 +08:00
FEASIBILITY.md	feat(mcp-computer-use): lay down the foundation of computer use (#1380 )	2026-04-11 00:54:01 +08:00
mimic-baseline-training-boundary.md	chore(computer-use-mcp): added docs defining mimic baseline training boundary (#1777 )	2026-05-13 11:57:18 +08:00
package.json	test(computer-use-mcp): add desktop v3 smoke coverage (#1780 )	2026-05-13 12:08:38 +08:00
planning-orchestration-contract.md	test(computer-use-mcp): define planning orchestration contract (#1778 )	2026-05-13 11:58:35 +08:00
README.md	[1/3] feat(desktop): add desktop observation and overlay baseline (#1647 )	2026-04-24 23:43:40 +08:00
tsconfig.json	style(computer-use-mcp): house keeping, lint fixes, type missing, and many more	2026-04-11 04:59:13 +08:00
tsdown.config.ts	feat(mcp-computer-use): lay down the foundation of computer use (#1380 )	2026-04-11 00:54:01 +08:00
vitest.config.ts	feat(mcp-computer-use): lay down the foundation of computer use (#1380 )	2026-04-11 00:54:01 +08:00

README.md

computer-use-mcp

AIRI-specific macOS desktop orchestration MCP service.

Why This Exists

This package exists because AIRI already has many useful pieces in the monorepo — providers, chat UX, MCP attachment, desktop app surfaces, browser integrations, tool bridges, and workflow-related logic — but those pieces are still too easy to use as isolated features instead of one coherent agent system.

computer-use-mcp is the missing execution substrate for that gap.

The current goal is not to add "another computer use demo". The goal is to give AIRI a unified way to:

observe the current desktop or browser state
choose the right execution surface for the task
run deterministic actions through tools and terminal commands
keep approvals, trace, and audit artifacts attached to the run
compose those actions into repeatable workflows instead of one-off demos

In short:

AIRI remains the control plane and agent shell
computer-use-mcp is the local execution and workflow substrate
the value is in orchestration, not in cursor movement by itself

What It Is

This package is no longer positioned as a generic remote computer-use experiment. The current v1 shape is:

AIRI keeps the control plane:
- MCP tool surface
- approval queue protocol
- audit log
- trace history
- screenshot persistence
computer-use-mcp provides a local macOS execution layer:
- window observation
- screenshots
- app open/focus
- mouse/keyboard injection
- background terminal command execution
AIRI desktop adds a native approval adapter:
- approval_required still comes from MCP
- Electron shows a native dialog
- AIRI automatically calls approve/reject on the user's behalf

The intended story is:

AIRI uses tools first
visual observation is supplementary, not the primary execution path
terminal commands are executed by a background shell runner, not by scripting Terminal tabs
desktop/Electron/native apps and browser DOM are treated as different execution surfaces

Why It Is Not "Just A Mouse Toy"

This package should not be understood as a coordinate-replay automation toy.

What makes it different:

it exposes an MCP tool surface instead of a one-off macro recorder
it keeps action policy, approval, trace history, and audit output per run
it distinguishes between desktop control and browser DOM control instead of forcing everything through blind clicks
it prefers deterministic execution paths (terminal_exec, workflows, browser_dom_*) before raw coordinate actions
it is designed to be called by AIRI automatically as part of a task flow, not merely driven by a human demo operator

That means the package is useful only when it helps AIRI turn scattered local capabilities into one observable, controllable task system.

Current Executor Modes

dry-run
- default
- never injects input
- still captures best-effort local screenshots for debugging
macos-local
- current primary backend
- window observation via NSWorkspace + CGWindowList
- input injection via Swift + Quartz CGEvent
- app open/focus via open -a and activate
linux-x11
- retained as a legacy experimental backend
- not the main v1 story anymore

Tool Surface

Desktop observation and control:

desktop_get_capabilities
desktop_observe_windows
desktop_screenshot
desktop_open_app
desktop_focus_app
desktop_click
desktop_type_text
desktop_press_keys
desktop_scroll
desktop_wait

Terminal orchestration:

terminal_exec
terminal_get_state
terminal_reset_state

Clipboard bridge:

secret_read_env_value
clipboard_read_text
clipboard_write_text

Browser DOM bridge:

browser_agent_get_status
browser_agent_run
browser_dom_get_bridge_status
browser_dom_get_active_tab
browser_dom_read_page
browser_dom_find_elements
browser_dom_click
browser_dom_read_input_value
browser_dom_set_input_value
browser_dom_check_checkbox
browser_dom_select_option
browser_dom_wait_for_element
browser_dom_get_element_attributes
browser_dom_get_computed_styles
browser_dom_trigger_event

Approval and audit helpers:

desktop_list_pending_actions
desktop_approve_pending_action
desktop_reject_pending_action
desktop_get_session_trace

Workflow orchestration:

workflow_open_workspace
- reveals a workspace in Finder and opens it in the configured IDE
workflow_validate_workspace
- opens the workspace, confirms pwd, inspects local changes, and runs a validation command such as pnpm typecheck
workflow_run_tests
- runs a test command from the workspace root
workflow_inspect_failure
- focuses the IDE and re-runs or inspects a failing command path
workflow_browse_and_act
- generic browse-and-act flow for app observation and follow-up actions
workflow_resume
- resumes a workflow that paused on approval_required

Policy Model

The current macOS v1 boundary is intentionally narrow and explicit:

global screen coordinates are allowed for UI actions
allowApps is not used as a hard gate for click/type/scroll
denyApps still blocks sensitive foreground apps
COMPUTER_USE_OPENABLE_APPS only gates desktop_open_app and desktop_focus_app
AIRI itself is in the default deny list to avoid self-operation
terminal commands always require approval
app open/focus always require approval
click/type/press/scroll still use per-action approval

Environment Variables

Core:

COMPUTER_USE_EXECUTOR
- dry-run, macos-local, or linux-x11
COMPUTER_USE_APPROVAL_MODE
- actions (default), all, never
COMPUTER_USE_SESSION_ROOT
- local output directory for screenshots and audit.jsonl
COMPUTER_USE_TIMEOUT_MS
COMPUTER_USE_DEFAULT_CAPTURE_AFTER
COMPUTER_USE_MAX_OPERATIONS
COMPUTER_USE_MAX_OPERATION_UNITS
COMPUTER_USE_MAX_PENDING_ACTIONS

macOS orchestration:

COMPUTER_USE_OPENABLE_APPS
- default Terminal,Cursor,Google Chrome
COMPUTER_USE_DENY_APPS
- default includes 1Password, Keychain, System Settings, Activity Monitor, AIRI
COMPUTER_USE_DENY_WINDOW_TITLES
COMPUTER_USE_TERMINAL_SHELL
- default current shell, otherwise /bin/zsh
COMPUTER_USE_ALLOWED_BOUNDS
- optional global coordinate clamp

Browser DOM bridge:

COMPUTER_USE_BROWSER_DOM_BRIDGE_ENABLED
- default true
COMPUTER_USE_BROWSER_DOM_BRIDGE_HOST
- default 127.0.0.1
COMPUTER_USE_BROWSER_DOM_BRIDGE_PORT
- default 8765
COMPUTER_USE_BROWSER_DOM_BRIDGE_TIMEOUT_MS
- default 10000

Autonomous browser agent:

COMPUTER_USE_BROWSER_AGENT_ROOT
- optional override for the embedded browser-agent workspace under src/bin/computer_use
COMPUTER_USE_PYTHON
- optional python executable override for browser_agent_run; defaults to the embedded .venv/bin/python when present, otherwise python3

Legacy remote runner:

COMPUTER_USE_REMOTE_SSH_HOST
COMPUTER_USE_REMOTE_SSH_USER
COMPUTER_USE_REMOTE_SSH_PORT
COMPUTER_USE_REMOTE_RUNNER_COMMAND
COMPUTER_USE_REMOTE_DISPLAY_SIZE
COMPUTER_USE_REMOTE_OBSERVATION_BASE_URL
COMPUTER_USE_REMOTE_OBSERVATION_SERVE_PORT
COMPUTER_USE_REMOTE_OBSERVATION_TOKEN

Binary overrides:

COMPUTER_USE_SWIFT_BINARY
COMPUTER_USE_OSASCRIPT_BINARY
COMPUTER_USE_SCREENSHOT_BINARY
COMPUTER_USE_OPEN_BINARY
COMPUTER_USE_SSH_BINARY
COMPUTER_USE_TAR_BINARY

AIRI Integration

AIRI still connects through mcp.json. Example local macOS entry:

{
  "mcpServers": {
    "computer_use": {
      "command": "pnpm",
      "args": [
        "-F",
        "@proj-airi/computer-use-mcp",
        "start"
      ],
      "cwd": "/path/to/your/airi/repo",
      "env": {
        "COMPUTER_USE_EXECUTOR": "macos-local",
        "COMPUTER_USE_APPROVAL_MODE": "actions",
        "COMPUTER_USE_OPENABLE_APPS": "Terminal,Cursor,Google Chrome"
      }
    }
  }
}

On the AIRI desktop side, approvals are handled like this:

model calls a computer_use::* tool
MCP returns approval_required
Electron shows a native approval dialog
AIRI automatically calls desktop_approve_pending_action or desktop_reject_pending_action
terminal/app approvals can be reused for the current run only

For browser DOM automation, computer-use-mcp also exposes a local WebSocket bridge that matches the user's Chrome extension bridge pattern:

computer-use-mcp listens on ws://127.0.0.1:8765 by default
the unpacked browser extension background service worker connects to that socket
AIRI can then call browser_dom_* MCP tools against the active browser tab

If you override COMPUTER_USE_BROWSER_DOM_BRIDGE_HOST or COMPUTER_USE_BROWSER_DOM_BRIDGE_PORT, mirror the same endpoint in the Chrome extension via chrome.storage.local.set({ browserDomBridgeHost, browserDomBridgePort }) so the background worker reconnects to the correct socket.

Use the two surfaces differently:

desktop_* for AIRI itself, native macOS apps, Electron windows, Finder, Terminal, VS Code
browser_dom_* for real browser pages, cross-frame DOM reads, form filling, selector-based interaction, and iframe-heavy flows
browser_agent_run for goal-driven browser tasks where AIRI should delegate the web exploration loop instead of manually hard-coding each browser step

Validation Commands

pnpm -F @proj-airi/computer-use-mcp typecheck
pnpm -F @proj-airi/computer-use-mcp test
pnpm -F @proj-airi/computer-use-mcp smoke:stdio
pnpm -F @proj-airi/computer-use-mcp smoke:macos
pnpm -F @proj-airi/computer-use-mcp e2e:airi-chat
pnpm -F @proj-airi/computer-use-mcp e2e:airi-discord

Legacy remote validation remains available:

pnpm -F @proj-airi/computer-use-mcp bootstrap:remote
pnpm -F @proj-airi/computer-use-mcp smoke:remote

Demo Story To Record

If you want to record a convincing demo, show the system as an orchestrated task runner instead of a flashy cursor dance.

Recommended recording structure:

Show the AIRI desktop window, a terminal, and the generated report directory.
Start the local AIRI desktop app and the computer-use-mcp service.
Show that AIRI can call the MCP tools automatically instead of only listing them.
Demonstrate one short task that exercises the full loop:

observe state
execute a tool or workflow
produce a visible result
persist trace / audit / screenshots

End by opening the generated report.json, audit.jsonl, or screenshots so the demo finishes with evidence rather than just screen motion.

Good first demos:

open a workspace, confirm pwd, inspect local changes, and run pnpm typecheck
create and run a Python hello-world project through terminal_exec
use desktop control for AIRI or native apps and use browser_dom_* only when the task truly moves into a browser page

Discord integration demo

For a management-readable AIRI demo, the Discord settings flow is more representative than a generic hello-world reply:

start AIRI desktop and services/discord-bot
open /settings/modules/messaging-discord
enable the module and save settings
verify that the Discord bot receives the forwarded config from AIRI and reconnects itself
finish by opening report.json, screenshots, audit log, and discord-bot.log

Notes:

for a pure local-secret run, set AIRI_E2E_DISCORD_TOKEN
for a more agentic run, set AIRI_E2E_DISCORD_TOKEN_SOURCE=portal or auto and let AIRI retrieve the token from the live browser / Discord Developer Portal session
clipboard_read_text / clipboard_write_text are the intended bridge when AIRI must move a copied token from the browser back into AIRI settings
the observable harness keeps the token out of the desktop audit trail by applying the secret through the renderer instead of typing it through Quartz key events
if you only want to validate the AIRI → Discord bot configuration plumbing without a real token, set AIRI_E2E_DISCORD_ALLOW_LOGIN_FAILURE=true

Less convincing demos:

long videos of coordinate clicking with no trace output
browser form-filling done only by screen coordinates when DOM tools were available
tasks that cannot explain afterwards what the agent observed, executed, or verified

Known Limits

macOS only for the main v1 path
no accessibility tree grounding yet
PTY/TUI terminal support is product-supported on the self-acquire mainline; legacy outward terminal reroute remains secondary
no multi-monitor orchestration policy yet
global coordinates are allowed, so the safety boundary is approval + audit, not strict app isolation