|
Some checks failed
CI / Lint (push) Waiting to run
CI / Build Test (stage-tamagotchi) (push) Waiting to run
CI / Build Test (stage-tamagotchi-godot) (push) Waiting to run
CI / Build Test (stage-web) (push) Waiting to run
CI / Build Test (ui-loading-screens) (push) Waiting to run
CI / Build Test (ui-transitions) (push) Waiting to run
CI / Type Check (push) Waiting to run
CI / Check Provenance (push) Waiting to run
Cloudflare Workers / Deploy - stage-web (push) Waiting to run
Update Nix assets Hash / update (push) Has been cancelled
Update Nix pnpmDeps Hash / update (push) Has been cancelled
## Summary - cap local `terminal_exec` stdout/stderr capture at a fixed per-stream limit - report whether each stream was truncated, along with the original captured length - add runner coverage for commands that emit large stdout and stderr payloads ## Why The local shell runner currently appends stdout/stderr without a boundary before returning `TerminalCommandResult`. Large command output can grow the MCP response and stored terminal state far beyond what is useful for the agent. This keeps command execution semantics the same while bounding the returned text and making truncation explicit to callers. ## Tests - `pnpm -F @proj-airi/computer-use-mcp test -- src/terminal/runner.test.ts` - `pnpm -F @proj-airi/computer-use-mcp typecheck` - `git diff --check origin/main...HEAD` Co-authored-by: 刘梓恒 <160735726+3361559784@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| chrome-extension | ||
| fixtures | ||
| src | ||
| AGENTS.md | ||
| airi-cli-architecture.md | ||
| coding-plast-mem-bridge-contract.md | ||
| FEASIBILITY.md | ||
| mimic-baseline-training-boundary.md | ||
| package.json | ||
| planning-orchestration-contract.md | ||
| README.md | ||
| tsconfig.json | ||
| tsdown.config.ts | ||
| vitest.config.ts | ||
computer-use-mcp
AIRI-specific macOS desktop orchestration MCP service.
Why This Exists
This package exists because AIRI already has many useful pieces in the monorepo — providers, chat UX, MCP attachment, desktop app surfaces, browser integrations, tool bridges, and workflow-related logic — but those pieces are still too easy to use as isolated features instead of one coherent agent system.
computer-use-mcp is the missing execution substrate for that gap.
The current goal is not to add "another computer use demo". The goal is to give AIRI a unified way to:
- observe the current desktop or browser state
- choose the right execution surface for the task
- run deterministic actions through tools and terminal commands
- keep approvals, trace, and audit artifacts attached to the run
- compose those actions into repeatable workflows instead of one-off demos
In short:
- AIRI remains the control plane and agent shell
computer-use-mcpis the local execution and workflow substrate- the value is in orchestration, not in cursor movement by itself
What It Is
This package is no longer positioned as a generic remote computer-use experiment. The current v1 shape is:
- AIRI keeps the control plane:
- MCP tool surface
- approval queue protocol
- audit log
- trace history
- screenshot persistence
computer-use-mcpprovides a local macOS execution layer:- window observation
- screenshots
- app open/focus
- mouse/keyboard injection
- background terminal command execution
- AIRI desktop adds a native approval adapter:
approval_requiredstill comes from MCP- Electron shows a native dialog
- AIRI automatically calls approve/reject on the user's behalf
The intended story is:
- AIRI uses tools first
- visual observation is supplementary, not the primary execution path
- terminal commands are executed by a background shell runner, not by scripting Terminal tabs
- desktop/Electron/native apps and browser DOM are treated as different execution surfaces
Why It Is Not "Just A Mouse Toy"
This package should not be understood as a coordinate-replay automation toy.
What makes it different:
- it exposes an MCP tool surface instead of a one-off macro recorder
- it keeps action policy, approval, trace history, and audit output per run
- it distinguishes between desktop control and browser DOM control instead of forcing everything through blind clicks
- it prefers deterministic execution paths (
terminal_exec, workflows,browser_dom_*) before raw coordinate actions - it is designed to be called by AIRI automatically as part of a task flow, not merely driven by a human demo operator
That means the package is useful only when it helps AIRI turn scattered local capabilities into one observable, controllable task system.
Current Executor Modes
dry-run- default
- never injects input
- still captures best-effort local screenshots for debugging
macos-local- current primary backend
- window observation via
NSWorkspace + CGWindowList - input injection via Swift + Quartz
CGEvent - app open/focus via
open -aandactivate
linux-x11- retained as a legacy experimental backend
- not the main v1 story anymore
Tool Surface
Desktop observation and control:
desktop_get_capabilitiesdesktop_observe_windowsdesktop_screenshotdesktop_open_appdesktop_focus_appdesktop_clickdesktop_type_textdesktop_press_keysdesktop_scrolldesktop_wait
Terminal orchestration:
terminal_execterminal_get_stateterminal_reset_state
Clipboard bridge:
secret_read_env_valueclipboard_read_textclipboard_write_text
Browser DOM bridge:
browser_agent_get_statusbrowser_agent_runbrowser_dom_get_bridge_statusbrowser_dom_get_active_tabbrowser_dom_read_pagebrowser_dom_find_elementsbrowser_dom_clickbrowser_dom_read_input_valuebrowser_dom_set_input_valuebrowser_dom_check_checkboxbrowser_dom_select_optionbrowser_dom_wait_for_elementbrowser_dom_get_element_attributesbrowser_dom_get_computed_stylesbrowser_dom_trigger_event
Approval and audit helpers:
desktop_list_pending_actionsdesktop_approve_pending_actiondesktop_reject_pending_actiondesktop_get_session_trace
Workflow orchestration:
workflow_open_workspace- reveals a workspace in Finder and opens it in the configured IDE
workflow_validate_workspace- opens the workspace, confirms
pwd, inspects local changes, and runs a validation command such aspnpm typecheck
- opens the workspace, confirms
workflow_run_tests- runs a test command from the workspace root
workflow_inspect_failure- focuses the IDE and re-runs or inspects a failing command path
workflow_browse_and_act- generic browse-and-act flow for app observation and follow-up actions
workflow_resume- resumes a workflow that paused on
approval_required
- resumes a workflow that paused on
Policy Model
The current macOS v1 boundary is intentionally narrow and explicit:
- global screen coordinates are allowed for UI actions
allowAppsis not used as a hard gate for click/type/scrolldenyAppsstill blocks sensitive foreground appsCOMPUTER_USE_OPENABLE_APPSonly gatesdesktop_open_appanddesktop_focus_app- AIRI itself is in the default deny list to avoid self-operation
- terminal commands always require approval
- app open/focus always require approval
- click/type/press/scroll still use per-action approval
Environment Variables
Core:
COMPUTER_USE_EXECUTORdry-run,macos-local, orlinux-x11
COMPUTER_USE_APPROVAL_MODEactions(default),all,never
COMPUTER_USE_SESSION_ROOT- local output directory for screenshots and
audit.jsonl
- local output directory for screenshots and
COMPUTER_USE_TIMEOUT_MSCOMPUTER_USE_DEFAULT_CAPTURE_AFTERCOMPUTER_USE_MAX_OPERATIONSCOMPUTER_USE_MAX_OPERATION_UNITSCOMPUTER_USE_MAX_PENDING_ACTIONS
macOS orchestration:
COMPUTER_USE_OPENABLE_APPS- default
Terminal,Cursor,Google Chrome
- default
COMPUTER_USE_DENY_APPS- default includes
1Password,Keychain,System Settings,Activity Monitor,AIRI
- default includes
COMPUTER_USE_DENY_WINDOW_TITLESCOMPUTER_USE_TERMINAL_SHELL- default current shell, otherwise
/bin/zsh
- default current shell, otherwise
COMPUTER_USE_ALLOWED_BOUNDS- optional global coordinate clamp
Browser DOM bridge:
COMPUTER_USE_BROWSER_DOM_BRIDGE_ENABLED- default
true
- default
COMPUTER_USE_BROWSER_DOM_BRIDGE_HOST- default
127.0.0.1
- default
COMPUTER_USE_BROWSER_DOM_BRIDGE_PORT- default
8765
- default
COMPUTER_USE_BROWSER_DOM_BRIDGE_TIMEOUT_MS- default
10000
- default
Autonomous browser agent:
COMPUTER_USE_BROWSER_AGENT_ROOT- optional override for the embedded browser-agent workspace under
src/bin/computer_use
- optional override for the embedded browser-agent workspace under
COMPUTER_USE_PYTHON- optional python executable override for
browser_agent_run; defaults to the embedded.venv/bin/pythonwhen present, otherwisepython3
- optional python executable override for
Legacy remote runner:
COMPUTER_USE_REMOTE_SSH_HOSTCOMPUTER_USE_REMOTE_SSH_USERCOMPUTER_USE_REMOTE_SSH_PORTCOMPUTER_USE_REMOTE_RUNNER_COMMANDCOMPUTER_USE_REMOTE_DISPLAY_SIZECOMPUTER_USE_REMOTE_OBSERVATION_BASE_URLCOMPUTER_USE_REMOTE_OBSERVATION_SERVE_PORTCOMPUTER_USE_REMOTE_OBSERVATION_TOKEN
Binary overrides:
COMPUTER_USE_SWIFT_BINARYCOMPUTER_USE_OSASCRIPT_BINARYCOMPUTER_USE_SCREENSHOT_BINARYCOMPUTER_USE_OPEN_BINARYCOMPUTER_USE_SSH_BINARYCOMPUTER_USE_TAR_BINARY
AIRI Integration
AIRI still connects through mcp.json.
Example local macOS entry:
{
"mcpServers": {
"computer_use": {
"command": "pnpm",
"args": [
"-F",
"@proj-airi/computer-use-mcp",
"start"
],
"cwd": "/path/to/your/airi/repo",
"env": {
"COMPUTER_USE_EXECUTOR": "macos-local",
"COMPUTER_USE_APPROVAL_MODE": "actions",
"COMPUTER_USE_OPENABLE_APPS": "Terminal,Cursor,Google Chrome"
}
}
}
}
On the AIRI desktop side, approvals are handled like this:
- model calls a
computer_use::*tool - MCP returns
approval_required - Electron shows a native approval dialog
- AIRI automatically calls
desktop_approve_pending_actionordesktop_reject_pending_action - terminal/app approvals can be reused for the current run only
For browser DOM automation, computer-use-mcp also exposes a local WebSocket bridge that matches the user's Chrome extension bridge pattern:
computer-use-mcplistens onws://127.0.0.1:8765by default- the unpacked browser extension background service worker connects to that socket
- AIRI can then call
browser_dom_*MCP tools against the active browser tab
If you override COMPUTER_USE_BROWSER_DOM_BRIDGE_HOST or
COMPUTER_USE_BROWSER_DOM_BRIDGE_PORT, mirror the same endpoint in the Chrome
extension via chrome.storage.local.set({ browserDomBridgeHost, browserDomBridgePort })
so the background worker reconnects to the correct socket.
Use the two surfaces differently:
desktop_*for AIRI itself, native macOS apps, Electron windows, Finder, Terminal, VS Codebrowser_dom_*for real browser pages, cross-frame DOM reads, form filling, selector-based interaction, and iframe-heavy flowsbrowser_agent_runfor goal-driven browser tasks where AIRI should delegate the web exploration loop instead of manually hard-coding each browser step
Validation Commands
pnpm -F @proj-airi/computer-use-mcp typecheckpnpm -F @proj-airi/computer-use-mcp testpnpm -F @proj-airi/computer-use-mcp smoke:stdiopnpm -F @proj-airi/computer-use-mcp smoke:macospnpm -F @proj-airi/computer-use-mcp e2e:airi-chatpnpm -F @proj-airi/computer-use-mcp e2e:airi-discord
Legacy remote validation remains available:
pnpm -F @proj-airi/computer-use-mcp bootstrap:remotepnpm -F @proj-airi/computer-use-mcp smoke:remote
Demo Story To Record
If you want to record a convincing demo, show the system as an orchestrated task runner instead of a flashy cursor dance.
Recommended recording structure:
- Show the AIRI desktop window, a terminal, and the generated report directory.
- Start the local AIRI desktop app and the
computer-use-mcpservice. - Show that AIRI can call the MCP tools automatically instead of only listing them.
- Demonstrate one short task that exercises the full loop:
- observe state
- execute a tool or workflow
- produce a visible result
- persist trace / audit / screenshots
- End by opening the generated
report.json,audit.jsonl, or screenshots so the demo finishes with evidence rather than just screen motion.
Good first demos:
- open a workspace, confirm
pwd, inspect local changes, and runpnpm typecheck - create and run a Python hello-world project through
terminal_exec - use desktop control for AIRI or native apps and use
browser_dom_*only when the task truly moves into a browser page
Discord integration demo
For a management-readable AIRI demo, the Discord settings flow is more representative than a generic hello-world reply:
- start AIRI desktop and
services/discord-bot - open
/settings/modules/messaging-discord - enable the module and save settings
- verify that the Discord bot receives the forwarded config from AIRI and reconnects itself
- finish by opening
report.json, screenshots, audit log, anddiscord-bot.log
Notes:
- for a pure local-secret run, set
AIRI_E2E_DISCORD_TOKEN - for a more agentic run, set
AIRI_E2E_DISCORD_TOKEN_SOURCE=portalorautoand let AIRI retrieve the token from the live browser / Discord Developer Portal session clipboard_read_text/clipboard_write_textare the intended bridge when AIRI must move a copied token from the browser back into AIRI settings- the observable harness keeps the token out of the desktop audit trail by applying the secret through the renderer instead of typing it through Quartz key events
- if you only want to validate the AIRI → Discord bot configuration plumbing without a real token, set
AIRI_E2E_DISCORD_ALLOW_LOGIN_FAILURE=true
Less convincing demos:
- long videos of coordinate clicking with no trace output
- browser form-filling done only by screen coordinates when DOM tools were available
- tasks that cannot explain afterwards what the agent observed, executed, or verified
Known Limits
- macOS only for the main v1 path
- no accessibility tree grounding yet
- PTY/TUI terminal support is product-supported on the self-acquire mainline; legacy outward terminal reroute remains secondary
- no multi-monitor orchestration policy yet
- global coordinates are allowed, so the safety boundary is approval + audit, not strict app isolation