mirror of
https://github.com/agent0ai/agent-zero.git
synced 2026-05-30 20:25:23 +00:00
Keep backend computer-use actions out of generic guidance
Move explicit AX action names and argument details out of the always-loaded computer_use_remote prompt and generic host-computer-use skill. The generic guidance now only explains backend discovery and skill loading, while host-computer-use-macos remains the detailed home for macOS structural targeting. Also soften the old Super+H hide-window guidance so window actions are chosen from the reported backend and verified visually.
This commit is contained in:
parent
e7cb3aa3fa
commit
a931759868
4 changed files with 30 additions and 27 deletions
|
|
@ -10,9 +10,9 @@ If the tool reports no CLI, disabled computer use, or `COMPUTER_USE_REARM_REQUIR
|
|||
|
||||
Call `start_session` before screen-driven tasks. Use `status` for state only, `capture` for screenshots without an action, and `stop_session` when the desktop task is complete. Interactive coordinate actions should use normalized global-screen coordinates from the most recent capture.
|
||||
|
||||
Some actions are backend-specific. Use `ax_snapshot` and `ax_action` only when `status` or `start_session` reports a macOS backend with Accessibility-tree features such as `accessibility-tree-snapshot` or `accessibility-structural-targeting`. For macOS structural targeting details, load and follow skill `host-computer-use-macos`; do not apply AX guidance to non-macOS backends.
|
||||
Some actions are backend-specific and intentionally documented only in backend skills. If `status` or `start_session` reports backend-specific features or tells you to load a backend skill, load and follow that skill before using those backend-only actions. For macOS structural targeting details, load and follow skill `host-computer-use-macos`; do not apply macOS guidance to non-macOS backends.
|
||||
|
||||
State-changing actions automatically attach a fresh screen after they run. Treat key presses, clicks, scrolling, typing, and window-manager shortcuts as attempts, not success: inspect the latest attached screen, or one explicit `capture` if it is unclear or unchanged, before saying the requested outcome happened. If the tool says a screen was attached but you cannot actually inspect the image, stop and report that visual verification is unavailable; do not continue by assuming the host state. For Ubuntu/GNOME/Wayland hide-window tasks, prefer `Super+H` (`{"action":"key","keys":["Super","H"]}`) for the active window; do not use `Alt+F9` as the primary hide/minimize shortcut because it often leaves the window visible. A `type` result only proves keystrokes were sent; it does not prove the window was hidden or that text landed in the intended place.
|
||||
State-changing actions automatically attach a fresh screen after they run. Treat key presses, clicks, scrolling, and typing as attempts, not success: inspect the latest attached screen, or one explicit `capture` if it is unclear or unchanged, before saying the requested outcome happened. If the tool says a screen was attached but you cannot actually inspect the image, stop and report that visual verification is unavailable; do not continue by assuming the host state. A `type` result only proves keystrokes were sent; it does not prove that text landed in the intended place.
|
||||
|
||||
```json
|
||||
{
|
||||
|
|
@ -24,18 +24,13 @@ State-changing actions automatically attach a fresh screen after they run. Treat
|
|||
```
|
||||
|
||||
Required argument:
|
||||
- `action`: one of `start_session`, `status`, `capture`, `ax_snapshot`, `ax_action`, `move`, `click`, `scroll`, `key`, `type`, `stop_session`
|
||||
- `action`: one of `start_session`, `status`, `capture`, `move`, `click`, `scroll`, `key`, `type`, `stop_session`; backend skills may document additional backend-only action values
|
||||
|
||||
Optional arguments by action:
|
||||
- `session_id`: session returned by `start_session`
|
||||
- `x`, `y`: normalized `[0,1]` global-screen coordinates for `move` and `click`
|
||||
- `button`: `left`, `right`, or `middle` for `click`
|
||||
- `count`: click count for `click`
|
||||
- `max_depth`, `max_nodes`: optional bounds for backend-gated `ax_snapshot`
|
||||
- `path`: element path from `ax_snapshot` for backend-gated `ax_action`
|
||||
- `target`: semantic element target for backend-gated `ax_action`, for example role/title/description/value/identifier
|
||||
- `operation` or `ax_action`: backend-gated AX operation such as `press`, `focus`, or `set_value`
|
||||
- `value` or `text`: value for backend-gated `ax_action` with `set_value`
|
||||
- `dx`, `dy`: scroll amounts for `scroll`
|
||||
- `key` or `keys`: key press value for `key`
|
||||
- `text`: text to type for `type`
|
||||
|
|
|
|||
|
|
@ -13,8 +13,6 @@ triggers:
|
|||
- "host screen"
|
||||
- "local screen"
|
||||
- "Ubuntu Wayland desktop"
|
||||
- "hide window"
|
||||
- "minimize window"
|
||||
---
|
||||
|
||||
# Host Computer Use
|
||||
|
|
@ -57,9 +55,9 @@ Use:
|
|||
|
||||
Arguments:
|
||||
|
||||
- `action`: `start_session`, `status`, `capture`, `ax_snapshot`, `ax_action`, `move`, `click`, `scroll`, `key`, `type`, `stop_session`
|
||||
- `action`: `start_session`, `status`, `capture`, `move`, `click`, `scroll`, `key`, `type`, `stop_session`
|
||||
- `session_id`: optional after `start_session`
|
||||
- `ax_snapshot`, `ax_action`: backend-gated structural accessibility actions; use only when backend metadata advertises matching support, and load the backend-specific skill first
|
||||
- backend skills may document additional backend-only action values; use them only when backend metadata advertises matching support and after loading the backend-specific skill
|
||||
- `move`: `x`, `y` normalized to `[0,1]`
|
||||
- `click`: optional `x`, `y`, optional `button` (`left`, `right`, `middle`), optional `count`
|
||||
- `scroll`: `dx`, `dy`
|
||||
|
|
@ -75,30 +73,28 @@ If any tool result contains `COMPUTER_USE_REARM_REQUIRED` or `status=rearm requi
|
|||
1. Call `start_session` first.
|
||||
2. Read the returned `backend_id`, `backend_family`, and `features`; load a backend-specific Computer Use skill when the task needs backend-only affordances.
|
||||
3. Decide final success from the latest screenshot, not from memory.
|
||||
4. Interactive actions (`ax_action`, `move`, `click`, `scroll`, `key`, `type`) already attach a fresh screenshot after they run; inspect it before claiming the requested outcome succeeded.
|
||||
4. Interactive actions already attach a fresh screenshot after they run; inspect it before claiming the requested outcome succeeded.
|
||||
5. Use `status` for state without starting a session.
|
||||
6. Use `capture` only when you need another screenshot without taking an action.
|
||||
|
||||
## Backend Skills
|
||||
|
||||
- If the backend is macOS or features include `accessibility-tree-snapshot` / `accessibility-structural-targeting`, load `host-computer-use-macos` before using `ax_snapshot` or `ax_action`.
|
||||
- If the backend is macOS or features include `accessibility-tree-snapshot` / `accessibility-structural-targeting`, load `host-computer-use-macos` before using macOS structural Accessibility actions.
|
||||
- Do not use backend-specific actions just because their argument names exist in the generic contract. Treat them as unavailable unless the connected CLI advertises the matching feature.
|
||||
|
||||
## Operating Rules
|
||||
|
||||
- Only the latest screenshot or a definitive tool result counts as evidence.
|
||||
- If a tool result says a screenshot was attached but you cannot actually see the image, stop and report that visual verification is unavailable. Do not continue with another action from an assumed host state.
|
||||
- Outside advertised AX support, use normalized global screen coordinates; do not assume window ids, element indexes, background-safe input, or semantic click targets unless the runtime explicitly advertises them.
|
||||
- Outside advertised structural accessibility support, use normalized global screen coordinates; do not assume window ids, element indexes, background-safe input, or semantic click targets unless the runtime explicitly advertises them.
|
||||
- Prefer accessibility and semantic UI paths first: shortcuts, command palettes, menu accelerators, address/search bars, focus traversal, and other keyboard-accessible controls.
|
||||
- Prefer `key` and `type` over pointer actions whenever a reliable keyboard path exists.
|
||||
- When a menu or popup is open, treat it as the active UI and prefer keyboard navigation over clicking small transient rows by coordinate.
|
||||
- If a click dismisses a menu or popup without producing the expected next UI, treat that attempt as failed.
|
||||
- If the same approach has already failed twice without visible progress, switch strategy instead of repeating it.
|
||||
- Do not infer focus or task completion from chat logs, sidebars, tool summaries, or status text.
|
||||
- Never claim a window was hidden, minimized, moved, text was submitted, or navigation completed until the latest screenshot visibly confirms it.
|
||||
- On Ubuntu/GNOME/Wayland, use `Super+H` (`{"action":"key","keys":["Super","H"]}`) to hide the active window. Do not use `Alt+F9` as the primary hide/minimize shortcut on this environment; it often leaves the window visible.
|
||||
- After any hide/minimize shortcut, inspect the fresh screenshot. If the target window or focused composer is still visible, treat the attempt as failed and do not type follow-up text into the active field.
|
||||
- A `type` tool result only confirms keystrokes were sent. It is not evidence that the text landed in the intended application, nor evidence that a window was hidden first.
|
||||
- Never claim a state-changing action succeeded until the latest screenshot visibly confirms it.
|
||||
- A `type` tool result only confirms keystrokes were sent. It is not evidence that the text landed in the intended application.
|
||||
- For browser-navigation tasks done through this tool, only claim success if the browser content area visibly shows the destination page or result.
|
||||
- If the attached screenshot appears unchanged after a state-changing action, use one explicit `capture` to verify before repeating the same action.
|
||||
- Use `type(..., submit=true)` only for URL or navigation-style entry where Enter should fire immediately after typing.
|
||||
|
|
|
|||
|
|
@ -134,19 +134,22 @@ def test_computer_use_remote_prompt_requires_visual_verification_after_actions()
|
|||
/ "SKILL.md"
|
||||
).read_text(encoding="utf-8")
|
||||
|
||||
assert "Treat key presses, clicks, scrolling, typing" in prompt
|
||||
assert "Treat key presses, clicks, scrolling, and typing" in prompt
|
||||
assert "attempts, not success" in prompt
|
||||
assert "visual verification is unavailable" in prompt
|
||||
assert "do not continue by assuming the host state" in prompt
|
||||
assert "Super+H" in prompt
|
||||
assert '["Super","H"]' in prompt
|
||||
assert "Alt+F9" in prompt
|
||||
assert "Super+H" not in prompt
|
||||
assert "Alt+F9" not in prompt
|
||||
assert "hide" not in prompt.lower()
|
||||
assert "minimize" not in prompt.lower()
|
||||
assert "window-manager" not in prompt
|
||||
assert "cannot actually see the image" in skill
|
||||
assert "Do not use `Alt+F9` as the primary hide/minimize shortcut" in skill
|
||||
assert "A `type` tool result only confirms keystrokes were sent" in skill
|
||||
assert "do not type follow-up text into the active field" in skill
|
||||
assert "visibly confirms" in skill
|
||||
assert "Ubuntu/GNOME/Wayland" in skill
|
||||
assert "hide window" not in skill
|
||||
assert "minimize window" not in skill
|
||||
assert "hide/minimize" not in skill
|
||||
assert "window-manager" not in skill
|
||||
|
||||
|
||||
def test_remote_file_and_exec_tools_are_standard_tool_prompts_independent_from_context():
|
||||
|
|
@ -398,11 +401,14 @@ def test_remote_tool_stubs_are_self_contained_and_reference_per_tool_skills():
|
|||
assert '"tool_name": "computer_use_remote"' in computer_stub
|
||||
assert "load and follow skill `host-computer-use`" in computer_stub
|
||||
assert "host-computer-use-macos" in computer_stub
|
||||
assert "ax_snapshot" not in computer_stub
|
||||
assert "ax_action" not in computer_stub
|
||||
assert "Do not substitute the `linux-desktop` skill" in computer_stub
|
||||
assert '"tool_name": "computer_use_remote"' in computer_skill
|
||||
assert '"tool_name": "computer_use_remote"' in macos_computer_skill
|
||||
assert "ax_snapshot" in macos_computer_skill
|
||||
assert "ax_snapshot`/`ax_action` are structural Accessibility targeting" not in computer_skill
|
||||
assert "ax_snapshot" not in computer_skill
|
||||
assert "ax_action" not in computer_skill
|
||||
assert "Availability, backend support, and trust mode are checked when the tool runs" in computer_stub
|
||||
assert "not `code_execution_tool`" in exec_stub
|
||||
assert "not to" in exec_stub
|
||||
|
|
|
|||
|
|
@ -618,8 +618,14 @@ def test_computer_use_remote_is_runtime_checked_standard_tool():
|
|||
assert "checked when the tool runs" in standard_prompt_text
|
||||
assert "visual verification is unavailable" in standard_prompt_text
|
||||
assert "host-computer-use-macos" in standard_prompt_text
|
||||
assert "ax_snapshot" not in standard_prompt_text
|
||||
assert "ax_action" not in standard_prompt_text
|
||||
assert '"tool_name": "computer_use_remote"' in skill_text
|
||||
assert "ax_snapshot" not in skill_text
|
||||
assert "ax_action" not in skill_text
|
||||
assert '"tool_name": "computer_use_remote"' in macos_skill_text
|
||||
assert "ax_snapshot" in macos_skill_text
|
||||
assert "ax_action" in macos_skill_text
|
||||
assert "Backend-specific macOS guidance" in macos_skill_text
|
||||
assert "Beta desktop control" in skill_text
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue