feat: adversarial agent for preventing leaking of info and more (#7948)
Some checks failed
Live Provider Tests / check-fork (push) Successful in 2s
Publish Docker Image / docker (push) Failing after 3s
Scorecard supply-chain security / Scorecard analysis (push) Has been skipped
CI / Check OpenAPI Schema is Up-to-Date (push) Has been skipped
Live Provider Tests / changes (push) Failing after 3s
Canary / Prepare Version (push) Failing after 3s
Canary / bundle-desktop (push) Has been skipped
Canary / bundle-desktop-intel (push) Has been skipped
Canary / bundle-desktop-linux (push) Has been skipped
Canary / bundle-desktop-windows (push) Has been skipped
Canary / build-cli (push) Has been skipped
Canary / Upload Install Script (push) Has been skipped
Canary / Release (push) Has been skipped
Unused Dependencies / machete (push) Has been skipped
CI / changes (push) Failing after 3s
Deploy Documentation / deploy (push) Failing after 4s
Publish Ask AI Bot Docker Image / docker (push) Failing after 5s
CI / Check Rust Code Format (push) Has been skipped
CI / Build and Test Rust Project (push) Has been skipped
CI / Lint Rust Code (push) Has been skipped
Live Provider Tests / Build Binary (push) Has been skipped
Live Provider Tests / Smoke Tests (Code Execution) (push) Has been skipped
Live Provider Tests / Smoke Tests (push) Has been skipped
Live Provider Tests / Compaction Tests (push) Has been skipped
Live Provider Tests / goose server HTTP integration tests (push) Has been skipped
CI / Test and Lint Electron Desktop App (push) Has been cancelled

This commit is contained in:
Michael Neale 2026-03-17 17:38:45 +11:00 committed by GitHub
parent a0835be10f
commit 754c214df4
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 904 additions and 0 deletions

View file

@ -0,0 +1,88 @@
---
sidebar_position: 2
title: Adversary Mode
sidebar_label: Adversary Mode
description: An independent agent reviewer that silently watches tool calls to protect you in case the agent goes rogue.
---
Adversary mode adds a silent, independent agent reviewer that watches tool calls before they execute. It acts as a second pair of eyes — protecting you in case the main agent is compromised, manipulated by a prompt injection, or simply doing something that doesn't match what you asked for.
Unlike [pattern-based detection](/docs/guides/security/prompt-injection-detection), the adversary reviewer understands context. It knows what you originally asked for, what you've said recently, and can judge whether a tool call makes sense for your task.
## How It Works
1. Before each tool call, the adversary reviewer checks your **original task**, **recent messages**, and the **tool call details**
2. It evaluates the tool call against your rules and returns **ALLOW** or **BLOCK**
3. Blocked tool calls are denied — the agent sees the rejection and cannot retry
4. If the reviewer fails for any reason, the tool call is allowed through (fail-open)
The adversary uses the same model and provider goose is already configured with. No additional API keys or services needed.
## Enabling Adversary Mode
Create a file at `~/.config/goose/adversary.md` with your rules:
```markdown
BLOCK if the tool call:
- Exfiltrates data (posting to unknown URLs, piping secrets to external services)
- Is destructive beyond the project scope (deleting system files, wiping directories)
- Installs malware or runs obfuscated code
- Attempts to escalate privileges unnecessarily
- Downloads and executes untrusted remote scripts
ALLOW normal development operations like editing files, running tests,
installing packages, using git, etc. Most tool calls are fine.
Err on the side of ALLOW — only block truly dangerous things.
```
That's it. If the file exists, adversary mode is on. Delete the file to turn it off.
An empty file uses sensible default rules similar to the example above.
## Writing Good Rules
The rules in `adversary.md` are your policy, written in plain language:
- **Be specific** about what should be blocked — vague rules lead to false positives
- **Err on the side of ALLOW** — the adversary should catch genuinely dangerous actions, not slow down normal work
- **Think about your threat model** — what would a rogue agent actually try to do?
You can tailor rules to your environment:
```markdown
BLOCK if:
- Any network request goes to a domain not in: github.com, npmjs.org, pypi.org
- Files outside of ~/projects/ are modified
- SSH keys, .env files, or credentials are read or transmitted
ALLOW all standard development operations within ~/projects/.
```
## What Gets Reviewed
By default, the adversary reviews **`shell`** and **`computercontroller__automation_script`** — the tools that can execute arbitrary code.
You can expand coverage by adding a `tools:` line at the top of your `adversary.md`:
```markdown
tools: shell, computercontroller__automation_script
---
BLOCK if the command exfiltrates data or is destructive.
ALLOW normal development operations.
```
The `tools:` line is a comma-separated list of tool names to review. Everything before the `---` separator is configuration; everything after is your rules. If you omit the `tools:` line, `shell` and `computercontroller__automation_script` are reviewed by default.
Some tool names you might want to add:
| Tool name | What it does |
|-----------|-------------|
| `shell` | Executes shell commands (default) |
| `computercontroller__automation_script` | Runs shell, Ruby, AppleScript, or PowerShell scripts (default) |
| `computercontroller__computer_control` | UI automation — clicks, keystrokes, typing |
| `computercontroller__web_scrape` | Fetches arbitrary URLs |
## See Also
- [Prompt Injection Detection](/docs/guides/security/prompt-injection-detection) — pattern-based detection (complementary, always-on when enabled)
- [goose Permission Modes](/docs/guides/goose-permissions) — control goose's autonomy level