# Pulse AI Pulse Patrol is available to everyone on the Community plan with BYOK (your own AI provider). Pro, Pro+, and Cloud unlock auto-fix and advanced analysis. Learn more at or see [PULSE_PRO.md](PULSE_PRO.md). --- ## Overview Pulse includes two AI-powered systems: 1. **Pulse Assistant** — An interactive chat interface for ad-hoc troubleshooting, investigations, and infrastructure control. 2. **Pulse Patrol** — A scheduled, context-aware analysis service that continuously monitors your infrastructure, learns what's normal, predicts issues, and generates actionable findings. Both systems are built on the same tool-driven architecture where the LLM acts as a proposer and Go code enforces safety gates. ### Not Just Another Chatbot Pulse Assistant is a **protocol-driven, safety-gated agentic system** that: - **Proactively gathers context** — understands resources before you ask (context prefetcher) - **Learns within sessions** — extracts and caches facts to avoid redundant queries (knowledge accumulator) - **Enforces workflow invariants** — FSM prevents dangerous state transitions - **Supports parallel tool execution** — efficient batch operations with concurrency control - **Detects and prevents hallucinations** — phantom execution detection - **Auto-recovers from errors** — structured error envelopes enable self-correction 📖 **For a deep technical dive into the Assistant architecture, see [architecture/pulse-assistant-deep-dive.md](architecture/pulse-assistant-deep-dive.md).** ### Not Just Another Alerting System Pulse Patrol is a **multi-layered intelligence platform** that: - **Learns** what's normal for your environment (baseline engine) - **Predicts** issues before they become critical (pattern detection + forecasting) - **Correlates** events across your entire infrastructure (root cause analysis) - **Remembers** past incidents and successful remediations (incident memory) - **Investigates** issues autonomously when configured (investigation orchestrator) - **Verifies** fixes and tracks remediation effectiveness (verification loops) All while running entirely on your infrastructure with BYOK for complete privacy. 📖 **For a deep technical dive into the intelligence subsystems, see [architecture/pulse-patrol-deep-dive.md](architecture/pulse-patrol-deep-dive.md).** See [architecture/pulse-assistant.md](architecture/pulse-assistant.md) for the original safety architecture documentation. --- ## Pulse Patrol Patrol is a scheduled analysis pipeline that builds a rich, system-wide snapshot and produces actionable findings. ### How Patrol Works ``` Scheduled/Event Trigger │ ▼ buildSeedContext() ── infrastructure snapshot │ ▼ LLM analysis (with tools) ← pulse_storage, pulse_metrics, pulse_alerts, etc. │ ▼ DetectSignals() ── deterministic signal detection from tool outputs │ ▼ createFinding() ── validated, deduplicated findings stored │ ▼ (if configured) MaybeInvestigateFinding() ── automatic investigation + remediation ``` ### What Patrol Sees Every patrol run passes the LLM comprehensive context about your environment: | Data Category | What's Included | |---------------|-----------------| | **Proxmox Nodes** | Status, CPU%, memory%, uptime, 24h/7d trend analysis | | **VMs & Containers** | Full metrics, backup status, OCI images, historical trends, anomaly flags | | **Storage Pools** | Usage %, capacity predictions, type (ZFS/LVM/Ceph), growth rates | | **Docker/Podman** | Container counts, health states, unhealthy container lists | | **Kubernetes** | Nodes, pods, deployments, services, DaemonSets, StatefulSets, namespaces | | **TrueNAS** | Pools, datasets, disk health, SMART status, replication, alerts | | **PBS/PMG** | Datastore status, backup jobs, job failures, verification status | | **Ceph** | Cluster health, OSD states, PG status | | **Agent Hosts** | Load averages, memory, disk, RAID status, temperatures | ### Enriched Context Beyond raw metrics, Patrol enriches the context with intelligence: - **Trend analysis** — 24h and 7d patterns showing `growing`, `stable`, `declining`, or `volatile` behavior - **Learned baselines** — Z-score anomaly detection based on what's *normal for your environment* - **Capacity predictions** — "Storage pool will be full in 12 days at current growth rate" - **Infrastructure changes** — Detected config changes, VM migrations, new deployments - **Resource correlations** — Pattern detection across related resources - **User notes** — Your annotations explaining expected behavior - **Dismissed findings** — Respects your feedback and suppressed alerts - **Incident memory** — Learns from past investigations and successful remediations ### Deterministic Signal Detection Patrol doesn't rely solely on LLM judgment. It parses tool call outputs and fires deterministic signals for known problems: | Signal Type | Trigger | Default Threshold | |------------|---------|-------------------| | `smart_failure` | SMART health status not OK/PASSED | N/A | | `high_cpu` | Average CPU usage | 70% | | `high_memory` | Average memory usage | 80% | | `high_disk` | Storage pool usage | 75% (warning), 95% (critical) | | `backup_failed` | Recent backup task with error status | Within 48h | | `backup_stale` | No backup completed for VM/CT | 48+ hours | | `active_alert` | Critical/warning alert in list | N/A | Thresholds can be configured via alert settings to match user-defined values. ### Examples of What Patrol Catches | Issue | Severity | Example | |-------|----------|---------| | **Node offline** | Critical | Proxmox node not responding | | **Disk approaching capacity** | Warning/Critical | Storage at 85%+, or growing toward full | | **Backup failures** | Warning | PBS job failed, no backup in 48+ hours | | **Service down** | Critical | Docker container crashed, agent offline | | **High resource usage** | Warning | Sustained memory >90%, CPU >85% | | **Storage issues** | Critical | PBS datastore errors, ZFS pool degraded | | **Ceph problems** | Warning/Critical | Degraded OSDs, unhealthy PGs | | **Kubernetes issues** | Warning | Pods stuck in Pending/CrashLoopBackOff | | **SMART failures** | Critical | Disk health check failed | ### What Patrol Ignores (by design) Patrol is **intentionally conservative** to avoid noise: - Small baseline deviations ("CPU at 15% vs typical 10%") - Low utilization that's "elevated" but fine (disk at 40%) - Stopped VMs/containers that were intentionally stopped - Brief spikes that resolve on their own - Anything that doesn't require human action > **Philosophy**: If a finding wouldn't be worth waking someone up at 3am, Patrol won't create it. ### Finding Severity - **Critical**: Immediate attention required (service down, data at risk) - **Warning**: Should be addressed soon (disk filling, backup stale) Note: `info` and `watch` level findings are filtered out to reduce noise. ### Managing Findings Findings can be managed via the UI or API: - **Get help**: Chat with AI to troubleshoot the issue - **Resolve**: Mark as fixed (finding will reappear if the issue resurfaces) - **Dismiss**: Mark as expected behavior (creates suppression rule) Dismissed and resolved findings persist across Pulse restarts. --- ## Autonomy Levels Patrol supports three autonomy modes that control how much action it can take: | Mode | Behavior | Plan | |------|----------|------| | **Monitor** | Detect issues only. No investigation or fixes. | Community (BYOK) | | **Investigate** | Investigates findings and proposes fixes. All fixes require approval before execution. | Community (BYOK) | | **Auto-fix** | Automatically fixes issues and verifies results. Critical findings still require approval by default. | Pro / Pro+ / Cloud | ### Investigation Flow When a finding is created in Investigate or Auto-fix mode: ``` Finding created │ ▼ MaybeInvestigateFinding() │ ├─ Has orch + chatService? │ │ │ ▼ │ InvestigateFinding() │ │ │ ▼ │ Create chat session │ │ │ ▼ │ AI analysis (with tools) │ │ │ ▼ │ [Fix proposed?] ──Yes──► Queue approval (or auto-execute in full mode) │ │ │ No │ ▼ │ Update finding with outcome │ └─ Skip investigation ``` ### Investigation Configuration | Setting | Default | Description | |---------|---------|-------------| | `MaxTurns` | 15 | Maximum agentic turns per investigation | | `Timeout` | 10 min | Maximum duration per investigation | | `MaxConcurrent` | 3 | Maximum concurrent investigations | | `MaxAttemptsPerFinding` | 3 | Maximum investigation attempts per finding | | `CooldownDuration` | 1 hour | Cooldown before re-investigating | | `TimeoutCooldownDuration` | 10 min | Shorter cooldown for timeout failures | | `VerificationDelay` | 30 sec | Wait before verifying fix | ### Investigation Outcomes | Outcome | Meaning | |---------|---------| | `resolved` | Issue resolved during investigation | | `fix_queued` | Fix proposed, awaiting approval | | `fix_executed` | Fix auto-executed successfully | | `fix_failed` | Fix attempted but failed | | `fix_verified` | Fix worked, issue confirmed resolved | | `fix_verification_failed` | Fix ran but issue persists | | `needs_attention` | Requires human intervention | | `cannot_fix` | Issue cannot be automatically fixed | | `timed_out` | Investigation timed out (will retry sooner) | --- ## Pulse Assistant (Chat) Pulse Assistant is a **tool-driven** chat interface. It does not "guess" system state — it calls live tools and reports their outputs. ### The Model's Workflow (Discover → Investigate → Act) 1. **Discover**: Uses `pulse_query` or `pulse_discovery` to find real resources and IDs 2. **Investigate**: Uses `pulse_read` to run bounded, read-only commands and check status/logs 3. **Act** (optional): Uses `pulse_control` for changes, then verifies with a read ### Available Tools | Tool | Classification | Purpose | |------|---------------|---------| | `pulse_query`, `pulse_discovery` | Resolve | Resource discovery and query | | `pulse_read` | Read | Read-only operations: exec, file, find, tail, logs | | `pulse_metrics` | Read | Performance metrics and baselines | | `pulse_storage` | Read | Storage pools, backups, snapshots, Ceph, RAID, disk health | | `pulse_kubernetes` | Read | Kubernetes cluster info | | `pulse_pmg` | Read | Proxmox Mail Gateway stats | | `pulse_alerts` | Read/Write | Alert management (resolve/dismiss are writes) | | `pulse_docker` | Read/Write | Docker operations (control/update are writes) | | `pulse_knowledge` | Read/Write | Knowledge persistence (remember/note/save are writes) | | `pulse_file_edit` | Read/Write | File operations (write/append are writes) | | `pulse_control` | Write | Guest control, service management | | `patrol_report_finding` | Patrol | Report a new finding (patrol runs only) | | `patrol_resolve_finding` | Patrol | Resolve an active finding (patrol runs only) | | `patrol_get_findings` | Patrol | List active findings (patrol runs only) | ### Safety Gates The assistant enforces multiple safety gates: 1. **Discovery Before Action** — Action tools cannot operate on resources that weren't first discovered 2. **Verification After Write** — After any write, the model must perform a read/status check before providing a final answer 3. **Read/Write Separation** — Read operations route through `pulse_read` (stays in READING state); write operations route through `pulse_control` (enters VERIFYING state) 4. **Phantom Detection** — Detects when the model claims execution without tool calls 5. **Approval Mode** — In Controlled mode, every write requires explicit user approval 6. **Execution Context Binding** — Commands execute within the resolved resource's context, not on parent hosts ### Control Levels | Level | Behavior | Plan | |-------|----------|---------| | **Read-only** | AI can observe and query data only | Community | | **Controlled** | AI asks for approval before executing commands | Community | | **Autonomous** | AI executes actions without prompting | Pro / Pro+ / Cloud | ### Using Approvals (Controlled Mode) When control level is **Controlled**, write actions pause for approval: 1. Tool returns `APPROVAL_REQUIRED: { approval_id, command, ... }` 2. Agentic loop emits `approval_needed` SSE event 3. UI shows approval card with the proposed command 4. **Approve** to execute and verify, or **Deny** to cancel 5. Only users with admin privileges can approve/deny --- ## Configuration Configure in the UI: **Settings → System → AI Assistant** ### Supported Providers - **Anthropic** (API key or OAuth) - **OpenAI** - **OpenRouter** - **DeepSeek** - **Google Gemini** - **Ollama** (self-hosted, with tool/function calling support) - **OpenAI-compatible base URL** (for providers that implement the OpenAI API shape) ### Models Pulse uses model identifiers in the form: `provider:model-name` You can set separate models for: - Chat (`chat_model`) - Patrol (`patrol_model`) - Auto-fix remediation (`auto_fix_model`) ### Storage AI settings are stored encrypted at rest in `ai.enc` under the Pulse config directory. Related files: | File | Purpose | |------|---------| | `ai.enc` | Encrypted AI configuration and credentials | | `ai_findings.json` | Patrol findings | | `ai_patrol_runs.json` | Patrol run history | | `ai_usage_history.json` | Token usage data | | `ai_chat_sessions.json` | Legacy chat sessions (UI sync) | | `baselines.json` | Learned resource baselines | | `ai_correlations.json` | Resource correlation data | | `ai_patterns.json` | Detected patterns | Config directory: `/etc/pulse` (systemd) or `/data` (Docker/Kubernetes) ### Testing - Test provider connectivity: `POST /api/ai/test` and `POST /api/ai/test/{provider}` - List available models: `GET /api/ai/models` --- ## Schedule and Triggers Patrol runs on a configurable schedule: | Interval | Description | |----------|-------------| | Disabled | Patrol runs only when manually triggered | | 10 min – 7 days | Configurable interval (default: 6 hours) | Patrol can also be triggered by: - **Manual run**: Click "Run Patrol" in the UI - **Alert-triggered analysis (Pro and above)**: Runs when an alert fires - **API call**: `POST /api/ai/patrol/run` --- ## AI Intelligence Layer Pulse includes a unified intelligence system that aggregates data from all AI subsystems: ### Components | Component | Purpose | |-----------|---------| | **Baseline Engine** | Learns normal behavior, detects anomalies via z-score | | **Pattern Detector** | Identifies recurring issues and trends | | **Correlation Engine** | Links related issues across resources | | **Incident Memory** | Tracks past incidents and successful remediations | | **Knowledge Store** | Persists user annotations and learned preferences | | **Forecast Engine** | Predicts capacity issues and resource exhaustion | ### Health Scoring Each resource receives a health score (A–F) based on: - Current metrics vs baseline - Active findings and alerts - Recent incidents - Trend direction (improving/stable/declining) --- ## Model Matrix (Pulse Assistant) This table summarizes the most recent **Pulse Assistant** eval runs per model. Update the table from eval reports: ``` EVAL_REPORT_DIR=tmp/eval-reports go run ./cmd/eval -scenario matrix -auto-models python3 scripts/eval/render_model_matrix.py tmp/eval-reports --write-doc docs/AI.md ``` Or use the helper script: ``` scripts/eval/run_model_matrix.sh ``` | Model | Smoke | Read-only | Time (matrix) | Tokens (matrix) | Last run (UTC) | | --- | --- | --- | --- | --- | --- | | anthropic:claude-3-haiku-20240307 | ✅ | ❌ | 2m 42s | — | 2026-01-29 | | anthropic:claude-haiku-4-5-20251001 | ✅ | ✅ | 8s | 18,923 | 2026-01-29 | | anthropic:claude-opus-4-5-20251101 | ✅ | ✅ | 9m 31s | 1,120,530 | 2026-01-29 | | gemini:gemini-3-flash-preview | ✅ | ✅ | 7m 4s | — | 2026-01-29 | | gemini:gemini-3-pro-preview | ✅ | ✅ | 3m 54s | 1,914 | 2026-01-29 | | openai:gpt-5.2 | ✅ | ✅ | 5s | 12,363 | 2026-01-29 | | openai:gpt-5.2-chat-latest | ✅ | ✅ | 8s | 12,595 | 2026-01-29 | --- ## Safety Controls Pulse includes settings that control how "active" AI features are: - **Autonomous mode (Pro and above)**: When enabled, AI may execute safe commands without approval - **Patrol auto-fix (Pro and above)**: Allows patrol to attempt automatic remediation - **Alert-triggered analysis (Pro and above)**: Limits AI to analyzing specific events when alerts occur - **Full autonomy unlock (Pro and above)**: Enables auto-fix for critical findings without approval (requires explicit toggle) If you enable execution features, ensure agent tokens and scopes are appropriately restricted. ### Advanced Network Restrictions Pulse blocks AI tool HTTP fetches to loopback and link-local addresses by default. For local development: - `PULSE_AI_ALLOW_LOOPBACK=true` Use this only in trusted environments. --- ## Privacy Patrol runs on your server and only sends the minimal context needed for analysis to the configured provider (when AI is enabled). Anonymous telemetry (counts and feature flags only — no hostnames or credentials) is enabled by default and can be disabled any time — see [Privacy](PRIVACY.md) for details. --- ## Why Patrol Is Different From Traditional Alerts Alerts are threshold-based and narrow. Patrol is context-based and cross-system. - **Alerts**: "Disk > 90%" - **Patrol**: "ZFS pool is 86% but trending +4%/day; projected to hit 95% within a week. Largest consumer is datastore X. Recommend prune or expand." --- ## Cost Tracking Pulse tracks token usage and costs: - View usage summary: `GET /api/ai/cost/summary` - Reset counters: `POST /api/ai/cost/reset` (admin) - Set monthly budget limits in AI settings --- ## Troubleshooting | Issue | Solution | |-------|----------| | AI not responding | Verify provider credentials in **Settings → System → AI Assistant** | | No execution capability | Confirm at least one agent is connected | | Findings not persisting | Check Pulse has write access to `ai_findings.json` in the config directory | | Too many findings | This shouldn't happen — please report if it does | | Investigation stuck | Check circuit breaker status at `/api/ai/circuit/status`; may auto-reset after cooldown | | Model not available | Ensure provider API key is valid and model ID matches provider format | ## Related Documentation ### Deep Dives (Recommended for Technical Audiences) - **[Pulse Assistant Deep Dive](architecture/pulse-assistant-deep-dive.md)** — Complete technical breakdown of the agentic architecture: context prefetching, knowledge accumulation, FSM enforcement, parallel execution, phantom detection, auto-recovery - **[Pulse Patrol Deep Dive](architecture/pulse-patrol-deep-dive.md)** — Full intelligence layer documentation: baseline learning, pattern detection, forecasting, correlation analysis, incident memory, investigation orchestration ### Reference Documentation - [Architecture: Pulse Assistant (Safety Gates)](architecture/pulse-assistant.md) — Detailed FSM states, tool protocol, and invariants - [API Reference](API.md) — Complete API endpoint documentation - [Plans and entitlements](PULSE_PRO.md) — Community/Relay/Pro/Pro+/Cloud features and licensing