Pulse/docs/AI.md

153 lines
7.4 KiB
Markdown

# Pulse AI
Pulse AI adds an optional assistant for troubleshooting and proactive monitoring. It is **off by default** and can be enabled per instance.
## What Makes AI Patrol Different
Unlike chatting with a generic AI where you manually describe your infrastructure, Patrol runs automatically and sees **your entire infrastructure at once** - every node, VM, container, storage pool, backup job, and Kubernetes cluster. It's not just a static checklist; it's an LLM analyzing real-time data enriched with historical context.
### Context Patrol Receives (That Generic LLMs Can't See)
Every patrol run passes the LLM comprehensive context about your environment:
| Data Category | What's Included |
|---------------|-----------------|
| **Proxmox Nodes** | Status, CPU%, memory%, uptime, 24h/7d trend analysis |
| **VMs & Containers** | Full metrics, backup status, OCI images, historical trends, anomaly flags |
| **Storage Pools** | Usage %, capacity predictions, type (ZFS/LVM/Ceph), growth rates |
| **Docker/Podman** | Container counts, health states, unhealthy container lists |
| **Kubernetes** | Nodes, pods, deployments, services, DaemonSets, StatefulSets, namespaces |
| **PBS/PMG** | Datastore status, backup jobs, job failures, verification status |
| **Ceph** | Cluster health, OSD states, PG status |
| **Agent Hosts** | Load averages, memory, disk, RAID status, temperatures |
### Enriched Context (The Real Differentiator)
Beyond raw metrics, Patrol enriches the context with intelligence that transforms raw data into actionable insights:
- **Trend analysis** - 24h and 7d patterns showing `growing`, `stable`, `declining`, or `volatile` behavior
- **Learned baselines** - Z-score anomaly detection based on what's *normal for your environment*
- **Capacity predictions** - "Storage pool will be full in 12 days at current growth rate"
- **Infrastructure changes** - Detected config changes, VM migrations, new deployments
- **Resource correlations** - Pattern detection across related resources (e.g., containers on same host)
- **User notes** - Your annotations explaining expected behavior ("runs hot for transcoding")
- **Dismissed findings** - Respects your feedback and suppressed alerts
- **Incident memory** - Learns from past investigations and successful remediations
### Examples of What Patrol Catches
Because it's an LLM with full context, Patrol catches issues that static threshold-based alerting misses:
| Issue | Severity | Example |
|-------|----------|---------|
| **Node offline** | Critical | Proxmox node not responding |
| **Disk approaching capacity** | Warning/Critical | Storage at 85%+, or growing toward full |
| **Backup failures** | Warning | PBS job failed, no backup in 48+ hours |
| **Service down** | Critical | Docker container crashed, agent offline |
| **High resource usage** | Warning | Sustained memory >90%, CPU >85% |
| **Storage issues** | Critical | PBS datastore errors, ZFS pool degraded |
| **Ceph problems** | Warning/Critical | Degraded OSDs, unhealthy PGs |
| **Kubernetes issues** | Warning | Pods stuck in Pending/CrashLoopBackOff |
| **Restart loops** | Warning | VMs that keep restarting without errors |
| **Clock drift** | Warning | Node time drift affecting Ceph/HA |
| **Unusual patterns** | Varies | Any anomaly the LLM identifies as unusual for your setup |
### What Patrol Ignores (by design)
Patrol is **intentionally conservative** to avoid noise:
- Small baseline deviations ("CPU at 15% vs typical 10%")
- Low utilization that's "elevated" but fine (disk at 40%)
- Stopped VMs/containers that were intentionally stopped
- Brief spikes that resolve on their own
- Anything that doesn't require human action
> **Philosophy**: If a finding wouldn't be worth waking someone up at 3am, Patrol won't create it.
## Features
- **Interactive chat**: Ask questions about current cluster state and get AI-assisted troubleshooting.
- **Patrol**: Background checks periodically (default: 6 hours) that generate findings. Interval is fully configurable down to 15 minutes.
- **Alert analysis**: Optional token-efficient analysis when alerts fire.
- **Command execution**: When enabled, AI can run commands via connected agents.
- **Finding management**: Dismiss, resolve, or suppress findings to prevent recurrence.
- **Cost tracking**: Tracks token usage and supports monthly budget limits.
## Configuration
Configure in the UI: **Settings → AI**
AI settings are stored encrypted at rest in `ai.enc` under the Pulse config directory. The discovered findings and their history are stored in `ai_findings.enc` (or `ai_findings.json` if encryption is disabled). These files are located in `/etc/pulse` for systemd installs, or `/data` for Docker/Kubernetes.
### Supported Providers
- **Anthropic** (API key or OAuth)
- **OpenAI**
- **DeepSeek**
- **Google Gemini**
- **Ollama** (self-hosted, with tool/function calling support)
- **OpenAI-compatible base URL** (for providers that implement the OpenAI API shape)
### Models
Pulse uses model identifiers in the form: `provider:model-name`
You can set separate models for:
- Chat (`chat_model`)
- Patrol (`patrol_model`)
- Auto-fix remediation (`auto_fix_model`)
### Testing
- Test provider connectivity: `POST /api/ai/test` and `POST /api/ai/test/{provider}`
- List available models: `GET /api/ai/models`
## Patrol Service (Pro Feature)
Patrol runs automated health checks on a configurable schedule (default: every 6 hours). It passes comprehensive infrastructure context to the LLM (see "Context Patrol Receives" above) and generates findings when issues are detected.
Pulse Pro users get full LLM-powered analysis. Free users still benefit from **Heuristic Patrol**, which uses local rule-based logic to detect common issues (offline nodes, disk exhaustion, etc.) without requiring an external AI provider. Free users also get full access to the AI Chat assistant (BYOK).
### Finding Severity
- **Critical**: Immediate attention required (service down, data at risk)
- **Warning**: Should be addressed soon (disk filling, backup stale)
Note: `info` and `watch` level findings are filtered out to reduce noise.
### Managing Findings
Findings can be managed via the UI or API:
- **Get help**: Chat with AI to troubleshoot the issue
- **Resolve**: Mark as fixed (finding will reappear if the issue resurfaces)
- **Dismiss**: Mark as expected behavior (creates suppression rule)
Dismissed and resolved findings persist across Pulse restarts.
### AI-Assisted Remediation
When chatting with AI about a patrol finding, the AI can:
- Run diagnostic commands on connected agents
- Propose fixes with explanations
- Automatically resolve findings after successful remediation
## Safety Controls
Pulse includes settings that control how "active" AI features are:
- **Autonomous mode**: When enabled, AI may execute safe commands without approval.
- **Patrol auto-fix**: Allows patrol to attempt automatic remediation.
- **Alert-triggered analysis**: Limits AI to analyzing specific events when alerts occur.
If you enable execution features, ensure agent tokens and scopes are appropriately restricted.
## Troubleshooting
| Issue | Solution |
|-------|----------|
| AI not responding | Verify provider credentials in **Settings → AI** |
| No execution capability | Confirm at least one agent is connected |
| Findings not persisting | Check Pulse has write access to `ai_findings.enc` in the config directory |
| Too many findings | This shouldn't happen - please report if it does |