Commit graph

15 commits

Author SHA1 Message Date
rcourtman
95a0d7a6bd feat(backend): implement AI Patrol, Investigation, and system-wide refactors 2026-01-30 19:02:14 +00:00
rcourtman
27f1a11acb feat: add AI Intelligence system with investigation and forecasting
Major new AI capabilities for infrastructure monitoring:

Investigation System:
- Autonomous finding investigation with configurable autonomy levels
- Investigation orchestrator with rate limiting and guardrails
- Safety checks for read-only mode enforcement
- Chat-based investigation with approval workflows

Forecasting & Remediation:
- Trend forecasting for resource capacity planning
- Remediation engine for generating fix proposals
- Circuit breaker for AI operation protection

Unified Findings:
- Unified store bridging alerts and AI findings
- Correlation and root cause analysis
- Incident coordinator with metrics recording

New Frontend:
- AI Intelligence page with patrol controls
- Investigation drawer for finding details
- Unified findings panel with actions

Supporting Infrastructure:
- Learning store for user preference tracking
- Proxmox event ingestion and correlation
- Enhanced patrol with investigation triggers
2026-01-24 22:41:43 +00:00
rcourtman
3fdf753a5b Enhance devcontainer and CI workflows
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-01 22:29:15 +00:00
rcourtman
46a7b7d10a docs: Add prominent VERSION file warning to GEMINI.md
- Add critical release section at top of file
- Make VERSION file requirement impossible to miss
- Explain that version_guard job will fail if VERSION doesn't match
2025-12-29 10:11:38 +00:00
rcourtman
b65f42acb5 chore: AI patrol and baseline improvements
- Enhanced patrol finding display in Alerts.tsx
- Improved baseline store with better error handling
- Added clean thinking test coverage
- Updated patrol logic for better finding management
2025-12-22 23:12:11 +00:00
rcourtman
185f1ef682 Improve AI test coverage
- baseline/store_test.go: Add tests for CheckResourceAnomalies, formatAnomalyDescription, formatRatio, GetAllAnomalies, floatToStr (67.9% -> 92.2%)
- memory/incidents_test.go: Add tests for RecordAlertUnacknowledged, RecordRunbook, ListIncidentsByResource, FormatForAlert, FormatForResource, FormatForPatrol (66.8% -> 81.1%)
- intelligence_test.go: Add tests for SetStateProvider, FormatGlobalContext, RecordLearning, severityOrder, CheckBaselinesForResource with baselines (61.4% -> 63.1%)
2025-12-21 20:22:47 +00:00
rcourtman
b90076f086 feat(ai): implement metric-specific anomaly thresholds
Smarter anomaly detection to reduce false positives:

**Learning Window:** 7 days → 14 days
- Captures weekly patterns (weekday vs weekend)

**Metric-Specific Thresholds:**

CPU:
- Only report if usage >70% AND >2x baseline
- Low CPU variance (5% vs 10%) is not actionable

Memory:
- Report if >80% OR (>1.5x baseline AND >60%)
- Memory is more stable, lower threshold makes sense

Disk:
- Report if >85% usage OR +15 percentage points growth
- Disk problems are critical, use absolute thresholds

Other metrics:
- Use 2x threshold as default

This dramatically reduces 'noise' anomalies while catching
actual problems that need operator attention.
2025-12-21 17:31:30 +00:00
rcourtman
3c4999ea48 fix(ai): raise anomaly threshold to 2x, filter 'Ran diagnostic' noise
More aggressive noise filtering:

1. Anomaly threshold raised from 1.5x to 2x
   - 1.5x is too borderline to be actionable
   - Now requires genuinely significant deviation

2. Filter out 'Ran diagnostic' and 'Executed command' fallback items
   - These are generic summaries that provide no value
   - Only show remediations with specific, meaningful descriptions

Goal: If something shows in AI Intelligence, it should demand attention.
2025-12-21 17:19:32 +00:00
rcourtman
4b25b84678 fix(ai): filter out noise from anomalies and status changes
Critical changes to surface only actionable insights:

1. Anomalies now require at least 50% deviation from baseline
   - '1.0x baseline' values filtered out (statistically significant but not actionable)
   - Must be >1.5x above OR <0.5x below baseline to report

2. Status changes filter out startup noise
   - 'unknown → running' is just system starting, not a real state change
   - Backups removed from main list (they have dedicated section)

3. Only show genuinely interesting changes:
   - Config changes, migrations, restarts, deletions
   - Things that require operator attention

This massively reduces noise while keeping high-signal alerts.
2025-12-21 17:15:44 +00:00
rcourtman
b752773924 test(baseline): add tests for trend prediction
Add comprehensive tests for CalculateTrend function:
- TestCalculateTrend_InsufficientData: <5 samples returns nil
- TestCalculateTrend_IncreasingTrend: detects critical/warning trends
- TestCalculateTrend_DecreasingTrend: correctly identifies declining usage
- TestCalculateTrend_StableTrend: stable patterns return DaysToFull=-1
- TestFormatDays: human-readable time formatting
2025-12-21 11:31:58 +00:00
rcourtman
6bb46eeb34 feat(ai): enhance intelligence status and add trend prediction
AIStatusIndicator:
- Now shows BOTH patrol findings AND baseline anomalies
- Displays even when only anomaly detection is active (no patrol)
- Badge count includes both findings + anomalies
- Tooltip provides detailed breakdown by severity

Trend Prediction (backend):
- Add TrendPrediction struct for resource exhaustion forecasting
- CalculateTrend() uses linear regression on sample history
- Predicts days until resource is full (or if declining/stable)
- Severity: critical (<7 days), warning (<30 days), info (>30 days)
- Human-readable descriptions like 'full in ~2 weeks (+0.5% per day)'

This creates a more cohesive intelligence experience where anomaly
detection works independently of the pro/patrol features, making
value visible immediately to all users.
2025-12-21 11:29:44 +00:00
rcourtman
d9f1f7accd feat(ai): add real-time anomaly detection endpoint
Add /api/ai/intelligence/anomalies endpoint that compares live metrics
against learned baselines to surface deviations - all deterministic
(no LLM required).

Backend:
- Add AnomalyReport struct with severity classification
- Add CheckResourceAnomalies method to baseline store
- Add HandleGetAnomalies API handler
- Add GetStateProvider getter to AI service

Frontend:
- Add AnomalyReport and AnomaliesResponse types
- Add getAnomalies API function
- Add AnomalySeverity type

This is the first step toward surfacing deterministic intelligence
directly in the UI without requiring LLM interaction.
2025-12-21 10:52:54 +00:00
rcourtman
b79d04f734 Add comprehensive AI test coverage
- Add integration tests for Ollama provider (17 tests against real API)
- Add unit tests for baseline, correlation, patterns, memory, knowledge, cost packages
- Add context formatter and builder tests
- Add factory tests for provider initialization
- Add Makefile targets: test-integration, test-all
- Clean up test theatre (removed struct field tests)

Integration tests require Ollama at OLLAMA_URL (default: 192.168.0.124:11434)
Run with: make test-integration
2025-12-16 12:33:06 +00:00
rcourtman
6f0379f879 feat(api): Add AI intelligence API endpoints
Expose learned AI intelligence data via REST API:

New endpoints:
- GET /api/ai/intelligence/patterns - Detected failure patterns
- GET /api/ai/intelligence/predictions - Failure predictions
- GET /api/ai/intelligence/correlations - Resource correlations
- GET /api/ai/intelligence/changes - Recent infrastructure changes
- GET /api/ai/intelligence/baselines - Learned baselines

All endpoints support ?resource_id filter for per-resource queries.
Changes endpoint supports ?hours filter (default: 24).

Backend additions:
- ai_intelligence_handlers.go - Handler implementations
- baseline.Store.GetAllBaselines() - Flat baseline export
- patrol.GetChangeDetector() - Access change detector

This enables frontend to display:
- 'OOM expected in 3 days based on pattern'
- 'When storage-1 is full, database VM restarts'
- 'VM memory baseline: 60-75%'

All tests passing.
2025-12-12 14:49:46 +00:00
rcourtman
5a77fab633 feat(ai): Add baseline learning and anomaly detection (Phase 2)
Phase 2 of Pulse AI differentiation:

- Create internal/ai/baseline package for learned baselines
- Implement statistical baseline learning with mean, stddev, percentiles
- Add z-score based anomaly detection with severity classification
  (low, medium, high, critical based on standard deviations)
- Integrate baseline provider into context builder
- Wire baseline store into patrol service with adapters
- Add anomaly enrichment to resource contexts

Key features:
- Learn computes baseline from historical metric data points
- IsAnomaly and CheckAnomaly detect deviations from normal
- Persists baselines to disk as JSON for durability
- Formatted anomaly descriptions for AI consumption
  Example: 'Memory is high above normal (85.2% vs typical 42.1% ± 8.3%)'

The baseline store needs to be initialized and triggered to learn
from metrics history. Next step is adding the learning loop.

All tests passing.
2025-12-12 11:26:31 +00:00