Commit graph

4826 commits

Author SHA1 Message Date
rcourtman
f67e7b3e97 fix: clean up debug logging and fix flaky encryption test
1. Fixed TestNewConfigPersistenceFailsWhenEncryptedDataPresentWithoutKey
   - Test was picking up real encryption key from /etc/pulse during migration
   - Now temporarily moves system key during test for proper isolation
   - Uses t.Cleanup to ensure key is restored even on failure

2. Cleaned up console.log statements in production code
   - Dashboard.tsx: replaced console.log with logger.debug for metadata events
   - CompleteStep.tsx: removed verbose agent detection debug logs

These changes reduce log noise in production while maintaining debug
capability in development mode.
2025-12-22 14:35:48 +00:00
rcourtman
28ac86c8ab fix: reduce WebSocket reconnection log noise in host agent
Addresses #866 - agents were logging 'WebSocket connection failed' warnings
even during normal reconnection scenarios (server restart, network blip, etc).

Changes:
- Normal close errors (1000, 1001, connection reset) now log at Debug level
- Only log Warning after 3+ consecutive failures
- Changed 'Connecting to Pulse' from Info to Debug to reduce noise
- Successful connections still log at Info level

The WebSocket is only used for AI command execution, not metrics, so
transient disconnections don't affect monitoring functionality.
2025-12-22 14:11:23 +00:00
rcourtman
59a4843f20 fix: persist finding dismissal state across restarts
User feedback fields (DismissedReason, UserNote, TimesRaised, Suppressed, Source)
were not being saved to disk, causing 'expected behavior' dismissals to be lost
after Pulse restarted.

- Add missing fields to AIFindingRecord in persistence.go
- Update FindingsPersistenceAdapter to save/load these fields
- Add comprehensive tests for dismissal persistence round-trip

Fixes issue where Frigate storage warning kept reappearing despite being
marked as expected behavior.
2025-12-22 11:18:43 +00:00
rcourtman
2c95961feb fix: Add missing guest filtering props to ThresholdsTable tests 2025-12-22 10:26:53 +00:00
rcourtman
0837c46f5a test: Add unit tests for guest tag filtering 2025-12-22 10:24:39 +00:00
rcourtman
71d0401c80 feat: Add guest filtering by tag and name prefix via Alert Configuration. Resolves #863 2025-12-22 10:03:12 +00:00
rcourtman
c9fc827f4c fix: Prevent buffering and log actionable error for host agent 403s. Related to discussion #845 2025-12-22 09:51:27 +00:00
rcourtman
fc300de328 fix: Prevent ignored container inputs from removing trailing newlines. Related to #865 2025-12-22 09:48:40 +00:00
rcourtman
a0580db62d fix: exclude watch from patrol status summary
Since watch/info findings are filtered from the UI and never shown
to users, don't include them in the patrol run status summary.
This makes the summary consistent with what users actually see.
2025-12-21 23:31:21 +00:00
rcourtman
78c3434061 fix: include VMID in AI context to prevent incorrect references
The LLM was confusing VMIDs because they weren't included in the
context. Now the formatted context shows:

  ### Container: ollama (VMID 200) on minipc

This prevents the AI from referencing the wrong VMID when generating
findings and recommendations.
2025-12-21 23:13:47 +00:00
rcourtman
8ee315eee4 perf: skip initial patrol if one ran recently
When the service restarts, it now checks if a patrol ran within the
last hour. If so, it skips the initial patrol to avoid wasting API
tokens during development/maintenance when the service is restarted
frequently.

The scheduled patrol runs (every 6 hours) are not affected.
2025-12-21 23:03:41 +00:00
rcourtman
9c58bfa127 perf: reduce MetricSamples from 100 to 24 points
100 samples was causing 326k+ input tokens which is expensive.
24 samples (hourly resolution) still provides good pattern visibility
while significantly reducing token cost.

Estimated reduction: ~75% fewer metric tokens.
2025-12-21 22:56:19 +00:00
rcourtman
c85814345d feat: surface AI patrol errors as findings
When AI patrol fails due to API issues like insufficient balance, invalid
API key, or rate limiting, we now create a finding that appears in the
AI Insights tab. This makes the issue visible to users rather than hidden
in logs.

The finding includes:
- Clear description of the issue (e.g., 'Insufficient API credits')
- Recommendation for how to fix it
- Evidence showing the actual error message
2025-12-21 22:45:29 +00:00
rcourtman
901f04a7b2 fix: don't show 'All healthy' when patrol run had errors
When a patrol run encounters errors (e.g., LLM call failed), don't
display 'All healthy' in the summary as that's misleading - the
analysis didn't complete properly.

Now shows 'Analysis incomplete (N errors)' instead, which correctly
explains why the status badge shows red/error.
2025-12-21 22:35:39 +00:00
rcourtman
c15f260280 feat: increase MetricSamples to 100 points (~15 min resolution)
Modern LLMs have 100k+ token contexts. 100 samples over 24h gives
~15 minute resolution while adding minimal token overhead.

This lets the LLM see fine-grained patterns, short spikes, and
accurately distinguish anomalies from normal behavior.
2025-12-21 22:25:54 +00:00
rcourtman
d23f1c78de fix: increase MetricSamples to 24 points for hourly resolution
12 samples was too coarse (2-hour intervals could miss spikes).
24 samples gives ~hourly resolution while still being compact.
2025-12-21 22:24:02 +00:00
rcourtman
5877ce00c3 fix: use 24h window for MetricSamples (matches in-memory retention)
The in-memory MetricsHistory only retains 24 hours of data, not 7 days.
Changed computeGuestMetricSamples to use trendWindow24h instead of
trendWindow7d, and reduced sample count from 24 to 12 points.

This ensures the LLM actually receives metric samples in the context,
which wasn't happening before because the 7-day query returned empty data.
2025-12-21 22:19:40 +00:00
rcourtman
f6b1414ed6 debug: add logging to verify MetricSamples population for LLM context 2025-12-21 22:14:54 +00:00
rcourtman
b1baab4c63 feat: add 'Alert' badge to findings triggered by alert-triggered analysis
Shows a purple ' Alert' badge on findings that were discovered through
alert-triggered analysis rather than scheduled patrol runs. This gives
users visibility into how findings were discovered without cluttering
the patrol run history table.
2025-12-21 22:04:32 +00:00
rcourtman
4e893117cd fix: correct patrol interval logging
The log was showing QuickCheckInterval (deprecated, always 0) instead of
the actual Interval field. This caused confusing 'interval: 0' logs.
2025-12-21 21:52:57 +00:00
rcourtman
07c5880b0a fix: AI settings persistence and UI improvements
Bug Fixes:
- Fix boolean fields with 'omitempty' not persisting false values
  - AlertTriggeredAnalysis, PatrolAnalyzeNodes/Guests/Docker/Storage
  - omitempty causes Go to skip false (zero value) when marshaling JSON
  - On reload, NewDefaultAIConfig() sets true, and missing field stays true

- Fix model dropdown losing selection after save (SolidJS reactivity issue)
  - Added explicit 'selected' attribute to option elements
  - Ensures browser maintains selection with optgroups during re-renders

Improvements:
- Change patrol type label from 'Quick' to 'Patrol' in history table
- Add chat_model and patrol_model to AI settings update log
- Add alert_triggered_analysis to AI config load log for debugging
2025-12-21 21:48:09 +00:00
rcourtman
2928fad643 feat(ai): pass raw metric samples to LLM for pattern interpretation
Instead of relying on pre-computed trend heuristics (which can be misleading
for edge cases like step changes vs continuous growth), we now pass downsampled
raw data points to the LLM so it can interpret patterns directly.

Changes:
- Add MetricSamples field to ResourceContext
- Add DownsampleMetrics() to reduce data points for LLM consumption
- Add formatMetricSamples() to format data compactly (e.g., 'Disk: 26→26→31%')
- Add computeGuestMetricSamples() to gather 7-day sampled history
- Populate MetricSamples for VMs and containers during context build
- Add History section to formatted context output

The LLM now sees actual patterns like 'stable for 6 days then jumped' rather
than just '45.8%/day growth rate' - allowing for much more nuanced interpretation.

This approach:
- Leverages LLM's pattern recognition instead of hard-coded heuristics
- Provides 7 days of data (~24 samples) for context on normal behavior
- Uses minimal tokens due to compact formatting with deduplication
- Is more future-proof as LLMs improve

Example output:
  **History (7d sampled, oldest→newest)**: Disk: 26→26→26→26→26→31%

Refs: Frigate disk usage false positive investigation
2025-12-21 21:09:24 +00:00
rcourtman
e604e4bb8a Add more AI test coverage
- findings_test.go: Add edge case tests for Acknowledge, Dismiss, SetUserNote, Suppress, Resolve, DeleteSuppressionRule, GetSummary, GetDismissedForContext (+20 tests)
- intelligence_test.go: Add tests for calculateResourceHealth with anomalies/predictions/notes, FormatContext with various subsystems, generateHealthPrediction, GetSummary with patterns/learning (+17 tests)

Coverage improvements:
- internal/ai: 63.1% -> 64.4%
- Overall AI module coverage now averages >80%
2025-12-21 20:31:24 +00:00
rcourtman
185f1ef682 Improve AI test coverage
- baseline/store_test.go: Add tests for CheckResourceAnomalies, formatAnomalyDescription, formatRatio, GetAllAnomalies, floatToStr (67.9% -> 92.2%)
- memory/incidents_test.go: Add tests for RecordAlertUnacknowledged, RecordRunbook, ListIncidentsByResource, FormatForAlert, FormatForResource, FormatForPatrol (66.8% -> 81.1%)
- intelligence_test.go: Add tests for SetStateProvider, FormatGlobalContext, RecordLearning, severityOrder, CheckBaselinesForResource with baselines (61.4% -> 63.1%)
2025-12-21 20:22:47 +00:00
rcourtman
92ccb35e73 test: update GetAllFindings test to match filtering behavior
GetAllFindings now filters out info/watch severity findings,
only returning critical and warning. Update test expectation
from 3 findings to 2.
2025-12-21 19:20:27 +00:00
rcourtman
0d72690fc7 fix: remove unused runbook UI code to fix TypeScript errors
- Remove unused correlations state and constants from AIOverviewTable
- Remove unused runbook-related imports, state, and functions from Alerts
- Add type annotation to Set() to fix type error
- Removes dead code left over from runbook UI removal
2025-12-21 19:12:20 +00:00
rcourtman
8e6dc18d6f security: allow rm on /var/tmp and /tmp with approval
Updated command policy to be more nuanced:

BLOCKED (hard block, never allowed):
- rm -rf / (root)
- rm -rf /* (root wildcard)
- rm -rf /home, /etc, /usr, /var/lib, /boot, /root, /bin, /sbin, /lib, /opt

REQUIRE APPROVAL (user must click 'Run'):
- rm -rf /var/tmp/* (Proxmox vzdump temp files)
- rm -rf /tmp/*

This allows AI to suggest cleaning up vzdump temp files while still
protecting against destructive operations on critical paths.
2025-12-21 18:53:08 +00:00
rcourtman
1dd1867e9d refactor: only show critical/warning patrol findings
Filter out 'watch' and 'info' severity findings from the API response.
These lower-severity findings were mostly noise:
- 'watch': CPU is 35% instead of 11% (who cares)
- 'info': Stopped container exists (knew that)

Now only showing actionable findings:
- critical: Something is broken NOW
- warning: Something needs attention soon

Users prefer silence to noise.
2025-12-21 18:34:51 +00:00
rcourtman
c3720fe92f refactor: remove runbook UI from frontend
Removed:
- Runbook button from finding action buttons
- Runbook Execution panel
- Fix Receipts panel (already removed)

Just showing 'Get Help' and 'I Fixed It' buttons now.
2025-12-21 18:08:47 +00:00
rcourtman
0e2f58900b refactor: remove Fix Receipts UI section
Fix Receipts was showing 'No fixes logged' most of the time since:
- Runbooks were removed
- Remediation logging was inconsistent

Just adds visual clutter without value. Removed ~100 lines of UI code.
2025-12-21 18:02:35 +00:00
rcourtman
586bf96e03 refactor: remove runbooks feature entirely
Runbooks were a half-built feature that provided no value:
- Only 3 runbooks existed
- AI dynamic remediation already covers the same ground
- Added UI complexity without benefit

Removed:
- runbooks.go and runbooks_test.go
- Handler functions in ai_handlers.go
- Routes in router.go
- Test cases in ai_handlers_test.go
- Auto-fix call in patrol.go

Kept (dead code but harmless):
- Frontend types/API calls (will 404)
- RecordIncidentRunbook function (unused)

Less code = easier to maintain.
2025-12-21 17:48:07 +00:00
rcourtman
cda930901b feat(ai): make patrol prompt stricter to reduce noise
Updated LLM prompt with explicit guidance on what NOT to report:
- Small baseline deviations (7% vs 4% is normal variance)
- Low utilization (under 50% CPU or 60% memory is fine)
- Stopped containers that aren't autostart
- 'Elevated' metrics still well under limits

Severity guidelines made more specific:
- CRITICAL: disk >95%, service down, data loss
- WARNING: disk >85%, memory >90%, failures
- WATCH: Only for trends projected to hit critical in <7 days
- INFO: Context/observations

Key message to LLM: 'Users prefer silence to noise'
Only flag things that require operator action.
2025-12-21 17:35:36 +00:00
rcourtman
b90076f086 feat(ai): implement metric-specific anomaly thresholds
Smarter anomaly detection to reduce false positives:

**Learning Window:** 7 days → 14 days
- Captures weekly patterns (weekday vs weekend)

**Metric-Specific Thresholds:**

CPU:
- Only report if usage >70% AND >2x baseline
- Low CPU variance (5% vs 10%) is not actionable

Memory:
- Report if >80% OR (>1.5x baseline AND >60%)
- Memory is more stable, lower threshold makes sense

Disk:
- Report if >85% usage OR +15 percentage points growth
- Disk problems are critical, use absolute thresholds

Other metrics:
- Use 2x threshold as default

This dramatically reduces 'noise' anomalies while catching
actual problems that need operator attention.
2025-12-21 17:31:30 +00:00
rcourtman
763d04821c fix(ui): remove AI Intelligence Summary - patrol findings are sufficient
The AI Intelligence Summary was adding noise rather than value:
- Predictions duplicated patrol findings
- Correlations were not actionable
- 'Fixed' items were vague diagnostics
- Status changes were startup noise

The real value is in the patrol findings section which shows:
- Actual issues found (critical/warning/watch/info)
- Actionable recommendations
- Suppression rules

Keeping the patrol findings, removing the redundant summary.
2025-12-21 17:23:25 +00:00
rcourtman
3c4999ea48 fix(ai): raise anomaly threshold to 2x, filter 'Ran diagnostic' noise
More aggressive noise filtering:

1. Anomaly threshold raised from 1.5x to 2x
   - 1.5x is too borderline to be actionable
   - Now requires genuinely significant deviation

2. Filter out 'Ran diagnostic' and 'Executed command' fallback items
   - These are generic summaries that provide no value
   - Only show remediations with specific, meaningful descriptions

Goal: If something shows in AI Intelligence, it should demand attention.
2025-12-21 17:19:32 +00:00
rcourtman
4b25b84678 fix(ai): filter out noise from anomalies and status changes
Critical changes to surface only actionable insights:

1. Anomalies now require at least 50% deviation from baseline
   - '1.0x baseline' values filtered out (statistically significant but not actionable)
   - Must be >1.5x above OR <0.5x below baseline to report

2. Status changes filter out startup noise
   - 'unknown → running' is just system starting, not a real state change
   - Backups removed from main list (they have dedicated section)

3. Only show genuinely interesting changes:
   - Config changes, migrations, restarts, deletions
   - Things that require operator attention

This massively reduces noise while keeping high-signal alerts.
2025-12-21 17:15:44 +00:00
rcourtman
e8f18ff0bb fix(ui): hide correlations - they're not actionable yet
Correlations currently show 'A and B alert together' which isn't useful:
- Bidirectional correlations (A→B AND B→A) are just coincidence
- 'experiences alert' is too vague to be actionable
- No root cause identification - just shows correlated things correlate

Hidden until we can properly identify:
- Root cause chains (A CAUSES B, not just 'A and B happen together')
- Specific trigger types (what kind of alert?)
- Direction of causality

Other improvements:
- Stopped VMs/containers filtered from anomaly detection
- Lower noise, more signal
2025-12-21 12:46:57 +00:00
rcourtman
fdacb60969 fix(ai): filter out noise from AI intelligence display
Critical fixes to show only actionable insights:

1. Skip stopped VMs/containers from anomaly detection
   - '0.0x baseline' for stopped resources is expected, not an anomaly
   - Only check anomalies for status='running'

2. Filter correlations by confidence (>=70%)
   - Low confidence correlations are likely coincidental
   - Only show high-confidence, actionable dependencies

This reduces noise and surfaces genuinely useful intelligence.
2025-12-21 12:41:27 +00:00
rcourtman
c94b6c1904 fix(ui): use Promise.allSettled for resilient API loading
Changed AIOverviewTable to use Promise.allSettled instead of
Promise.all so that one failing endpoint (e.g., anomalies 404)
doesn't break the entire component.

Each API result now has a fallback for failed requests, allowing
the table to gracefully degrade when endpoints are unavailable.
2025-12-21 12:31:09 +00:00
rcourtman
f8e42990b7 fix(ui): make anomalies fetch resilient to failures
Separate anomalies API call from Promise.all so that a failure
in the anomalies endpoint doesn't break the entire AI Overview.

This fixes 'Failed to load AI overview data' error when the
anomalies endpoint isn't available (e.g., patrol not started).
2025-12-21 12:27:53 +00:00
rcourtman
5931d240df feat(ui): reduce AI Intelligence table length with 'Show more' buttons
Added collapsible sections to prevent overwhelming list:
- Dependencies limited to top 5 (sorted by confidence)
- Actions limited to top 5
- Changes limited to top 5
- 'Show more' buttons appear at bottom when items are hidden
- Clicking expands to show all items in that category

This addresses user feedback about excessive scrolling when
there are many dependency correlations or remediation actions.
2025-12-21 12:25:43 +00:00
rcourtman
ad61f1809b feat(ui): show anomalies in AI Intelligence Summary table
Adds real-time anomaly detection results to the AI Overview Table:
- Anomalies appear at TOP of list (before predictions) since they're real-time
- Severity-based color coding (critical=red, high=orange, medium=amber, low=blue)
- Shows resource name, metric, and deviation ratio (e.g., 'CPU at 2.5x baseline')
- Subtitle shows current vs baseline values
- Timestamp shows 'Now' since anomalies are current state

This integrates the FREE anomaly detection feature directly alongside
the Pro patrol insights, providing immediate value to all users.
2025-12-21 11:50:54 +00:00
rcourtman
53cc9ee5a9 feat(ui): add learning status hook and enhance AI indicator visibility
New useLearningStatus hook:
- Polls /api/ai/intelligence/learning every 60 seconds
- Provides resourceCount(), metricCount(), learningState()
- Convenience accessors: isActive(), isLearning(), isWaiting()

Enhanced AIStatusIndicator:
- Now shows when ANY baselines exist (not just when Patrol enabled)
- Tooltip shows 'X resources baselined' for transparency
- Healthy state  45 resources baselined'shows '
- Works even without Pro license since baselines are FREE

This makes the AI presence visible from the moment Pulse starts
learning, providing immediate value feedback to all users.
2025-12-21 11:45:55 +00:00
rcourtman
9aa266a615 feat(ai): make anomaly detection FREE and add learning status endpoint
Free Features (no license required):
- Anomaly detection - removed license gating, purely statistical analysis
- Learning status endpoint - GET /api/ai/intelligence/learning

Learning Status Response:
- resources_baselined: count of resources with learned baselines
- total_metrics: total metric baselines (cpu + memory + disk)
- metric_breakdown: {cpu: X, memory: Y, disk: Z}
- status: 'waiting' | 'learning' | 'active'
- message: human-readable description

This makes the AI intelligence features visible to all users,
encouraging upgrades for the full LLM-powered patrol experience.
2025-12-21 11:36:54 +00:00
rcourtman
b752773924 test(baseline): add tests for trend prediction
Add comprehensive tests for CalculateTrend function:
- TestCalculateTrend_InsufficientData: <5 samples returns nil
- TestCalculateTrend_IncreasingTrend: detects critical/warning trends
- TestCalculateTrend_DecreasingTrend: correctly identifies declining usage
- TestCalculateTrend_StableTrend: stable patterns return DaysToFull=-1
- TestFormatDays: human-readable time formatting
2025-12-21 11:31:58 +00:00
rcourtman
6bb46eeb34 feat(ai): enhance intelligence status and add trend prediction
AIStatusIndicator:
- Now shows BOTH patrol findings AND baseline anomalies
- Displays even when only anomaly detection is active (no patrol)
- Badge count includes both findings + anomalies
- Tooltip provides detailed breakdown by severity

Trend Prediction (backend):
- Add TrendPrediction struct for resource exhaustion forecasting
- CalculateTrend() uses linear regression on sample history
- Predicts days until resource is full (or if declining/stable)
- Severity: critical (<7 days), warning (<30 days), info (>30 days)
- Human-readable descriptions like 'full in ~2 weeks (+0.5% per day)'

This creates a more cohesive intelligence experience where anomaly
detection works independently of the pro/patrol features, making
value visible immediately to all users.
2025-12-21 11:29:44 +00:00
rcourtman
ee15a5626d feat(ui): wire memory and disk anomaly indicators
Complete the anomaly indicator integration for all three metrics:
- CPU: EnhancedCPUBar (already done)
- Memory: StackedMemoryBar (new)
- Disk: StackedDiskBar (new)

All three metric bars now show a pulsing indicator (e.g., '2.5x↑')
when the current value is significantly above the learned baseline.

Severity colors:
- Critical (>4σ): red
- High (3-4σ): orange
- Medium (2.5-3σ): yellow
- Low (2-2.5σ): blue

This is 100% deterministic - no LLM involved. The indicators appear
automatically based on statistical deviation from learned baselines.
2025-12-21 11:24:42 +00:00
rcourtman
e374164c8c feat(ui): wire CPU anomaly indicator to dashboard
Connect anomaly data to the EnhancedCPUBar component in GuestRow.
When a VM/container's CPU is significantly above its learned baseline,
a pulsing indicator (e.g., '2.5x') appears directly on the CPU bar.

This provides real-time baseline deviation feedback without any LLM
involvement - purely deterministic statistical analysis.

Memory and disk anomaly hooks are prepared but not yet wired to their
respective bar components (TODO for follow-up).
2025-12-21 11:06:23 +00:00
rcourtman
869a88a800 feat(ui): add anomaly indicator components and hooks
Add frontend infrastructure for displaying baseline anomalies:
- useAnomalies hook for fetching and caching anomaly data
- AnomalyCell component for displaying multiple anomalies
- AnomalyIndicator/AnomalyBadge components for inline display
- Update EnhancedCPUBar to accept optional anomaly prop

The anomaly endpoint is polled every 30 seconds and cached.
Anomaly badges show severity (color) and deviation ratio (e.g., '2.5x').

This prepares the UI for displaying real-time baseline deviations
without requiring LLM interaction.
2025-12-21 11:04:18 +00:00
rcourtman
d9f1f7accd feat(ai): add real-time anomaly detection endpoint
Add /api/ai/intelligence/anomalies endpoint that compares live metrics
against learned baselines to surface deviations - all deterministic
(no LLM required).

Backend:
- Add AnomalyReport struct with severity classification
- Add CheckResourceAnomalies method to baseline store
- Add HandleGetAnomalies API handler
- Add GetStateProvider getter to AI service

Frontend:
- Add AnomalyReport and AnomaliesResponse types
- Add getAnomalies API function
- Add AnomalySeverity type

This is the first step toward surfacing deterministic intelligence
directly in the UI without requiring LLM interaction.
2025-12-21 10:52:54 +00:00