1. Fixed TestNewConfigPersistenceFailsWhenEncryptedDataPresentWithoutKey
- Test was picking up real encryption key from /etc/pulse during migration
- Now temporarily moves system key during test for proper isolation
- Uses t.Cleanup to ensure key is restored even on failure
2. Cleaned up console.log statements in production code
- Dashboard.tsx: replaced console.log with logger.debug for metadata events
- CompleteStep.tsx: removed verbose agent detection debug logs
These changes reduce log noise in production while maintaining debug
capability in development mode.
Addresses #866 - agents were logging 'WebSocket connection failed' warnings
even during normal reconnection scenarios (server restart, network blip, etc).
Changes:
- Normal close errors (1000, 1001, connection reset) now log at Debug level
- Only log Warning after 3+ consecutive failures
- Changed 'Connecting to Pulse' from Info to Debug to reduce noise
- Successful connections still log at Info level
The WebSocket is only used for AI command execution, not metrics, so
transient disconnections don't affect monitoring functionality.
User feedback fields (DismissedReason, UserNote, TimesRaised, Suppressed, Source)
were not being saved to disk, causing 'expected behavior' dismissals to be lost
after Pulse restarted.
- Add missing fields to AIFindingRecord in persistence.go
- Update FindingsPersistenceAdapter to save/load these fields
- Add comprehensive tests for dismissal persistence round-trip
Fixes issue where Frigate storage warning kept reappearing despite being
marked as expected behavior.
Since watch/info findings are filtered from the UI and never shown
to users, don't include them in the patrol run status summary.
This makes the summary consistent with what users actually see.
The LLM was confusing VMIDs because they weren't included in the
context. Now the formatted context shows:
### Container: ollama (VMID 200) on minipc
This prevents the AI from referencing the wrong VMID when generating
findings and recommendations.
When the service restarts, it now checks if a patrol ran within the
last hour. If so, it skips the initial patrol to avoid wasting API
tokens during development/maintenance when the service is restarted
frequently.
The scheduled patrol runs (every 6 hours) are not affected.
100 samples was causing 326k+ input tokens which is expensive.
24 samples (hourly resolution) still provides good pattern visibility
while significantly reducing token cost.
Estimated reduction: ~75% fewer metric tokens.
When AI patrol fails due to API issues like insufficient balance, invalid
API key, or rate limiting, we now create a finding that appears in the
AI Insights tab. This makes the issue visible to users rather than hidden
in logs.
The finding includes:
- Clear description of the issue (e.g., 'Insufficient API credits')
- Recommendation for how to fix it
- Evidence showing the actual error message
When a patrol run encounters errors (e.g., LLM call failed), don't
display 'All healthy' in the summary as that's misleading - the
analysis didn't complete properly.
Now shows 'Analysis incomplete (N errors)' instead, which correctly
explains why the status badge shows red/error.
Modern LLMs have 100k+ token contexts. 100 samples over 24h gives
~15 minute resolution while adding minimal token overhead.
This lets the LLM see fine-grained patterns, short spikes, and
accurately distinguish anomalies from normal behavior.
The in-memory MetricsHistory only retains 24 hours of data, not 7 days.
Changed computeGuestMetricSamples to use trendWindow24h instead of
trendWindow7d, and reduced sample count from 24 to 12 points.
This ensures the LLM actually receives metric samples in the context,
which wasn't happening before because the 7-day query returned empty data.
Shows a purple '⚡ Alert' badge on findings that were discovered through
alert-triggered analysis rather than scheduled patrol runs. This gives
users visibility into how findings were discovered without cluttering
the patrol run history table.
Bug Fixes:
- Fix boolean fields with 'omitempty' not persisting false values
- AlertTriggeredAnalysis, PatrolAnalyzeNodes/Guests/Docker/Storage
- omitempty causes Go to skip false (zero value) when marshaling JSON
- On reload, NewDefaultAIConfig() sets true, and missing field stays true
- Fix model dropdown losing selection after save (SolidJS reactivity issue)
- Added explicit 'selected' attribute to option elements
- Ensures browser maintains selection with optgroups during re-renders
Improvements:
- Change patrol type label from 'Quick' to 'Patrol' in history table
- Add chat_model and patrol_model to AI settings update log
- Add alert_triggered_analysis to AI config load log for debugging
Instead of relying on pre-computed trend heuristics (which can be misleading
for edge cases like step changes vs continuous growth), we now pass downsampled
raw data points to the LLM so it can interpret patterns directly.
Changes:
- Add MetricSamples field to ResourceContext
- Add DownsampleMetrics() to reduce data points for LLM consumption
- Add formatMetricSamples() to format data compactly (e.g., 'Disk: 26→26→31%')
- Add computeGuestMetricSamples() to gather 7-day sampled history
- Populate MetricSamples for VMs and containers during context build
- Add History section to formatted context output
The LLM now sees actual patterns like 'stable for 6 days then jumped' rather
than just '45.8%/day growth rate' - allowing for much more nuanced interpretation.
This approach:
- Leverages LLM's pattern recognition instead of hard-coded heuristics
- Provides 7 days of data (~24 samples) for context on normal behavior
- Uses minimal tokens due to compact formatting with deduplication
- Is more future-proof as LLMs improve
Example output:
**History (7d sampled, oldest→newest)**: Disk: 26→26→26→26→26→31%
Refs: Frigate disk usage false positive investigation
- Remove unused correlations state and constants from AIOverviewTable
- Remove unused runbook-related imports, state, and functions from Alerts
- Add type annotation to Set() to fix type error
- Removes dead code left over from runbook UI removal
Filter out 'watch' and 'info' severity findings from the API response.
These lower-severity findings were mostly noise:
- 'watch': CPU is 35% instead of 11% (who cares)
- 'info': Stopped container exists (knew that)
Now only showing actionable findings:
- critical: Something is broken NOW
- warning: Something needs attention soon
Users prefer silence to noise.
Fix Receipts was showing 'No fixes logged' most of the time since:
- Runbooks were removed
- Remediation logging was inconsistent
Just adds visual clutter without value. Removed ~100 lines of UI code.
Runbooks were a half-built feature that provided no value:
- Only 3 runbooks existed
- AI dynamic remediation already covers the same ground
- Added UI complexity without benefit
Removed:
- runbooks.go and runbooks_test.go
- Handler functions in ai_handlers.go
- Routes in router.go
- Test cases in ai_handlers_test.go
- Auto-fix call in patrol.go
Kept (dead code but harmless):
- Frontend types/API calls (will 404)
- RecordIncidentRunbook function (unused)
Less code = easier to maintain.
Updated LLM prompt with explicit guidance on what NOT to report:
- Small baseline deviations (7% vs 4% is normal variance)
- Low utilization (under 50% CPU or 60% memory is fine)
- Stopped containers that aren't autostart
- 'Elevated' metrics still well under limits
Severity guidelines made more specific:
- CRITICAL: disk >95%, service down, data loss
- WARNING: disk >85%, memory >90%, failures
- WATCH: Only for trends projected to hit critical in <7 days
- INFO: Context/observations
Key message to LLM: 'Users prefer silence to noise'
Only flag things that require operator action.
Smarter anomaly detection to reduce false positives:
**Learning Window:** 7 days → 14 days
- Captures weekly patterns (weekday vs weekend)
**Metric-Specific Thresholds:**
CPU:
- Only report if usage >70% AND >2x baseline
- Low CPU variance (5% vs 10%) is not actionable
Memory:
- Report if >80% OR (>1.5x baseline AND >60%)
- Memory is more stable, lower threshold makes sense
Disk:
- Report if >85% usage OR +15 percentage points growth
- Disk problems are critical, use absolute thresholds
Other metrics:
- Use 2x threshold as default
This dramatically reduces 'noise' anomalies while catching
actual problems that need operator attention.
The AI Intelligence Summary was adding noise rather than value:
- Predictions duplicated patrol findings
- Correlations were not actionable
- 'Fixed' items were vague diagnostics
- Status changes were startup noise
The real value is in the patrol findings section which shows:
- Actual issues found (critical/warning/watch/info)
- Actionable recommendations
- Suppression rules
Keeping the patrol findings, removing the redundant summary.
More aggressive noise filtering:
1. Anomaly threshold raised from 1.5x to 2x
- 1.5x is too borderline to be actionable
- Now requires genuinely significant deviation
2. Filter out 'Ran diagnostic' and 'Executed command' fallback items
- These are generic summaries that provide no value
- Only show remediations with specific, meaningful descriptions
Goal: If something shows in AI Intelligence, it should demand attention.
Critical changes to surface only actionable insights:
1. Anomalies now require at least 50% deviation from baseline
- '1.0x baseline' values filtered out (statistically significant but not actionable)
- Must be >1.5x above OR <0.5x below baseline to report
2. Status changes filter out startup noise
- 'unknown → running' is just system starting, not a real state change
- Backups removed from main list (they have dedicated section)
3. Only show genuinely interesting changes:
- Config changes, migrations, restarts, deletions
- Things that require operator attention
This massively reduces noise while keeping high-signal alerts.
Correlations currently show 'A and B alert together' which isn't useful:
- Bidirectional correlations (A→B AND B→A) are just coincidence
- 'experiences alert' is too vague to be actionable
- No root cause identification - just shows correlated things correlate
Hidden until we can properly identify:
- Root cause chains (A CAUSES B, not just 'A and B happen together')
- Specific trigger types (what kind of alert?)
- Direction of causality
Other improvements:
- Stopped VMs/containers filtered from anomaly detection
- Lower noise, more signal
Critical fixes to show only actionable insights:
1. Skip stopped VMs/containers from anomaly detection
- '0.0x baseline' for stopped resources is expected, not an anomaly
- Only check anomalies for status='running'
2. Filter correlations by confidence (>=70%)
- Low confidence correlations are likely coincidental
- Only show high-confidence, actionable dependencies
This reduces noise and surfaces genuinely useful intelligence.
Changed AIOverviewTable to use Promise.allSettled instead of
Promise.all so that one failing endpoint (e.g., anomalies 404)
doesn't break the entire component.
Each API result now has a fallback for failed requests, allowing
the table to gracefully degrade when endpoints are unavailable.
Separate anomalies API call from Promise.all so that a failure
in the anomalies endpoint doesn't break the entire AI Overview.
This fixes 'Failed to load AI overview data' error when the
anomalies endpoint isn't available (e.g., patrol not started).
Added collapsible sections to prevent overwhelming list:
- Dependencies limited to top 5 (sorted by confidence)
- Actions limited to top 5
- Changes limited to top 5
- 'Show more' buttons appear at bottom when items are hidden
- Clicking expands to show all items in that category
This addresses user feedback about excessive scrolling when
there are many dependency correlations or remediation actions.
Adds real-time anomaly detection results to the AI Overview Table:
- Anomalies appear at TOP of list (before predictions) since they're real-time
- Severity-based color coding (critical=red, high=orange, medium=amber, low=blue)
- Shows resource name, metric, and deviation ratio (e.g., 'CPU at 2.5x baseline')
- Subtitle shows current vs baseline values
- Timestamp shows 'Now' since anomalies are current state
This integrates the FREE anomaly detection feature directly alongside
the Pro patrol insights, providing immediate value to all users.
New useLearningStatus hook:
- Polls /api/ai/intelligence/learning every 60 seconds
- Provides resourceCount(), metricCount(), learningState()
- Convenience accessors: isActive(), isLearning(), isWaiting()
Enhanced AIStatusIndicator:
- Now shows when ANY baselines exist (not just when Patrol enabled)
- Tooltip shows 'X resources baselined' for transparency
- Healthy state 45 resources baselined'shows '
- Works even without Pro license since baselines are FREE
This makes the AI presence visible from the moment Pulse starts
learning, providing immediate value feedback to all users.
Free Features (no license required):
- Anomaly detection - removed license gating, purely statistical analysis
- Learning status endpoint - GET /api/ai/intelligence/learning
Learning Status Response:
- resources_baselined: count of resources with learned baselines
- total_metrics: total metric baselines (cpu + memory + disk)
- metric_breakdown: {cpu: X, memory: Y, disk: Z}
- status: 'waiting' | 'learning' | 'active'
- message: human-readable description
This makes the AI intelligence features visible to all users,
encouraging upgrades for the full LLM-powered patrol experience.
AIStatusIndicator:
- Now shows BOTH patrol findings AND baseline anomalies
- Displays even when only anomaly detection is active (no patrol)
- Badge count includes both findings + anomalies
- Tooltip provides detailed breakdown by severity
Trend Prediction (backend):
- Add TrendPrediction struct for resource exhaustion forecasting
- CalculateTrend() uses linear regression on sample history
- Predicts days until resource is full (or if declining/stable)
- Severity: critical (<7 days), warning (<30 days), info (>30 days)
- Human-readable descriptions like 'full in ~2 weeks (+0.5% per day)'
This creates a more cohesive intelligence experience where anomaly
detection works independently of the pro/patrol features, making
value visible immediately to all users.
Complete the anomaly indicator integration for all three metrics:
- CPU: EnhancedCPUBar (already done)
- Memory: StackedMemoryBar (new)
- Disk: StackedDiskBar (new)
All three metric bars now show a pulsing indicator (e.g., '2.5x↑')
when the current value is significantly above the learned baseline.
Severity colors:
- Critical (>4σ): red
- High (3-4σ): orange
- Medium (2.5-3σ): yellow
- Low (2-2.5σ): blue
This is 100% deterministic - no LLM involved. The indicators appear
automatically based on statistical deviation from learned baselines.
Connect anomaly data to the EnhancedCPUBar component in GuestRow.
When a VM/container's CPU is significantly above its learned baseline,
a pulsing indicator (e.g., '2.5x') appears directly on the CPU bar.
This provides real-time baseline deviation feedback without any LLM
involvement - purely deterministic statistical analysis.
Memory and disk anomaly hooks are prepared but not yet wired to their
respective bar components (TODO for follow-up).
Add frontend infrastructure for displaying baseline anomalies:
- useAnomalies hook for fetching and caching anomaly data
- AnomalyCell component for displaying multiple anomalies
- AnomalyIndicator/AnomalyBadge components for inline display
- Update EnhancedCPUBar to accept optional anomaly prop
The anomaly endpoint is polled every 30 seconds and cached.
Anomaly badges show severity (color) and deviation ratio (e.g., '2.5x').
This prepares the UI for displaying real-time baseline deviations
without requiring LLM interaction.
Add /api/ai/intelligence/anomalies endpoint that compares live metrics
against learned baselines to surface deviations - all deterministic
(no LLM required).
Backend:
- Add AnomalyReport struct with severity classification
- Add CheckResourceAnomalies method to baseline store
- Add HandleGetAnomalies API handler
- Add GetStateProvider getter to AI service
Frontend:
- Add AnomalyReport and AnomaliesResponse types
- Add getAnomalies API function
- Add AnomalySeverity type
This is the first step toward surfacing deterministic intelligence
directly in the UI without requiring LLM interaction.