Bug Fixes:
- Fix boolean fields with 'omitempty' not persisting false values
- AlertTriggeredAnalysis, PatrolAnalyzeNodes/Guests/Docker/Storage
- omitempty causes Go to skip false (zero value) when marshaling JSON
- On reload, NewDefaultAIConfig() sets true, and missing field stays true
- Fix model dropdown losing selection after save (SolidJS reactivity issue)
- Added explicit 'selected' attribute to option elements
- Ensures browser maintains selection with optgroups during re-renders
Improvements:
- Change patrol type label from 'Quick' to 'Patrol' in history table
- Add chat_model and patrol_model to AI settings update log
- Add alert_triggered_analysis to AI config load log for debugging
Instead of relying on pre-computed trend heuristics (which can be misleading
for edge cases like step changes vs continuous growth), we now pass downsampled
raw data points to the LLM so it can interpret patterns directly.
Changes:
- Add MetricSamples field to ResourceContext
- Add DownsampleMetrics() to reduce data points for LLM consumption
- Add formatMetricSamples() to format data compactly (e.g., 'Disk: 26→26→31%')
- Add computeGuestMetricSamples() to gather 7-day sampled history
- Populate MetricSamples for VMs and containers during context build
- Add History section to formatted context output
The LLM now sees actual patterns like 'stable for 6 days then jumped' rather
than just '45.8%/day growth rate' - allowing for much more nuanced interpretation.
This approach:
- Leverages LLM's pattern recognition instead of hard-coded heuristics
- Provides 7 days of data (~24 samples) for context on normal behavior
- Uses minimal tokens due to compact formatting with deduplication
- Is more future-proof as LLMs improve
Example output:
**History (7d sampled, oldest→newest)**: Disk: 26→26→26→26→26→31%
Refs: Frigate disk usage false positive investigation
- Remove unused correlations state and constants from AIOverviewTable
- Remove unused runbook-related imports, state, and functions from Alerts
- Add type annotation to Set() to fix type error
- Removes dead code left over from runbook UI removal
Filter out 'watch' and 'info' severity findings from the API response.
These lower-severity findings were mostly noise:
- 'watch': CPU is 35% instead of 11% (who cares)
- 'info': Stopped container exists (knew that)
Now only showing actionable findings:
- critical: Something is broken NOW
- warning: Something needs attention soon
Users prefer silence to noise.
Fix Receipts was showing 'No fixes logged' most of the time since:
- Runbooks were removed
- Remediation logging was inconsistent
Just adds visual clutter without value. Removed ~100 lines of UI code.
Runbooks were a half-built feature that provided no value:
- Only 3 runbooks existed
- AI dynamic remediation already covers the same ground
- Added UI complexity without benefit
Removed:
- runbooks.go and runbooks_test.go
- Handler functions in ai_handlers.go
- Routes in router.go
- Test cases in ai_handlers_test.go
- Auto-fix call in patrol.go
Kept (dead code but harmless):
- Frontend types/API calls (will 404)
- RecordIncidentRunbook function (unused)
Less code = easier to maintain.
Updated LLM prompt with explicit guidance on what NOT to report:
- Small baseline deviations (7% vs 4% is normal variance)
- Low utilization (under 50% CPU or 60% memory is fine)
- Stopped containers that aren't autostart
- 'Elevated' metrics still well under limits
Severity guidelines made more specific:
- CRITICAL: disk >95%, service down, data loss
- WARNING: disk >85%, memory >90%, failures
- WATCH: Only for trends projected to hit critical in <7 days
- INFO: Context/observations
Key message to LLM: 'Users prefer silence to noise'
Only flag things that require operator action.
Smarter anomaly detection to reduce false positives:
**Learning Window:** 7 days → 14 days
- Captures weekly patterns (weekday vs weekend)
**Metric-Specific Thresholds:**
CPU:
- Only report if usage >70% AND >2x baseline
- Low CPU variance (5% vs 10%) is not actionable
Memory:
- Report if >80% OR (>1.5x baseline AND >60%)
- Memory is more stable, lower threshold makes sense
Disk:
- Report if >85% usage OR +15 percentage points growth
- Disk problems are critical, use absolute thresholds
Other metrics:
- Use 2x threshold as default
This dramatically reduces 'noise' anomalies while catching
actual problems that need operator attention.
The AI Intelligence Summary was adding noise rather than value:
- Predictions duplicated patrol findings
- Correlations were not actionable
- 'Fixed' items were vague diagnostics
- Status changes were startup noise
The real value is in the patrol findings section which shows:
- Actual issues found (critical/warning/watch/info)
- Actionable recommendations
- Suppression rules
Keeping the patrol findings, removing the redundant summary.
More aggressive noise filtering:
1. Anomaly threshold raised from 1.5x to 2x
- 1.5x is too borderline to be actionable
- Now requires genuinely significant deviation
2. Filter out 'Ran diagnostic' and 'Executed command' fallback items
- These are generic summaries that provide no value
- Only show remediations with specific, meaningful descriptions
Goal: If something shows in AI Intelligence, it should demand attention.
Critical changes to surface only actionable insights:
1. Anomalies now require at least 50% deviation from baseline
- '1.0x baseline' values filtered out (statistically significant but not actionable)
- Must be >1.5x above OR <0.5x below baseline to report
2. Status changes filter out startup noise
- 'unknown → running' is just system starting, not a real state change
- Backups removed from main list (they have dedicated section)
3. Only show genuinely interesting changes:
- Config changes, migrations, restarts, deletions
- Things that require operator attention
This massively reduces noise while keeping high-signal alerts.
Correlations currently show 'A and B alert together' which isn't useful:
- Bidirectional correlations (A→B AND B→A) are just coincidence
- 'experiences alert' is too vague to be actionable
- No root cause identification - just shows correlated things correlate
Hidden until we can properly identify:
- Root cause chains (A CAUSES B, not just 'A and B happen together')
- Specific trigger types (what kind of alert?)
- Direction of causality
Other improvements:
- Stopped VMs/containers filtered from anomaly detection
- Lower noise, more signal
Critical fixes to show only actionable insights:
1. Skip stopped VMs/containers from anomaly detection
- '0.0x baseline' for stopped resources is expected, not an anomaly
- Only check anomalies for status='running'
2. Filter correlations by confidence (>=70%)
- Low confidence correlations are likely coincidental
- Only show high-confidence, actionable dependencies
This reduces noise and surfaces genuinely useful intelligence.
Changed AIOverviewTable to use Promise.allSettled instead of
Promise.all so that one failing endpoint (e.g., anomalies 404)
doesn't break the entire component.
Each API result now has a fallback for failed requests, allowing
the table to gracefully degrade when endpoints are unavailable.
Separate anomalies API call from Promise.all so that a failure
in the anomalies endpoint doesn't break the entire AI Overview.
This fixes 'Failed to load AI overview data' error when the
anomalies endpoint isn't available (e.g., patrol not started).
Added collapsible sections to prevent overwhelming list:
- Dependencies limited to top 5 (sorted by confidence)
- Actions limited to top 5
- Changes limited to top 5
- 'Show more' buttons appear at bottom when items are hidden
- Clicking expands to show all items in that category
This addresses user feedback about excessive scrolling when
there are many dependency correlations or remediation actions.
Adds real-time anomaly detection results to the AI Overview Table:
- Anomalies appear at TOP of list (before predictions) since they're real-time
- Severity-based color coding (critical=red, high=orange, medium=amber, low=blue)
- Shows resource name, metric, and deviation ratio (e.g., 'CPU at 2.5x baseline')
- Subtitle shows current vs baseline values
- Timestamp shows 'Now' since anomalies are current state
This integrates the FREE anomaly detection feature directly alongside
the Pro patrol insights, providing immediate value to all users.
New useLearningStatus hook:
- Polls /api/ai/intelligence/learning every 60 seconds
- Provides resourceCount(), metricCount(), learningState()
- Convenience accessors: isActive(), isLearning(), isWaiting()
Enhanced AIStatusIndicator:
- Now shows when ANY baselines exist (not just when Patrol enabled)
- Tooltip shows 'X resources baselined' for transparency
- Healthy state 45 resources baselined'shows '
- Works even without Pro license since baselines are FREE
This makes the AI presence visible from the moment Pulse starts
learning, providing immediate value feedback to all users.
Free Features (no license required):
- Anomaly detection - removed license gating, purely statistical analysis
- Learning status endpoint - GET /api/ai/intelligence/learning
Learning Status Response:
- resources_baselined: count of resources with learned baselines
- total_metrics: total metric baselines (cpu + memory + disk)
- metric_breakdown: {cpu: X, memory: Y, disk: Z}
- status: 'waiting' | 'learning' | 'active'
- message: human-readable description
This makes the AI intelligence features visible to all users,
encouraging upgrades for the full LLM-powered patrol experience.
AIStatusIndicator:
- Now shows BOTH patrol findings AND baseline anomalies
- Displays even when only anomaly detection is active (no patrol)
- Badge count includes both findings + anomalies
- Tooltip provides detailed breakdown by severity
Trend Prediction (backend):
- Add TrendPrediction struct for resource exhaustion forecasting
- CalculateTrend() uses linear regression on sample history
- Predicts days until resource is full (or if declining/stable)
- Severity: critical (<7 days), warning (<30 days), info (>30 days)
- Human-readable descriptions like 'full in ~2 weeks (+0.5% per day)'
This creates a more cohesive intelligence experience where anomaly
detection works independently of the pro/patrol features, making
value visible immediately to all users.
Complete the anomaly indicator integration for all three metrics:
- CPU: EnhancedCPUBar (already done)
- Memory: StackedMemoryBar (new)
- Disk: StackedDiskBar (new)
All three metric bars now show a pulsing indicator (e.g., '2.5x↑')
when the current value is significantly above the learned baseline.
Severity colors:
- Critical (>4σ): red
- High (3-4σ): orange
- Medium (2.5-3σ): yellow
- Low (2-2.5σ): blue
This is 100% deterministic - no LLM involved. The indicators appear
automatically based on statistical deviation from learned baselines.
Connect anomaly data to the EnhancedCPUBar component in GuestRow.
When a VM/container's CPU is significantly above its learned baseline,
a pulsing indicator (e.g., '2.5x') appears directly on the CPU bar.
This provides real-time baseline deviation feedback without any LLM
involvement - purely deterministic statistical analysis.
Memory and disk anomaly hooks are prepared but not yet wired to their
respective bar components (TODO for follow-up).
Add frontend infrastructure for displaying baseline anomalies:
- useAnomalies hook for fetching and caching anomaly data
- AnomalyCell component for displaying multiple anomalies
- AnomalyIndicator/AnomalyBadge components for inline display
- Update EnhancedCPUBar to accept optional anomaly prop
The anomaly endpoint is polled every 30 seconds and cached.
Anomaly badges show severity (color) and deviation ratio (e.g., '2.5x').
This prepares the UI for displaying real-time baseline deviations
without requiring LLM interaction.
Add /api/ai/intelligence/anomalies endpoint that compares live metrics
against learned baselines to surface deviations - all deterministic
(no LLM required).
Backend:
- Add AnomalyReport struct with severity classification
- Add CheckResourceAnomalies method to baseline store
- Add HandleGetAnomalies API handler
- Add GetStateProvider getter to AI service
Frontend:
- Add AnomalyReport and AnomaliesResponse types
- Add getAnomalies API function
- Add AnomalySeverity type
This is the first step toward surfacing deterministic intelligence
directly in the UI without requiring LLM interaction.
- Create Intelligence struct that aggregates all AI subsystems
- Add /api/ai/intelligence endpoint for system-wide and per-resource insights
- Wire Intelligence into PatrolService as a facade (not replacement)
- Add TypeScript types and API client for frontend
- Add unit tests for Intelligence orchestrator
- Fix pre-existing test failures using diagnostic commands instead of actionable ones
The Intelligence orchestrator provides:
- System-wide health scoring (A-F grades)
- Aggregated findings, predictions, correlations
- Per-resource context generation for AI prompts
- Learning progress tracking
This unifies access to AI subsystems without replacing existing code paths.
- KubernetesClusters.tsx: Escape -> as → in JSX text to fix parsing error
- Settings.tsx: Remove unused HostProxySummary interface (deprecated in v5)
- AIOverviewTable.tsx: Prefix unused summarizeAction with underscore
The pulse-sensor-proxy feature was deprecated in v5 and disabled by default.
The frontend was still calling /api/temperature-proxy/host-status which
returned 410 Gone, causing console errors.
Removed:
- HostProxyStatusResponse interface
- _hostProxyStatus signal (was never read)
- refreshHostProxyStatus function
- Polling interval that called the deprecated endpoint
The temperature monitoring now uses pulse-agent instead.
IMPORTANT: This disables the encryption key deletion during migration.
Previously, when migrating from /etc/pulse to a new data directory, the code
would DELETE the original key after copying it. This was causing mysterious
key loss bugs in dev environments.
Changes:
- Commented out the os.Remove() call that deletes the encryption key
- Keep both copies of the key for safety (old location is just unused)
- Updated test to skip when production key exists (test isolation issue)
The old key at /etc/pulse will now be preserved even after migration.
This is safe because:
1. The new key location is checked first
2. Having a backup is better than risking data loss
3. Users can manually clean up the old key if desired
Added extensive logging to crypto.go to trace when the encryption key
migration code runs and when it deletes the key. This is to diagnose
a recurring bug where the encryption key mysteriously disappears.
The logs will show:
- When migration is being considered (dataDir != /etc/pulse)
- When migration is skipped (dataDir == /etc/pulse)
- CRITICAL log when key is about to be deleted
- CRITICAL log when key has been deleted
This will help identify whether it's the Go code or something external
deleting the key.
Backend:
- Enhanced buildEnrichedResourceContext to ALWAYS show learned baselines with
status indicators (normal/elevated/anomaly) instead of only when anomalous
- This makes Pulse Pro's 'moat' visible - users can see the AI understands
their infrastructure's normal behavior patterns
- Added baseline import to service.go
Frontend (user changes):
- Added incident event type filtering with toggle buttons
- Added resource incident panel to view all incidents for a resource
- Added timeline expand/collapse functionality in alert history
- Added incident note saving with proper incidentId tracking
- Added startedAt parameter for proper incident timeline loading
Multiple frontend components were using - as a fallback
when guest.id was falsy. This format drops the node component, which is
critical for clustered setups where the same VMID can exist on different
nodes.
Changes:
- GuestDrawer.tsx: Updated guestId() and handleAskAI() to use canonical format
- GuestRow.tsx: Updated buildGuestId() to use canonical format
- Dashboard.tsx: Updated handleGuestRowClick() and guest rendering loop,
also fixed legacy metadata fallback to use consistent keying
- ThresholdsTable.tsx: Updated guestsGroupedByNode() to use canonical format
Backend changes:
- Removed temporary debug logging added during investigation
- Added alert history section to AI buildEnrichedResourceContext() function
The backend generates VM/Container IDs in instance:node:vmid format (e.g.,
delly:delly:101) via makeGuestID(). This format is now consistently used
across all frontend fallbacks to prevent AI context, metadata, overrides,
and metrics from colliding or desyncing in clustered environments.
- Fixed normalizeStorageDefaults to allow Trigger=0
- Fixed normalizeNodeDefaults (Temperature) to allow Trigger=0
- Added comprehensive tests for all threshold normalization patterns
- Updated existing test that expected old behavior
Related to #864
- Login.tsx: Use apiClient.fetch with skipAuth to avoid auth loops
- router.go: Skip CSRF validation for /api/login endpoint
- hot-dev.sh: Detect encrypted files before generating new key to prevent data loss
When the GitHub API returns 403 (rate limited), Pulse now falls back
to parsing the releases.atom feed which doesn't count against API
rate limits. This ensures users can still check for updates even
when rate limited.
The feed parser:
- Extracts version tags from Atom feed entries
- Filters prereleases for stable channel users
- Returns the first matching release
Fixes#840
When offline_access scope is configured, Pulse now stores and uses
OIDC refresh tokens to automatically extend sessions. Sessions remain
valid as long as the IdP allows token refresh (typically 30-90 days).
Changes:
- Store OIDC tokens (refresh token, expiry, issuer) alongside sessions
- Automatically refresh tokens when access token nears expiry
- Invalidate session if IdP revokes access (forces re-login)
- Add background token refresh with concurrency protection
- Persist OIDC tokens across restarts
Related to #854
Header and action buttons now stack vertically on narrow screens
instead of overflowing. Button labels are shortened on mobile.
Related to discussion #845 (feedback from @MDE186)
When the user logged out, the code would immediately set needsAuth=true
and return WITHOUT first fetching /api/security/status. This meant the
securityStatus signal was null, causing shouldShowLocalLogin() in Login.tsx
to return true (since !undefined === true).
Now we always fetch security status before showing the login form, even
in the just_logged_out path. This ensures hideLocalLogin, oidcEnabled,
and other OIDC settings are properly available to the Login component.
When 'Hide local login form' was toggled in Settings, the change
was saved to disk but not applied to the in-memory config until
restart. Now reloadSystemSettings() also updates config.HideLocalLogin
so the setting takes effect immediately.