Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-25 22:57:22 +00:00

Author	SHA1	Message	Date
rcourtman	f67e7b3e97	fix: clean up debug logging and fix flaky encryption test 1. Fixed TestNewConfigPersistenceFailsWhenEncryptedDataPresentWithoutKey - Test was picking up real encryption key from /etc/pulse during migration - Now temporarily moves system key during test for proper isolation - Uses t.Cleanup to ensure key is restored even on failure 2. Cleaned up console.log statements in production code - Dashboard.tsx: replaced console.log with logger.debug for metadata events - CompleteStep.tsx: removed verbose agent detection debug logs These changes reduce log noise in production while maintaining debug capability in development mode.	2025-12-22 14:35:48 +00:00
rcourtman	28ac86c8ab	fix: reduce WebSocket reconnection log noise in host agent Addresses #866 - agents were logging 'WebSocket connection failed' warnings even during normal reconnection scenarios (server restart, network blip, etc). Changes: - Normal close errors (1000, 1001, connection reset) now log at Debug level - Only log Warning after 3+ consecutive failures - Changed 'Connecting to Pulse' from Info to Debug to reduce noise - Successful connections still log at Info level The WebSocket is only used for AI command execution, not metrics, so transient disconnections don't affect monitoring functionality.	2025-12-22 14:11:23 +00:00
rcourtman	59a4843f20	fix: persist finding dismissal state across restarts User feedback fields (DismissedReason, UserNote, TimesRaised, Suppressed, Source) were not being saved to disk, causing 'expected behavior' dismissals to be lost after Pulse restarted. - Add missing fields to AIFindingRecord in persistence.go - Update FindingsPersistenceAdapter to save/load these fields - Add comprehensive tests for dismissal persistence round-trip Fixes issue where Frigate storage warning kept reappearing despite being marked as expected behavior.	2025-12-22 11:18:43 +00:00
rcourtman	2c95961feb	fix: Add missing guest filtering props to ThresholdsTable tests	2025-12-22 10:26:53 +00:00
rcourtman	0837c46f5a	test: Add unit tests for guest tag filtering	2025-12-22 10:24:39 +00:00
rcourtman	71d0401c80	feat: Add guest filtering by tag and name prefix via Alert Configuration. Resolves #863	2025-12-22 10:03:12 +00:00
rcourtman	c9fc827f4c	fix: Prevent buffering and log actionable error for host agent 403s. Related to discussion #845	2025-12-22 09:51:27 +00:00
rcourtman	fc300de328	fix: Prevent ignored container inputs from removing trailing newlines. Related to #865	2025-12-22 09:48:40 +00:00
rcourtman	a0580db62d	fix: exclude watch from patrol status summary Since watch/info findings are filtered from the UI and never shown to users, don't include them in the patrol run status summary. This makes the summary consistent with what users actually see.	2025-12-21 23:31:21 +00:00
rcourtman	78c3434061	fix: include VMID in AI context to prevent incorrect references The LLM was confusing VMIDs because they weren't included in the context. Now the formatted context shows: ### Container: ollama (VMID 200) on minipc This prevents the AI from referencing the wrong VMID when generating findings and recommendations.	2025-12-21 23:13:47 +00:00
rcourtman	8ee315eee4	perf: skip initial patrol if one ran recently When the service restarts, it now checks if a patrol ran within the last hour. If so, it skips the initial patrol to avoid wasting API tokens during development/maintenance when the service is restarted frequently. The scheduled patrol runs (every 6 hours) are not affected.	2025-12-21 23:03:41 +00:00
rcourtman	9c58bfa127	perf: reduce MetricSamples from 100 to 24 points 100 samples was causing 326k+ input tokens which is expensive. 24 samples (hourly resolution) still provides good pattern visibility while significantly reducing token cost. Estimated reduction: ~75% fewer metric tokens.	2025-12-21 22:56:19 +00:00
rcourtman	c85814345d	feat: surface AI patrol errors as findings When AI patrol fails due to API issues like insufficient balance, invalid API key, or rate limiting, we now create a finding that appears in the AI Insights tab. This makes the issue visible to users rather than hidden in logs. The finding includes: - Clear description of the issue (e.g., 'Insufficient API credits') - Recommendation for how to fix it - Evidence showing the actual error message	2025-12-21 22:45:29 +00:00
rcourtman	901f04a7b2	fix: don't show 'All healthy' when patrol run had errors When a patrol run encounters errors (e.g., LLM call failed), don't display 'All healthy' in the summary as that's misleading - the analysis didn't complete properly. Now shows 'Analysis incomplete (N errors)' instead, which correctly explains why the status badge shows red/error.	2025-12-21 22:35:39 +00:00
rcourtman	c15f260280	feat: increase MetricSamples to 100 points (~15 min resolution) Modern LLMs have 100k+ token contexts. 100 samples over 24h gives ~15 minute resolution while adding minimal token overhead. This lets the LLM see fine-grained patterns, short spikes, and accurately distinguish anomalies from normal behavior.	2025-12-21 22:25:54 +00:00
rcourtman	d23f1c78de	fix: increase MetricSamples to 24 points for hourly resolution 12 samples was too coarse (2-hour intervals could miss spikes). 24 samples gives ~hourly resolution while still being compact.	2025-12-21 22:24:02 +00:00
rcourtman	5877ce00c3	fix: use 24h window for MetricSamples (matches in-memory retention) The in-memory MetricsHistory only retains 24 hours of data, not 7 days. Changed computeGuestMetricSamples to use trendWindow24h instead of trendWindow7d, and reduced sample count from 24 to 12 points. This ensures the LLM actually receives metric samples in the context, which wasn't happening before because the 7-day query returned empty data.	2025-12-21 22:19:40 +00:00
rcourtman	f6b1414ed6	debug: add logging to verify MetricSamples population for LLM context	2025-12-21 22:14:54 +00:00
rcourtman	b1baab4c63	feat: add 'Alert' badge to findings triggered by alert-triggered analysis Shows a purple '⚡ Alert' badge on findings that were discovered through alert-triggered analysis rather than scheduled patrol runs. This gives users visibility into how findings were discovered without cluttering the patrol run history table.	2025-12-21 22:04:32 +00:00
rcourtman	4e893117cd	fix: correct patrol interval logging The log was showing QuickCheckInterval (deprecated, always 0) instead of the actual Interval field. This caused confusing 'interval: 0' logs.	2025-12-21 21:52:57 +00:00
rcourtman	07c5880b0a	fix: AI settings persistence and UI improvements Bug Fixes: - Fix boolean fields with 'omitempty' not persisting false values - AlertTriggeredAnalysis, PatrolAnalyzeNodes/Guests/Docker/Storage - omitempty causes Go to skip false (zero value) when marshaling JSON - On reload, NewDefaultAIConfig() sets true, and missing field stays true - Fix model dropdown losing selection after save (SolidJS reactivity issue) - Added explicit 'selected' attribute to option elements - Ensures browser maintains selection with optgroups during re-renders Improvements: - Change patrol type label from 'Quick' to 'Patrol' in history table - Add chat_model and patrol_model to AI settings update log - Add alert_triggered_analysis to AI config load log for debugging	2025-12-21 21:48:09 +00:00
rcourtman	2928fad643	feat(ai): pass raw metric samples to LLM for pattern interpretation Instead of relying on pre-computed trend heuristics (which can be misleading for edge cases like step changes vs continuous growth), we now pass downsampled raw data points to the LLM so it can interpret patterns directly. Changes: - Add MetricSamples field to ResourceContext - Add DownsampleMetrics() to reduce data points for LLM consumption - Add formatMetricSamples() to format data compactly (e.g., 'Disk: 26→26→31%') - Add computeGuestMetricSamples() to gather 7-day sampled history - Populate MetricSamples for VMs and containers during context build - Add History section to formatted context output The LLM now sees actual patterns like 'stable for 6 days then jumped' rather than just '45.8%/day growth rate' - allowing for much more nuanced interpretation. This approach: - Leverages LLM's pattern recognition instead of hard-coded heuristics - Provides 7 days of data (~24 samples) for context on normal behavior - Uses minimal tokens due to compact formatting with deduplication - Is more future-proof as LLMs improve Example output: History (7d sampled, oldest→newest): Disk: 26→26→26→26→26→31% Refs: Frigate disk usage false positive investigation	2025-12-21 21:09:24 +00:00
rcourtman	e604e4bb8a	Add more AI test coverage - findings_test.go: Add edge case tests for Acknowledge, Dismiss, SetUserNote, Suppress, Resolve, DeleteSuppressionRule, GetSummary, GetDismissedForContext (+20 tests) - intelligence_test.go: Add tests for calculateResourceHealth with anomalies/predictions/notes, FormatContext with various subsystems, generateHealthPrediction, GetSummary with patterns/learning (+17 tests) Coverage improvements: - internal/ai: 63.1% -> 64.4% - Overall AI module coverage now averages >80%	2025-12-21 20:31:24 +00:00
rcourtman	185f1ef682	Improve AI test coverage - baseline/store_test.go: Add tests for CheckResourceAnomalies, formatAnomalyDescription, formatRatio, GetAllAnomalies, floatToStr (67.9% -> 92.2%) - memory/incidents_test.go: Add tests for RecordAlertUnacknowledged, RecordRunbook, ListIncidentsByResource, FormatForAlert, FormatForResource, FormatForPatrol (66.8% -> 81.1%) - intelligence_test.go: Add tests for SetStateProvider, FormatGlobalContext, RecordLearning, severityOrder, CheckBaselinesForResource with baselines (61.4% -> 63.1%)	2025-12-21 20:22:47 +00:00
rcourtman	92ccb35e73	test: update GetAllFindings test to match filtering behavior GetAllFindings now filters out info/watch severity findings, only returning critical and warning. Update test expectation from 3 findings to 2.	2025-12-21 19:20:27 +00:00
rcourtman	0d72690fc7	fix: remove unused runbook UI code to fix TypeScript errors - Remove unused correlations state and constants from AIOverviewTable - Remove unused runbook-related imports, state, and functions from Alerts - Add type annotation to Set() to fix type error - Removes dead code left over from runbook UI removal	2025-12-21 19:12:20 +00:00
rcourtman	8e6dc18d6f	security: allow rm on /var/tmp and /tmp with approval Updated command policy to be more nuanced: BLOCKED (hard block, never allowed): - rm -rf / (root) - rm -rf /* (root wildcard) - rm -rf /home, /etc, /usr, /var/lib, /boot, /root, /bin, /sbin, /lib, /opt REQUIRE APPROVAL (user must click 'Run'): - rm -rf /var/tmp/* (Proxmox vzdump temp files) - rm -rf /tmp/* This allows AI to suggest cleaning up vzdump temp files while still protecting against destructive operations on critical paths.	2025-12-21 18:53:08 +00:00
rcourtman	1dd1867e9d	refactor: only show critical/warning patrol findings Filter out 'watch' and 'info' severity findings from the API response. These lower-severity findings were mostly noise: - 'watch': CPU is 35% instead of 11% (who cares) - 'info': Stopped container exists (knew that) Now only showing actionable findings: - critical: Something is broken NOW - warning: Something needs attention soon Users prefer silence to noise.	2025-12-21 18:34:51 +00:00
rcourtman	c3720fe92f	refactor: remove runbook UI from frontend Removed: - Runbook button from finding action buttons - Runbook Execution panel - Fix Receipts panel (already removed) Just showing 'Get Help' and 'I Fixed It' buttons now.	2025-12-21 18:08:47 +00:00
rcourtman	0e2f58900b	refactor: remove Fix Receipts UI section Fix Receipts was showing 'No fixes logged' most of the time since: - Runbooks were removed - Remediation logging was inconsistent Just adds visual clutter without value. Removed ~100 lines of UI code.	2025-12-21 18:02:35 +00:00
rcourtman	586bf96e03	refactor: remove runbooks feature entirely Runbooks were a half-built feature that provided no value: - Only 3 runbooks existed - AI dynamic remediation already covers the same ground - Added UI complexity without benefit Removed: - runbooks.go and runbooks_test.go - Handler functions in ai_handlers.go - Routes in router.go - Test cases in ai_handlers_test.go - Auto-fix call in patrol.go Kept (dead code but harmless): - Frontend types/API calls (will 404) - RecordIncidentRunbook function (unused) Less code = easier to maintain.	2025-12-21 17:48:07 +00:00
rcourtman	cda930901b	feat(ai): make patrol prompt stricter to reduce noise Updated LLM prompt with explicit guidance on what NOT to report: - Small baseline deviations (7% vs 4% is normal variance) - Low utilization (under 50% CPU or 60% memory is fine) - Stopped containers that aren't autostart - 'Elevated' metrics still well under limits Severity guidelines made more specific: - CRITICAL: disk >95%, service down, data loss - WARNING: disk >85%, memory >90%, failures - WATCH: Only for trends projected to hit critical in <7 days - INFO: Context/observations Key message to LLM: 'Users prefer silence to noise' Only flag things that require operator action.	2025-12-21 17:35:36 +00:00
rcourtman	b90076f086	feat(ai): implement metric-specific anomaly thresholds Smarter anomaly detection to reduce false positives: Learning Window: 7 days → 14 days - Captures weekly patterns (weekday vs weekend) Metric-Specific Thresholds: CPU: - Only report if usage >70% AND >2x baseline - Low CPU variance (5% vs 10%) is not actionable Memory: - Report if >80% OR (>1.5x baseline AND >60%) - Memory is more stable, lower threshold makes sense Disk: - Report if >85% usage OR +15 percentage points growth - Disk problems are critical, use absolute thresholds Other metrics: - Use 2x threshold as default This dramatically reduces 'noise' anomalies while catching actual problems that need operator attention.	2025-12-21 17:31:30 +00:00
rcourtman	763d04821c	fix(ui): remove AI Intelligence Summary - patrol findings are sufficient The AI Intelligence Summary was adding noise rather than value: - Predictions duplicated patrol findings - Correlations were not actionable - 'Fixed' items were vague diagnostics - Status changes were startup noise The real value is in the patrol findings section which shows: - Actual issues found (critical/warning/watch/info) - Actionable recommendations - Suppression rules Keeping the patrol findings, removing the redundant summary.	2025-12-21 17:23:25 +00:00
rcourtman	3c4999ea48	fix(ai): raise anomaly threshold to 2x, filter 'Ran diagnostic' noise More aggressive noise filtering: 1. Anomaly threshold raised from 1.5x to 2x - 1.5x is too borderline to be actionable - Now requires genuinely significant deviation 2. Filter out 'Ran diagnostic' and 'Executed command' fallback items - These are generic summaries that provide no value - Only show remediations with specific, meaningful descriptions Goal: If something shows in AI Intelligence, it should demand attention.	2025-12-21 17:19:32 +00:00
rcourtman	4b25b84678	fix(ai): filter out noise from anomalies and status changes Critical changes to surface only actionable insights: 1. Anomalies now require at least 50% deviation from baseline - '1.0x baseline' values filtered out (statistically significant but not actionable) - Must be >1.5x above OR <0.5x below baseline to report 2. Status changes filter out startup noise - 'unknown → running' is just system starting, not a real state change - Backups removed from main list (they have dedicated section) 3. Only show genuinely interesting changes: - Config changes, migrations, restarts, deletions - Things that require operator attention This massively reduces noise while keeping high-signal alerts.	2025-12-21 17:15:44 +00:00
rcourtman	e8f18ff0bb	fix(ui): hide correlations - they're not actionable yet Correlations currently show 'A and B alert together' which isn't useful: - Bidirectional correlations (A→B AND B→A) are just coincidence - 'experiences alert' is too vague to be actionable - No root cause identification - just shows correlated things correlate Hidden until we can properly identify: - Root cause chains (A CAUSES B, not just 'A and B happen together') - Specific trigger types (what kind of alert?) - Direction of causality Other improvements: - Stopped VMs/containers filtered from anomaly detection - Lower noise, more signal	2025-12-21 12:46:57 +00:00
rcourtman	fdacb60969	fix(ai): filter out noise from AI intelligence display Critical fixes to show only actionable insights: 1. Skip stopped VMs/containers from anomaly detection - '0.0x baseline' for stopped resources is expected, not an anomaly - Only check anomalies for status='running' 2. Filter correlations by confidence (>=70%) - Low confidence correlations are likely coincidental - Only show high-confidence, actionable dependencies This reduces noise and surfaces genuinely useful intelligence.	2025-12-21 12:41:27 +00:00
rcourtman	c94b6c1904	fix(ui): use Promise.allSettled for resilient API loading Changed AIOverviewTable to use Promise.allSettled instead of Promise.all so that one failing endpoint (e.g., anomalies 404) doesn't break the entire component. Each API result now has a fallback for failed requests, allowing the table to gracefully degrade when endpoints are unavailable.	2025-12-21 12:31:09 +00:00
rcourtman	f8e42990b7	fix(ui): make anomalies fetch resilient to failures Separate anomalies API call from Promise.all so that a failure in the anomalies endpoint doesn't break the entire AI Overview. This fixes 'Failed to load AI overview data' error when the anomalies endpoint isn't available (e.g., patrol not started).	2025-12-21 12:27:53 +00:00
rcourtman	5931d240df	feat(ui): reduce AI Intelligence table length with 'Show more' buttons Added collapsible sections to prevent overwhelming list: - Dependencies limited to top 5 (sorted by confidence) - Actions limited to top 5 - Changes limited to top 5 - 'Show more' buttons appear at bottom when items are hidden - Clicking expands to show all items in that category This addresses user feedback about excessive scrolling when there are many dependency correlations or remediation actions.	2025-12-21 12:25:43 +00:00
rcourtman	ad61f1809b	feat(ui): show anomalies in AI Intelligence Summary table Adds real-time anomaly detection results to the AI Overview Table: - Anomalies appear at TOP of list (before predictions) since they're real-time - Severity-based color coding (critical=red, high=orange, medium=amber, low=blue) - Shows resource name, metric, and deviation ratio (e.g., 'CPU at 2.5x baseline') - Subtitle shows current vs baseline values - Timestamp shows 'Now' since anomalies are current state This integrates the FREE anomaly detection feature directly alongside the Pro patrol insights, providing immediate value to all users.	2025-12-21 11:50:54 +00:00
rcourtman	53cc9ee5a9	feat(ui): add learning status hook and enhance AI indicator visibility New useLearningStatus hook: - Polls /api/ai/intelligence/learning every 60 seconds - Provides resourceCount(), metricCount(), learningState() - Convenience accessors: isActive(), isLearning(), isWaiting() Enhanced AIStatusIndicator: - Now shows when ANY baselines exist (not just when Patrol enabled) - Tooltip shows 'X resources baselined' for transparency - Healthy state 45 resources baselined'shows ' - Works even without Pro license since baselines are FREE This makes the AI presence visible from the moment Pulse starts learning, providing immediate value feedback to all users.	2025-12-21 11:45:55 +00:00
rcourtman	9aa266a615	feat(ai): make anomaly detection FREE and add learning status endpoint Free Features (no license required): - Anomaly detection - removed license gating, purely statistical analysis - Learning status endpoint - GET /api/ai/intelligence/learning Learning Status Response: - resources_baselined: count of resources with learned baselines - total_metrics: total metric baselines (cpu + memory + disk) - metric_breakdown: {cpu: X, memory: Y, disk: Z} - status: 'waiting' \| 'learning' \| 'active' - message: human-readable description This makes the AI intelligence features visible to all users, encouraging upgrades for the full LLM-powered patrol experience.	2025-12-21 11:36:54 +00:00
rcourtman	b752773924	test(baseline): add tests for trend prediction Add comprehensive tests for CalculateTrend function: - TestCalculateTrend_InsufficientData: <5 samples returns nil - TestCalculateTrend_IncreasingTrend: detects critical/warning trends - TestCalculateTrend_DecreasingTrend: correctly identifies declining usage - TestCalculateTrend_StableTrend: stable patterns return DaysToFull=-1 - TestFormatDays: human-readable time formatting	2025-12-21 11:31:58 +00:00
rcourtman	6bb46eeb34	feat(ai): enhance intelligence status and add trend prediction AIStatusIndicator: - Now shows BOTH patrol findings AND baseline anomalies - Displays even when only anomaly detection is active (no patrol) - Badge count includes both findings + anomalies - Tooltip provides detailed breakdown by severity Trend Prediction (backend): - Add TrendPrediction struct for resource exhaustion forecasting - CalculateTrend() uses linear regression on sample history - Predicts days until resource is full (or if declining/stable) - Severity: critical (<7 days), warning (<30 days), info (>30 days) - Human-readable descriptions like 'full in ~2 weeks (+0.5% per day)' This creates a more cohesive intelligence experience where anomaly detection works independently of the pro/patrol features, making value visible immediately to all users.	2025-12-21 11:29:44 +00:00
rcourtman	ee15a5626d	feat(ui): wire memory and disk anomaly indicators Complete the anomaly indicator integration for all three metrics: - CPU: EnhancedCPUBar (already done) - Memory: StackedMemoryBar (new) - Disk: StackedDiskBar (new) All three metric bars now show a pulsing indicator (e.g., '2.5x↑') when the current value is significantly above the learned baseline. Severity colors: - Critical (>4σ): red - High (3-4σ): orange - Medium (2.5-3σ): yellow - Low (2-2.5σ): blue This is 100% deterministic - no LLM involved. The indicators appear automatically based on statistical deviation from learned baselines.	2025-12-21 11:24:42 +00:00
rcourtman	e374164c8c	feat(ui): wire CPU anomaly indicator to dashboard Connect anomaly data to the EnhancedCPUBar component in GuestRow. When a VM/container's CPU is significantly above its learned baseline, a pulsing indicator (e.g., '2.5x') appears directly on the CPU bar. This provides real-time baseline deviation feedback without any LLM involvement - purely deterministic statistical analysis. Memory and disk anomaly hooks are prepared but not yet wired to their respective bar components (TODO for follow-up).	2025-12-21 11:06:23 +00:00
rcourtman	869a88a800	feat(ui): add anomaly indicator components and hooks Add frontend infrastructure for displaying baseline anomalies: - useAnomalies hook for fetching and caching anomaly data - AnomalyCell component for displaying multiple anomalies - AnomalyIndicator/AnomalyBadge components for inline display - Update EnhancedCPUBar to accept optional anomaly prop The anomaly endpoint is polled every 30 seconds and cached. Anomaly badges show severity (color) and deviation ratio (e.g., '2.5x'). This prepares the UI for displaying real-time baseline deviations without requiring LLM interaction.	2025-12-21 11:04:18 +00:00
rcourtman	d9f1f7accd	feat(ai): add real-time anomaly detection endpoint Add /api/ai/intelligence/anomalies endpoint that compares live metrics against learned baselines to surface deviations - all deterministic (no LLM required). Backend: - Add AnomalyReport struct with severity classification - Add CheckResourceAnomalies method to baseline store - Add HandleGetAnomalies API handler - Add GetStateProvider getter to AI service Frontend: - Add AnomalyReport and AnomaliesResponse types - Add getAnomalies API function - Add AnomalySeverity type This is the first step toward surfacing deterministic intelligence directly in the UI without requiring LLM interaction.	2025-12-21 10:52:54 +00:00

... 57 58 59 60 61 ...

4826 commits