Bug Fixes:
- Fix boolean fields with 'omitempty' not persisting false values
- AlertTriggeredAnalysis, PatrolAnalyzeNodes/Guests/Docker/Storage
- omitempty causes Go to skip false (zero value) when marshaling JSON
- On reload, NewDefaultAIConfig() sets true, and missing field stays true
- Fix model dropdown losing selection after save (SolidJS reactivity issue)
- Added explicit 'selected' attribute to option elements
- Ensures browser maintains selection with optgroups during re-renders
Improvements:
- Change patrol type label from 'Quick' to 'Patrol' in history table
- Add chat_model and patrol_model to AI settings update log
- Add alert_triggered_analysis to AI config load log for debugging
Runbooks were a half-built feature that provided no value:
- Only 3 runbooks existed
- AI dynamic remediation already covers the same ground
- Added UI complexity without benefit
Removed:
- runbooks.go and runbooks_test.go
- Handler functions in ai_handlers.go
- Routes in router.go
- Test cases in ai_handlers_test.go
- Auto-fix call in patrol.go
Kept (dead code but harmless):
- Frontend types/API calls (will 404)
- RecordIncidentRunbook function (unused)
Less code = easier to maintain.
Critical fixes to show only actionable insights:
1. Skip stopped VMs/containers from anomaly detection
- '0.0x baseline' for stopped resources is expected, not an anomaly
- Only check anomalies for status='running'
2. Filter correlations by confidence (>=70%)
- Low confidence correlations are likely coincidental
- Only show high-confidence, actionable dependencies
This reduces noise and surfaces genuinely useful intelligence.
Free Features (no license required):
- Anomaly detection - removed license gating, purely statistical analysis
- Learning status endpoint - GET /api/ai/intelligence/learning
Learning Status Response:
- resources_baselined: count of resources with learned baselines
- total_metrics: total metric baselines (cpu + memory + disk)
- metric_breakdown: {cpu: X, memory: Y, disk: Z}
- status: 'waiting' | 'learning' | 'active'
- message: human-readable description
This makes the AI intelligence features visible to all users,
encouraging upgrades for the full LLM-powered patrol experience.
Add /api/ai/intelligence/anomalies endpoint that compares live metrics
against learned baselines to surface deviations - all deterministic
(no LLM required).
Backend:
- Add AnomalyReport struct with severity classification
- Add CheckResourceAnomalies method to baseline store
- Add HandleGetAnomalies API handler
- Add GetStateProvider getter to AI service
Frontend:
- Add AnomalyReport and AnomaliesResponse types
- Add getAnomalies API function
- Add AnomalySeverity type
This is the first step toward surfacing deterministic intelligence
directly in the UI without requiring LLM interaction.
- Create Intelligence struct that aggregates all AI subsystems
- Add /api/ai/intelligence endpoint for system-wide and per-resource insights
- Wire Intelligence into PatrolService as a facade (not replacement)
- Add TypeScript types and API client for frontend
- Add unit tests for Intelligence orchestrator
- Fix pre-existing test failures using diagnostic commands instead of actionable ones
The Intelligence orchestrator provides:
- System-wide health scoring (A-F grades)
- Aggregated findings, predictions, correlations
- Per-resource context generation for AI prompts
- Learning progress tracking
This unifies access to AI subsystems without replacing existing code paths.
Backend:
- Enhanced buildEnrichedResourceContext to ALWAYS show learned baselines with
status indicators (normal/elevated/anomaly) instead of only when anomalous
- This makes Pulse Pro's 'moat' visible - users can see the AI understands
their infrastructure's normal behavior patterns
- Added baseline import to service.go
Frontend (user changes):
- Added incident event type filtering with toggle buttons
- Added resource incident panel to view all incidents for a resource
- Added timeline expand/collapse functionality in alert history
- Added incident note saving with proper incidentId tracking
- Added startedAt parameter for proper incident timeline loading
- Login.tsx: Use apiClient.fetch with skipAuth to avoid auth loops
- router.go: Skip CSRF validation for /api/login endpoint
- hot-dev.sh: Detect encrypted files before generating new key to prevent data loss
When offline_access scope is configured, Pulse now stores and uses
OIDC refresh tokens to automatically extend sessions. Sessions remain
valid as long as the IdP allows token refresh (typically 30-90 days).
Changes:
- Store OIDC tokens (refresh token, expiry, issuer) alongside sessions
- Automatically refresh tokens when access token nears expiry
- Invalidate session if IdP revokes access (forces re-login)
- Add background token refresh with concurrency protection
- Persist OIDC tokens across restarts
Related to #854
When 'Hide local login form' was toggled in Settings, the change
was saved to disk but not applied to the in-memory config until
restart. Now reloadSystemSettings() also updates config.HideLocalLogin
so the setting takes effect immediately.
- Add HandleLicenseFeatures handler that was missing from license_handlers.go
- Add /api/license/features route to router
- Update AI service and metadata provider
- Update frontend license API and components
- Fix CI build failure caused by tests referencing unimplemented method
The 'Removed Docker Hosts' section was not appearing in Settings -> Agents
even when hosts were blocked from re-enrolling. This prevented users from
using the 'Allow re-enroll' button to unblock their Docker agents.
Root cause: The WebSocket store was missing:
1. The 'removedDockerHosts' property in its initial state
2. A handler to process removedDockerHosts data from WebSocket messages
This meant the backend was correctly sending the data, but the frontend
was completely ignoring it.
Changes:
- Add removedDockerHosts to WebSocket store initial state and message handler
- Add removedDockerHosts to App.tsx fallback state for consistency
- Add missing BroadcastState call after AllowDockerHostReenroll succeeds
Also includes previous fixes from this session:
- Add PULSE_AGENT_URL as alias for PULSE_AGENT_CONNECT_URL (config.go)
- Add runtime Docker/Podman auto-detection in pulse-agent (main.go)
Fixes issue reported by darthrater78 in discussion #845
- Add AgentConnectURL config option to override public URL for agents
- Improve install.sh to diagnose docker detection failures
- Update router to prioritize AgentConnectURL for agent install commands
The /ws endpoint was rate limited to 30 connections/minute. After
prolonged use with WebSocket reconnections (network hiccups, browser
tab throttling, etc.), users with many Docker containers would hit
this limit and get stuck with a 'Connecting...' UI.
WebSocket connections are already authenticated via session/API token
and reconnections are normal behavior, so rate limiting is not needed.
Fixes#859 (second report about WebSocket rate limiting after hours of use).
Fixes issue where /api/security/status reports hasHTTPS=false when accessed
via HTTPS through a reverse proxy like Caddy.
Resolves feedback from discussion #845 (clar2242).
- Create reusable UrlEditPopover component with fixed positioning
- Add createUrlEditState hook for managing editing state
- Update DockerHostSummaryTable to use new popover
- Update DockerUnifiedTable (containers & services) to use new popover
- Update GuestRow (Proxmox VMs/containers) to use new popover
- Update HostsOverview (Proxmox hosts) to use new popover
- Add Docker host metadata API for custom URLs
- Consistent styling with save, delete, cancel buttons and keyboard shortcuts
Fixes#858
The patrol interval setting was not being properly applied due to:
1. ReconfigurePatrol() was setting the deprecated QuickCheckInterval field
instead of the preferred Interval field
2. SetConfig() was comparing raw field values instead of using GetInterval()
to compare effective intervals, causing change detection to fail
3. The API response was missing interval_ms, preventing the frontend from
displaying the correct interval
Changes:
- Update StartPatrol() and ReconfigurePatrol() to use the Interval field
- Fix SetConfig() to use GetInterval() for interval comparison
- Add IntervalMs to PatrolStatusResponse and include it in the API response
Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.
- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation
Addresses feedback from PR #855
The issue was a SolidJS reactivity problem in the Dashboard component.
When guestMetadata signal was accessed inside a For loop callback and
assigned to a plain variable, SolidJS lost reactive tracking.
Changed from:
const metadata = guestMetadata()[guestId] || ...
customUrl={metadata?.customUrl}
To:
const getMetadata = () => guestMetadata()[guestId] || ...
customUrl={getMetadata()?.customUrl}
This ensures SolidJS properly tracks the signal dependency when the
getter function is called directly in JSX props.
When a specific architecture is requested (e.g., linux-arm64), don't fall
back to the generic pulse-agent binary if the requested arch isn't found.
This was causing ARM64 machines to receive x86-64 binaries that can't run.
Now returns 404 with helpful error message if requested architecture
binary is not available.
Reverts overly strict alert ID validation that was rejecting valid
alert IDs containing special characters. Docker host IDs can contain
user-supplied data like hostnames which may include parentheses,
brackets, or other printable ASCII characters.
The previous validation only allowed alphanumeric + limited punctuation,
which caused 400 errors when acknowledging alerts from Docker hosts
with special characters in their identifiers.
Related to #852
Previously the Retry-After header was hardcoded to "60" seconds
regardless of the rate limiter's actual window duration. Now uses
the limiter's configured window (e.g., 600 seconds for recovery
endpoints, 300 for exports).
Related to #579
- Replace verbose info banner with streamlined layout
- Add collapsible 'Advanced Model Selection' accordion for Chat/Patrol models
- Make AI Patrol Settings section collapsible with inline summary badges
- Compact Cost Controls into single-row inline layout
- Reduce form spacing for tighter presentation
- Remove unused formHelpText import
Also includes:
- OpenAI provider fixes for max_tokens parameters
- Security setup CSRF and 401 fixes
- Minor UI tweaks
- Add setup modal that appears when enabling AI without configured provider
- Modal allows selecting provider (Anthropic, OpenAI, DeepSeek, Ollama)
- Enter API key/URL and enable AI in one smooth flow
- Reorder backend to apply API keys before enabled check
- Fix Ollama to strip 'ollama:' prefix from model names
- Simplify backend error message for unconfigured providers
The enable validation was using the legacy single-provider model which
checked settings.Provider and settings.APIKey. Users configuring Ollama
via the new multi-provider UI (setting ollama_base_url) couldn't enable
AI because settings.Provider defaulted to "anthropic" which required an
API key.
Now checks GetConfiguredProviders() first - if any provider is configured
(Anthropic, OpenAI, DeepSeek, or Ollama), AI can be enabled.
Related to #847
- Add cluster-aware guest ID generation (clusterName-VMID instead of instanceName-VMID)
to prevent duplicate VMs/containers when multiple cluster nodes are monitored
- Add cluster deduplication at registration time - when a node is added that belongs
to an already-configured cluster, merge as endpoint instead of creating duplicate
- Add startup consolidation to automatically merge duplicate cluster instances
- Change host agent token binding from agent GUID to hostname, allowing:
- Multiple host agents to share a token (each bound by hostname)
- Agent reinstalls on same host without token conflicts
- Remove 12-character password minimum requirement
- Remove emoji from auto-registration success message
- Fix grouped view node lookup to support both cluster-aware node IDs
(clusterName-nodeName) and legacy guest grouping keys (instance-nodeName)
Fixes duplicate guests appearing when agents are installed on multiple
cluster nodes. Also improves multi-agent UX by allowing shared tokens.
When no auth is configured (fresh install), CheckAuth allows all requests.
This creates a race condition where existing agents from a previous setup
can report data before the wizard completes security configuration.
This fix clears all host agents and docker hosts when /api/security/quick-setup
is called, ensuring the wizard shows a clean state after security is configured.
Added:
- State.ClearAllHosts() - removes all host agents
- State.ClearAllDockerHosts() - removes all docker hosts
- Monitor.ClearUnauthenticatedAgents() - clears both and resets token bindings
- Call to ClearUnauthenticatedAgents() in handleQuickSecuritySetupFixed()
- Add GET /api/metrics-store/history endpoint for querying SQLite-backed metrics
- Support flexible time ranges: 1h, 6h, 12h, 24h, 7d, 30d, 90d
- Return aggregated data with min/max values for longer time ranges
- Add TypeScript types and ChartsAPI.getMetricsHistory() client method
This enables frontend charts to visualize long-term trends using the
tiered retention system (raw → minute → hourly → daily averages).
- Add DOMPurify sanitization for AI chat markdown rendering (XSS fix)
- Configure DOMPurify to add target=_blank and rel=noopener to links
- Update system prompt to align with command approval policy
- Clarify safe vs destructive commands in prompt
- Improve patrol auto-fix mode guidance with safe operation list
- Add verification requirements for auto-fix actions
- Update observe-only mode to be clearer about read-only restrictions
Add configurable model specifically for automatic remediation actions:
Backend (internal/config/ai.go):
- Add AutoFixModel field to AIConfig
- Add GetAutoFixModel() getter with fallback chain:
AutoFixModel -> PatrolModel -> Model
Frontend (AISettings.tsx, types/ai.ts):
- Add auto_fix_model to AISettings types
- Add Auto-Fix Model dropdown (only shows when patrol_auto_fix enabled)
- Falls back to patrol model if not set
API (ai_handlers.go):
- Add auto_fix_model to response and update request
- Handle saving/loading the new field
Rationale:
- Auto-fix takes real actions, may warrant a more capable model
- Patrol observation can use cheaper models for cost savings
- Gives users granular control over model costs vs reliability
- Model hierarchy: Chat > AutoFix > Patrol > Default
Create internal/ai/correlation package:
1. Correlation Detector (detector.go):
- Tracks events across resources
- Detects when events on one resource follow events on another
- Calculates average delay between correlated events
- Confidence scoring based on occurrence count
- Persists to ai_correlations.json
2. Features:
- GetCorrelations() - All detected relationships
- GetCorrelationsForResource() - Relationships for one resource
- GetDependencies() - What resources depend on this one
- GetDependsOn() - What this resource depends on
- PredictCascade() - Predict what will be affected
- FormatForContext() - AI-consumable summary
3. Integration:
- Wire to alert history in router startup
- Map alert types to correlation event types
- Add correlation context to enriched AI context
Example AI context now includes:
'When local-zfs experiences high usage, database often follows within 5 minutes'
This enables the AI to understand infrastructure dependencies
and predict cascade failures.
All tests passing.
Connect alert system to failure prediction:
1. Add AlertCallback to HistoryManager:
- OnAlert() method to register callbacks
- Callbacks invoked when alerts are added
- Called outside lock to prevent deadlocks
2. Expose OnAlertHistory() on alerts.Manager:
- Pass-through to HistoryManager.OnAlert()
- Enables external systems to track alerts
3. Wire pattern detector in router startup:
- Register callback when pattern detector is created
- Convert alert types to trackable events
- Pattern detector now learns from production alerts
Now every alert (memory_warning, cpu_critical, etc.) is recorded as
a historical event for pattern analysis. The AI can predict:
'High memory usage typically occurs every ~3 days (next expected in ~1 day)'
All tests passing.
Create internal/ai/patterns package:
1. Pattern Detector (detector.go):
- Records historical events (high memory, OOM, restarts, etc.)
- Detects recurring failure patterns
- Calculates average interval between occurrences
- Computes confidence based on pattern consistency
- Predicts when failures will occur again
- Persists to ai_patterns.json
2. Event types tracked:
- high_memory, high_cpu, disk_full
- oom, restart, unresponsive
- backup_failed
3. Integration:
- Wire PatternDetector into router startup
- Add to AI context in buildEnrichedContext
- FormatForContext generates failure predictions
Example AI context now includes:
'OOM events typically occurs every ~10 days (next expected in ~3 days)'
This enables proactive alerts before problems recur.
All tests passing.
Complete Phase 3 integration:
- Initialize ChangeDetector and RemediationLog in StartPatrol
- Add SetChangeDetector/SetRemediationLog to handler chain:
Router -> AISettingsHandler -> Service -> PatrolService
- Persist change history to ai_changes.json
- Persist remediation log to ai_remediations.json
- Both use the Pulse config directory for storage
Operational memory is now fully integrated:
- Change detector tracks infrastructure changes on each patrol
- Recent changes (24h) are appended to AI context
- Remediation log ready for command execution logging
All tests passing.
Complete Phase 2 baseline integration:
- Add baseline_exports.go for clean type aliasing
- Wire baseline store initialization into StartPatrol
- Implement startBaselineLearning background loop
- Runs initial learning after 5 min delay
- Updates baselines every hour from metrics history
- Learns from 7 days of data for nodes, VMs, containers
- Add SetBaselineStore methods throughout the chain
(Router -> AIHandler -> Service -> PatrolService)
- Persists baselines to data directory as JSON
The baseline learning loop:
1. Starts automatically when AI patrol starts
2. Queries metrics history for all resources
3. Computes mean, stddev, percentiles for cpu/memory/disk
4. Saves baselines to disk for durability
5. Anomaly detection uses these baselines in context builder
All tests passing.