Commit graph

173 commits

Author SHA1 Message Date
rcourtman
b3285c05c8 Consolidate pending changes
- Add Docker metadata test comment
- Update alerts configuration and thresholds
- Enhance config file watcher
- Update documentation
- Refine settings UI
2025-10-28 23:20:44 +00:00
rcourtman
99b11760ac Implement Docker metadata API endpoints
Add backend support for storing and managing Docker resource metadata:

- Create DockerMetadataStore for managing Docker container/service metadata
- Implement DockerMetadataHandler with GET/PUT/DELETE operations
- Register /api/docker/metadata routes with proper authentication
- Store metadata in docker_metadata.json file
- Validate custom URLs (http/https scheme, valid host)
- Supports resource IDs in format: {hostId}:container:{containerId}

Enables the frontend Docker URL editing feature to persist data.
2025-10-28 22:56:53 +00:00
rcourtman
f2acdd59af Normalize docker agent version handling 2025-10-28 08:42:58 +00:00
rcourtman
e07336dd9f refactor: remove legacy DISABLE_AUTH flag and enhance authentication UX
Major authentication system improvements:

- Remove deprecated DISABLE_AUTH environment variable support
- Update all documentation to remove DISABLE_AUTH references
- Add auth recovery instructions to docs (create .auth_recovery file)
- Improve first-run setup and Quick Security wizard flows
- Enhance login page with better error messaging and validation
- Refactor Docker hosts view with new unified table and tree components
- Add useDebouncedValue hook for better search performance
- Improve Settings page with better security configuration UX
- Update mock mode and development scripts for consistency
- Add ScrollableTable persistence and improved responsive design

Backend changes:
- Remove DISABLE_AUTH flag detection and handling
- Improve auth configuration validation and error messages
- Enhance security status endpoint responses
- Update router integration tests

Frontend changes:
- New Docker components: DockerUnifiedTable, DockerTree, DockerSummaryStats
- Better connection status indicator positioning
- Improved authentication state management
- Enhanced CSRF and session handling
- Better loading states and error recovery

This completes the migration away from the insecure DISABLE_AUTH pattern
toward proper authentication with recovery mechanisms.
2025-10-27 19:46:51 +00:00
rcourtman
68ce8e7520 feat: finalize swarm service monitoring (#598) 2025-10-26 09:35:49 +00:00
rcourtman
8e83eaf823 Add container state filtering to Docker agent 2025-10-25 21:40:59 +00:00
rcourtman
334ed3aedc Improve setup script auth usability 2025-10-25 19:08:48 +00:00
rcourtman
5a2d808aa1 Harden setup token flow and enforce encrypted persistence 2025-10-25 16:00:37 +00:00
rcourtman
a279e6720e Add auth enforcement integration tests 2025-10-25 15:02:48 +00:00
rcourtman
7075cef326 Harden API auth and token handling 2025-10-25 14:54:03 +00:00
rcourtman
77282bd3a6 Implement Pulse tag overrides and alert clear persistence 2025-10-25 14:28:32 +00:00
rcourtman
d643dcf0bc perf: reduce polling allocations and guest metadata load 2025-10-25 13:12:47 +00:00
rcourtman
138d8facd2 Improve host agent onboarding flow 2025-10-25 09:37:29 +00:00
rcourtman
b4247fc095 feat: add server-side support for agent installation improvements
API Enhancements:
- Add SHA256 checksum endpoint for binary downloads
  - Computes checksum on-the-fly when .sha256 suffix is requested
  - Example: /download/pulse-host-agent?platform=linux&arch=amd64.sha256
  - Enables installer scripts to verify binary integrity
- Add /uninstall-host-agent.sh endpoint for Linux/macOS uninstall script
- Add endpoint to public paths (no auth required)

Checksum Implementation:
- New serveChecksum() function computes SHA256 hash using crypto/sha256
- Returns plain text checksum in hex format
- Supports all binary download endpoints
- Zero performance impact (only computed when requested)

Install Script Updates:
- Add --force/-f flag to skip all interactive prompts
  - URL/token prompts skipped with --force
  - Reinstall confirmation skipped with --force
  - Checksum mismatch still aborts (security first)
- Force mode auto-accepts updates and reinstalls
- Usage: ./install-host-agent.sh --url $URL --token $TOKEN --force

Security Notes:
- Checksum verification protects against:
  - Corrupted downloads due to network issues
  - Man-in-the-middle binary tampering
  - Storage corruption on server
- Force mode maintains security by aborting on checksum mismatch
- No bypass for security-critical validations

These improvements enable:
- Automated deployments (--force flag)
- Binary integrity verification (checksums)
- Better security posture (tamper detection)
- Standardized uninstall process (endpoint)

The /api/version endpoint already exists and returns version info
for update checks (no changes needed).
2025-10-23 22:27:02 +00:00
rcourtman
6333a445e9 feat: add native Windows service support and expandable host details
Windows Host Agent Enhancements:
- Implement native Windows service support using golang.org/x/sys/windows/svc
- Add Windows Event Log integration for troubleshooting
- Create professional PowerShell installation/uninstallation scripts
- Add process termination and retry logic to handle Windows file locking
- Register uninstall endpoint at /uninstall-host-agent.ps1

Host Agent UI Improvements:
- Add expandable drawer to Hosts page (click row to view details)
- Display system info, network interfaces, disks, and temperatures in cards
- Replace status badges with subtle colored indicators
- Remove redundant master-detail sidebar layout
- Add search filtering for hosts

Technical Details:
- service_windows.go: Windows service lifecycle management with graceful shutdown
- service_stub.go: Cross-platform compatibility for non-Windows builds
- install-host-agent.ps1: Full Windows installation with validation
- uninstall-host-agent.ps1: Clean removal with process termination and retries
- HostsOverview.tsx: Expandable row pattern matching Docker/Proxmox pages

Files Added:
- cmd/pulse-host-agent/service_windows.go
- cmd/pulse-host-agent/service_stub.go
- scripts/install-host-agent.ps1
- scripts/uninstall-host-agent.ps1
- frontend-modern/src/components/Hosts/HostsOverview.tsx
- frontend-modern/src/components/Hosts/HostsFilter.tsx

The Windows service now starts reliably with automatic restart on failure,
and the uninstall script handles file locking gracefully without requiring reboots.
2025-10-23 22:11:56 +00:00
rcourtman
5c54685f04 Add API token scopes and standalone host agent
Introduces granular permission scopes for API tokens (docker:report, docker:manage, host-agent:report, monitoring:read/write, settings:read/write) allowing tokens to be restricted to minimum required access. Legacy tokens default to full access until scopes are explicitly configured.

Adds standalone host agent for monitoring Linux, macOS, and Windows servers outside Proxmox/Docker estates. New Servers workspace in UI displays uptime, OS metadata, and capacity metrics from enrolled agents.

Includes comprehensive token management UI overhaul with scope presets, inline editing, and visual scope indicators.
2025-10-23 11:40:31 +00:00
rcourtman
e76ab5eec0 Strip IPv6 scopes from container metadata (#596) 2025-10-23 08:55:18 +00:00
rcourtman
a885fb5472 Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
rcourtman
b95c01066e Capture dynamic LXC IP metrics (#596) 2025-10-23 07:50:45 +00:00
rcourtman
be85459db2 Add LXC config metadata for guest drawers (#596) 2025-10-23 07:30:32 +00:00
rcourtman
f4ead79c82 Ensure LXC drawers populate without metrics (#596) 2025-10-22 22:27:19 +00:00
rcourtman
aac3dacd63 Improve LXC guest metrics visibility (#596) 2025-10-22 22:24:33 +00:00
rcourtman
cdba742884 Stabilize diagnostics test VM selection 2025-10-22 19:48:56 +00:00
rcourtman
dd2beffc8c Stop legacy temperature SSH retries when auth fails (#595) 2025-10-22 19:35:51 +00:00
rcourtman
fe1533ea13 Improve PMG metric ingestion refs #551 2025-10-22 18:15:27 +00:00
rcourtman
d813f2396f Respect custom ports when discovering Proxmox clusters 2025-10-22 17:42:52 +00:00
rcourtman
20ff56aceb Add coverage for PVE memused fallback #553 2025-10-22 17:14:12 +00:00
rcourtman
3a3e0e080c Add replication monitoring plumbing and UI
Refs #395
2025-10-22 16:10:15 +00:00
rcourtman
7ae393c8ec Refine Proxmox node memory fallback (#582) 2025-10-22 15:36:26 +00:00
rcourtman
c9543e8a7e Add qemu guest agent version metadata 2025-10-22 15:24:07 +00:00
rcourtman
cdf80dbab5 Align backup node column with current host (#577) 2025-10-22 13:47:51 +00:00
rcourtman
77108abc65 Propagate config updates to settings nodes (#588) 2025-10-22 13:45:13 +00:00
rcourtman
be26f957c0 Add snapshot size alert thresholds (#585) 2025-10-22 13:30:40 +00:00
rcourtman
30879c3b7b Handle AMD Tctl temperature readings (refs #586) 2025-10-22 12:58:34 +00:00
rcourtman
f83caf8933 Add collision-safe Docker host identifiers (#590) 2025-10-22 12:30:25 +00:00
rcourtman
bc479643e4 release: prepare v4.25.0 2025-10-22 10:46:18 +00:00
rcourtman
4eb8bed9b5 Fix initial setup caching and container discovery defaults 2025-10-22 07:34:32 +00:00
rcourtman
ff4dc49ae4 Update Pulse install flow and related components 2025-10-21 19:58:53 +00:00
rcourtman
999da6d900 feat: production-ready import/export with API tokens and transactional rollback
Export/import payload bumped to v4.1 to include API tokens alongside existing
config bundle, eliminating blind spots in disaster recovery scenarios.

## Key Features

**API Tokens in Exports (v4.1)**
- Exports now include API token metadata (ID, name, hash, prefix, suffix, timestamps)
- Export format version bumped from 4.0 to 4.1
- Fixes gap where API tokens were lost during config migrations

**Transactional Atomic Imports**
- New importTransaction helper stages all writes before committing
- On failure, automatic rollback restores original configs
- Prevents partial/corrupted imports that could break running systems
- All config writes (nodes, alerts, email, webhooks, apprise, system, OIDC, API tokens, guest metadata) now transaction-aware

**Backward Compatibility**
- Version 4.0 exports (without API tokens) still import successfully
- System logs notice but proceeds, leaving existing API tokens untouched
- No breaking changes to existing export/import workflows

## Implementation

**Files Added:**
- internal/config/import_transaction.go - Transaction helper with staging/rollback

**Files Modified:**
- internal/config/export.go - v4.1 export, transactional ImportConfig wrapper
- internal/config/persistence.go - Transaction-aware Save* methods, beginTransaction/endTransaction helpers
- internal/config/persistence_test.go - 4 comprehensive unit tests

**Testing:**
- TestExportConfigIncludesAPITokens - Verifies API tokens in v4.1 exports
- TestImportConfigTransactionalSuccess - Validates atomic import success path
- TestImportConfigRollbackOnFailure - Confirms rollback on mid-import failure
- TestImportAcceptsVersion40Bundle - Ensures backward compatibility with v4.0

All tests passing 

## Migration Notes

- No manual migration required
- Users can re-export to generate v4.1 bundles with API tokens
- Existing 4.0 bundles remain valid for import
- Recommended: Re-run export after upgrade to ensure API tokens are captured

Co-authored-by: Codex (implementation)
Co-authored-by: Claude (coordination and testing)
2025-10-21 14:37:44 +00:00
rcourtman
2786afdff0 feat: comprehensive diagnostics and observability improvements
Upgrade diagnostics infrastructure from 5/10 to 8/10 production readiness
with enhanced metrics, logging, and request correlation capabilities.

**Request Correlation**
- Wire request IDs through context in middleware
- Return X-Request-ID header in all API responses
- Enable downstream log correlation across request lifecycle

**HTTP/API Metrics** (18 new Prometheus metrics)
- pulse_http_request_duration_seconds - API latency histogram
- pulse_http_requests_total - request counter by method/route/status
- pulse_http_request_errors_total - error counter by type
- Path normalization to control label cardinality

**Per-Node Poll Metrics**
- pulse_monitor_node_poll_duration_seconds - per-node timing
- pulse_monitor_node_poll_total - success/error counts per node
- pulse_monitor_node_poll_errors_total - error breakdown per node
- pulse_monitor_node_poll_last_success_timestamp - freshness tracking
- pulse_monitor_node_poll_staleness_seconds - age since last success
- Enables multi-node hotspot identification

**Scheduler Health Metrics**
- pulse_scheduler_queue_due_soon - ready queue depth
- pulse_scheduler_queue_depth - by instance type
- pulse_scheduler_queue_wait_seconds - time in queue histogram
- pulse_scheduler_dead_letter_depth - failed task tracking
- pulse_scheduler_breaker_state - circuit breaker state
- pulse_scheduler_breaker_failure_count - consecutive failures
- pulse_scheduler_breaker_retry_seconds - time until retry
- Enable alerting on DLQ spikes, breaker opens, queue backlogs

**Diagnostics Endpoint Caching**
- pulse_diagnostics_cache_hits_total - cache performance
- pulse_diagnostics_cache_misses_total - cache misses
- pulse_diagnostics_refresh_duration_seconds - probe timing
- 45-second TTL prevents thundering herd on /api/diagnostics
- Thread-safe with RWMutex
- X-Diagnostics-Cached-At header shows cache freshness

**Debug Log Performance**
- Gate high-frequency debug logs behind IsLevelEnabled() checks
- Reduces CPU waste in production when debug disabled
- Covers scheduler loops, poll cycles, API handlers

**Persistent Logging**
- File logging with automatic rotation
- LOG_FILE, LOG_MAX_SIZE, LOG_MAX_AGE, LOG_COMPRESS env vars
- MultiWriter sends logs to both stderr and file
- Gzip compression support for rotated logs

Files modified:
- internal/api/diagnostics.go (caching layer)
- internal/api/middleware.go (request IDs, HTTP metrics)
- internal/api/http_metrics.go (NEW - HTTP metric definitions)
- internal/logging/logging.go (file logging with rotation)
- internal/monitoring/metrics.go (node + scheduler metrics)
- internal/monitoring/monitor.go (instrumentation, debug gating)

Impact: Dramatically improved production troubleshooting with per-node
visibility, scheduler health metrics, persistent logs, and cached
diagnostics. Fast incident response now possible for multi-node deployments.
2025-10-21 12:37:39 +00:00
rcourtman
bd13b966d0 feat: complete API token export/import with version handling
Complete the API token export/import feature with proper version
handling and backward compatibility:

- Bump export format to version 4.1 to indicate API token support
- Import API tokens when loading v4.1 exports
- Handle version compatibility gracefully:
  - v4.1: Full support including API tokens
  - v4.0: Notice that tokens weren't included (backward compatible)
  - Other: Warning but best-effort import
- Initialize empty array instead of nil for cleaner JSON

This ensures API tokens are properly preserved when migrating or
restoring Pulse instances while maintaining backward compatibility
with older exports.
2025-10-21 11:38:23 +00:00
rcourtman
59cd456428 feat: improve request ID handling in middleware
Enhance request ID middleware to support distributed tracing:

- Honor incoming X-Request-ID headers from upstream proxies/load balancers
- Use logging.WithRequestID() for consistent ID generation across codebase
- Return X-Request-ID in response headers for client correlation
- Include request_id in panic recovery logs for debugging

This enables better request tracing across multiple Pulse instances
and integrates with standard distributed tracing practices.
2025-10-21 11:37:57 +00:00
rcourtman
cdbc6057b0 feat: export API tokens in config export
Add API tokens to the export data so they are included when
exporting/backing up configuration. This ensures API tokens are
preserved when migrating or restoring Pulse instances.

Changes:
- Add APITokens field to ExportData struct
- Load API tokens during export process
- Include tokens in exported JSON (omitempty if none exist)
2025-10-21 11:37:25 +00:00
rcourtman
ad371bf412 feat: improve alert system performance, UX, and edge case handling
Implement 5 medium/low priority improvements identified in systematic review:

UX IMPROVEMENTS:
- Notify existing critical alerts when activating from pending_review state
  Previously: critical alerts during observation window would never notify
  Now: users receive notifications for active critical alerts after activation
  Implementation: Added NotifyExistingAlert() method and logic in ActivateAlerts()

PERFORMANCE OPTIMIZATIONS:
- Replace per-alert cleanup goroutines with periodic batch cleanup
  Prevents spawning 1000s of goroutines during alert flapping
  recentlyResolved entries now cleaned up once per minute instead of 1 goroutine per alert
- Simplify GetActiveAlerts() implementation
  Removed intermediate map copy, holds lock slightly longer but operation is fast
  Cleaner code with reduced memory allocation

CONFIGURATION VALIDATION:
- Validate timezone in quiet hours configuration
  Invalid timezones now disable quiet hours with error log instead of silent fallback
  Prevents unexpected behavior when timezone is typo'd or invalid

GRACEFUL SHUTDOWN:
- Add 100ms delay in Stop() for background goroutine cleanup
  Reduces risk of state corruption during shutdown
  Allows escalation checker and periodic save to exit cleanly

Technical details:
- internal/alerts/alerts.go: Added NotifyExistingAlert(), optimized cleanup patterns
- internal/api/alerts.go: Enhanced ActivateAlerts() to notify existing critical alerts
- Removed ~20 lines of goroutine spawning code
- Added periodic cleanup for recentlyResolved map
- All changes preserve backward compatibility

Testing: Verified compilation with 'go build -o /dev/null ./...'
2025-10-21 11:05:45 +00:00
rcourtman
06b5d5153b fix: resolve critical alert system bugs preventing crashes and memory leaks
Fix 5 critical bugs identified through systematic code review:

CRITICAL FIXES (prevent service crashes):
- Add panic recovery to all alert callbacks (onAlert, onResolved, onEscalate)
- Clone alerts before passing to escalation callback to prevent data races
- Make clearAlertNoLock callback async to prevent deadlock

HIGH PRIORITY FIXES (prevent memory leaks):
- Add cleanup for stale pendingAlerts entries (deleted resources)
- Add cleanup for dockerRestartTracking (ephemeral containers in CI/CD)

MEDIUM PRIORITY FIXES (prevent stuck alerts):
- Validate hysteresis thresholds (ensure clear < trigger)
- Auto-fix invalid configurations with warning logs

Impact:
- Service stability: Malformed webhook URLs or email configs can no longer crash Pulse
- Memory management: Prevents unbounded growth in dynamic environments
- Alert reliability: Prevents alerts that never clear due to invalid thresholds
- Concurrency safety: Eliminates data races in escalation path

Technical details:
- Created safeCallResolvedCallback() and safeCallEscalateCallback() wrappers
- Added ensureValidHysteresis() validation helper
- Extended Cleanup() with pendingAlerts and dockerRestartTracking pruning
- All callbacks now have defer/recover panic handlers with detailed logging

Testing: Verified compilation with 'go build -o /dev/null ./...'
2025-10-21 10:55:57 +00:00
rcourtman
85ffe10aed docs: add Mermaid diagrams to improve visual documentation
Enhance documentation with six Mermaid diagrams to better explain
complex system implementations:

- Adaptive polling lifecycle flowchart showing enqueue→execute→feedback
  cycle with scheduler, priority queue, and worker interactions
- Circuit breaker state machine diagram illustrating Closed↔Open↔Half-open
  transitions with triggers and recovery paths
- Temperature proxy architecture diagram highlighting trust boundaries,
  security controls, and data flow between host/container/cluster
- Sensor proxy request flow sequence diagram showing auth, rate limiting,
  validation, and SSH execution pipeline
- Alert webhook pipeline flowchart detailing template resolution, URL
  rendering, HTTP dispatch, and retry logic
- Script library workflow diagram illustrating dev→test→bundle→distribute
  lifecycle emphasizing modular design

These visualizations make it easier for operators and contributors to
understand Pulse's sophisticated architectural patterns.
2025-10-21 10:40:33 +00:00
rcourtman
66b97333f7 fix: skip update check for source builds and show appropriate UI message
Source builds use commit hashes (main-c147fa1) not semantic versions
(v4.23.0), so update checks would always fail or show misleading
"Update Available" banners.

Changes:
- Add IsSourceBuild flag to VersionInfo struct
- Detect source builds via BUILD_FROM_SOURCE marker file
- Skip update check for source builds (like Docker)
- Update frontend to show "Built from source" message
- Disable manual update check button for source builds
- Return "source" deployment type for source builds

Backend:
- internal/updates/version.go: Add isSourceBuildEnvironment() detection
- internal/updates/manager.go: Skip check with appropriate message
- internal/api/types.go: Add isSourceBuild to API response
- internal/api/router.go: Include isSourceBuild in version endpoint

Frontend:
- src/api/updates.ts: Add isSourceBuild to VersionInfo type
- src/stores/updates.ts: Don't poll for updates on source builds
- src/components/Settings/Settings.tsx: Show "Built from source" message

Fixes the confusing "Update Available" banner for users who explicitly
chose --source to get latest main branch code.

Co-authored-by: Codex AI
2025-10-21 10:08:00 +00:00
rcourtman
56c6c0cc0c feat: improve discovery with progress tracking, validation, and structured errors
Significantly enhanced network discovery feature to eliminate false positives,
provide real-time progress updates, and better error reporting.

Key improvements:
- Require positive Proxmox identification (version data, auth headers, or certificates)
  instead of reporting any service on ports 8006/8007
- Add real-time progress tracking with phase/target counts and completion percentage
- Implement structured error reporting with IP, phase, type, and timestamp details
- Fix TLS timeout handling to prevent hangs on unresponsive hosts
- Expose progress and structured errors via WebSocket for UI consumption
- Reduce log verbosity by moving discovery logs to debug level
- Fix duplicate IP counting to ensure progress reaches 100%

Breaking changes: None (backward compatible with legacy API methods)
2025-10-20 22:29:30 +00:00
rcourtman
8194ce9e7a feat: add containerization detection to version endpoint
Added containerized and containerId fields to /api/version endpoint
to enable automatic temperature proxy installation for LXC containers.

Changes:
- Added Containerized bool field to VersionResponse
- Added ContainerId string field to VersionResponse
- Detect containerization by checking /run/systemd/container file
- Extract container ID from hostname for LXC containers
- Set deployment type from container type (lxc/docker)

This allows the PVE setup script to:
1. Detect that Pulse is running in a container
2. Find the container ID by matching IPs
3. Automatically install pulse-sensor-proxy on the host
4. Configure bind mount for secure socket communication

Fixes the issue where setup script showed 'Proxy not available'
even when Pulse was containerized.
2025-10-20 22:14:03 +00:00
rcourtman
d430efcecb fix: correct fmt.Sprintf argument alignment in PVE setup script
Critical bug fix: The setup script's format string had 33 placeholders
but was only receiving 27 arguments, causing:
- INSTALLER_URL to receive authToken instead of pulseURL
- This made curl try to resolve the token value as a hostname
- Error: 'curl: (6) Could not resolve host: N7AE3P'
- Token ID showed '%!s(MISSING)' in manual setup instructions

Fixed by:
- Added missing tokenName at position 7
- Added literal '%s' strings for version_ge printf placeholders
- Added authToken arguments for Authorization headers (positions 29, 31)
- Ensured all 33 format placeholders have corresponding arguments

Now generates correct URLs:
- INSTALLER_URL: http://192.168.0.160:7655/api/install/install-sensor-proxy.sh
- --pulse-server: http://192.168.0.160:7655
- Token ID: pulse-monitor@pam!pulse-192-168-0-160-[timestamp]
2025-10-20 21:58:37 +00:00