Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-04 22:40:14 +00:00

Author	SHA1	Message	Date
rcourtman	45d4d68127	fix: Add debug logging and response format handling for replication status - Add comprehensive debug logging to diagnose replication status fetch failures - Handle both array and single-object response formats from Proxmox API - Log raw response body for easier debugging - Log success/failure for each enrichment step This helps diagnose issue #992 where replication last/next sync times aren't showing. The logging will reveal if the API call is failing, returning empty data, or returning data in an unexpected format. Related to #992	2026-01-04 15:01:32 +00:00
rcourtman	43b5fad12c	fix: Add main host URL as fallback for remote cluster access When a Proxmox cluster is discovered, Pulse now includes the user-provided main host URL as a fallback endpoint. This handles scenarios where Proxmox reports internal IPs that aren't reachable from Pulse's network (e.g., monitoring a remote cluster across different networks). Previously, if all cluster endpoint IPs were unreachable, the connection would fail with no fallback. Now the ClusterClient will fall back to the main host URL, allowing Proxmox to route API calls internally. Related to #1028	2026-01-04 14:54:03 +00:00
rcourtman	5d4e911298	feat: improve test coverage for pulse-sensor-proxy	2026-01-03 21:42:19 +00:00
rcourtman	ed78509f92	Fix flaky tests and improve coverage across alerts, api, and config packages - Fix deadlock and race conditions in internal/alerts - Add comprehensive error path tests for internal/config - Fix 401 handling in internal/api - Fix Docker Swarm task filtering test logic	2026-01-03 18:36:17 +00:00
rcourtman	9e339957c6	fix: Update runtime config when toggling Docker update actions setting The DisableDockerUpdateActions setting was being saved to disk but not updated in h.config, causing the UI toggle to appear to revert on page refresh since the API returned the stale runtime value. Related to #1023	2026-01-03 11:14:17 +00:00
rcourtman	277aca3e4e	fix: Only log 'Migration complete' when inline allowed_nodes actually migrated. Related to Discussion #946 The sensor proxy self-heal script runs every 5 minutes and calls migrate-to-file. Previously it would print 'Migration complete' every time, even when already in file mode with nothing to migrate. Now migrateInlineToFile returns a boolean indicating if migration actually occurred, and the CLI only prints the message when work was done.	2025-12-29 14:15:57 +00:00
rcourtman	2b48b0a459	feat: add --kube-include-all-deployments flag for Kubernetes agent Adds IncludeAllDeployments option to show all deployments, not just problem ones (where replicas don't match desired). This provides parity with the existing --kube-include-all-pods flag. - Add IncludeAllDeployments to kubernetesagent.Config - Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var - Update collectDeployments to respect the new flag - Add test for IncludeAllDeployments functionality - Update UNIFIED_AGENT.md documentation Addresses feedback from PR #855	2025-12-18 20:58:30 +00:00
rcourtman	30f01771ac	Add meaningful tests for host agent and exec websocket	2025-12-17 17:02:01 +00:00
rcourtman	5a15a1820b	fix(sensor-proxy): Make nodeGate.acquire() context-aware to prevent goroutine leaks The acquire() function blocked indefinitely without respecting context cancellation. When clients disconnect while waiting for the per-node lock, goroutines would remain blocked forever, connections accumulate in CLOSE_WAIT state, and rate limiter semaphores are never released. Added acquireContext() that respects context cancellation and updated both HTTP and RPC handlers to use it. This prevents: - Goroutine leaks from cancelled requests - CLOSE_WAIT connection accumulation - Cascading failures from filled semaphores Related to #832	2025-12-10 20:14:28 +00:00
rcourtman	8948e84fe5	feat: AI features, agent improvements, and host monitoring enhancements AI Chat Integration: - Multi-provider support (Anthropic, OpenAI, Ollama) - Streaming responses with markdown rendering - Agent command execution for remote troubleshooting - Context-aware conversations with host/container metadata Agent Updates: - Add --enable-proxmox flag for automatic PVE/PBS token setup - Improve auto-update with semver comparison (prevents downgrades) - Add updatedFrom tracking to report previous version after update - Reduce initial update check delay from 30s to 5s - Add agent version column to Hosts page table Host Metrics: - Add DiskIO stats collection (read/write bytes, ops, time) - Improve disk filtering to exclude Docker overlay mounts - Add RAID array monitoring via mdadm - Enhanced temperature sensor parsing Frontend: - New Agent Version column on Hosts overview table - Improved node modal with agent-first installation flow - Add DiskIO display in host drawer - Better responsive handling for metric bars	2025-12-05 10:37:02 +00:00
rcourtman	59b713d176	fix: Force tcp4 network for IPv4 addresses in sensor-proxy HTTP mode On dual-stack systems with net.ipv6.bindv6only=1 (like some Proxmox 8 configurations), Go's net.Listen("tcp", "0.0.0.0:8443") may still bind to IPv6-only. This caused IPv4 localhost connections to hang while IPv6 worked. Fix by detecting IPv4 addresses and explicitly using "tcp4" network type when creating the listener. Related to #805	2025-12-04 20:09:37 +00:00
rcourtman	7d733db3a8	fix: Default sensor-proxy HTTP to 0.0.0.0:8443 for IPv4 binding On systems with net.ipv6.bindv6only=1 (including some Proxmox 8 configurations), using ":8443" results in IPv6-only binding. Users reported curl to 127.0.0.1:8443 hanging while [::1]:8443 worked. Changed default from ":8443" to "0.0.0.0:8443" to explicitly bind IPv4. Related to #805	2025-12-03 20:25:08 +00:00
rcourtman	4f824ab148	style: Apply gofmt to 37 files Standardize code formatting across test files and monitor.go. No functional changes.	2025-12-02 17:21:48 +00:00
rcourtman	97672b4701	Add unit tests for validation utility functions (pulse-sensor-proxy) Add 42 test cases for security-critical validation utility functions: - TestStripNodeDelimiters (10 cases): IPv6 bracket handling, edge cases - TestParseNodeIP (10 cases): IPv4/IPv6 parsing with bracket support - TestNormalizeAllowlistEntry (11 cases): case normalization, whitespace handling, IPv6 full form compression - TestIPAllowed (11 cases): CIDR matching, hosts map lookup, nil handling These functions are used for node allowlist validation to prevent SSRF attacks in the sensor proxy.	2025-11-30 17:04:43 +00:00
rcourtman	786a78af85	Add unit tests for dedupeUint32 and dedupeStrings (pulse-sensor-proxy)	2025-11-30 15:26:42 +00:00
rcourtman	48af5615b9	Add unit tests for pulse-sensor-proxy utility functions - hashIPToUID: 11 test cases covering IP hashing for rate limiting (determinism, range bounds, collision detection, boundary values) - extractNodesFromYAML: 17 test cases covering YAML node list parsing (map format, list format, mixed types, edge cases) First test files for config_cmd.go and http_server.go utilities.	2025-11-30 14:19:43 +00:00
rcourtman	9476de40a6	Add pagination to backup list for large backup counts - Add pagination (100 items per page) to prevent UI lockup with 2500+ backups - Show year in date labels for non-current year backups - Reset to page 1 when filters change - Add First/Previous/Next/Last navigation controls Fixes #541	2025-11-30 09:55:01 +00:00
rcourtman	f9a4df2e5a	Add unit tests for sanitizeNodeLabel (pulse-sensor-proxy/metrics.go)	2025-11-30 04:50:51 +00:00
rcourtman	deb5c3cd23	Add unit tests for pulse-sensor-proxy capability functions Test Capability.Has, parseCapabilityList, capabilityNames, and constant values. 54 test cases covering bitmask operations, parsing, case insensitivity, whitespace handling, unknown values, and round-trip consistency.	2025-11-30 03:32:53 +00:00
rcourtman	6a8258be14	chore: remove dead code and unused exports Remove ~900 lines of unused code identified by static analysis: Go: - internal/logging: Remove 10 unused functions (InitFromConfig, New, FromContext, WithLogger, etc.) that were built but never integrated - cmd/pulse-sensor-proxy: Remove 7 dead validation functions for a removed command execution feature - internal/metrics: Remove 8 unused notification metric functions and 10 Prometheus metrics that were never wired up Frontend: - Delete ActivationBanner.tsx stub component - Remove unused exports: stopMetricsSampler, getSamplerStatus, formatSpeedCompact, parseMetricKey, getResourceAlerts	2025-11-27 13:17:39 +00:00
rcourtman	6ff345fb6b	chore: fix staticcheck SA warnings - Fix SA4006 unused value issues in ssh.go, validation.go, generator.go - Replace deprecated ioutil with io/os in config.go - Replace deprecated tar.TypeRegA with tar.TypeReg - Remove deprecated rand.Seed calls (auto-seeded in Go 1.20+) - Fix always-true nil check in main.go - Fix impossible nil comparison in tempproxy/client.go - Add nil check for config in monitor.New()	2025-11-27 09:16:53 +00:00
rcourtman	bc9e89696b	chore: fix staticcheck U1000 unused code warnings - Remove unused ipv6Regex from validation.go - Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use) - Remove unused apiLimiter rate limiter - Remove unused stopOnce fields from csrf_store.go and session_store.go - Remove unused lastBroadcast field from hub.go - Remove unused lastUsedIndex field from cluster_client.go	2025-11-27 09:12:17 +00:00
rcourtman	3fce14469c	chore: remove legacy proxy handlers and unused functions Remove legacy V1 handlers replaced by V2 versions: - sendError (replaced by sendErrorV2) - handleGetStatus (replaced by handleGetStatusV2) - handleEnsureClusterKeys (replaced by handleEnsureClusterKeysV2) - handleRegisterNodes (replaced by handleRegisterNodesV2) - handleGetTemperature (replaced by handleGetTemperatureV2) Also remove related unused functions: - getPublicKey wrapper (only getPublicKeyFrom is used) - pushSSHKey wrapper (only pushSSHKeyFrom is used) - nodeValidator.ipAllowed method (standalone ipAllowed is used) - validateConfigFile (never called) - runServiceDebug (Windows debug mode, never called)	2025-11-27 08:41:28 +00:00
rcourtman	01f7d81d38	style: fix gofmt formatting inconsistencies Run gofmt -w to fix tab/space inconsistencies across 33 files.	2025-11-26 23:44:36 +00:00
rcourtman	6853a0ffd1	feat: serve install scripts from GitHub releases instead of main branch Scripts like install.sh and install-sensor-proxy.sh are now attached as release assets and downloaded from releases/latest/download/ URLs. This ensures users always get scripts compatible with their installed version, even while development continues on main. Changes: - build-release.sh: copy install scripts to release directory - create-release.yml: upload scripts as release assets - Updated all documentation and code references to use release URLs - Scripts reference each other via release URLs for consistency	2025-11-26 08:59:59 +00:00
courtmanr@gmail.com	c91add36d2	fix: filter out qdevice from cluster node discovery	2025-11-24 22:54:58 +00:00
courtmanr@gmail.com	1ae34285c5	fix(sensor-proxy): relax pvecm status parsing to support decimal node IDs Fixes an issue where pvecm status output using decimal node IDs (e.g. '1' instead of '0x1') caused node discovery to fail. Added test case for this format.	2025-11-23 08:21:58 +00:00
courtmanr@gmail.com	a5fbe52a59	Fix pvecm status parsing for QDevice flags (#738 )	2025-11-22 23:44:01 +00:00
rcourtman	429f9c45bb	Ensure sensor proxy wrapper delivers SMART temps locally	2025-11-21 10:07:42 +00:00
courtmanr@gmail.com	37b1517bd8	feat: implement atomic config management in sensor proxy	2025-11-20 19:01:24 +00:00
courtmanr@gmail.com	c8b4d4a0d8	Implement sensor proxy installation and configuration updates	2025-11-20 13:23:21 +00:00
rcourtman	b72fc2ab79	docs: align sensor proxy config with current defaults	2025-11-20 12:40:01 +00:00
courtmanr@gmail.com	d8e2b40086	Fix macOS build for sensor-proxy and improve hot-dev script	2025-11-20 12:28:01 +00:00
rcourtman	d554c9dbb2	fix(sensor-proxy): eliminate all uncoordinated config writers Remove all code paths that manipulate config files without Phase 2 locking: 1. Installer: Remove ensure_allowed_nodes_file_reference() call (line 1674) - Migration now handled exclusively by config migrate-to-file 2. Installer: Make migration failures fatal in update_allowed_nodes() - Prevents fallback to unsafe Python manipulation 3. Daemon sanitizer: Remove os.WriteFile() call - Now only sanitizes in-memory copy, doesn't write back to disk - Logs warning instructing admin to run `config migrate-to-file` 4. Self-heal script: Replace 132 lines of Python with CLI call - sanitize_allowed_nodes() now calls `config migrate-to-file` - Eliminates uncoordinated Python-based config rewriting All config mutations now flow exclusively through Phase 2 CLI with atomic operations and file locking. No code paths remain that can create duplicate allowed_nodes blocks. Addresses Codex review feedback on Phase 2 gaps.	2025-11-19 10:55:01 +00:00
rcourtman	4419d8be87	fix(sensor-proxy): sanitize duplicate blocks before migration The migrate-to-file command now calls sanitizeDuplicateAllowedNodesBlocks before parsing the config, allowing it to handle corrupted configs with duplicate allowed_nodes blocks. This ensures migration works even on hosts that were affected by the original corruption issue.	2025-11-19 10:38:04 +00:00
rcourtman	28cd487889	feat(sensor-proxy): complete Phase 2 with CLI-based config migration Add `config migrate-to-file` command and update installer to eliminate all shell/Python config manipulation, ensuring atomic operations throughout. Changes: - Add `config migrate-to-file` command to atomically migrate inline allowed_nodes blocks to file-based configuration - Update installer's update_allowed_nodes() to call CLI exclusively - Simplify migrate_inline_allowed_nodes_to_file() to use CLI - Remove dependency on Python/sed for config manipulation - Implement dual-file locking (config.yaml + allowed_nodes.yaml) to prevent race conditions during migration All config mutations now flow through the Phase 2 CLI with: - File locking (flock) - Atomic writes (temp + rename + fsync) - Proper YAML parsing/generation This completes Phase 2 architecture and eliminates the root cause of config corruption issues. Related to prior commits: `53dec6010`, `3dc073a28`, `804a638ea`, `131666bc1`	2025-11-19 10:35:49 +00:00
rcourtman	e39c6a3660	docs(sensor-proxy): comprehensive config management documentation Adds complete documentation for the new sensor-proxy config management CLI implemented in Phase 2. Addresses user-facing aspects of the corruption fix. New Documentation: - docs/operations/sensor-proxy-config-management.md (469 lines) - Complete operations runbook for config management - Full CLI reference with examples - Migration guide from inline config - Architecture explanation - Common operational tasks - Troubleshooting guide - Best practices and automation Updated Documentation: - cmd/pulse-sensor-proxy/README.md - Configuration Management CLI section - Allowed Nodes File format - Enhanced troubleshooting - Config corruption recovery - docs/TEMPERATURE_MONITORING.md - Config validation failure troubleshooting - Configuration Management quick reference - Cross-links to detailed docs - docs/TROUBLESHOOTING.md - Sensor proxy config validation errors - Comprehensive diagnosis steps - Automatic and manual recovery - README.md & docs/README.md - Added new runbook to operations index - Positioned for discoverability Coverage: - Both CLI commands fully documented - Phase 1 & Phase 2 architecture explained - Migration path from pre-v4.31.1 - Config corruption recovery procedures - Safe config editing practices - Automation examples - Troubleshooting all failure modes Documentation Quality: - Cross-linked from 5 different documents - Clear examples for common use cases - Target audience: system administrators - Follows project documentation style - Production-ready This completes the sensor-proxy config corruption fix by providing users with comprehensive guidance for the new config management system. Related to Phase 2 commits `3dc073a28`, `804a638ea`, `131666bc1`	2025-11-19 10:01:33 +00:00
rcourtman	d99a855ee7	fix(sensor-proxy): lock file permissions and deadlock prevention Final security hardening based on second Codex review: Lock File Permission Fix (Security) - Lock file now created with 0600 instead of 0644 - Prevents unprivileged users from opening lock and holding LOCK_EX - Without this, any local user could DoS the installer/self-heal - Added f.Chmod(0600) to fix permissions on existing lock files Deadlock Prevention (Future-Proofing) - Added documentation for future multi-file locking scenarios - Specifies consistent lock ordering requirement (config.yaml.lock before allowed_nodes.yaml.lock) - Prevents potential deadlocks if future commands modify multiple files - Current implementation only locks one file, so no immediate issue Testing: ✅ Lock file created as `-rw-------` (0600) ✅ Existing lock files with wrong perms get fixed ✅ Unprivileged users can no longer DoS the lock Codex Validation: - Locking is now correct (persistent .lock file, held during entire operation) - Atomic writes complete while lock is held - Validation honors actual config paths - Empty lists supported for operational flexibility - Error propagation prevents silent failures - No remaining race conditions or security issues Phase 2 is now complete and Codex-verified as secure. Related to Phase 2 fixes commit `804a638ea`	2025-11-19 09:51:20 +00:00
rcourtman	1162a208cc	fix(sensor-proxy): critical Phase 2 locking and validation fixes Fixes critical issues found by Codex code review: 1. Fixed file locking race condition (CRITICAL) - Lock file was being replaced by atomic rename, invalidating the lock - New approach: lock a separate `.lock` file that persists across renames - Ensures concurrent writers (installer + self-heal timer) are properly serialized - Without this fix, corruption was still possible despite Phase 2 2. Fixed validation to honor configured allowed_nodes_file path - validate command now uses loadConfig() to read actual config - Respects allowed_nodes_file setting instead of assuming default path - Prevents false positives/negatives when path is customized 3. Allow empty allowed_nodes lists - Empty lists are valid (admin may clear for security, or rely on IPC validation) - validate no longer fails on empty lists - set-allowed-nodes --replace with zero nodes now supported - Critical for operational flexibility 4. Installer error propagation - update_allowed_nodes failures now exit installer with error - Prevents silent failures that leave stale allowlists - Self-heal will abort instead of masking CLI errors Technical Details: - withLockedFile() now locks `<path>.lock` instead of target file - Lock held for entire duration of read-modify-write-rename - atomicWriteFile() completes while lock is still held - Empty lists represented as `allowed_nodes: []` in YAML Testing: ✅ Lock file created and persists across operations ✅ Empty list can be written with --replace ✅ Validation passes with empty lists ✅ Config path from allowed_nodes_file honored ✅ Concurrent operations properly serialized These fixes ensure Phase 2 actually eliminates corruption by design. Identified by Codex code review Related to Phase 2 commit `3dc073a28`	2025-11-19 09:47:43 +00:00
rcourtman	0565781655	feat(sensor-proxy): Phase 2 - atomic config management with CLI Implements bullet-proof configuration management to completely eliminate allowed_nodes corruption by design. This builds on Phase 1 (file-only mode) by replacing all shell/Python config manipulation with proper Go tooling. New Features: - `pulse-sensor-proxy config validate` - parse and validate config files - `pulse-sensor-proxy config set-allowed-nodes` - atomic node list updates - File locking via flock prevents concurrent write races - Atomic writes (temp file + rename) ensure consistency - systemd ExecStartPre validation prevents startup with bad config Architectural Changes: 1. Installer now calls config CLI instead of embedded Python/shell scripts 2. All config mutations go through single authoritative writer 3. Deduplication and normalization handled in Go (reuses existing logic) 4. Sanitizer kept as noisy failsafe (warns if corruption still occurs) Implementation Details: - New cmd/pulse-sensor-proxy/config_cmd.go with cobra commands - withLockedFile() wrapper ensures exclusive access - atomicWriteFile() uses temp + rename pattern - Installer update_allowed_nodes() simplified to CLI calls - Both systemd service modes include ExecStartPre validation Why This Works: - Single code path for all writes (no shell/Python divergence) - File locking serializes self-heal timer + manual installer runs - Validation gate prevents proxy from starting with corrupt config - CLI uses same YAML parser as the daemon (guaranteed compatibility) Phase 2 Benefits: - Corruption impossible by design (not just detected and fixed) - No more Python dependency for config management - Atomic operations prevent partial writes - Clear error messages on validation failures The defensive sanitizer remains active but now logs loudly if triggered, allowing us to confirm Phase 2 eliminates corruption in production before removing the safety net entirely. This completes the fix for the recurring temperature monitoring outages. Related to Phase 1 commit `53dec6010`	2025-11-19 09:37:49 +00:00
rcourtman	6e77c4dbea	fix: sanitize sensor proxy config during self-heal Related to #714.	2025-11-18 22:51:40 +00:00
rcourtman	509e87ca35	Sanitize duplicate allowed_nodes blocks	2025-11-18 19:33:26 +00:00
rcourtman	eca1f272ca	Move allowed_nodes to managed file	2025-11-16 10:06:58 +00:00
rcourtman	47d5c14aef	Improve temperature proxy control-plane flow	2025-11-15 21:49:51 +00:00
rcourtman	c957ccd9e6	Add CI build workflow and tighten proxy diagnostics	2025-11-14 13:32:29 +00:00
rcourtman	a4eb70af96	docs: document sensor proxy log forwarding	2025-11-14 01:12:25 +00:00
rcourtman	3f159a93dc	docs: escape table pipes in sensor proxy readme	2025-11-14 01:01:55 +00:00
rcourtman	3c41d3960c	docs: add operations runbooks and audit fixes	2025-11-14 01:01:21 +00:00
rcourtman	61f011af1d	Improve temperature proxy diagnostics and tests	2025-11-13 22:31:53 +00:00
rcourtman	e178ae50a5	Add context timeout to local temperature collection The getTemperatureLocal() function was running sensors without a timeout, which could cause HTTP requests to hang if the sensors command stalled. This adds context.Context parameter and uses exec.CommandContext to ensure local temperature collection respects the same 15-second timeout as SSH-based collection. Fixes issue where HTTP mode worked for remote nodes but timed out for self-monitoring on the same host.	2025-11-13 20:15:05 +00:00

1 2

85 commits