Commit graph

103 commits

Author SHA1 Message Date
rcourtman
e29fda5e5a Rebuild agent token bindings on API token config reload
When api_tokens.json is modified on disk, the ConfigWatcher reloads
the tokens into memory. However, the Monitor's dockerTokenBindings and
hostTokenBindings maps were not synchronized with the new token set,
causing orphaned bindings when agents reconnect after reinstall.

Add SetAPITokenReloadCallback to ConfigWatcher that triggers Monitor's
new RebuildTokenBindings method after token reload. This method
reconstructs the binding maps from current Docker host and host agent
state, keeping only bindings for tokens that still exist in config.

Related to #773
2025-11-29 14:09:30 +00:00
rcourtman
c104ceb19e feat: add self-update capability to standalone pulse-host-agent
The standalone pulse-host-agent was missing self-update functionality
that existed in pulse-docker-agent and the unified pulse-agent.

Changes:
- Add agentupdate integration to pulse-host-agent
- Add --no-auto-update flag and PULSE_NO_AUTO_UPDATE env var
- Update Windows service to use errgroup pattern with auto-updater
- Move version from internal/hostagent to main package for ldflags

Related to #737
2025-11-27 20:21:05 +00:00
rcourtman
cf62ebae27 Add Windows service support to unified agent
Port Windows SCM integration from pulse-host-agent to pulse-agent,
enabling the unified agent to run as a Windows service with proper
start/stop handling and event logging.

Related to #766
2025-11-27 17:00:03 +00:00
rcourtman
f7ffb36c41 chore: remove dead code and unused exports
Remove ~900 lines of unused code identified by static analysis:

Go:
- internal/logging: Remove 10 unused functions (InitFromConfig, New,
  FromContext, WithLogger, etc.) that were built but never integrated
- cmd/pulse-sensor-proxy: Remove 7 dead validation functions for a
  removed command execution feature
- internal/metrics: Remove 8 unused notification metric functions and
  10 Prometheus metrics that were never wired up

Frontend:
- Delete ActivationBanner.tsx stub component
- Remove unused exports: stopMetricsSampler, getSamplerStatus,
  formatSpeedCompact, parseMetricKey, getResourceAlerts
2025-11-27 13:17:39 +00:00
rcourtman
e9ac1429c0 refactor: remove unnecessary type conversions
Remove redundant type conversions identified by unconvert linter:
- Remove int() conversions for already-int VMID fields
- Remove int64() conversions for already-int64 arithmetic results
- Remove uint64() conversions for already-uint64 Disk/MaxDisk fields
- Remove int() on syscall.Stdin (already int constant)
2025-11-27 10:33:35 +00:00
rcourtman
67b60adb65 style: fix revive linter warnings
- Mark unused stub parameters with underscore
- Rename 'copy' variable to avoid shadowing builtin
- Remove unnecessary else blocks after return statements
2025-11-27 10:26:26 +00:00
rcourtman
1fb4a2c2c6 chore: remove deprecated build tags and use strings.ReplaceAll
- Remove redundant // +build directives (go:build is sufficient in Go 1.17+)
- Replace strings.Replace(..., -1) with strings.ReplaceAll
2025-11-27 10:16:08 +00:00
rcourtman
e25e1af8cb chore: fix staticcheck SA warnings
- Fix SA4006 unused value issues in ssh.go, validation.go, generator.go
- Replace deprecated ioutil with io/os in config.go
- Replace deprecated tar.TypeRegA with tar.TypeReg
- Remove deprecated rand.Seed calls (auto-seeded in Go 1.20+)
- Fix always-true nil check in main.go
- Fix impossible nil comparison in tempproxy/client.go
- Add nil check for config in monitor.New()
2025-11-27 09:16:53 +00:00
rcourtman
4fd3bdbc04 chore: fix staticcheck U1000 unused code warnings
- Remove unused ipv6Regex from validation.go
- Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use)
- Remove unused apiLimiter rate limiter
- Remove unused stopOnce fields from csrf_store.go and session_store.go
- Remove unused lastBroadcast field from hub.go
- Remove unused lastUsedIndex field from cluster_client.go
2025-11-27 09:12:17 +00:00
rcourtman
c0fff33aed chore: remove legacy proxy handlers and unused functions
Remove legacy V1 handlers replaced by V2 versions:
- sendError (replaced by sendErrorV2)
- handleGetStatus (replaced by handleGetStatusV2)
- handleEnsureClusterKeys (replaced by handleEnsureClusterKeysV2)
- handleRegisterNodes (replaced by handleRegisterNodesV2)
- handleGetTemperature (replaced by handleGetTemperatureV2)

Also remove related unused functions:
- getPublicKey wrapper (only getPublicKeyFrom is used)
- pushSSHKey wrapper (only pushSSHKeyFrom is used)
- nodeValidator.ipAllowed method (standalone ipAllowed is used)
- validateConfigFile (never called)
- runServiceDebug (Windows debug mode, never called)
2025-11-27 08:41:28 +00:00
rcourtman
2fe7bb6141 style: fix gofmt formatting inconsistencies
Run gofmt -w to fix tab/space inconsistencies across 33 files.
2025-11-26 23:44:36 +00:00
rcourtman
e467523a61 feat: serve install scripts from GitHub releases instead of main branch
Scripts like install.sh and install-sensor-proxy.sh are now attached
as release assets and downloaded from releases/latest/download/ URLs.
This ensures users always get scripts compatible with their installed
version, even while development continues on main.

Changes:
- build-release.sh: copy install scripts to release directory
- create-release.yml: upload scripts as release assets
- Updated all documentation and code references to use release URLs
- Scripts reference each other via release URLs for consistency
2025-11-26 08:59:59 +00:00
rcourtman
0f0832d30f fix: propagate unified agent version and improve legacy cleanup
Issues found during scenario testing:

1. Version propagation: The hostagent and dockeragent packages were
   reporting their own Version (0.1.0-dev) instead of the unified
   agent's version. Added AgentVersion config field to pass the
   parent's version down.

2. macOS legacy cleanup: The install.sh script was missing cleanup
   for pulse-docker-agent on macOS.

3. Windows legacy cleanup: The install.ps1 script was missing cleanup
   for legacy PulseHostAgent and PulseDockerAgent services.

These fixes ensure:
- Unified agent reports consistent version across host/docker metrics
- Legacy agents are properly removed on all platforms during upgrade
- Users migrating from legacy agents get a clean transition
2025-11-25 23:39:10 +00:00
rcourtman
920f271b41 feat: improve legacy agent detection and migration UX
Add seamless migration path from legacy agents to unified agent:

- Add AgentType field to report payloads (unified vs legacy detection)
- Update server to detect legacy agents by type instead of version
- Add UI banner showing upgrade command when legacy agents are detected
- Add deprecation notice to install-host-agent.ps1
- Create install-docker-agent.sh stub that redirects to unified installer

Legacy agents (pulse-host-agent, pulse-docker-agent) now show a "Legacy"
badge in the UI with a one-click copy command to upgrade to the unified
agent.
2025-11-25 23:26:22 +00:00
rcourtman
ee35d9e5a5 feat: add auto-update support for unified agent
Implement self-update capability for the unified pulse-agent binary:

- Add internal/agentupdate package with cross-platform update logic
- Hourly version checks against /api/agent/version endpoint
- SHA256 checksum verification for downloaded binaries
- Atomic binary replacement with backup/rollback on failure
- Support for Linux, macOS, and Windows (10 platform/arch combinations)

Build and release changes:
- Dockerfile builds unified agent for all platforms
- build-release.sh includes unified agent in release artifacts
- validate-release.sh validates unified agent binaries
- Install scripts (install.sh, install.ps1) use correct URL format

Related to #727, #737
2025-11-25 23:15:03 +00:00
courtmanr@gmail.com
26ebd476da WIP: Save all pending changes including frontend updates and unified agent scaffolding 2025-11-25 11:27:07 +00:00
courtmanr@gmail.com
9fdea3c9cd fix: filter out qdevice from cluster node discovery 2025-11-24 22:54:58 +00:00
courtmanr@gmail.com
53836ea9b0 fix(sensor-proxy): relax pvecm status parsing to support decimal node IDs
Fixes an issue where pvecm status output using decimal node IDs (e.g. '1' instead of '0x1') caused node discovery to fail. Added test case for this format.
2025-11-23 08:21:58 +00:00
courtmanr@gmail.com
b2c4a583f7 Fix pvecm status parsing for QDevice flags (#738) 2025-11-22 23:44:01 +00:00
rcourtman
cd31bdece1 Add log level control to host agent
Related to #742
2025-11-22 07:48:34 +00:00
rcourtman
19a2cac355 Add log level control for docker agent
Related to #742
2025-11-22 07:43:48 +00:00
rcourtman
335795d354 Ensure sensor proxy wrapper delivers SMART temps locally 2025-11-21 10:07:42 +00:00
courtmanr@gmail.com
17d55b8cf4 feat: implement atomic config management in sensor proxy 2025-11-20 19:01:24 +00:00
rcourtman
823508dc48 Related to #712: auto-restore host agent binaries for download 2025-11-20 15:45:21 +00:00
courtmanr@gmail.com
838993cf40 Implement sensor proxy installation and configuration updates 2025-11-20 13:23:21 +00:00
rcourtman
3f10c97c4e docs: align sensor proxy config with current defaults 2025-11-20 12:40:01 +00:00
courtmanr@gmail.com
3f974c06ca Fix macOS build for sensor-proxy and improve hot-dev script 2025-11-20 12:28:01 +00:00
rcourtman
9d20acd35e fix(sensor-proxy): eliminate all uncoordinated config writers
Remove all code paths that manipulate config files without Phase 2 locking:

1. Installer: Remove ensure_allowed_nodes_file_reference() call (line 1674)
   - Migration now handled exclusively by config migrate-to-file

2. Installer: Make migration failures fatal in update_allowed_nodes()
   - Prevents fallback to unsafe Python manipulation

3. Daemon sanitizer: Remove os.WriteFile() call
   - Now only sanitizes in-memory copy, doesn't write back to disk
   - Logs warning instructing admin to run `config migrate-to-file`

4. Self-heal script: Replace 132 lines of Python with CLI call
   - sanitize_allowed_nodes() now calls `config migrate-to-file`
   - Eliminates uncoordinated Python-based config rewriting

All config mutations now flow exclusively through Phase 2 CLI with
atomic operations and file locking. No code paths remain that can
create duplicate allowed_nodes blocks.

Addresses Codex review feedback on Phase 2 gaps.
2025-11-19 10:55:01 +00:00
rcourtman
0e2559a6d4 fix(sensor-proxy): sanitize duplicate blocks before migration
The migrate-to-file command now calls sanitizeDuplicateAllowedNodesBlocks
before parsing the config, allowing it to handle corrupted configs with
duplicate allowed_nodes blocks.

This ensures migration works even on hosts that were affected by the
original corruption issue.
2025-11-19 10:38:04 +00:00
rcourtman
c1185ebc9d feat(sensor-proxy): complete Phase 2 with CLI-based config migration
Add `config migrate-to-file` command and update installer to eliminate
all shell/Python config manipulation, ensuring atomic operations throughout.

Changes:
- Add `config migrate-to-file` command to atomically migrate inline
  allowed_nodes blocks to file-based configuration
- Update installer's update_allowed_nodes() to call CLI exclusively
- Simplify migrate_inline_allowed_nodes_to_file() to use CLI
- Remove dependency on Python/sed for config manipulation
- Implement dual-file locking (config.yaml + allowed_nodes.yaml) to
  prevent race conditions during migration

All config mutations now flow through the Phase 2 CLI with:
- File locking (flock)
- Atomic writes (temp + rename + fsync)
- Proper YAML parsing/generation

This completes Phase 2 architecture and eliminates the root cause of
config corruption issues.

Related to prior commits: 53dec6010, 3dc073a28, 804a638ea, 131666bc1
2025-11-19 10:35:49 +00:00
rcourtman
a4cd3a275e docs(sensor-proxy): comprehensive config management documentation
Adds complete documentation for the new sensor-proxy config management CLI
implemented in Phase 2. Addresses user-facing aspects of the corruption fix.

**New Documentation:**
- docs/operations/sensor-proxy-config-management.md (469 lines)
  - Complete operations runbook for config management
  - Full CLI reference with examples
  - Migration guide from inline config
  - Architecture explanation
  - Common operational tasks
  - Troubleshooting guide
  - Best practices and automation

**Updated Documentation:**
- cmd/pulse-sensor-proxy/README.md
  - Configuration Management CLI section
  - Allowed Nodes File format
  - Enhanced troubleshooting
  - Config corruption recovery

- docs/TEMPERATURE_MONITORING.md
  - Config validation failure troubleshooting
  - Configuration Management quick reference
  - Cross-links to detailed docs

- docs/TROUBLESHOOTING.md
  - Sensor proxy config validation errors
  - Comprehensive diagnosis steps
  - Automatic and manual recovery

- README.md & docs/README.md
  - Added new runbook to operations index
  - Positioned for discoverability

**Coverage:**
- Both CLI commands fully documented
- Phase 1 & Phase 2 architecture explained
- Migration path from pre-v4.31.1
- Config corruption recovery procedures
- Safe config editing practices
- Automation examples
- Troubleshooting all failure modes

**Documentation Quality:**
- Cross-linked from 5 different documents
- Clear examples for common use cases
- Target audience: system administrators
- Follows project documentation style
- Production-ready

This completes the sensor-proxy config corruption fix by providing users
with comprehensive guidance for the new config management system.

Related to Phase 2 commits 3dc073a28, 804a638ea, 131666bc1
2025-11-19 10:01:33 +00:00
rcourtman
131666bc1f fix(sensor-proxy): lock file permissions and deadlock prevention
Final security hardening based on second Codex review:

**Lock File Permission Fix (Security)**
- Lock file now created with 0600 instead of 0644
- Prevents unprivileged users from opening lock and holding LOCK_EX
- Without this, any local user could DoS the installer/self-heal
- Added f.Chmod(0600) to fix permissions on existing lock files

**Deadlock Prevention (Future-Proofing)**
- Added documentation for future multi-file locking scenarios
- Specifies consistent lock ordering requirement (config.yaml.lock before allowed_nodes.yaml.lock)
- Prevents potential deadlocks if future commands modify multiple files
- Current implementation only locks one file, so no immediate issue

**Testing:**
 Lock file created as `-rw-------` (0600)
 Existing lock files with wrong perms get fixed
 Unprivileged users can no longer DoS the lock

**Codex Validation:**
- Locking is now correct (persistent .lock file, held during entire operation)
- Atomic writes complete while lock is held
- Validation honors actual config paths
- Empty lists supported for operational flexibility
- Error propagation prevents silent failures
- No remaining race conditions or security issues

Phase 2 is now complete and Codex-verified as secure.

Related to Phase 2 fixes commit 804a638ea
2025-11-19 09:51:20 +00:00
rcourtman
804a638ea3 fix(sensor-proxy): critical Phase 2 locking and validation fixes
Fixes critical issues found by Codex code review:

**1. Fixed file locking race condition (CRITICAL)**
- Lock file was being replaced by atomic rename, invalidating the lock
- New approach: lock a separate `.lock` file that persists across renames
- Ensures concurrent writers (installer + self-heal timer) are properly serialized
- Without this fix, corruption was still possible despite Phase 2

**2. Fixed validation to honor configured allowed_nodes_file path**
- validate command now uses loadConfig() to read actual config
- Respects allowed_nodes_file setting instead of assuming default path
- Prevents false positives/negatives when path is customized

**3. Allow empty allowed_nodes lists**
- Empty lists are valid (admin may clear for security, or rely on IPC validation)
- validate no longer fails on empty lists
- set-allowed-nodes --replace with zero nodes now supported
- Critical for operational flexibility

**4. Installer error propagation**
- update_allowed_nodes failures now exit installer with error
- Prevents silent failures that leave stale allowlists
- Self-heal will abort instead of masking CLI errors

**Technical Details:**
- withLockedFile() now locks `<path>.lock` instead of target file
- Lock held for entire duration of read-modify-write-rename
- atomicWriteFile() completes while lock is still held
- Empty lists represented as `allowed_nodes: []` in YAML

**Testing:**
 Lock file created and persists across operations
 Empty list can be written with --replace
 Validation passes with empty lists
 Config path from allowed_nodes_file honored
 Concurrent operations properly serialized

These fixes ensure Phase 2 actually eliminates corruption by design.

Identified by Codex code review
Related to Phase 2 commit 3dc073a28
2025-11-19 09:47:43 +00:00
rcourtman
3dc073a285 feat(sensor-proxy): Phase 2 - atomic config management with CLI
Implements bullet-proof configuration management to completely eliminate
allowed_nodes corruption by design. This builds on Phase 1 (file-only mode)
by replacing all shell/Python config manipulation with proper Go tooling.

**New Features:**
- `pulse-sensor-proxy config validate` - parse and validate config files
- `pulse-sensor-proxy config set-allowed-nodes` - atomic node list updates
- File locking via flock prevents concurrent write races
- Atomic writes (temp file + rename) ensure consistency
- systemd ExecStartPre validation prevents startup with bad config

**Architectural Changes:**
1. Installer now calls config CLI instead of embedded Python/shell scripts
2. All config mutations go through single authoritative writer
3. Deduplication and normalization handled in Go (reuses existing logic)
4. Sanitizer kept as noisy failsafe (warns if corruption still occurs)

**Implementation Details:**
- New cmd/pulse-sensor-proxy/config_cmd.go with cobra commands
- withLockedFile() wrapper ensures exclusive access
- atomicWriteFile() uses temp + rename pattern
- Installer update_allowed_nodes() simplified to CLI calls
- Both systemd service modes include ExecStartPre validation

**Why This Works:**
- Single code path for all writes (no shell/Python divergence)
- File locking serializes self-heal timer + manual installer runs
- Validation gate prevents proxy from starting with corrupt config
- CLI uses same YAML parser as the daemon (guaranteed compatibility)

**Phase 2 Benefits:**
- Corruption impossible by design (not just detected and fixed)
- No more Python dependency for config management
- Atomic operations prevent partial writes
- Clear error messages on validation failures

The defensive sanitizer remains active but now logs loudly if triggered,
allowing us to confirm Phase 2 eliminates corruption in production before
removing the safety net entirely.

This completes the fix for the recurring temperature monitoring outages.

Related to Phase 1 commit 53dec6010
2025-11-19 09:37:49 +00:00
rcourtman
6b5ae18dfe fix: sanitize sensor proxy config during self-heal
Related to #714.
2025-11-18 22:51:40 +00:00
rcourtman
d93a2c1053 Improve host agent binary handling and docker installer purge (Related to #693) 2025-11-18 22:11:44 +00:00
rcourtman
3f46d35a81 feat: make PVE polling interval configurable (related to #467) 2025-11-18 21:30:04 +00:00
rcourtman
936894a5fe Sanitize duplicate allowed_nodes blocks 2025-11-18 19:33:26 +00:00
rcourtman
430c98c5ef Move allowed_nodes to managed file 2025-11-16 10:06:58 +00:00
rcourtman
de5b314842 Improve temperature proxy control-plane flow 2025-11-15 21:49:51 +00:00
rcourtman
20194d9bb7 Add CI build workflow and tighten proxy diagnostics 2025-11-14 13:32:29 +00:00
rcourtman
4693900f8b docs: document sensor proxy log forwarding 2025-11-14 01:12:25 +00:00
rcourtman
1aec43b1ac docs: escape table pipes in sensor proxy readme 2025-11-14 01:01:55 +00:00
rcourtman
9864258207 docs: add operations runbooks and audit fixes 2025-11-14 01:01:21 +00:00
rcourtman
70673c1fdc Improve temperature proxy diagnostics and tests 2025-11-13 22:31:53 +00:00
rcourtman
319bf56a6f Add context timeout to local temperature collection
The getTemperatureLocal() function was running sensors without a timeout,
which could cause HTTP requests to hang if the sensors command stalled.

This adds context.Context parameter and uses exec.CommandContext to ensure
local temperature collection respects the same 15-second timeout as SSH-based
collection.

Fixes issue where HTTP mode worked for remote nodes but timed out for
self-monitoring on the same host.
2025-11-13 20:15:05 +00:00
rcourtman
3c707a7368 Fix HTTP mode reliability: add context timeouts to SSH collection
Critical fix for intermittent HTTP endpoint hangs identified by Codex analysis.

## Root Cause
SSH collection via getTemperatureViaSSH() had no timeout, causing HTTP
handlers to block indefinitely when sensors command hung. This held node-level
mutexes and rate limit slots, creating cascading failures where subsequent
requests queued indefinitely.

## Solution
- Thread request context through to SSH execution
- Add exec.CommandContext with 15s timeout (vs 30s HTTP client timeout)
- Create execCommandWithLimitsContext() to wrap SSH commands
- Ensures handlers always release locks and respond within deadline

## Impact
- HTTP temps endpoint now responds in ~70ms consistently
- Temperature data successfully collected and displayed in Pulse
- Eliminates 'context deadline exceeded' errors
- Prevents node gate deadlocks from slow/stuck SSH sessions

Related to Codex session 019a7e99-00fc-7903-afa3-01100baf47c6
2025-11-13 19:09:50 +00:00
rcourtman
e2bd514899 Fix HTTP mode for pulse-sensor-proxy and improve installer safety
## HTTP Server Fixes
- Add source IP middleware to enforce allowed_source_subnets
- Fix missing source subnet validation for external HTTP requests
- HTTP health endpoint now respects subnet restrictions

## Installer Improvements
- Auto-configure allowed_source_subnets with Pulse server IP
- Add cluster node hostnames to allowed_nodes (not just IPs)
- Fix node validation to accept both hostnames and IPs
- Add Pulse server reachability check before installation
- Add port availability check for HTTP mode
- Add automatic rollback on service startup failure
- Add HTTP endpoint health check after installation
- Fix config backup and deduplication (prevent duplicate keys)
- Fix IPv4 validation with loopback rejection
- Improve registration retry logic with detailed errors
- Add automatic LXC bind mount cleanup on uninstall

## Temperature Collection Fixes
- Add local temperature collection for self-monitoring nodes
- Fix node identifier matching (use hostname not SSH host)
- Fix JSON double-encoding in HTTP client response

Related to #XXX (temperature monitoring fixes)
2025-11-13 18:22:36 +00:00
rcourtman
22f092f941 Add HTTP mode to pulse-sensor-proxy for multi-instance temperature monitoring
This implements HTTP/HTTPS support for pulse-sensor-proxy to enable
temperature monitoring across multiple separate Proxmox instances.

Architecture changes:
- Dual-mode operation: Unix socket (local) + HTTPS (remote)
- Unix socket remains default for security/performance (no breaking change)
- HTTP mode enables temps from external PVE hosts

Backend implementation:
- Add HTTPS server with TLS + Bearer token authentication to sensor-proxy
- Add TemperatureProxyURL and TemperatureProxyToken fields to PVEInstance
- Add HTTP client (internal/tempproxy/http_client.go) for remote proxy calls
- Update temperature collector to prefer HTTP proxy when configured
- Fallback logic: HTTP proxy → Unix socket → direct SSH (if not containerized)

Configuration:
- pulse-sensor-proxy config: http_enabled, http_listen_addr, http_tls_cert/key, http_auth_token
- PVEInstance config: temperature_proxy_url, temperature_proxy_token
- Environment variables: PULSE_SENSOR_PROXY_HTTP_* for all HTTP settings

Security:
- TLS 1.2+ with modern cipher suites
- Constant-time token comparison (timing attack prevention)
- Rate limiting applied to HTTP requests (shared with socket mode)
- Audit logging for all HTTP requests

Next steps:
- Update installer script to support HTTP mode + auto-registration
- Add Pulse API endpoint for proxy registration
- Generate TLS certificates during installation
- Test multi-instance temperature collection

Related to #571 (multi-instance architecture)
2025-11-13 16:13:53 +00:00
rcourtman
ea80a6153c Increase rate limiting for startup bursts
Increased default rate limits to handle Pulse startup polling:
- Per-peer burst: 5 → 10 requests (handles multi-node clusters with retries)
- Per-peer interval: 1s → 500ms (1 QPS → 2 QPS, 60/min → 120/min)

This prevents the proxy from being disabled during Pulse startup when it
polls all nodes simultaneously. The previous limits were too restrictive
for clusters with 3+ nodes.
2025-11-13 15:42:26 +00:00