Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-06 16:16:26 +00:00

Author	SHA1	Message	Date
rcourtman	c1bf03fe39	fix: use proper Monitor constructor in PMG tests to initialize all maps Fixes panic: assignment to entry in nil map in PMG polling tests. Problem: Tests were manually creating Monitor structs without initializing internal maps like pollStatusMap, causing nil map panics when recordTaskResult() tried to update task status. Root Cause: - TestPollPMGInstancePopulatesState (line 90) - TestPollPMGInstanceRecordsAuthFailures (line 189) Both created Monitor with only partial field initialization, missing: - pollStatusMap - dlqInsightMap - instanceInfoCache - Other internal state maps Solution: Changed both tests to use New() constructor which properly initializes all maps and internal state (monitor.go:1541). This ensures tests match production initialization and will automatically pick up any future map additions. Tests: ✅ TestPollPMGInstancePopulatesState - now passes ✅ TestPollPMGInstanceRecordsAuthFailures - now passes ✅ All monitoring tests pass (0.125s) Follows best practice: use constructors instead of manual struct creation to maintain initialization invariants.	2025-10-20 15:22:23 +00:00
rcourtman	039a07b8b0	test: add X-RateLimit-Limit header regression test (#578 ) test: add X-RateLimit-Limit header regression test	2025-10-20 16:14:40 +01:00
rcourtman	97871bec82	feat: implement updates rollback logic (Phase 1 follow-up) Implement complete rollback functionality for systemd/LXC deployments: Rollback Strategy: - Downloads old binary from GitHub releases - Restores config from timestamped backups - Service detection (pulse/pulse-backend/pulse-hot-dev) - Comprehensive health verification Implementation: Main rollback flow: 1. Create rollback history entry 2. Detect active service name 3. Download old binary version from GitHub 4. Stop Pulse service 5. Create safety backup of current config 6. Restore config from backup directory 7. Install old binary 8. Start service 9. Wait for health check (30s timeout) 10. Update rollback history (success/failure) Helper Functions: - detectServiceName(): Auto-detect active service from candidates - downloadBinary(): Download specific version from GitHub releases - Auto-detects architecture (amd64/arm64) - Validates download success - Sets executable permissions - stopService/startService(): Systemctl service management - restoreConfig(): Atomic config restoration - installBinary(): Safe binary installation with backup - waitForHealth(): Retry health endpoint with timeout Safety Features: - Safety backup before restore (rollback-safety timestamp) - Pre-rollback binary backup (.pre-rollback) - Health check verification post-rollback - Comprehensive error logging - History tracking for audit Limitations: - Binary backup deleted by install.sh (downloads from GitHub) - Network dependency for binary retrieval - Config-only backups from current install.sh Testing: - Compiles cleanly - Ready for unit/integration tests Closes Phase 1 technical debt - rollback capability now functional. Part of Phase 1 Security Hardening follow-up work	2025-10-20 15:13:38 +00:00
rcourtman	469d11fc7e	docs: add comprehensive scheduler health API documentation Add detailed API reference and update rollout playbook: New: docs/api/SCHEDULER_HEALTH.md - Complete endpoint reference for /api/monitoring/scheduler/health - Request/response structure with field descriptions - Enhanced "instances" array documentation - Example responses showing all states (healthy, transient, DLQ) - Useful jq queries for troubleshooting: - Find instances with errors - List DLQ entries - Show open circuit breakers - Sort by failure streaks - Migration guide (legacy → new fields) - Troubleshooting examples with real scenarios Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Enhanced "Accessing Scheduler Health API" section (§6) - Added examples using new instances[] array - Updated queries to use pollStatus, breaker, deadLetter fields - Practical jq commands for operators Key Documentation Features: - Complete JSON schema with examples - All new fields documented with types and descriptions - Real-world troubleshooting scenarios - Copy-paste ready jq queries - Migration path for existing integrations - Backward compatibility notes Operators can now: - Find error messages without log digging - Understand circuit breaker states - Track DLQ entries with full context - Diagnose issues using single API call Part of Phase 2 follow-up - enhanced observability	2025-10-20 15:13:38 +00:00
rcourtman	9b1709a05b	feat: enhance scheduler health API with rich instance metadata Add comprehensive instance-level diagnostics to /api/monitoring/scheduler/health New Response Structure: Enhanced "instances" array with per-instance details: - Instance metadata: displayName, type, connection URL - Poll status: last success/error timestamps, error messages, error category - Circuit breaker: state, timestamps, failure counts, retry windows - Dead letter: present flag, reason, attempt history, retry schedule Implementation: Data structures: - instanceInfo: cache of display names, URLs, types - pollStatus: tracks successes/errors with timestamps and categories - dlqInsight: DLQ entry metadata (reason, attempts, schedule) - circuitBreaker: enhanced with stateSince, lastTransition Tracking logic: - buildInstanceInfoCache: populate metadata from config on startup - recordTaskResult: track poll outcomes, error details, categories - sendToDeadLetter: capture DLQ insights (reason, timestamps) - circuitBreaker: record state transitions with timestamps Backward Compatible: - Existing fields (deadLetter, breakers, staleness) unchanged - New "instances" array is additive - Old clients can ignore new fields Testing: - Unit test: TestSchedulerHealth_EnhancedResponse validates all fields - Integration tests: still passing (55s) - All error tracking and breaker history verified Operator Benefits: - Diagnose issues without log digging - See error messages directly in API - Understand breaker states and retry schedules - Track DLQ entries with full context - Single API call for complete instance health view Example: Quickly identify "401 unauthorized" on specific PBS instance, see it's in DLQ after 5 retries, and know when next retry scheduled. Part of Phase 2 follow-up work to improve observability.	2025-10-20 15:13:38 +00:00
rcourtman	0fcfad3dc5	feat: add shared script library system and refactor docker-agent installer Implements a comprehensive script improvement infrastructure to reduce code duplication, improve maintainability, and enable easier testing of installer scripts. ## New Infrastructure ### Shared Library System (scripts/lib/) - common.sh: Core utilities (logging, sudo, dry-run, cleanup management) - systemd.sh: Service management helpers with container-safe systemctl - http.sh: HTTP/download helpers with curl/wget fallback and retry logic - README.md: Complete API documentation for all library functions ### Bundler System - scripts/bundle.sh: Concatenates library modules into single-file installers - scripts/bundle.manifest: Defines bundling configuration for distributables - Enables both modular development and curl\|bash distribution ### Test Infrastructure - scripts/tests/run.sh: Test harness for running all smoke tests - scripts/tests/test-common-lib.sh: Common library validation (5 tests) - scripts/tests/test-docker-agent-v2.sh: Installer smoke tests (4 tests) - scripts/tests/integration/: Container-based integration tests (5 scenarios) - All tests passing ✓ ## Refactored Installer ### install-docker-agent-v2.sh - Reduced from 1098 to 563 lines (48% code reduction) - Uses shared libraries for all common operations - NEW: --dry-run flag support - Maintains 100% backward compatibility with original - Fully tested with smoke and integration tests ### Key Improvements - Sudo escalation: 100+ lines → 1 function call - Download logic: 51 lines → 1 function call - Service creation: 33 lines → 2 function calls - Logging: Standardized across all operations - Error handling: Improved with common library ## Documentation ### Rollout Strategy (docs/installer-v2-rollout.md) - 3-phase rollout plan (Alpha → Beta → GA) - Feature flag mechanism for gradual deployment - Testing checklist and success metrics - Rollback procedures and communication plan ### Developer Guides - docs/script-library-guide.md: Complete library usage guide - docs/CONTRIBUTING-SCRIPTS.md: Contribution workflow - docs/installer-v2-quickref.md: Quick reference for operators ## Metrics - Code reduction: 48% (1098 → 563 lines) - Reusable functions: 0 → 30+ - Test coverage: 0 → 8 test scenarios - Documentation: 0 → 5 comprehensive guides ## Testing All tests passing: - Smoke tests: 2/2 passed (8 test cases) - Integration tests: 5/5 scenarios passed - Bundled output: Syntax validated, dry-run tested ## Next Steps This lays the foundation for migrating other installers (install.sh, install-sensor-proxy.sh) to use the same pattern, reducing overall maintenance burden and improving code quality across the project.	2025-10-20 15:13:38 +00:00
rcourtman	ce5ad64810	docs: defer circuit breaker/DLQ management endpoints (Phase 2 Task 11) Document decision to defer mutation endpoints after soak testing: Assessment Results: - Integration tests (55s, 12 instances): Automatic recovery worked perfectly - Soak tests (2-240min, 80 instances): No manual intervention needed - Circuit breakers: Opened/closed automatically as designed - DLQ routing: Permanent failures handled correctly Current Capabilities (Sufficient): - Read-only scheduler health API provides full visibility - Operator workarounds: service restart, feature flag toggle - Grafana alerting: queue depth, staleness, DLQ, breakers Why Defer: - No operational need demonstrated in testing - Implementation requires auth/RBAC/audit/UI work - Cost not justified until production usage reveals need - Can add later when data shows actual pain points Future Design Notes: - POST /api/monitoring/breakers/{instance}/reset - POST /api/monitoring/dlq/retry (all or specific) - DELETE /api/monitoring/dlq/{instance} - Auth, audit, rate limiting, UI integration required Re-evaluation Criteria: - Operators request controls >3x in 30 days - Troubleshooting steps inadequate - Service restarts too disruptive - Production incidents need surgical controls Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns. Part of Phase 2 - Adaptive Polling completion	2025-10-20 15:13:38 +00:00
rcourtman	cb8be81f1d	docs: add adaptive polling production rollout playbook (Phase 2 Task 10) Add comprehensive operator playbook for production enablement: Prerequisites: - Test suite validation (unit, integration, soak) - Monitoring readiness (Grafana dashboards, alerts) - Configuration management and rollback planning - Stakeholder sign-off Staging Rollout: - Feature flag enablement steps - Verification procedures (scheduler health API) - 24-48h observation window with success criteria - Metric checkpoints at 0h, 12h, 24h Production Rollout: - Gradual strategy (25% nodes every 2 hours) - Low-traffic maintenance window - Per-cluster monitoring during rollout - Success criteria and completion validation Grafana/Alert Configuration: - Dashboard panels: queue depth, staleness, throughput, breakers/DLQ - Alert thresholds: - Queue depth > 1.5× instances for >10min (Warning) - Staleness > 60s for >5min (Critical) - DLQ growth (Warning) - Stuck breakers >10min (Critical) Rollback Procedure: - Clear disable/restart steps - Verification of rollback success - Post-rollback actions and incident reporting Troubleshooting: - Symptom/cause/action table - Scheduler health API access guide - Immediate rollback triggers Operators can now safely enable adaptive polling following this step-by-step playbook. Part of Phase 2 Task 10 (Documentation)	2025-10-20 15:13:38 +00:00
rcourtman	14d06a1654	test: add soak test with runtime instrumentation (Phase 2 Task 9d) Add comprehensive soak testing capabilities: Runtime Instrumentation: - Periodic sampling of heap, stack, goroutines, GC count - Sample every 10s during harness runs - HarnessReport includes full RuntimeSamples history - Detect memory leaks (>10% sustained growth) - Detect goroutine leaks (>20 leaked goroutines) Soak Test: - TestAdaptiveSchedulerSoak with 15min+ duration - Skip unless -soak flag or HARNESS_SOAK_MINUTES set - 80 synthetic instances (60 healthy, 15 transient, 5 permanent) - Configurable duration via env var - Validates: heap growth <10%, goroutines stable, queue depth bounded - Staleness threshold: 45s for long-running tests Wrapper Script: - testing-tools/run_adaptive_soak.sh for easy execution - Accepts duration in minutes: ./run_adaptive_soak.sh 30 - Logs to tmp/adaptive_soak_<timestamp>.log - Sets proper timeout (duration + 5min buffer) Test Results (2-minute validation): - 80 instances, 17 samples - Heap: 2.3MB → 3.1MB (healthy) - Goroutines: 16 → 6 (no leak, actually decreased) - Circuit breakers: correctly blocking transient failures Run with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerSoak -soak -timeout 20m Part of Phase 2 Task 9 (Integration/Soak Testing)	2025-10-20 15:13:38 +00:00
rcourtman	2636ba9137	test: add comprehensive integration test harness for adaptive polling (Phase 2 Task 9c) Add PollExecutor seam and integration test infrastructure: PollExecutor Interface: - Add pluggable executor interface for testability - Implement realExecutor wrapping existing poll functions - Add SetExecutor() for test injection - Zero impact on production behavior Integration Test Harness: - Build-tagged integration tests (go:build integration) - Synthetic workload generator with configurable scenarios - Fake executor simulating latencies, failures, recovery - Runtime metrics collection (queue depth, staleness, goroutines) Comprehensive Assertions: - Queue depth bounds: stays within 1.5× instance count - Staleness: healthy instances <20s, multiple poll cycles - Circuit breakers: transient failures recover, permanent stay blocked - Dead-letter queue: only permanent failures routed - Scheduler health: snapshot consistency validation Test Scenarios: - 10 healthy PVE instances (rapid polling) - 1 transient failure instance (fail → recover) - 1 permanent failure instance (DLQ routing) - 55s test duration with 3s base intervals - Validates full adaptive scheduler lifecycle Runs with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerIntegration Part of Phase 2 Task 9 (Integration/Soak Testing)	2025-10-20 15:13:38 +00:00
rcourtman	d5c7a3494b	chore: remove deprecated Pulse+ agent metrics and add audit log rotation docs Removed all legacy Pulse+ agent metrics infrastructure (cloud-relay) which has been fully replaced by the new docker agent and temperature agent implementations. Changes: - Remove cloud-relay directory and all related binaries (relay, relay-linux, etc.) - Remove Pulse+ documentation (AGENT_METRICS_IMPLEMENTATION.md, AGENT_METRICS_SETUP.md) - Clean up pulse-relay references in workflows and release checklist - Add audit log rotation documentation for sensor proxy hash-chained logs - Update .gitignore to remove cloud-relay/ entry The new docker and temp agents remain fully functional and unaffected by this cleanup.	2025-10-20 15:13:38 +00:00
rcourtman	7d422d2909	feat: add professional logging with runtime configuration and performance optimization Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.	2025-10-20 15:13:38 +00:00
rcourtman	fa21e9c69c	chore: remove completed phase summary documents Removed PHASE1_SUMMARY.md and PHASE2_SUMMARY.md as both phases are complete. All relevant documentation has been integrated into the main docs: - Security hardening docs in SECURITY.md - Adaptive polling architecture in docs/monitoring/ADAPTIVE_POLLING.md	2025-10-20 15:13:38 +00:00
rcourtman	b3f37a798c	docs: update Phase 2 summary to reflect completion (9/10 tasks = 90%) Updated PHASE2_SUMMARY.md to include: - ✅ Task 8: Scheduler health API endpoint completion - ✅ Task 9: Unit testing completion (40+ test cases) - Updated git commit history (9 commits total) - Revised known limitations (removed API/testing gaps) - Updated future work section Phase 2 achievements: - 9/10 tasks complete (only integration/soak tests deferred) - 40+ unit tests covering backoff, circuit breakers, staleness - Full scheduler health API with authentication - Comprehensive documentation and rollout plan - Production-ready with feature flag control Remaining work (deferred to future): - Integration tests with mock PVE/PBS clients - Soak tests for extended queue stability - Write endpoints for circuit breaker/DLQ management	2025-10-20 15:13:38 +00:00
rcourtman	25b797f18d	test: add comprehensive staleness tracker unit tests (Phase 2 Task 9b) Added 17 test cases covering: - UpdateSuccess/UpdateError state management - Staleness scoring (fresh, stale, max-stale, never-succeeded) - Score normalization and capping (0.0 to 1.0 range) - SetBounds behavior and defaults - Snapshot merging logic - Snapshot() API for full state export - Nil safety and concurrent access All tests verify correct freshness calculation based on lastSuccess timestamps and configurable maxStale bounds. Phase 2 testing status: - ✅ Backoff exponential growth and jitter (13 tests) - ✅ Circuit breaker state machine (10 tests) - ✅ Staleness tracker scoring (17 tests) - Total: 40+ unit tests covering core scheduling logic	2025-10-20 15:13:38 +00:00
rcourtman	24ae6d8d78	test: add comprehensive unit tests for backoff and circuit breaker (Phase 2 Task 9a) Added 30+ test cases covering: Backoff tests (backoff_test.go): - Exponential growth with multiplier - Jitter distribution and bounds - Max delay capping - Edge cases (negative attempts, zero config values) - Realistic production scenarios Circuit breaker tests (circuit_breaker_test.go): - State transitions: closed → open → half-open → closed - Retry interval backoff with bit-shifting (5s << failureCount) - Half-open window behavior - Concurrent access safety - Default parameter validation All tests pass with proper handling of time-based state transitions and exponential backoff mechanics (bit-shift based retry intervals).	2025-10-20 15:13:38 +00:00
rcourtman	160adeb3b8	feat: add scheduler health API endpoint (Phase 2 Task 8) Task 8 of 10 complete. Exposes read-only scheduler health data including: - Queue depth and distribution by instance type - Dead-letter queue inspection (top 25 tasks with error details) - Circuit breaker states (instance-level) - Staleness scores per instance New API endpoint: GET /api/monitoring/scheduler/health (requires authentication) New snapshot methods: - StalenessTracker.Snapshot() - exports all staleness data - TaskQueue.Snapshot() - queue depth & per-type distribution - TaskQueue.PeekAll() - dead-letter task inspection - circuitBreaker.State() - exports state, failures, retryAt - Monitor.SchedulerHealth() - aggregates all health data Documentation updated with API spec, field descriptions, and usage examples.	2025-10-20 15:13:38 +00:00
rcourtman	5fbdf6099f	docs: add adaptive polling architecture guide (Phase 2 Task 10) Comprehensive documentation for Phase 2 adaptive polling: - Architecture overview with component diagram - Configuration guide (env vars, defaults, feature flag) - Prometheus metrics reference (7 new metrics) - Circuit breaker & backoff behavior explanation - Dead-letter queue operational guidance - Rollout plan (dev/QA → staged → full) - Troubleshooting guide for common issues Task 10 of 10 complete. Phase 2: 8/10 tasks implemented (80%).	2025-10-20 15:13:37 +00:00
rcourtman	b1f445b33d	feat: implement error handling with circuit breakers and backoff (Phase 2 Task 7) Adds comprehensive error resilience: - Circuit breaker with closed/open/half-open states (3 failures = trip) - Exponential backoff with jitter (2s initial, 2x multiplier, 5min max) - Dead-letter queue for tasks exceeding 5 retry attempts - Error classification (transient vs permanent) using internal/errors helpers - Per-instance failure tracking and breaker state management - Integration with staleness tracker for outcome recording Task 7 of 10 complete (70%). Ready for API surfaces and testing.	2025-10-20 15:13:37 +00:00
rcourtman	aa5c08ad4a	feat: implement priority queue-based task execution (Phase 2 Task 6) Replaces immediate polling with queue-based scheduling: - TaskQueue with min-heap (container/heap) for NextRun-ordered execution - Worker goroutines that block on WaitNext() until tasks are due - Tasks only execute when NextRun <= now, respecting adaptive intervals - Automatic rescheduling after execution via scheduler.BuildPlan - Queue depth tracking for backpressure-aware interval adjustments - Upsert semantics for updating scheduled tasks without duplicates Task 6 of 10 complete (60%). Ready for error/backoff policies.	2025-10-20 15:13:37 +00:00
rcourtman	c554380cb5	feat: verify adaptive interval logic implementation (Phase 2 Task 5) Confirms adaptive scheduling logic is fully operational: - EMA smoothing (alpha=0.6) to prevent interval oscillations - Staleness-based interpolation between min/max intervals - Error penalty (0.6x per error) for faster recovery detection - Queue depth stretch (0.1x per task) for backpressure handling - ±5% jitter to prevent thundering herd effects - Per-instance state tracking for smooth transitions Task 5 of 10 complete. Scheduler foundation ready for queue-based execution.	2025-10-20 15:13:37 +00:00
rcourtman	c7d1abf874	feat: implement staleness tracker for adaptive polling (Phase 2 Task 4) Adds freshness metadata tracking for all monitored instances: - StalenessTracker with per-instance last success/error/mutation timestamps - Change hash detection using SHA1 for detecting data mutations - Normalized staleness scoring (0-1 scale) based on age vs maxStale - Integration with PollMetrics for authoritative last-success data - Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError - Connected to scheduler as StalenessSource implementation Task 4 of 10 complete. Ready for adaptive interval logic.	2025-10-20 15:13:37 +00:00
rcourtman	57429900a6	feat: add adaptive polling scheduler infrastructure (Phase 2 Tasks 1-3) Implements adaptive scheduling foundation for Phase 2: - Poll cycle metrics: duration, staleness, queue depth, in-flight counters - Adaptive scheduler with pluggable staleness/interval/enqueue interfaces - Config support: ADAPTIVE_POLLING_ENABLED flag + min/max/base intervals - Feature flag defaults to disabled for safe rollout - Scheduler wiring into Monitor with conditional instantiation Tasks 1-3 of 10 complete. Ready for staleness tracker implementation.	2025-10-20 15:13:37 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
rcourtman	6619dc803e	refactor: use strconv.Itoa instead of string(rune()) in test Replace string(rune(i)) with strconv.Itoa(i) in hub_concurrency_test.go for generating client IDs. While this is test code and not a production bug, it uses the same incorrect pattern that caused the PR #575 bug. This ensures consistent best practices across the codebase and avoids confusion for developers who might copy this pattern. Related: #575	2025-10-20 15:12:14 +00:00
rcourtman	8d6346a008	test: add X-RateLimit-Limit header regression test Add regression test for PR #575 to ensure rate limit headers are formatted as decimal strings (e.g., "10") instead of Unicode control characters. Also fixes pre-existing fmt.Sprintf argument count mismatch in PVE setup script (internal/api/config_handlers.go:3077). The template had 28 format specifiers (excluding %%s escape sequence) but was only receiving 24 arguments. Added missing pulseURL and tokenName arguments to match template. Related: #575	2025-10-20 15:10:59 +00:00
rcourtman	20d94f4c90	Fix X-RateLimit-Limit header value (#575 ) Fix X-RateLimit-Limit header value	2025-10-20 15:57:28 +01:00
rcourtman	29f4879cd4	test: add comprehensive security tests and documentation Implements all remaining Codex recommendations before launch: 1. Privileged Methods Tests: - TestPrivilegedMethodsCompleteness ensures all host-side RPCs are protected - Will fail if new privileged RPC is added without authorization - Verifies read-only methods are NOT in privilegedMethods 2. ID-Mapped Root Detection Tests: - TestIDMappedRootDetection covers all boundary conditions - Tests UID/GID range detection (both must be in range) - Tests multiple ID ranges, edge cases, disabled mode - 100% coverage of container identification logic 3. Authorization Tests: - TestPrivilegedMethodsBlocked verifies containers can't call privileged RPCs - TestIDMappedRootDisabled ensures feature can be disabled - Tests both container and host credentials 4. Comprehensive Security Documentation (23 KB): - Architecture overview with diagrams - Complete authentication & authorization flow - Rate limiting details (already implemented: 20/min per peer) - SSH security model and forced commands - Container isolation mechanisms - Monitoring & alerting recommendations - Development mode documentation (PULSE_DEV_ALLOW_CONTAINER_SSH) - Troubleshooting guide with common issues - Incident response procedures Rate Limiting Status: - Already implemented in throttle.go (20 req/min, burst 10, max 10 concurrent) - Per-peer rate limiting at line 328 in main.go - Per-node concurrency control at line 825 in main.go - Exceeds Codex's requirements All tests pass. Documentation covers all security aspects. Addresses final Codex recommendations for production readiness.	2025-10-19 16:47:13 +00:00
rcourtman	1519390f08	security: enhance logging for denied privileged method calls Improved security audit trail for attempted container privilege escalation: - Added detailed logging when containers attempt privileged methods - Logs UID, GID, PID, correlation ID, and method name - Marked with "SECURITY:" prefix for easy filtering/alerting - Helps operators detect and investigate compromise attempts Example log output: SECURITY: Container attempted to call privileged method - access denied method=ensure_cluster_keys uid=101000 gid=101000 pid=12345 Addresses Codex recommendation for comprehensive logging of denied privileged RPCs to enable monitoring and alerting on attempted abuse.	2025-10-19 16:40:42 +00:00
rcourtman	1e25fa572a	security: add resilience and error handling to tempproxy client Implements comprehensive client-side improvements for production reliability: 1. Context Support with Deadlines: - Added callWithContext() for context-aware RPC calls - Respects context deadlines and cancellation - Prevents goroutine pileup under network issues 2. Exponential Backoff with Jitter: - Automatic retry with exponential backoff (100ms → 10s) - ±10% jitter to prevent thundering herd - Max 3 retries for transient failures - Smart retry decision based on error classification 3. Error Classification: - ProxyError type with classification (Transport, Auth, SSH, Sensor, Timeout) - Retryable vs non-retryable error identification - Better error messages for debugging - Structured error handling throughout 4. Improved Connection Handling: - DialContext for cancellable connections - Proper deadline propagation - Clean separation of single-attempt vs retry logic - Legacy call() method preserved for backwards compatibility Security Notes: - SSH fallback already blocked in containers (temperature.go:69-77) - Per-client token auth not needed after method-level authz (commit `d55112ac4`) - ID-mapped root blocked from privileged methods Performance: - No retry on non-retryable errors (auth, sensor failures) - Context cancellation short-circuits retry loops - Jitter prevents synchronized retry storms Addresses Codex findings #4 and #5 from security audit.	2025-10-19 16:37:11 +00:00
rcourtman	a974fbf011	docs: remove security hardening document User prefers to track these issues differently	2025-10-19 16:34:11 +00:00
rcourtman	2cd5a784e9	docs: add temperature monitoring security hardening roadmap Comprehensive security improvement plan for post-launch hardening: Completed Fixes: ✅ SSH command injection (commit `124ab7826`) ✅ Unauthorized key distribution (commit `d55112ac4`) Post-Launch Tasks: 📋 #3: Socket ACL multi-tenancy improvements (v4.24.0) - Options: per-client tokens, mutual TLS, or enhanced audit - Addresses container compromise blast radius 📋 #4: Direct SSH fallback policy (v4.24.0) - Options: remove entirely, opt-in with warnings, or read-only key - Resolves tension between security and availability 📋 #5: Client resilience & observability (v4.25.0) - Context deadlines, exponential backoff, error classification - Circuit breaker pattern, structured metrics - Prevents goroutine pileup and improves debuggability Includes: - Detailed problem statements and proposed solutions - Security vs usability trade-offs for each option - Testing plan and documentation improvements - Open questions for architectural decisions - Target timelines and decision points This roadmap ensures we can ship the temperature monitoring feature now while maintaining clear visibility into remaining hardening work.	2025-10-19 16:33:03 +00:00
rcourtman	026b9c5b77	security: add method-level authorization for privileged RPC methods RELEASE BLOCKER FIX - Prevents containers from triggering host-level operations. Added host-only method restrictions: - RPCEnsureClusterKeys (SSH key distribution) - RPCRegisterNodes (node registration) - RPCRequestCleanup (cleanup operations) Implementation: - New privilegedMethods map defines host-only methods - Request handler checks if method is privileged - If privileged AND caller is from ID-mapped UID range (container), reject - Host processes (real root, configured UIDs) can still call privileged methods - Containers can still call get_temperature and get_status Security impact: - Prevents compromised containers from: • Triggering unwanted SSH key distribution to cluster nodes • Learning about cluster topology via forced registration • DOS attacks by repeatedly calling key distribution • Other host-level privileged operations Without this fix, any container with root could call these methods after authentication, undermining the security isolation between container and host. Addresses high-severity finding #2 from security audit.	2025-10-19 16:31:50 +00:00
rcourtman	3a6a4fd362	security: fix SSH command injection vulnerabilities in pulse-sensor-proxy CRITICAL security fixes for pulse-sensor-proxy: 1. Strengthened hostname validation regex: - Now requires hostnames to start with alphanumeric character - Prevents SSH option injection via hostnames starting with '-' - Pattern: ^[a-zA-Z0-9][a-zA-Z0-9._-]{0,63}$ (1-64 chars total) - Added IPv4 and IPv6 validation regexes for future use 2. Added validation to vulnerable V1 RPC handlers: - handleGetTemperature: Now validates node parameter before SSH - handleRegisterNodes: Now validates discovered cluster nodes - Previously these handlers passed unsanitized input directly to SSH 3. Defense in depth: - V2 handlers already had validation (now using improved regex) - Multiple layers of protection against malicious node identifiers - Validation prevents container from passing SSH options as hostnames Without these fixes, a compromised container could potentially inject SSH options by providing malicious node names, though the 'root@' prefix provided some mitigation. Addresses high-severity finding from security audit.	2025-10-19 16:28:38 +00:00
rcourtman	67862e6f11	feat: add user-friendly explanation for socket bind mount Added clear messaging to explain why the socket bind mount is configured, focusing on the security benefits rather than technical implementation. Changes: - Add explanatory header "Secure Container Communication Setup" - Explain the three key benefits: • Container communicates via Unix socket (not SSH) • No SSH keys exposed inside container (enhanced security) • Proxy on host manages all temperature collection - Update technical messages to be more user-friendly: • "Configuring socket bind mount" instead of "Ensuring..." • "Restarting container to activate secure communication" • "Verifying secure communication channel" • "✓ Secure socket communication ready" • "Configuring Pulse to use proxy" This helps users understand WHY the bind mount exists (security) rather than just seeing technical implementation details.	2025-10-19 16:22:03 +00:00
rcourtman	ee6d9d4877	feat: add user confirmation prompt for pulse-sensor-proxy installation Adds explicit user consent before installing pulse-sensor-proxy on the Proxmox host, with support for noninteractive/scripted installations. Changes: - Add --proxy flag with yes/no/auto modes - Add prompt_proxy_installation() function that explains what will be installed and asks for user confirmation - Detect Docker in container and preselect 'yes' as default when found - Support noninteractive mode via --proxy flag for automated installs - Skip proxy installation if user declines or --proxy=no specified - Auto-detect mode (--proxy=auto) installs only if Docker is present Behavior: - Default (no flag): Prompt user with explanation of what will be installed - --proxy=yes: Install without prompting (for turnkey workflows) - --proxy=no: Skip proxy installation entirely - --proxy=auto: Install only if Docker is detected in container - Docker detected: Default prompt answer changes to [Y/n] instead of [y/N] When user declines, clear message explains temperature monitoring will be unavailable and provides command to enable later. This provides transparency about host-level changes while preserving the turnkey workflow for automated/Docker installations.	2025-10-19 16:13:46 +00:00
rcourtman	171723a7d3	fix: automatically restart container when proxy mount is configured Instead of warning the user to restart the container manually, the script now automatically restarts it when the socket mount configuration is updated. This ensures the mount is immediately active and temperature monitoring works right away without user intervention. Uses 'pct restart' if running, 'pct start' if stopped.	2025-10-19 15:56:31 +00:00
rcourtman	d3c2a01140	fix: pass --main flag through to inner LXC installation When installing with --main flag, the outer install.sh now passes --main to the inner installation running inside the LXC. This ensures that pulse-sensor-proxy is built from source inside the container, so the binary can be copied to the Proxmox host using 'pct pull'. Previously, the --main flag was not passed through, causing the inner installation to download the release binary instead of building from source, which resulted in an empty binary being copied to the host.	2025-10-19 15:40:29 +00:00
rcourtman	762df9629b	fix: use locally-built pulse-sensor-proxy when installing with --main flag When --main flag is specified, install.sh now copies the binary that was built inside the LXC to the Proxmox host using 'pct pull' and passes it to install-sensor-proxy.sh with --local-binary flag. This ensures that when users build from source, no binary downloads are attempted - everything is built as expected. Release installs continue to use the download fallback mechanism.	2025-10-19 15:26:16 +00:00
rcourtman	f81d77bb98	fix: fall back to Pulse server when GitHub download fails for pulse-sensor-proxy The install-sensor-proxy.sh script now tries GitHub releases first, then falls back to downloading from the Pulse server if GitHub fails or doesn't have the binary (common when building from main). The LXC installer sets PULSE_SENSOR_PROXY_FALLBACK_URL to point to the Pulse server running inside the newly created LXC, ensuring the proxy binary can be downloaded from /api/install/pulse-sensor-proxy. This fixes the issue where installing with --main would fail to install pulse-sensor-proxy on the host because GitHub releases don't include it yet.	2025-10-19 15:17:59 +00:00
rcourtman	97c895dbb1	fix: build and install pulse-sensor-proxy when building from source When users install with --main, the install script now: - Builds pulse-sensor-proxy from source - Installs it to /opt/pulse/bin/pulse-sensor-proxy - Copies install-docker.sh and install-sensor-proxy.sh to scripts dir This ensures the turnkey Docker installer can download pulse-sensor-proxy from the Pulse server (/api/install/pulse-sensor-proxy) instead of failing. Previously, building from source would skip pulse-sensor-proxy entirely, causing the Docker installer to fail when trying to set up temperature monitoring.	2025-10-19 15:12:31 +00:00
rcourtman	049f79987f	feat: add turnkey Docker installer with automatic proxy setup Adds a one-command Docker deployment flow that: - Detects if running in LXC and installs Docker if needed - Automatically installs pulse-sensor-proxy on the Proxmox host - Configures bind mount for proxy socket into LXC - Generates optimized docker-compose.yml with proxy socket - Enables temperature monitoring via host-side proxy The install-docker.sh script handles the complete setup including: - Docker installation (if needed) - ACL configuration for container UIDs - Bind mount setup - Automatic apparmor=unconfined for socket access Accessible via: curl -sSL http://pulse:7655/api/install/install-docker.sh \| bash	2025-10-19 15:03:24 +00:00
rcourtman	a841a1a6fe	fix: show success message instead of warning when using pulse-sensor-proxy When the setup script detects TEMPERATURE_PROXY_KEY (proxy is available), it now shows a clear success message instead of attempting SSH verification. The verification check doesn't work with proxy-based setups since the container doesn't have SSH keys - all temperature collection happens via the Unix socket to pulse-sensor-proxy, which handles SSH. Now shows: ✓ Temperature monitoring configured via pulse-sensor-proxy Temperature data will appear in the dashboard within 10 seconds Instead of the misleading: ⚠️ Unable to verify SSH connectivity. Temperature data will appear once SSH connectivity is configured.	2025-10-19 14:06:18 +00:00
rcourtman	557eedb247	fix: detect and use proxy SSH key in setup script for Docker deployments When pulse-sensor-proxy is available, the setup script now automatically detects and uses the proxy's SSH public key instead of trying to generate keys inside the container. This fixes temperature monitoring setup for Docker deployments where: - Container has proxy socket mounted at /mnt/pulse-proxy - Proxy handles SSH connections to nodes - Setup script needs to distribute the proxy's key, not container's key The fix queries /api/system/proxy-public-key during setup script generation and overrides SSH_SENSORS_PUBLIC_KEY if the proxy is available. Tested with Docker on native Proxmox host (delly) - temperatures collected successfully via proxy socket.	2025-10-19 13:50:08 +00:00
Sangar	ce21a6b94f	Fix X-RateLimit-Limit header value	2025-10-19 11:43:03 +02:00
rcourtman	21712111e7	fix: enable variable expansion in cluster node SSH key heredoc Changed heredoc delimiter from <<'EOF' to <<EOF to allow bash variable expansion. Previously $SSH_PUBLIC_KEY and $SSH_RESTRICTED_KEY_ENTRY were being passed as literal strings instead of their actual values, so cluster nodes never received the correct SSH keys. This fixes cluster node ProxyJump setup - now both restricted and unrestricted keys are properly added to cluster nodes.	2025-10-19 09:08:00 +00:00
rcourtman	c17059ca8e	fix: add ProxyJump key to all cluster nodes automatically The setup script now adds both the restricted and unrestricted SSH keys to ALL cluster nodes, not just the first one. This makes temperature monitoring truly turnkey - you say 'yes' to configure cluster nodes and it automatically sets up both keys on each node. This ensures: - All nodes can act as ProxyJump hosts if needed - All nodes can provide temperature data via sensors - No manual SSH key configuration required Fixes turnkey cluster temperature monitoring setup.	2025-10-19 09:02:28 +00:00
rcourtman	bfde490ad4	fix: add unrestricted SSH key for ProxyJump on jump host When using ProxyJump for cluster temperature monitoring, the jump host (typically the first cluster node) needs an unrestricted SSH key to allow connection forwarding. Previously only the restricted key with command="sensors -j" was added, which blocked ProxyJump. Now the setup script adds TWO keys: 1. Unrestricted key (for ProxyJump/connection forwarding) 2. Restricted key (for running sensors -j directly) This allows containerized Pulse to: - Connect through the jump host to other cluster nodes - Collect temperature data from all cluster members Fixes cluster temperature monitoring for Docker/LXC deployments.	2025-10-19 08:56:52 +00:00
rcourtman	78c2228b89	fix: add HostName entries for cluster nodes in SSH config Added logic to resolve IP addresses for cluster nodes and include them as HostName entries in the SSH config. Without this, Pulse couldn't connect to cluster nodes like 'minipc' because the container couldn't resolve the hostname. Uses getent to resolve node names to IPs, with fallback to hostname if resolution fails (for environments where DNS works).	2025-10-19 08:48:25 +00:00
rcourtman	dd70bdee08	feat: switch to Ed25519 SSH keys and add openssh-client to container - Changed SSH key generation from RSA 2048 to Ed25519 (more secure, faster, smaller) - Added openssh-client package to Docker image (required for temperature monitoring) - Updated SSH config template to use id_ed25519 - Removed unused crypto/rsa and crypto/x509 imports Ed25519 provides better security with shorter keys and faster operations compared to RSA. The container now has SSH client tools needed to connect to Proxmox nodes for temperature data collection.	2025-10-19 08:43:20 +00:00

... 92 93 94 95 96 ...

4826 commits