Commit graph

1573 commits

Author SHA1 Message Date
rcourtman
545634372e Document log_level configuration for pulse-sensor-proxy
Update hardening documentation to include log_level configuration option.
Users can now find examples of controlling logging verbosity through
YAML config and environment variables.

Related to #629
2025-11-05 19:48:42 +00:00
rcourtman
a5e3469da8 Add comprehensive automation documentation for temperature proxy installation
This addresses the need for users who deploy Pulse via infrastructure-as-code
tools (Ansible, Terraform, Salt, Puppet) to have scriptable, well-documented
installation procedures.

Changes:

**Comprehensive Automation Section:**
- Documented all installer script flags and options
  - Required: --ctid (LXC) or --standalone (Docker)
  - Optional: --quiet, --pulse-server, --version, --local-binary, --skip-restart
  - Documented idempotency, exit codes, and non-interactive behavior

**Real-World Examples:**
- Ansible playbook for LXC deployments
- Ansible playbook for Docker deployments (includes docker-compose.yml management)
- Terraform null_resource example with remote-exec
- Manual step-by-step configuration (no script)

**Configuration Documentation:**
- Complete YAML config file format with all options
- Environment variable overrides (PULSE_SENSOR_PROXY_ALLOWED_SUBNETS, etc.)
- Example systemd service overrides
- Rate limiting, metrics, ACL, and subnet configuration

**Quick Reference:**
- Added link at top of doc for automation users to jump directly to automation section
- Clear examples of re-running after changes (adding nodes, upgrading versions)

**Key Features for Automation:**
- --quiet flag for non-interactive execution
- Idempotent design (safe to re-run)
- Verifiable exit codes
- Environment variable configuration
- Local binary support (no internet required)

This makes it straightforward for infrastructure teams to integrate Pulse
temperature monitoring into their existing automation workflows without
relying on interactive scripts or manual steps.
2025-11-05 18:18:04 +00:00
rcourtman
a1fb79ae6a Fix temperature proxy documentation and setup script for Docker vs LXC clarity
This addresses confusion around temperature monitoring setup for Docker
deployments where users expected a turnkey experience similar to LXC.

The core issue: The setup script and documentation suggested that
temperature monitoring was "automatically configured" for all containerized
deployments, but in reality only LXC containers have a fully automatic
setup. Docker requires manual steps.

Changes:

**Setup Script (config_handlers.go):**
- Fixed "unknown environment" path to show separate instructions for LXC vs Docker
- Docker instructions now correctly show --standalone flag (was incorrectly showing --ctid)
- Added docker-compose.yml bind mount instructions inline
- Added restart command for Docker deployments

**Documentation (TEMPERATURE_MONITORING.md):**
- Added prominent "Deployment-Specific Setup" callout at the top
- Clarified that LXC is fully automatic, Docker requires manual steps
- Reorganized "Setup (Automatic)" section to clearly distinguish:
  - LXC: Fully turnkey (no manual steps)
  - Docker: Manual proxy installation required
  - Node configuration: Works for both
- Updated "Host-side responsibilities" to specify it's Docker-only
- Fixed architecture benefits to reflect LXC vs Docker differences

Why this matters:
- LXC setup script auto-detects the container and runs install-sensor-proxy.sh --ctid
- Docker deployments can't be auto-detected and require --standalone flag
- Users running Docker were getting incorrect instructions (--ctid instead of --standalone)
- Documentation suggested everything was automatic, leading to confusion

Now the documentation and setup script accurately reflect that:
- LXC = Turnkey (automatic)
- Docker = Manual steps required (but well-documented)
- Native = Direct SSH (no proxy)

Related to GitHub Discussion #605
2025-11-05 18:18:04 +00:00
rcourtman
26144ae558 Improve temperature proxy setup guidance for Docker deployments
This addresses GitHub Discussion #605 where users were unclear about
configuring the temperature proxy when running Pulse in Docker.

Changes:

**install-sensor-proxy.sh:**
- Add Docker-specific post-install instructions when --standalone flag is used
- Show required docker-compose.yml bind mount configuration
- Provide verification commands for Docker deployments
- Link to full documentation for troubleshooting

**TEMPERATURE_MONITORING.md:**
- Add prominent "Quick Start for Docker Deployments" section at the top
- Move Docker instructions earlier in the document for better visibility
- Provide complete 4-step setup process with verification commands

These changes ensure Docker users immediately see:
1. How to install the proxy on the Proxmox host
2. What bind mount to add to docker-compose.yml
3. How to restart and verify the setup
4. Where to find detailed troubleshooting

The installer now provides actionable next steps instead of just
confirming installation, reducing confusion for containerized deployments.
2025-11-05 18:18:04 +00:00
rcourtman
7a185c4ab3 Improve guest agent timeout handling for high-load environments (refs #592)
This change addresses intermittent "Guest details unavailable" and "Disk stats
unavailable" errors affecting users with large VM deployments (50+ VMs) or
high-load Proxmox environments.

Changes:
- Increased default guest agent timeouts (3-5s → 10-15s) to better handle
  environments under load
- Added automatic retry logic (1 retry by default) for transient timeout failures
- Made all timeouts and retry count configurable via environment variables:
  * GUEST_AGENT_FSINFO_TIMEOUT (default: 15s)
  * GUEST_AGENT_NETWORK_TIMEOUT (default: 10s)
  * GUEST_AGENT_OSINFO_TIMEOUT (default: 10s)
  * GUEST_AGENT_VERSION_TIMEOUT (default: 10s)
  * GUEST_AGENT_RETRIES (default: 1)
- Added comprehensive documentation in VM_DISK_MONITORING.md with configuration
  examples for different deployment scenarios

These improvements allow Pulse to gracefully handle intermittent API timeouts
without immediately displaying errors, while remaining configurable for
different network conditions and environment sizes.

Fixes: https://github.com/rcourtman/Pulse/discussions/592
2025-11-05 09:40:58 +00:00
rcourtman
d52ac6d8b5 Fix CSRF token validation and improve token management
- Add Access-Control-Expose-Headers to allow frontend to read X-CSRF-Token response header
- Implement proactive CSRF token issuance on GET requests when session exists but CSRF cookie is missing
- Ensures frontend always has valid CSRF token before making POST requests
- Fixes 403 Forbidden errors when toggling system settings

This resolves CSRF validation failures that occurred when CSRF tokens expired or were missing while valid sessions existed.
2025-11-05 09:23:44 +00:00
rcourtman
10862db4e4 Enhance container detection for temperature SSH safeguards (refs #601) 2025-11-04 22:30:35 +00:00
rcourtman
6eb1a10d9b Refactor: Code cleanup and localStorage consolidation
This commit includes comprehensive codebase cleanup and refactoring:

## Code Cleanup
- Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate)
- Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo)
- Clean up commented-out code blocks across multiple files
- Remove unused TypeScript exports (helpTextClass, private tag color helpers)
- Delete obsolete test files and components

## localStorage Consolidation
- Centralize all storage keys into STORAGE_KEYS constant
- Update 5 files to use centralized keys:
  * utils/apiClient.ts (AUTH, LEGACY_TOKEN)
  * components/Dashboard/Dashboard.tsx (GUEST_METADATA)
  * components/Docker/DockerHosts.tsx (DOCKER_METADATA)
  * App.tsx (PLATFORMS_SEEN)
  * stores/updates.ts (UPDATES)
- Benefits: Single source of truth, prevents typos, better maintainability

## Previous Work Committed
- Docker monitoring improvements and disk metrics
- Security enhancements and setup fixes
- API refactoring and cleanup
- Documentation updates
- Build system improvements

## Testing
- All frontend tests pass (29 tests)
- All Go tests pass (15 packages)
- Production build successful
- Zero breaking changes

Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)
2025-11-04 21:50:46 +00:00
rcourtman
5c4be1921c chore: snapshot current changes 2025-11-02 22:47:55 +00:00
rcourtman
fb22469eb0 Add disk usage threshold support for Docker containers
Extends the Docker monitoring and alerting system to track writable layer
usage as a percentage of the container's root filesystem. This helps
identify containers with bloated copy-on-write layers before they
consume excessive disk space.

- Add disk threshold to DockerThresholdConfig (default: 85% trigger, 80% clear)
- Evaluate disk alerts for running containers when RootFilesystemBytes > 0
- Include disk metadata (writable layer, total filesystem, block I/O stats)
- Update frontend to display and configure disk thresholds
- Add test coverage for disk usage alert hysteresis
- Document disk monitoring in DOCKER_MONITORING.md

Per-container and per-host overrides apply to disk thresholds the same
way they do for CPU and memory.
2025-10-29 14:52:25 +00:00
rcourtman
32392d1212 Add disk metrics, block I/O, and mount details to Docker monitoring
Extends Docker container monitoring with comprehensive disk and storage information:
- Writable layer size and root filesystem usage displayed in new Disk column
- Block I/O statistics (read/write bytes totals) shown in container drawer
- Mount metadata including type, source, destination, mode, and driver details
- Configurable via --collect-disk flag (enabled by default, can be disabled for large fleets)

Also fixes config watcher to consistently use production auth config path instead of following PULSE_DATA_DIR when in mock mode.
2025-10-29 12:05:36 +00:00
rcourtman
b3285c05c8 Consolidate pending changes
- Add Docker metadata test comment
- Update alerts configuration and thresholds
- Enhance config file watcher
- Update documentation
- Refine settings UI
2025-10-28 23:20:44 +00:00
rcourtman
e07336dd9f refactor: remove legacy DISABLE_AUTH flag and enhance authentication UX
Major authentication system improvements:

- Remove deprecated DISABLE_AUTH environment variable support
- Update all documentation to remove DISABLE_AUTH references
- Add auth recovery instructions to docs (create .auth_recovery file)
- Improve first-run setup and Quick Security wizard flows
- Enhance login page with better error messaging and validation
- Refactor Docker hosts view with new unified table and tree components
- Add useDebouncedValue hook for better search performance
- Improve Settings page with better security configuration UX
- Update mock mode and development scripts for consistency
- Add ScrollableTable persistence and improved responsive design

Backend changes:
- Remove DISABLE_AUTH flag detection and handling
- Improve auth configuration validation and error messages
- Enhance security status endpoint responses
- Update router integration tests

Frontend changes:
- New Docker components: DockerUnifiedTable, DockerTree, DockerSummaryStats
- Better connection status indicator positioning
- Improved authentication state management
- Enhanced CSRF and session handling
- Better loading states and error recovery

This completes the migration away from the insecure DISABLE_AUTH pattern
toward proper authentication with recovery mechanisms.
2025-10-27 19:46:51 +00:00
rcourtman
68ce8e7520 feat: finalize swarm service monitoring (#598) 2025-10-26 09:35:49 +00:00
rcourtman
8e83eaf823 Add container state filtering to Docker agent 2025-10-25 21:40:59 +00:00
rcourtman
5a2d808aa1 Harden setup token flow and enforce encrypted persistence 2025-10-25 16:00:37 +00:00
rcourtman
d643dcf0bc perf: reduce polling allocations and guest metadata load 2025-10-25 13:12:47 +00:00
rcourtman
138d8facd2 Improve host agent onboarding flow 2025-10-25 09:37:29 +00:00
rcourtman
cee24ff7e0 docs: refresh API token scope guidance 2025-10-23 13:44:19 +00:00
rcourtman
5c54685f04 Add API token scopes and standalone host agent
Introduces granular permission scopes for API tokens (docker:report, docker:manage, host-agent:report, monitoring:read/write, settings:read/write) allowing tokens to be restricted to minimum required access. Legacy tokens default to full access until scopes are explicitly configured.

Adds standalone host agent for monitoring Linux, macOS, and Windows servers outside Proxmox/Docker estates. New Servers workspace in UI displays uptime, OS metadata, and capacity metrics from enrolled agents.

Includes comprehensive token management UI overhaul with scope presets, inline editing, and visual scope indicators.
2025-10-23 11:40:31 +00:00
rcourtman
e1fe8354e9 Ensure Docker agent builds stay static (#597) 2025-10-22 21:48:57 +00:00
rcourtman
bc479643e4 release: prepare v4.25.0 2025-10-22 10:46:18 +00:00
rcourtman
ff4dc49ae4 Update Pulse install flow and related components 2025-10-21 19:58:53 +00:00
rcourtman
e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00
rcourtman
ddc9a7a068 docs: comprehensive documentation for rate limit fix and configurability
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:

TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count

TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example

CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override

pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands

TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments

Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
2025-10-21 11:36:07 +00:00
rcourtman
2f43d67af9 docs: simplify Mermaid diagrams for better readability
The previous diagrams were too complex and overwhelming. Simplified
all diagrams to show core concepts clearly:

- Adaptive polling: reduced to basic scheduler→queue→workers flow
- Temperature proxy: simplified to 3-box trust boundary view
- Sensor proxy sequence: simplified to essential request flow
- Webhook pipeline: reduced to template→send→retry flow
- Script library: simplified to code→test→bundle→dist flow

Fixed parsing error in temperature proxy diagram (parentheses in
edge label causing render failure).

Diagrams should clarify architecture, not recreate implementation.
2025-10-21 10:50:40 +00:00
rcourtman
7bfd6997ec docs: remove outdated installer v2 rollout planning doc
The v2 installer rollout is complete - dist/install-docker-agent.sh
now contains the bundled v2 installer with embedded library modules.
This planning document served its purpose and is no longer relevant.
2025-10-21 10:48:35 +00:00
rcourtman
10d52244f8 docs: remove internal 'Phase 2' reference from adaptive polling docs
Replace internal development phase reference with clear description
of what the adaptive polling scheduler does. 'Phase 2' is internal
jargon that provides no value to users.
2025-10-21 10:45:46 +00:00
rcourtman
85ffe10aed docs: add Mermaid diagrams to improve visual documentation
Enhance documentation with six Mermaid diagrams to better explain
complex system implementations:

- Adaptive polling lifecycle flowchart showing enqueue→execute→feedback
  cycle with scheduler, priority queue, and worker interactions
- Circuit breaker state machine diagram illustrating Closed↔Open↔Half-open
  transitions with triggers and recovery paths
- Temperature proxy architecture diagram highlighting trust boundaries,
  security controls, and data flow between host/container/cluster
- Sensor proxy request flow sequence diagram showing auth, rate limiting,
  validation, and SSH execution pipeline
- Alert webhook pipeline flowchart detailing template resolution, URL
  rendering, HTTP dispatch, and retry logic
- Script library workflow diagram illustrating dev→test→bundle→distribute
  lifecycle emphasizing modular design

These visualizations make it easier for operators and contributors to
understand Pulse's sophisticated architectural patterns.
2025-10-21 10:40:33 +00:00
rcourtman
b929fdcc6e feat: improve source build installation experience
- Remove confusing --main flag, use --source for clarity
- Fix timeout issues when building from source in LXC containers
  - Increase timeout from 5min to 20min for source builds
  - Add PULSE_CONTAINER_TIMEOUT env var for custom timeouts
  - Support PULSE_CONTAINER_TIMEOUT=0 to disable timeout
- Fix misleading "Latest version: vX.X.X" message during source builds
- Update documentation to use --source instead of --main
- Simplify auto-update script logic for source builds

Changes:
- install.sh: Check BUILD_FROM_SOURCE early to skip version detection
- install.sh: Adaptive timeout (300s binary, 1200s source builds)
- install.sh: Better timeout error messages with recovery instructions
- README.md: Replace --main with --source in examples
- docs/INSTALL.md: Replace --main with --source in examples
- scripts/pulse-auto-update.sh: Remove --main special case
2025-10-21 08:57:29 +00:00
rcourtman
c91b7874ac docs: comprehensive v4.24.0 documentation audit and updates
Complete documentation overhaul for Pulse v4.24.0 release covering all new
features and operational procedures.

Documentation Updates (19 files):

P0 Release-Critical:
- Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook
- Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status
- Operations: Enhanced audit-log-rotation.md with scheduler health checks
- Security: Updated proxy hardening docs with rate limit defaults
- Docker: Added runtime logging and rollback procedures

P1 Deployment & Integration:
- KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification
- PORT_CONFIGURATION.md: Service naming, change tracking via update history
- REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification
- PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration
- TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows

Features Documented:
- X-RateLimit-* headers for all API responses
- Updates rollback workflow (UI & CLI)
- Scheduler health API with rich metadata
- Runtime logging configuration (no restart required)
- Adaptive polling (GA, enabled by default)
- Enhanced audit logging
- Circuit breakers and dead-letter queue

Supporting Changes:
- Discovery service enhancements
- Config handlers updates
- Sensor proxy installer improvements

Total Changes: 1,626 insertions(+), 622 deletions(-)
Files Modified: 24 (19 docs, 5 code)

All documentation is production-ready for v4.24.0 release.
2025-10-20 17:20:13 +00:00
rcourtman
fd0a4f2b0a docs: update documentation for v4.24.0 features
Updates documentation to reflect features implemented in recent commits:

**Security & API Enhancements:**
- Rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After)
- Audit logging for rollback actions and scheduler health
- Runtime logging configuration tracking

**Scheduler Health API:**
- Document new v4.24.0 endpoint features
- Per-instance circuit breaker status
- Dead-letter queue tracking
- Staleness metrics
- Enhanced response format with backward compatibility

**Version & Health Endpoints:**
- Updated /api/version response fields
- Optional health endpoint fields
- Deployment type and update availability

**Configuration & Installation:**
- HTTP config fetch via PULSE_INIT_CONFIG_URL
- Updated environment variable documentation
- Enhanced FAQ entries

**Monitoring & Operations:**
- Adaptive polling architecture documentation
- Rollback procedure references
- Production deployment guidance

All documentation changes align with implemented features from commits:
- 656ae0d25 (PMG test fix)
- dec85a4ef (PBS/PMG stubs + HTTP config)
- Earlier commits: scheduler health API, rollback, rate limiting
2025-10-20 16:08:10 +00:00
rcourtman
469d11fc7e docs: add comprehensive scheduler health API documentation
Add detailed API reference and update rollout playbook:

**New: docs/api/SCHEDULER_HEALTH.md**
- Complete endpoint reference for /api/monitoring/scheduler/health
- Request/response structure with field descriptions
- Enhanced "instances" array documentation
- Example responses showing all states (healthy, transient, DLQ)
- Useful jq queries for troubleshooting:
  - Find instances with errors
  - List DLQ entries
  - Show open circuit breakers
  - Sort by failure streaks
- Migration guide (legacy → new fields)
- Troubleshooting examples with real scenarios

**Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md**
- Enhanced "Accessing Scheduler Health API" section (§6)
- Added examples using new instances[] array
- Updated queries to use pollStatus, breaker, deadLetter fields
- Practical jq commands for operators

**Key Documentation Features:**
- Complete JSON schema with examples
- All new fields documented with types and descriptions
- Real-world troubleshooting scenarios
- Copy-paste ready jq queries
- Migration path for existing integrations
- Backward compatibility notes

Operators can now:
- Find error messages without log digging
- Understand circuit breaker states
- Track DLQ entries with full context
- Diagnose issues using single API call

Part of Phase 2 follow-up - enhanced observability
2025-10-20 15:13:38 +00:00
rcourtman
0fcfad3dc5 feat: add shared script library system and refactor docker-agent installer
Implements a comprehensive script improvement infrastructure to reduce code
duplication, improve maintainability, and enable easier testing of installer
scripts.

## New Infrastructure

### Shared Library System (scripts/lib/)
- common.sh: Core utilities (logging, sudo, dry-run, cleanup management)
- systemd.sh: Service management helpers with container-safe systemctl
- http.sh: HTTP/download helpers with curl/wget fallback and retry logic
- README.md: Complete API documentation for all library functions

### Bundler System
- scripts/bundle.sh: Concatenates library modules into single-file installers
- scripts/bundle.manifest: Defines bundling configuration for distributables
- Enables both modular development and curl|bash distribution

### Test Infrastructure
- scripts/tests/run.sh: Test harness for running all smoke tests
- scripts/tests/test-common-lib.sh: Common library validation (5 tests)
- scripts/tests/test-docker-agent-v2.sh: Installer smoke tests (4 tests)
- scripts/tests/integration/: Container-based integration tests (5 scenarios)
- All tests passing ✓

## Refactored Installer

### install-docker-agent-v2.sh
- Reduced from 1098 to 563 lines (48% code reduction)
- Uses shared libraries for all common operations
- NEW: --dry-run flag support
- Maintains 100% backward compatibility with original
- Fully tested with smoke and integration tests

### Key Improvements
- Sudo escalation: 100+ lines → 1 function call
- Download logic: 51 lines → 1 function call
- Service creation: 33 lines → 2 function calls
- Logging: Standardized across all operations
- Error handling: Improved with common library

## Documentation

### Rollout Strategy (docs/installer-v2-rollout.md)
- 3-phase rollout plan (Alpha → Beta → GA)
- Feature flag mechanism for gradual deployment
- Testing checklist and success metrics
- Rollback procedures and communication plan

### Developer Guides
- docs/script-library-guide.md: Complete library usage guide
- docs/CONTRIBUTING-SCRIPTS.md: Contribution workflow
- docs/installer-v2-quickref.md: Quick reference for operators

## Metrics

- Code reduction: 48% (1098 → 563 lines)
- Reusable functions: 0 → 30+
- Test coverage: 0 → 8 test scenarios
- Documentation: 0 → 5 comprehensive guides

## Testing

All tests passing:
- Smoke tests: 2/2 passed (8 test cases)
- Integration tests: 5/5 scenarios passed
- Bundled output: Syntax validated, dry-run tested

## Next Steps

This lays the foundation for migrating other installers (install.sh,
install-sensor-proxy.sh) to use the same pattern, reducing overall
maintenance burden and improving code quality across the project.
2025-10-20 15:13:38 +00:00
rcourtman
ce5ad64810 docs: defer circuit breaker/DLQ management endpoints (Phase 2 Task 11)
Document decision to defer mutation endpoints after soak testing:

**Assessment Results:**
- Integration tests (55s, 12 instances): Automatic recovery worked perfectly
- Soak tests (2-240min, 80 instances): No manual intervention needed
- Circuit breakers: Opened/closed automatically as designed
- DLQ routing: Permanent failures handled correctly

**Current Capabilities (Sufficient):**
- Read-only scheduler health API provides full visibility
- Operator workarounds: service restart, feature flag toggle
- Grafana alerting: queue depth, staleness, DLQ, breakers

**Why Defer:**
- No operational need demonstrated in testing
- Implementation requires auth/RBAC/audit/UI work
- Cost not justified until production usage reveals need
- Can add later when data shows actual pain points

**Future Design Notes:**
- POST /api/monitoring/breakers/{instance}/reset
- POST /api/monitoring/dlq/retry (all or specific)
- DELETE /api/monitoring/dlq/{instance}
- Auth, audit, rate limiting, UI integration required

**Re-evaluation Criteria:**
- Operators request controls >3x in 30 days
- Troubleshooting steps inadequate
- Service restarts too disruptive
- Production incidents need surgical controls

Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns.

Part of Phase 2 - Adaptive Polling completion
2025-10-20 15:13:38 +00:00
rcourtman
cb8be81f1d docs: add adaptive polling production rollout playbook (Phase 2 Task 10)
Add comprehensive operator playbook for production enablement:

**Prerequisites:**
- Test suite validation (unit, integration, soak)
- Monitoring readiness (Grafana dashboards, alerts)
- Configuration management and rollback planning
- Stakeholder sign-off

**Staging Rollout:**
- Feature flag enablement steps
- Verification procedures (scheduler health API)
- 24-48h observation window with success criteria
- Metric checkpoints at 0h, 12h, 24h

**Production Rollout:**
- Gradual strategy (25% nodes every 2 hours)
- Low-traffic maintenance window
- Per-cluster monitoring during rollout
- Success criteria and completion validation

**Grafana/Alert Configuration:**
- Dashboard panels: queue depth, staleness, throughput, breakers/DLQ
- Alert thresholds:
  - Queue depth > 1.5× instances for >10min (Warning)
  - Staleness > 60s for >5min (Critical)
  - DLQ growth (Warning)
  - Stuck breakers >10min (Critical)

**Rollback Procedure:**
- Clear disable/restart steps
- Verification of rollback success
- Post-rollback actions and incident reporting

**Troubleshooting:**
- Symptom/cause/action table
- Scheduler health API access guide
- Immediate rollback triggers

Operators can now safely enable adaptive polling following this step-by-step playbook.

Part of Phase 2 Task 10 (Documentation)
2025-10-20 15:13:38 +00:00
rcourtman
d5c7a3494b chore: remove deprecated Pulse+ agent metrics and add audit log rotation docs
Removed all legacy Pulse+ agent metrics infrastructure (cloud-relay) which has been
fully replaced by the new docker agent and temperature agent implementations.

Changes:
- Remove cloud-relay directory and all related binaries (relay, relay-linux, etc.)
- Remove Pulse+ documentation (AGENT_METRICS_IMPLEMENTATION.md, AGENT_METRICS_SETUP.md)
- Clean up pulse-relay references in workflows and release checklist
- Add audit log rotation documentation for sensor proxy hash-chained logs
- Update .gitignore to remove cloud-relay/ entry

The new docker and temp agents remain fully functional and unaffected by this cleanup.
2025-10-20 15:13:38 +00:00
rcourtman
fa21e9c69c chore: remove completed phase summary documents
Removed PHASE1_SUMMARY.md and PHASE2_SUMMARY.md as both phases are complete.
All relevant documentation has been integrated into the main docs:
- Security hardening docs in SECURITY.md
- Adaptive polling architecture in docs/monitoring/ADAPTIVE_POLLING.md
2025-10-20 15:13:38 +00:00
rcourtman
b3f37a798c docs: update Phase 2 summary to reflect completion (9/10 tasks = 90%)
Updated PHASE2_SUMMARY.md to include:
-  Task 8: Scheduler health API endpoint completion
-  Task 9: Unit testing completion (40+ test cases)
- Updated git commit history (9 commits total)
- Revised known limitations (removed API/testing gaps)
- Updated future work section

Phase 2 achievements:
- 9/10 tasks complete (only integration/soak tests deferred)
- 40+ unit tests covering backoff, circuit breakers, staleness
- Full scheduler health API with authentication
- Comprehensive documentation and rollout plan
- Production-ready with feature flag control

Remaining work (deferred to future):
- Integration tests with mock PVE/PBS clients
- Soak tests for extended queue stability
- Write endpoints for circuit breaker/DLQ management
2025-10-20 15:13:38 +00:00
rcourtman
160adeb3b8 feat: add scheduler health API endpoint (Phase 2 Task 8)
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance

New API endpoint:
  GET /api/monitoring/scheduler/health (requires authentication)

New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data

Documentation updated with API spec, field descriptions, and usage examples.
2025-10-20 15:13:38 +00:00
rcourtman
5fbdf6099f docs: add adaptive polling architecture guide (Phase 2 Task 10)
Comprehensive documentation for Phase 2 adaptive polling:
- Architecture overview with component diagram
- Configuration guide (env vars, defaults, feature flag)
- Prometheus metrics reference (7 new metrics)
- Circuit breaker & backoff behavior explanation
- Dead-letter queue operational guidance
- Rollout plan (dev/QA → staged → full)
- Troubleshooting guide for common issues

Task 10 of 10 complete. Phase 2: 8/10 tasks implemented (80%).
2025-10-20 15:13:37 +00:00
rcourtman
aa5c08ad4a feat: implement priority queue-based task execution (Phase 2 Task 6)
Replaces immediate polling with queue-based scheduling:
- TaskQueue with min-heap (container/heap) for NextRun-ordered execution
- Worker goroutines that block on WaitNext() until tasks are due
- Tasks only execute when NextRun <= now, respecting adaptive intervals
- Automatic rescheduling after execution via scheduler.BuildPlan
- Queue depth tracking for backpressure-aware interval adjustments
- Upsert semantics for updating scheduled tasks without duplicates

Task 6 of 10 complete (60%). Ready for error/backoff policies.
2025-10-20 15:13:37 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00
rcourtman
29f4879cd4 test: add comprehensive security tests and documentation
Implements all remaining Codex recommendations before launch:

1. Privileged Methods Tests:
   - TestPrivilegedMethodsCompleteness ensures all host-side RPCs are protected
   - Will fail if new privileged RPC is added without authorization
   - Verifies read-only methods are NOT in privilegedMethods

2. ID-Mapped Root Detection Tests:
   - TestIDMappedRootDetection covers all boundary conditions
   - Tests UID/GID range detection (both must be in range)
   - Tests multiple ID ranges, edge cases, disabled mode
   - 100% coverage of container identification logic

3. Authorization Tests:
   - TestPrivilegedMethodsBlocked verifies containers can't call privileged RPCs
   - TestIDMappedRootDisabled ensures feature can be disabled
   - Tests both container and host credentials

4. Comprehensive Security Documentation (23 KB):
   - Architecture overview with diagrams
   - Complete authentication & authorization flow
   - Rate limiting details (already implemented: 20/min per peer)
   - SSH security model and forced commands
   - Container isolation mechanisms
   - Monitoring & alerting recommendations
   - Development mode documentation (PULSE_DEV_ALLOW_CONTAINER_SSH)
   - Troubleshooting guide with common issues
   - Incident response procedures

Rate Limiting Status:
- Already implemented in throttle.go (20 req/min, burst 10, max 10 concurrent)
- Per-peer rate limiting at line 328 in main.go
- Per-node concurrency control at line 825 in main.go
- Exceeds Codex's requirements

All tests pass. Documentation covers all security aspects.

Addresses final Codex recommendations for production readiness.
2025-10-19 16:47:13 +00:00
Pulse Automation Bot
0b4e4f9c59 Add configurable backup polling interval 2025-10-18 13:06:41 +00:00
Pulse Automation Bot
d15ad1d0b4 Add Helm chart tooling, CI, and release packaging 2025-10-18 11:50:57 +00:00
Richard Courtman
02701ca22b fix: gracefully handle standalone node cleanup limitation
- Cleanup script now detects forced command restriction on standalone nodes
- Logs helpful message explaining limitation (security by design)
- Does not fail when standalone nodes cannot be cleaned up
- Documents that standalone node cleanup is limited by forced command security
- Automatic cleanup works fully for cluster nodes
- Manual cleanup command provided for standalone nodes if needed
2025-10-18 07:34:18 +00:00
Richard Courtman
b328a09e45 docs: add automatic cleanup documentation for node removal 2025-10-18 07:03:42 +00:00
Richard Courtman
de3bb47930 fix: improve turnkey temperature monitoring for standalone nodes
- Fix script input handling to work with standard curl | bash pattern by prioritizing /dev/tty
- Add Raspberry Pi temperature sensor support (cpu_thermal chip and generic temp sensors)
- Add comprehensive documentation for turnkey standalone node setup
- Fix printf formatting error in setup script
2025-10-18 06:51:56 +00:00
rcourtman
a5d4d57097 docs: implement Codex recommendations for temperature monitoring
Add comprehensive documentation improvements based on architectural review:

1. Enhanced Known Limitations section:
   - Document single proxy failure mode
   - Explain sensors output parsing brittleness with mitigation steps
   - Clarify cluster discovery dependencies and fallback options
   - Describe SSH fan-out scaling considerations for large clusters

2. Documented SSH key rotation workflow:
   - Promote automated rotation script as recommended approach
   - Include dry-run, execution, and rollback examples
   - Provide manual fallback process
   - Reference existing pulse-proxy-rotate-keys.sh script

3. Added Future Improvements roadmap:
   - Proxmox API integration (when available)
   - Agent-based architecture option
   - SNMP/IPMI support
   - Schema validation
   - Caching and throttling
   - Automated rotation timer
   - Health check endpoint

Instrumentation verified: proxy already has comprehensive Prometheus metrics
(RPC/SSH requests, latency, queue depth, rate limiting) and structured logging.
2025-10-17 12:03:31 +00:00