Added two troubleshooting sections to DOCKER_MONITORING.md:
1. "Docker hosts cycling or appearing to replace each other" - explains
why multiple agents sharing the same token cause the UI to switch
between hosts instead of showing all simultaneously
2. "Agent rejected after host removal" - documents the re-enrollment
process when a host is on the removal blocklist
These entries make common setup issues searchable while linking to
canonical setup instructions rather than duplicating them.
Add comprehensive documentation for new alert system reliability features:
**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
- GET /api/notifications/dlq - Retrieve failed notifications
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry DLQ items
- POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
- 18 metrics covering alerts, notifications, and queue health
- Example Prometheus configuration
- Example PromQL queries for common monitoring scenarios
**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
- maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
- flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms
All new features are fully documented with examples and default values.
Issues found during systematic audit after #642:
1. CRITICAL BUG - Rollback downloads were completely broken:
- Code constructed: pulse-linux-amd64 (no version, no .tar.gz)
- Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz
- This would cause 404 errors on all rollback attempts
- Fixed: Construct correct tarball URL with version
- Added: Extract tarball after download to get binary
2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0:
- Changed to use /latest/download/ for future-proof docs
3. API.md example had wrong filename format:
- Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz
- Ensures example matches actual release asset naming
The rollback bug would have affected any user attempting to roll back
to a previous version via the UI or API.
Issue: HOST_AGENT.md documented downloading pulse-host-agent binaries
from GitHub releases, but those assets didn't exist. Only tarballs were
available, making manual installation unnecessarily complex.
Changes:
- Copy standalone host-agent binaries (all architectures) to release/
directory alongside sensor-proxy binaries
- Include host-agent binaries in checksum generation
- Update HOST_AGENT.md to clarify available architectures
- Retroactively uploaded missing binaries to v4.26.1
This enables air-gapped and manual installations without requiring an
already-running Pulse server to download from.
Related to #617
This fixes a misconfiguration scenario where Docker containers could
attempt direct SSH connections (producing [preauth] log spam) instead
of using the sensor proxy.
Changes:
- Fix container detection to check PULSE_DOCKER=true in addition to
system.InContainer() heuristics (both temperature.go and config_handlers.go)
- Upgrade temperature collection log from Error to Warn with actionable
guidance about mounting the proxy socket
- Add Info log when dev mode override is active so operators understand
the security posture
- Add troubleshooting section to docs for SSH [preauth] logs from containers
The container detection was inconsistent - monitor.go checked both flags
but temperature.go and config_handlers.go only checked InContainer().
Now all locations consistently check PULSE_DOCKER || InContainer().
Related to #636
When authentication is not configured (hasAuth() returns false), the
Settings tab is now automatically hidden from the web interface. This
provides a cleaner monitoring-only view for unauthenticated deployments
where users only need to check the health of their environment.
The Settings icon beside the Alerts tab will only appear when
authentication is properly configured via PULSE_AUTH_USER/PASS,
API tokens, proxy auth, or OIDC.
Changes:
- Modified utilityTabs in App.tsx to conditionally include Settings
based on hasAuth() signal
- Updated CONFIGURATION.md to document this UI behavior
Add comprehensive documentation for HTTPS/TLS configuration including:
- File ownership and permission requirements (pulse user)
- Common troubleshooting steps for startup failures
- Complete setup examples for systemd and Docker
- Validation commands for certificate/key verification
Related to discussion #634
Added comprehensive documentation for the per-metric alert delay feature
that was requested in issue #433. This feature allows configuring
different alert delays for different metrics (e.g., longer delays for
CPU spikes, shorter delays for memory pressure).
Key additions:
- Detailed explanation of delay precedence hierarchy
- JSON configuration examples for common use cases
- Table of recommended delays by metric type with reasoning
- UI access instructions for the Alert Delay row
Also added example tests demonstrating the feature's functionality
and common configuration patterns.
The feature itself was already fully implemented in both backend
(metricTimeThresholds support) and frontend (per-metric delay inputs
in ResourceTable). This commit surfaces the feature through
documentation so users know it exists and how to use it.
Related to #433
Update hardening documentation to include log_level configuration option.
Users can now find examples of controlling logging verbosity through
YAML config and environment variables.
Related to #629
This addresses the need for users who deploy Pulse via infrastructure-as-code
tools (Ansible, Terraform, Salt, Puppet) to have scriptable, well-documented
installation procedures.
Changes:
**Comprehensive Automation Section:**
- Documented all installer script flags and options
- Required: --ctid (LXC) or --standalone (Docker)
- Optional: --quiet, --pulse-server, --version, --local-binary, --skip-restart
- Documented idempotency, exit codes, and non-interactive behavior
**Real-World Examples:**
- Ansible playbook for LXC deployments
- Ansible playbook for Docker deployments (includes docker-compose.yml management)
- Terraform null_resource example with remote-exec
- Manual step-by-step configuration (no script)
**Configuration Documentation:**
- Complete YAML config file format with all options
- Environment variable overrides (PULSE_SENSOR_PROXY_ALLOWED_SUBNETS, etc.)
- Example systemd service overrides
- Rate limiting, metrics, ACL, and subnet configuration
**Quick Reference:**
- Added link at top of doc for automation users to jump directly to automation section
- Clear examples of re-running after changes (adding nodes, upgrading versions)
**Key Features for Automation:**
- --quiet flag for non-interactive execution
- Idempotent design (safe to re-run)
- Verifiable exit codes
- Environment variable configuration
- Local binary support (no internet required)
This makes it straightforward for infrastructure teams to integrate Pulse
temperature monitoring into their existing automation workflows without
relying on interactive scripts or manual steps.
This addresses confusion around temperature monitoring setup for Docker
deployments where users expected a turnkey experience similar to LXC.
The core issue: The setup script and documentation suggested that
temperature monitoring was "automatically configured" for all containerized
deployments, but in reality only LXC containers have a fully automatic
setup. Docker requires manual steps.
Changes:
**Setup Script (config_handlers.go):**
- Fixed "unknown environment" path to show separate instructions for LXC vs Docker
- Docker instructions now correctly show --standalone flag (was incorrectly showing --ctid)
- Added docker-compose.yml bind mount instructions inline
- Added restart command for Docker deployments
**Documentation (TEMPERATURE_MONITORING.md):**
- Added prominent "Deployment-Specific Setup" callout at the top
- Clarified that LXC is fully automatic, Docker requires manual steps
- Reorganized "Setup (Automatic)" section to clearly distinguish:
- LXC: Fully turnkey (no manual steps)
- Docker: Manual proxy installation required
- Node configuration: Works for both
- Updated "Host-side responsibilities" to specify it's Docker-only
- Fixed architecture benefits to reflect LXC vs Docker differences
Why this matters:
- LXC setup script auto-detects the container and runs install-sensor-proxy.sh --ctid
- Docker deployments can't be auto-detected and require --standalone flag
- Users running Docker were getting incorrect instructions (--ctid instead of --standalone)
- Documentation suggested everything was automatic, leading to confusion
Now the documentation and setup script accurately reflect that:
- LXC = Turnkey (automatic)
- Docker = Manual steps required (but well-documented)
- Native = Direct SSH (no proxy)
Related to GitHub Discussion #605
This addresses GitHub Discussion #605 where users were unclear about
configuring the temperature proxy when running Pulse in Docker.
Changes:
**install-sensor-proxy.sh:**
- Add Docker-specific post-install instructions when --standalone flag is used
- Show required docker-compose.yml bind mount configuration
- Provide verification commands for Docker deployments
- Link to full documentation for troubleshooting
**TEMPERATURE_MONITORING.md:**
- Add prominent "Quick Start for Docker Deployments" section at the top
- Move Docker instructions earlier in the document for better visibility
- Provide complete 4-step setup process with verification commands
These changes ensure Docker users immediately see:
1. How to install the proxy on the Proxmox host
2. What bind mount to add to docker-compose.yml
3. How to restart and verify the setup
4. Where to find detailed troubleshooting
The installer now provides actionable next steps instead of just
confirming installation, reducing confusion for containerized deployments.
This change addresses intermittent "Guest details unavailable" and "Disk stats
unavailable" errors affecting users with large VM deployments (50+ VMs) or
high-load Proxmox environments.
Changes:
- Increased default guest agent timeouts (3-5s → 10-15s) to better handle
environments under load
- Added automatic retry logic (1 retry by default) for transient timeout failures
- Made all timeouts and retry count configurable via environment variables:
* GUEST_AGENT_FSINFO_TIMEOUT (default: 15s)
* GUEST_AGENT_NETWORK_TIMEOUT (default: 10s)
* GUEST_AGENT_OSINFO_TIMEOUT (default: 10s)
* GUEST_AGENT_VERSION_TIMEOUT (default: 10s)
* GUEST_AGENT_RETRIES (default: 1)
- Added comprehensive documentation in VM_DISK_MONITORING.md with configuration
examples for different deployment scenarios
These improvements allow Pulse to gracefully handle intermittent API timeouts
without immediately displaying errors, while remaining configurable for
different network conditions and environment sizes.
Fixes: https://github.com/rcourtman/Pulse/discussions/592
- Add Access-Control-Expose-Headers to allow frontend to read X-CSRF-Token response header
- Implement proactive CSRF token issuance on GET requests when session exists but CSRF cookie is missing
- Ensures frontend always has valid CSRF token before making POST requests
- Fixes 403 Forbidden errors when toggling system settings
This resolves CSRF validation failures that occurred when CSRF tokens expired or were missing while valid sessions existed.
Extends the Docker monitoring and alerting system to track writable layer
usage as a percentage of the container's root filesystem. This helps
identify containers with bloated copy-on-write layers before they
consume excessive disk space.
- Add disk threshold to DockerThresholdConfig (default: 85% trigger, 80% clear)
- Evaluate disk alerts for running containers when RootFilesystemBytes > 0
- Include disk metadata (writable layer, total filesystem, block I/O stats)
- Update frontend to display and configure disk thresholds
- Add test coverage for disk usage alert hysteresis
- Document disk monitoring in DOCKER_MONITORING.md
Per-container and per-host overrides apply to disk thresholds the same
way they do for CPU and memory.
Extends Docker container monitoring with comprehensive disk and storage information:
- Writable layer size and root filesystem usage displayed in new Disk column
- Block I/O statistics (read/write bytes totals) shown in container drawer
- Mount metadata including type, source, destination, mode, and driver details
- Configurable via --collect-disk flag (enabled by default, can be disabled for large fleets)
Also fixes config watcher to consistently use production auth config path instead of following PULSE_DATA_DIR when in mock mode.
Introduces granular permission scopes for API tokens (docker:report, docker:manage, host-agent:report, monitoring:read/write, settings:read/write) allowing tokens to be restricted to minimum required access. Legacy tokens default to full access until scopes are explicitly configured.
Adds standalone host agent for monitoring Linux, macOS, and Windows servers outside Proxmox/Docker estates. New Servers workspace in UI displays uptime, OS metadata, and capacity metrics from enrolled agents.
Includes comprehensive token management UI overhaul with scope presets, inline editing, and visual scope indicators.
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:
TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count
TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example
CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override
pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands
TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments
Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
The previous diagrams were too complex and overwhelming. Simplified
all diagrams to show core concepts clearly:
- Adaptive polling: reduced to basic scheduler→queue→workers flow
- Temperature proxy: simplified to 3-box trust boundary view
- Sensor proxy sequence: simplified to essential request flow
- Webhook pipeline: reduced to template→send→retry flow
- Script library: simplified to code→test→bundle→dist flow
Fixed parsing error in temperature proxy diagram (parentheses in
edge label causing render failure).
Diagrams should clarify architecture, not recreate implementation.
The v2 installer rollout is complete - dist/install-docker-agent.sh
now contains the bundled v2 installer with embedded library modules.
This planning document served its purpose and is no longer relevant.
Replace internal development phase reference with clear description
of what the adaptive polling scheduler does. 'Phase 2' is internal
jargon that provides no value to users.
- Remove confusing --main flag, use --source for clarity
- Fix timeout issues when building from source in LXC containers
- Increase timeout from 5min to 20min for source builds
- Add PULSE_CONTAINER_TIMEOUT env var for custom timeouts
- Support PULSE_CONTAINER_TIMEOUT=0 to disable timeout
- Fix misleading "Latest version: vX.X.X" message during source builds
- Update documentation to use --source instead of --main
- Simplify auto-update script logic for source builds
Changes:
- install.sh: Check BUILD_FROM_SOURCE early to skip version detection
- install.sh: Adaptive timeout (300s binary, 1200s source builds)
- install.sh: Better timeout error messages with recovery instructions
- README.md: Replace --main with --source in examples
- docs/INSTALL.md: Replace --main with --source in examples
- scripts/pulse-auto-update.sh: Remove --main special case
Add detailed API reference and update rollout playbook:
**New: docs/api/SCHEDULER_HEALTH.md**
- Complete endpoint reference for /api/monitoring/scheduler/health
- Request/response structure with field descriptions
- Enhanced "instances" array documentation
- Example responses showing all states (healthy, transient, DLQ)
- Useful jq queries for troubleshooting:
- Find instances with errors
- List DLQ entries
- Show open circuit breakers
- Sort by failure streaks
- Migration guide (legacy → new fields)
- Troubleshooting examples with real scenarios
**Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md**
- Enhanced "Accessing Scheduler Health API" section (§6)
- Added examples using new instances[] array
- Updated queries to use pollStatus, breaker, deadLetter fields
- Practical jq commands for operators
**Key Documentation Features:**
- Complete JSON schema with examples
- All new fields documented with types and descriptions
- Real-world troubleshooting scenarios
- Copy-paste ready jq queries
- Migration path for existing integrations
- Backward compatibility notes
Operators can now:
- Find error messages without log digging
- Understand circuit breaker states
- Track DLQ entries with full context
- Diagnose issues using single API call
Part of Phase 2 follow-up - enhanced observability
Document decision to defer mutation endpoints after soak testing:
**Assessment Results:**
- Integration tests (55s, 12 instances): Automatic recovery worked perfectly
- Soak tests (2-240min, 80 instances): No manual intervention needed
- Circuit breakers: Opened/closed automatically as designed
- DLQ routing: Permanent failures handled correctly
**Current Capabilities (Sufficient):**
- Read-only scheduler health API provides full visibility
- Operator workarounds: service restart, feature flag toggle
- Grafana alerting: queue depth, staleness, DLQ, breakers
**Why Defer:**
- No operational need demonstrated in testing
- Implementation requires auth/RBAC/audit/UI work
- Cost not justified until production usage reveals need
- Can add later when data shows actual pain points
**Future Design Notes:**
- POST /api/monitoring/breakers/{instance}/reset
- POST /api/monitoring/dlq/retry (all or specific)
- DELETE /api/monitoring/dlq/{instance}
- Auth, audit, rate limiting, UI integration required
**Re-evaluation Criteria:**
- Operators request controls >3x in 30 days
- Troubleshooting steps inadequate
- Service restarts too disruptive
- Production incidents need surgical controls
Decision: Monitor production usage for 60 days, then reassess based on actual operator feedback and support ticket patterns.
Part of Phase 2 - Adaptive Polling completion
Removed all legacy Pulse+ agent metrics infrastructure (cloud-relay) which has been
fully replaced by the new docker agent and temperature agent implementations.
Changes:
- Remove cloud-relay directory and all related binaries (relay, relay-linux, etc.)
- Remove Pulse+ documentation (AGENT_METRICS_IMPLEMENTATION.md, AGENT_METRICS_SETUP.md)
- Clean up pulse-relay references in workflows and release checklist
- Add audit log rotation documentation for sensor proxy hash-chained logs
- Update .gitignore to remove cloud-relay/ entry
The new docker and temp agents remain fully functional and unaffected by this cleanup.
Removed PHASE1_SUMMARY.md and PHASE2_SUMMARY.md as both phases are complete.
All relevant documentation has been integrated into the main docs:
- Security hardening docs in SECURITY.md
- Adaptive polling architecture in docs/monitoring/ADAPTIVE_POLLING.md
Updated PHASE2_SUMMARY.md to include:
- ✅ Task 8: Scheduler health API endpoint completion
- ✅ Task 9: Unit testing completion (40+ test cases)
- Updated git commit history (9 commits total)
- Revised known limitations (removed API/testing gaps)
- Updated future work section
Phase 2 achievements:
- 9/10 tasks complete (only integration/soak tests deferred)
- 40+ unit tests covering backoff, circuit breakers, staleness
- Full scheduler health API with authentication
- Comprehensive documentation and rollout plan
- Production-ready with feature flag control
Remaining work (deferred to future):
- Integration tests with mock PVE/PBS clients
- Soak tests for extended queue stability
- Write endpoints for circuit breaker/DLQ management
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance
New API endpoint:
GET /api/monitoring/scheduler/health (requires authentication)
New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data
Documentation updated with API spec, field descriptions, and usage examples.