Commit graph

97 commits

Author SHA1 Message Date
rcourtman
6a1a88217f Add release dry run workflow and API update integration test 2025-11-12 21:02:52 +00:00
rcourtman
305a5b17bc Handle Snap Docker home restrictions (Related to #693) 2025-11-12 19:20:04 +00:00
rcourtman
95c8fe9c2d Add Snap Docker support to install-docker-agent.sh
Snap-installed Docker does not automatically create a docker group,
causing permission denied errors when the pulse-docker service user
tries to access /var/run/docker.sock.

Changes:
- Auto-detect Snap Docker installations
- Create docker group if missing when Snap Docker is detected
- Restart Snap Docker after group creation to refresh socket ACLs
- Add socket access validation before starting the service
- Handle symlinked Docker sockets in systemd unit ReadWritePaths
- Document troubleshooting steps in DOCKER_MONITORING.md
2025-11-11 23:07:29 +00:00
rcourtman
3477aa3dae Update Kubernetes docs with GitHub Pages Helm repository
- Replace GHCR OCI instructions with GitHub Pages repository
- Add comprehensive upgrade instructions with examples
- Add rollback procedures
- Add detailed uninstall instructions
- Simplify installation (no authentication required)
2025-11-11 19:40:51 +00:00
rcourtman
bb7ca93c18 feat: Add mdadm RAID monitoring support for host agents
Implements comprehensive mdadm RAID array monitoring for Linux hosts
via pulse-host-agent. Arrays are automatically detected and monitored
with real-time status updates, rebuild progress tracking, and automatic
alerting for degraded or failed arrays.

Key changes:

**Backend:**
- Add mdadm package for parsing mdadm --detail output
- Extend host agent report structure with RAID array data
- Integrate mdadm collection into host agent (Linux-only, best-effort)
- Add RAID array processing in monitoring system
- Implement automatic alerting:
  - Critical alerts for degraded arrays or arrays with failed devices
  - Warning alerts for rebuilding/resyncing arrays with progress tracking
  - Auto-clear alerts when arrays return to healthy state

**Frontend:**
- Add TypeScript types for RAID arrays and devices
- Display RAID arrays in host details drawer with:
  - Array status (clean/degraded/recovering) with color-coded indicators
  - Device counts (active/total/failed/spare)
  - Rebuild progress percentage and speed when applicable
  - Green for healthy, amber for rebuilding, red for degraded

**Documentation:**
- Document mdadm monitoring feature in HOST_AGENT.md
- Explain requirements (Linux, mdadm installed, root access)
- Clarify scope (software RAID only, hardware RAID not supported)

**Testing:**
- Add comprehensive tests for mdadm output parsing
- Test parsing of healthy, degraded, and rebuilding arrays
- Verify proper extraction of device states and rebuild progress

All builds pass successfully. RAID monitoring is automatic and best-effort
- if mdadm is not installed or no arrays exist, host agent continues
reporting other metrics normally.

Related to #676
2025-11-09 16:36:33 +00:00
rcourtman
188944019a docs: Add webhook private IP allowlist configuration guide
Document the new webhook security feature that allows homelab users to configure
trusted private IP ranges for webhook targets.

Includes:
- Overview of default security behavior
- Step-by-step configuration instructions
- Security considerations and best practices
- Example CIDR configurations
- Troubleshooting guidance for common error messages

Related to #673
2025-11-09 08:36:15 +00:00
rcourtman
082c6c2201 Fix documentation typo: change 'Servers' to 'Hosts' tab (related to #661) 2025-11-08 17:24:15 +00:00
rcourtman
1a3abf7f3f Fix pulse-host-agent temperature collection on all Linux distros (related to #661)
The temperature collection in pulse-host-agent was broken on all Linux
distributions due to an incorrect platform check.

Root cause:
- collectTemperatures() checked `if a.platform != "linux"` at agent.go:316
- normalisePlatform() returns the raw distro name from gopsutil (debian, ubuntu, pve)
- This caused temperature collection to be skipped on ALL Linux hosts

Fix:
- Changed check to `if runtime.GOOS != "linux"` which correctly identifies Linux
- runtime.GOOS returns "linux" regardless of distribution

Also fixed documentation typo:
- Changed "Servers tab" to "Hosts tab" in HOST_AGENT.md and TEMPERATURE_MONITORING.md
- Reported by user in issue #661 comments

Testing:
- Verified build succeeds
- Confirmed runtime.GOOS returns "linux" on Linux systems

Related to #661
2025-11-08 10:25:01 +00:00
rcourtman
3ad35976b2 Clarify Docker agent cycling troubleshooting for cloned VMs/LXCs (related to #648)
Enhanced the "Docker hosts cycling" troubleshooting entry to explicitly
call out VM/LXC cloning as a cause of identical agent IDs. Added specific
remediation steps for regenerating machine IDs on cloned systems.

This addresses the resolution path discovered in discussion #648 where a
user cloned a Proxmox LXC and encountered cycling behavior even with
separate API tokens because the agent IDs were duplicated.
2025-11-07 22:59:19 +00:00
rcourtman
2b7492ac59 feat: Add temperature collection to pulse-host-agent (related to #661)
Implements temperature monitoring in pulse-host-agent to support Docker-in-VM
deployments where the sensor proxy socket cannot cross VM boundaries.

Changes:
- Create internal/sensors package with local collection and parsing
- Add temperature collection to host agent (Linux only, best-effort)
- Support CPU package/core, NVMe, and GPU temperature sensors
- Update TEMPERATURE_MONITORING.md with Docker-in-VM setup instructions
- Update HOST_AGENT.md to document temperature feature

The host agent now automatically collects temperature data on Linux systems
with lm-sensors installed. This provides an alternative path for temperature
monitoring when running Pulse in a VM, avoiding the unix socket limitation.

Temperature collection is best-effort and fails gracefully if lm-sensors is
not available, ensuring other metrics continue to be reported.

Related to #661
2025-11-07 22:54:40 +00:00
rcourtman
52bc23b850 docs: Fix remaining :rw mount references to :ro
Updates all remaining references to read-write socket mounts in
TEMPERATURE_MONITORING.md to use read-only (:ro) mounts for security.

Changes:
- Manual installation section
- Docker-only responsibilities section
- Ansible playbook example

All socket mounts should be :ro to prevent container tampering.
2025-11-07 17:14:47 +00:00
rcourtman
427cb383d8 docs: Remove deployment checklist (per user request) 2025-11-07 17:11:15 +00:00
rcourtman
f9dc2f6466 docs: Add comprehensive security audit documentation
Adds complete documentation for 2025-11-07 security audit and hardening:

- SECURITY_AUDIT_2025-11-07.md: Full professional audit report
  - 9 security issues identified and fixed (4 critical, 4 medium, 1 low)
  - Detailed findings, remediations, and testing
  - Security posture improved from B+ to A
  - 85%+ reduction in exploitable attack surface

- SECURITY_CHANGELOG.md: Detailed changelog with migration guide
  - Complete implementation details for all fixes
  - Configuration examples
  - Backwards compatibility notes
  - New metrics and features

- DEPLOYMENT_CHECKLIST.md: Step-by-step deployment guide
  - Pre-deployment backup procedures
  - Deployment steps for Docker and LXC
  - Verification procedures
  - Rollback procedures
  - Troubleshooting guide
  - Success criteria

- README.md: Updated with security hardening highlights
  - Links to audit report
  - Key security features added

Audit performed by Claude (Sonnet 4.5) + Codex collaboration.
All implementations by Codex based on Claude specifications.
100% remediation rate (9/9 issues fixed).
17 new tests added, all passing.

Related to security audit 2025-11-07.
2025-11-07 17:10:21 +00:00
rcourtman
48fabdd827 Improve Docker temperature monitoring documentation for clarity (related to #600)
Updated the Quick Start for Docker section in TEMPERATURE_MONITORING.md to be
more user-friendly and address common setup issues:

- Added clear explanation of why the proxy is needed (containers can't access hardware)
- Provided concrete IP example instead of placeholder
- Showed full docker-compose.yml context with proper YAML structure
- Added sudo to commands where needed
- Updated docker-compose commands to v2 syntax with note about v1
- Expanded verification steps with clearer success indicators
- Added reminder to check container name in verification commands

These improvements should help users who encounter blank temperature displays
due to missing proxy installation or bind mount configuration.
2025-11-07 15:09:42 +00:00
rcourtman
910f2dd800 Add troubleshooting entries for Docker agent token issues (related to #648)
Added two troubleshooting sections to DOCKER_MONITORING.md:

1. "Docker hosts cycling or appearing to replace each other" - explains
   why multiple agents sharing the same token cause the UI to switch
   between hosts instead of showing all simultaneously

2. "Agent rejected after host removal" - documents the re-enrollment
   process when a host is on the removal blocklist

These entries make common setup issues searchable while linking to
canonical setup instructions rather than duplicating them.
2025-11-07 10:55:45 +00:00
rcourtman
a1dc451ed4 Document alert reliability features and DLQ API
Add comprehensive documentation for new alert system reliability features:

**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
  - GET /api/notifications/dlq - Retrieve failed notifications
  - GET /api/notifications/queue/stats - Queue statistics
  - POST /api/notifications/dlq/retry - Retry DLQ items
  - POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
  - 18 metrics covering alerts, notifications, and queue health
  - Example Prometheus configuration
  - Example PromQL queries for common monitoring scenarios

**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
  - maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
  - flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms

All new features are fully documented with examples and default values.
2025-11-06 17:34:05 +00:00
rcourtman
becda56897 Fix critical rollback download URL bug and doc inconsistencies
Issues found during systematic audit after #642:

1. CRITICAL BUG - Rollback downloads were completely broken:
   - Code constructed: pulse-linux-amd64 (no version, no .tar.gz)
   - Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz
   - This would cause 404 errors on all rollback attempts
   - Fixed: Construct correct tarball URL with version
   - Added: Extract tarball after download to get binary

2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0:
   - Changed to use /latest/download/ for future-proof docs

3. API.md example had wrong filename format:
   - Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz
   - Ensures example matches actual release asset naming

The rollback bug would have affected any user attempting to roll back
to a previous version via the UI or API.
2025-11-06 14:25:32 +00:00
rcourtman
fd3a72606f Add standalone host-agent binaries to releases
Issue: HOST_AGENT.md documented downloading pulse-host-agent binaries
from GitHub releases, but those assets didn't exist. Only tarballs were
available, making manual installation unnecessarily complex.

Changes:
- Copy standalone host-agent binaries (all architectures) to release/
  directory alongside sensor-proxy binaries
- Include host-agent binaries in checksum generation
- Update HOST_AGENT.md to clarify available architectures
- Retroactively uploaded missing binaries to v4.26.1

This enables air-gapped and manual installations without requiring an
already-running Pulse server to download from.
2025-11-06 14:20:59 +00:00
rcourtman
6192e166f2 chore: prepare release v4.26.1 2025-11-06 12:13:56 +00:00
rcourtman
dfe960deb4 Fix container SSH detection and improve troubleshooting for issue #617
Related to #617

This fixes a misconfiguration scenario where Docker containers could
attempt direct SSH connections (producing [preauth] log spam) instead
of using the sensor proxy.

Changes:
- Fix container detection to check PULSE_DOCKER=true in addition to
  system.InContainer() heuristics (both temperature.go and config_handlers.go)
- Upgrade temperature collection log from Error to Warn with actionable
  guidance about mounting the proxy socket
- Add Info log when dev mode override is active so operators understand
  the security posture
- Add troubleshooting section to docs for SSH [preauth] logs from containers

The container detection was inconsistent - monitor.go checked both flags
but temperature.go and config_handlers.go only checked InContainer().
Now all locations consistently check PULSE_DOCKER || InContainer().
2025-11-06 09:57:53 +00:00
rcourtman
88ad986877 Revert "Hide Settings tab when authentication is not configured"
This reverts commit d5a1e3d07729bad61743e8645a636e2545e11038.
2025-11-05 23:21:34 +00:00
rcourtman
3d1c910daa Hide Settings tab when authentication is not configured
Related to #636

When authentication is not configured (hasAuth() returns false), the
Settings tab is now automatically hidden from the web interface. This
provides a cleaner monitoring-only view for unauthenticated deployments
where users only need to check the health of their environment.

The Settings icon beside the Alerts tab will only appear when
authentication is properly configured via PULSE_AUTH_USER/PASS,
API tokens, proxy auth, or OIDC.

Changes:
- Modified utilityTabs in App.tsx to conditionally include Settings
  based on hasAuth() signal
- Updated CONFIGURATION.md to document this UI behavior
2025-11-05 23:10:20 +00:00
rcourtman
8ca31003a0 docs: document TLS certificate file permissions for HTTPS setup
Add comprehensive documentation for HTTPS/TLS configuration including:
- File ownership and permission requirements (pulse user)
- Common troubleshooting steps for startup failures
- Complete setup examples for systemd and Docker
- Validation commands for certificate/key verification

Related to discussion #634
2025-11-05 23:08:02 +00:00
rcourtman
efa1ec1cd9 docs: document per-metric alert delay configuration (addresses #433)
Added comprehensive documentation for the per-metric alert delay feature
that was requested in issue #433. This feature allows configuring
different alert delays for different metrics (e.g., longer delays for
CPU spikes, shorter delays for memory pressure).

Key additions:
- Detailed explanation of delay precedence hierarchy
- JSON configuration examples for common use cases
- Table of recommended delays by metric type with reasoning
- UI access instructions for the Alert Delay row

Also added example tests demonstrating the feature's functionality
and common configuration patterns.

The feature itself was already fully implemented in both backend
(metricTimeThresholds support) and frontend (per-metric delay inputs
in ResourceTable). This commit surfaces the feature through
documentation so users know it exists and how to use it.

Related to #433
2025-11-05 20:04:44 +00:00
rcourtman
545634372e Document log_level configuration for pulse-sensor-proxy
Update hardening documentation to include log_level configuration option.
Users can now find examples of controlling logging verbosity through
YAML config and environment variables.

Related to #629
2025-11-05 19:48:42 +00:00
rcourtman
a5e3469da8 Add comprehensive automation documentation for temperature proxy installation
This addresses the need for users who deploy Pulse via infrastructure-as-code
tools (Ansible, Terraform, Salt, Puppet) to have scriptable, well-documented
installation procedures.

Changes:

**Comprehensive Automation Section:**
- Documented all installer script flags and options
  - Required: --ctid (LXC) or --standalone (Docker)
  - Optional: --quiet, --pulse-server, --version, --local-binary, --skip-restart
  - Documented idempotency, exit codes, and non-interactive behavior

**Real-World Examples:**
- Ansible playbook for LXC deployments
- Ansible playbook for Docker deployments (includes docker-compose.yml management)
- Terraform null_resource example with remote-exec
- Manual step-by-step configuration (no script)

**Configuration Documentation:**
- Complete YAML config file format with all options
- Environment variable overrides (PULSE_SENSOR_PROXY_ALLOWED_SUBNETS, etc.)
- Example systemd service overrides
- Rate limiting, metrics, ACL, and subnet configuration

**Quick Reference:**
- Added link at top of doc for automation users to jump directly to automation section
- Clear examples of re-running after changes (adding nodes, upgrading versions)

**Key Features for Automation:**
- --quiet flag for non-interactive execution
- Idempotent design (safe to re-run)
- Verifiable exit codes
- Environment variable configuration
- Local binary support (no internet required)

This makes it straightforward for infrastructure teams to integrate Pulse
temperature monitoring into their existing automation workflows without
relying on interactive scripts or manual steps.
2025-11-05 18:18:04 +00:00
rcourtman
a1fb79ae6a Fix temperature proxy documentation and setup script for Docker vs LXC clarity
This addresses confusion around temperature monitoring setup for Docker
deployments where users expected a turnkey experience similar to LXC.

The core issue: The setup script and documentation suggested that
temperature monitoring was "automatically configured" for all containerized
deployments, but in reality only LXC containers have a fully automatic
setup. Docker requires manual steps.

Changes:

**Setup Script (config_handlers.go):**
- Fixed "unknown environment" path to show separate instructions for LXC vs Docker
- Docker instructions now correctly show --standalone flag (was incorrectly showing --ctid)
- Added docker-compose.yml bind mount instructions inline
- Added restart command for Docker deployments

**Documentation (TEMPERATURE_MONITORING.md):**
- Added prominent "Deployment-Specific Setup" callout at the top
- Clarified that LXC is fully automatic, Docker requires manual steps
- Reorganized "Setup (Automatic)" section to clearly distinguish:
  - LXC: Fully turnkey (no manual steps)
  - Docker: Manual proxy installation required
  - Node configuration: Works for both
- Updated "Host-side responsibilities" to specify it's Docker-only
- Fixed architecture benefits to reflect LXC vs Docker differences

Why this matters:
- LXC setup script auto-detects the container and runs install-sensor-proxy.sh --ctid
- Docker deployments can't be auto-detected and require --standalone flag
- Users running Docker were getting incorrect instructions (--ctid instead of --standalone)
- Documentation suggested everything was automatic, leading to confusion

Now the documentation and setup script accurately reflect that:
- LXC = Turnkey (automatic)
- Docker = Manual steps required (but well-documented)
- Native = Direct SSH (no proxy)

Related to GitHub Discussion #605
2025-11-05 18:18:04 +00:00
rcourtman
26144ae558 Improve temperature proxy setup guidance for Docker deployments
This addresses GitHub Discussion #605 where users were unclear about
configuring the temperature proxy when running Pulse in Docker.

Changes:

**install-sensor-proxy.sh:**
- Add Docker-specific post-install instructions when --standalone flag is used
- Show required docker-compose.yml bind mount configuration
- Provide verification commands for Docker deployments
- Link to full documentation for troubleshooting

**TEMPERATURE_MONITORING.md:**
- Add prominent "Quick Start for Docker Deployments" section at the top
- Move Docker instructions earlier in the document for better visibility
- Provide complete 4-step setup process with verification commands

These changes ensure Docker users immediately see:
1. How to install the proxy on the Proxmox host
2. What bind mount to add to docker-compose.yml
3. How to restart and verify the setup
4. Where to find detailed troubleshooting

The installer now provides actionable next steps instead of just
confirming installation, reducing confusion for containerized deployments.
2025-11-05 18:18:04 +00:00
rcourtman
7a185c4ab3 Improve guest agent timeout handling for high-load environments (refs #592)
This change addresses intermittent "Guest details unavailable" and "Disk stats
unavailable" errors affecting users with large VM deployments (50+ VMs) or
high-load Proxmox environments.

Changes:
- Increased default guest agent timeouts (3-5s → 10-15s) to better handle
  environments under load
- Added automatic retry logic (1 retry by default) for transient timeout failures
- Made all timeouts and retry count configurable via environment variables:
  * GUEST_AGENT_FSINFO_TIMEOUT (default: 15s)
  * GUEST_AGENT_NETWORK_TIMEOUT (default: 10s)
  * GUEST_AGENT_OSINFO_TIMEOUT (default: 10s)
  * GUEST_AGENT_VERSION_TIMEOUT (default: 10s)
  * GUEST_AGENT_RETRIES (default: 1)
- Added comprehensive documentation in VM_DISK_MONITORING.md with configuration
  examples for different deployment scenarios

These improvements allow Pulse to gracefully handle intermittent API timeouts
without immediately displaying errors, while remaining configurable for
different network conditions and environment sizes.

Fixes: https://github.com/rcourtman/Pulse/discussions/592
2025-11-05 09:40:58 +00:00
rcourtman
d52ac6d8b5 Fix CSRF token validation and improve token management
- Add Access-Control-Expose-Headers to allow frontend to read X-CSRF-Token response header
- Implement proactive CSRF token issuance on GET requests when session exists but CSRF cookie is missing
- Ensures frontend always has valid CSRF token before making POST requests
- Fixes 403 Forbidden errors when toggling system settings

This resolves CSRF validation failures that occurred when CSRF tokens expired or were missing while valid sessions existed.
2025-11-05 09:23:44 +00:00
rcourtman
10862db4e4 Enhance container detection for temperature SSH safeguards (refs #601) 2025-11-04 22:30:35 +00:00
rcourtman
6eb1a10d9b Refactor: Code cleanup and localStorage consolidation
This commit includes comprehensive codebase cleanup and refactoring:

## Code Cleanup
- Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate)
- Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo)
- Clean up commented-out code blocks across multiple files
- Remove unused TypeScript exports (helpTextClass, private tag color helpers)
- Delete obsolete test files and components

## localStorage Consolidation
- Centralize all storage keys into STORAGE_KEYS constant
- Update 5 files to use centralized keys:
  * utils/apiClient.ts (AUTH, LEGACY_TOKEN)
  * components/Dashboard/Dashboard.tsx (GUEST_METADATA)
  * components/Docker/DockerHosts.tsx (DOCKER_METADATA)
  * App.tsx (PLATFORMS_SEEN)
  * stores/updates.ts (UPDATES)
- Benefits: Single source of truth, prevents typos, better maintainability

## Previous Work Committed
- Docker monitoring improvements and disk metrics
- Security enhancements and setup fixes
- API refactoring and cleanup
- Documentation updates
- Build system improvements

## Testing
- All frontend tests pass (29 tests)
- All Go tests pass (15 packages)
- Production build successful
- Zero breaking changes

Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)
2025-11-04 21:50:46 +00:00
rcourtman
5c4be1921c chore: snapshot current changes 2025-11-02 22:47:55 +00:00
rcourtman
fb22469eb0 Add disk usage threshold support for Docker containers
Extends the Docker monitoring and alerting system to track writable layer
usage as a percentage of the container's root filesystem. This helps
identify containers with bloated copy-on-write layers before they
consume excessive disk space.

- Add disk threshold to DockerThresholdConfig (default: 85% trigger, 80% clear)
- Evaluate disk alerts for running containers when RootFilesystemBytes > 0
- Include disk metadata (writable layer, total filesystem, block I/O stats)
- Update frontend to display and configure disk thresholds
- Add test coverage for disk usage alert hysteresis
- Document disk monitoring in DOCKER_MONITORING.md

Per-container and per-host overrides apply to disk thresholds the same
way they do for CPU and memory.
2025-10-29 14:52:25 +00:00
rcourtman
32392d1212 Add disk metrics, block I/O, and mount details to Docker monitoring
Extends Docker container monitoring with comprehensive disk and storage information:
- Writable layer size and root filesystem usage displayed in new Disk column
- Block I/O statistics (read/write bytes totals) shown in container drawer
- Mount metadata including type, source, destination, mode, and driver details
- Configurable via --collect-disk flag (enabled by default, can be disabled for large fleets)

Also fixes config watcher to consistently use production auth config path instead of following PULSE_DATA_DIR when in mock mode.
2025-10-29 12:05:36 +00:00
rcourtman
b3285c05c8 Consolidate pending changes
- Add Docker metadata test comment
- Update alerts configuration and thresholds
- Enhance config file watcher
- Update documentation
- Refine settings UI
2025-10-28 23:20:44 +00:00
rcourtman
e07336dd9f refactor: remove legacy DISABLE_AUTH flag and enhance authentication UX
Major authentication system improvements:

- Remove deprecated DISABLE_AUTH environment variable support
- Update all documentation to remove DISABLE_AUTH references
- Add auth recovery instructions to docs (create .auth_recovery file)
- Improve first-run setup and Quick Security wizard flows
- Enhance login page with better error messaging and validation
- Refactor Docker hosts view with new unified table and tree components
- Add useDebouncedValue hook for better search performance
- Improve Settings page with better security configuration UX
- Update mock mode and development scripts for consistency
- Add ScrollableTable persistence and improved responsive design

Backend changes:
- Remove DISABLE_AUTH flag detection and handling
- Improve auth configuration validation and error messages
- Enhance security status endpoint responses
- Update router integration tests

Frontend changes:
- New Docker components: DockerUnifiedTable, DockerTree, DockerSummaryStats
- Better connection status indicator positioning
- Improved authentication state management
- Enhanced CSRF and session handling
- Better loading states and error recovery

This completes the migration away from the insecure DISABLE_AUTH pattern
toward proper authentication with recovery mechanisms.
2025-10-27 19:46:51 +00:00
rcourtman
68ce8e7520 feat: finalize swarm service monitoring (#598) 2025-10-26 09:35:49 +00:00
rcourtman
8e83eaf823 Add container state filtering to Docker agent 2025-10-25 21:40:59 +00:00
rcourtman
5a2d808aa1 Harden setup token flow and enforce encrypted persistence 2025-10-25 16:00:37 +00:00
rcourtman
d643dcf0bc perf: reduce polling allocations and guest metadata load 2025-10-25 13:12:47 +00:00
rcourtman
138d8facd2 Improve host agent onboarding flow 2025-10-25 09:37:29 +00:00
rcourtman
cee24ff7e0 docs: refresh API token scope guidance 2025-10-23 13:44:19 +00:00
rcourtman
5c54685f04 Add API token scopes and standalone host agent
Introduces granular permission scopes for API tokens (docker:report, docker:manage, host-agent:report, monitoring:read/write, settings:read/write) allowing tokens to be restricted to minimum required access. Legacy tokens default to full access until scopes are explicitly configured.

Adds standalone host agent for monitoring Linux, macOS, and Windows servers outside Proxmox/Docker estates. New Servers workspace in UI displays uptime, OS metadata, and capacity metrics from enrolled agents.

Includes comprehensive token management UI overhaul with scope presets, inline editing, and visual scope indicators.
2025-10-23 11:40:31 +00:00
rcourtman
e1fe8354e9 Ensure Docker agent builds stay static (#597) 2025-10-22 21:48:57 +00:00
rcourtman
bc479643e4 release: prepare v4.25.0 2025-10-22 10:46:18 +00:00
rcourtman
ff4dc49ae4 Update Pulse install flow and related components 2025-10-21 19:58:53 +00:00
rcourtman
e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00
rcourtman
ddc9a7a068 docs: comprehensive documentation for rate limit fix and configurability
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:

TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count

TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example

CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override

pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands

TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments

Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
2025-10-21 11:36:07 +00:00
rcourtman
2f43d67af9 docs: simplify Mermaid diagrams for better readability
The previous diagrams were too complex and overwhelming. Simplified
all diagrams to show core concepts clearly:

- Adaptive polling: reduced to basic scheduler→queue→workers flow
- Temperature proxy: simplified to 3-box trust boundary view
- Sensor proxy sequence: simplified to essential request flow
- Webhook pipeline: reduced to template→send→retry flow
- Script library: simplified to code→test→bundle→dist flow

Fixed parsing error in temperature proxy diagram (parentheses in
edge label causing render failure).

Diagrams should clarify architecture, not recreate implementation.
2025-10-21 10:50:40 +00:00