Commit graph

125 commits

Author SHA1 Message Date
rcourtman
5dbaf7c596 Add OIDC CA bundle support 2025-11-22 09:44:03 +00:00
rcourtman
cd31bdece1 Add log level control to host agent
Related to #742
2025-11-22 07:48:34 +00:00
rcourtman
19a2cac355 Add log level control for docker agent
Related to #742
2025-11-22 07:43:48 +00:00
rcourtman
5360f05bdd Document TrueNAS exec requirement for host agent 2025-11-21 23:06:22 +00:00
rcourtman
19408fdb2b Related to #738: make pulse proxy mount migration-safe 2025-11-21 21:29:14 +00:00
rcourtman
30a069dfa6 Add TrueNAS SCALE persistence for host agent (Related to #718) 2025-11-21 10:07:14 +00:00
rcourtman
18b0323340 Add agent-id support to host agent installers (Related to #721) 2025-11-20 18:14:18 +00:00
rcourtman
3f10c97c4e docs: align sensor proxy config with current defaults 2025-11-20 12:40:01 +00:00
rcourtman
4067b0717a Improve Windows installer arch detection (related to #723) 2025-11-20 09:37:45 +00:00
rcourtman
a4cd3a275e docs(sensor-proxy): comprehensive config management documentation
Adds complete documentation for the new sensor-proxy config management CLI
implemented in Phase 2. Addresses user-facing aspects of the corruption fix.

**New Documentation:**
- docs/operations/sensor-proxy-config-management.md (469 lines)
  - Complete operations runbook for config management
  - Full CLI reference with examples
  - Migration guide from inline config
  - Architecture explanation
  - Common operational tasks
  - Troubleshooting guide
  - Best practices and automation

**Updated Documentation:**
- cmd/pulse-sensor-proxy/README.md
  - Configuration Management CLI section
  - Allowed Nodes File format
  - Enhanced troubleshooting
  - Config corruption recovery

- docs/TEMPERATURE_MONITORING.md
  - Config validation failure troubleshooting
  - Configuration Management quick reference
  - Cross-links to detailed docs

- docs/TROUBLESHOOTING.md
  - Sensor proxy config validation errors
  - Comprehensive diagnosis steps
  - Automatic and manual recovery

- README.md & docs/README.md
  - Added new runbook to operations index
  - Positioned for discoverability

**Coverage:**
- Both CLI commands fully documented
- Phase 1 & Phase 2 architecture explained
- Migration path from pre-v4.31.1
- Config corruption recovery procedures
- Safe config editing practices
- Automation examples
- Troubleshooting all failure modes

**Documentation Quality:**
- Cross-linked from 5 different documents
- Clear examples for common use cases
- Target audience: system administrators
- Follows project documentation style
- Production-ready

This completes the sensor-proxy config corruption fix by providing users
with comprehensive guidance for the new config management system.

Related to Phase 2 commits 3dc073a28, 804a638ea, 131666bc1
2025-11-19 10:01:33 +00:00
rcourtman
d93a2c1053 Improve host agent binary handling and docker installer purge (Related to #693) 2025-11-18 22:11:44 +00:00
rcourtman
3f46d35a81 feat: make PVE polling interval configurable (related to #467) 2025-11-18 21:30:04 +00:00
rcourtman
32c2a267e9 Document proxy control-plane refresh 2025-11-18 14:31:08 +00:00
rcourtman
a479040651 Improve temperature proxy workflow 2025-11-17 14:25:46 +00:00
rcourtman
de5b314842 Improve temperature proxy control-plane flow 2025-11-15 21:49:51 +00:00
rcourtman
1ed69d35b6 docs: highlight runbooks in index and script verification checklist 2025-11-14 10:39:10 +00:00
rcourtman
dca562b7b4 docs: reference log forwarding runbook in sensor proxy guides 2025-11-14 10:37:09 +00:00
rcourtman
4693900f8b docs: document sensor proxy log forwarding 2025-11-14 01:12:25 +00:00
rcourtman
34bf5f524b docs: add auto-update troubleshooting 2025-11-14 01:07:32 +00:00
rcourtman
7a9c420912 docs: clarify auto-update flow and surface proxy guide 2025-11-14 01:06:11 +00:00
rcourtman
a163988bf5 docs: add auto-update runbook 2025-11-14 01:05:06 +00:00
rcourtman
9864258207 docs: add operations runbooks and audit fixes 2025-11-14 01:01:21 +00:00
rcourtman
b77330b1be Clarify sensor proxy HTTPS workflow in docs 2025-11-14 00:48:41 +00:00
rcourtman
a278b52b5e Explain HTTPS-first temperature architecture 2025-11-14 00:45:20 +00:00
rcourtman
bfb8fd4a64 Document v4.31.0 release highlights 2025-11-14 00:43:16 +00:00
rcourtman
f2b888386c Remove installer v2 rollout doc 2025-11-13 22:36:01 +00:00
rcourtman
6d591920d9 Remove CONTRIBUTING-SCRIPTS doc 2025-11-13 22:34:21 +00:00
rcourtman
70673c1fdc Improve temperature proxy diagnostics and tests 2025-11-13 22:31:53 +00:00
rcourtman
3b079eeddb Add release dry run workflow and API update integration test 2025-11-12 21:02:52 +00:00
rcourtman
92e155ed3f Handle Snap Docker home restrictions (Related to #693) 2025-11-12 19:20:04 +00:00
rcourtman
92a5d74ba9 Add Snap Docker support to install-docker-agent.sh
Snap-installed Docker does not automatically create a docker group,
causing permission denied errors when the pulse-docker service user
tries to access /var/run/docker.sock.

Changes:
- Auto-detect Snap Docker installations
- Create docker group if missing when Snap Docker is detected
- Restart Snap Docker after group creation to refresh socket ACLs
- Add socket access validation before starting the service
- Handle symlinked Docker sockets in systemd unit ReadWritePaths
- Document troubleshooting steps in DOCKER_MONITORING.md
2025-11-11 23:07:29 +00:00
rcourtman
3a260a3130 Update Kubernetes docs with GitHub Pages Helm repository
- Replace GHCR OCI instructions with GitHub Pages repository
- Add comprehensive upgrade instructions with examples
- Add rollback procedures
- Add detailed uninstall instructions
- Simplify installation (no authentication required)
2025-11-11 19:40:51 +00:00
rcourtman
bb7ca93c18 feat: Add mdadm RAID monitoring support for host agents
Implements comprehensive mdadm RAID array monitoring for Linux hosts
via pulse-host-agent. Arrays are automatically detected and monitored
with real-time status updates, rebuild progress tracking, and automatic
alerting for degraded or failed arrays.

Key changes:

**Backend:**
- Add mdadm package for parsing mdadm --detail output
- Extend host agent report structure with RAID array data
- Integrate mdadm collection into host agent (Linux-only, best-effort)
- Add RAID array processing in monitoring system
- Implement automatic alerting:
  - Critical alerts for degraded arrays or arrays with failed devices
  - Warning alerts for rebuilding/resyncing arrays with progress tracking
  - Auto-clear alerts when arrays return to healthy state

**Frontend:**
- Add TypeScript types for RAID arrays and devices
- Display RAID arrays in host details drawer with:
  - Array status (clean/degraded/recovering) with color-coded indicators
  - Device counts (active/total/failed/spare)
  - Rebuild progress percentage and speed when applicable
  - Green for healthy, amber for rebuilding, red for degraded

**Documentation:**
- Document mdadm monitoring feature in HOST_AGENT.md
- Explain requirements (Linux, mdadm installed, root access)
- Clarify scope (software RAID only, hardware RAID not supported)

**Testing:**
- Add comprehensive tests for mdadm output parsing
- Test parsing of healthy, degraded, and rebuilding arrays
- Verify proper extraction of device states and rebuild progress

All builds pass successfully. RAID monitoring is automatic and best-effort
- if mdadm is not installed or no arrays exist, host agent continues
reporting other metrics normally.

Related to #676
2025-11-09 16:36:33 +00:00
rcourtman
188944019a docs: Add webhook private IP allowlist configuration guide
Document the new webhook security feature that allows homelab users to configure
trusted private IP ranges for webhook targets.

Includes:
- Overview of default security behavior
- Step-by-step configuration instructions
- Security considerations and best practices
- Example CIDR configurations
- Troubleshooting guidance for common error messages

Related to #673
2025-11-09 08:36:15 +00:00
rcourtman
082c6c2201 Fix documentation typo: change 'Servers' to 'Hosts' tab (related to #661) 2025-11-08 17:24:15 +00:00
rcourtman
1a3abf7f3f Fix pulse-host-agent temperature collection on all Linux distros (related to #661)
The temperature collection in pulse-host-agent was broken on all Linux
distributions due to an incorrect platform check.

Root cause:
- collectTemperatures() checked `if a.platform != "linux"` at agent.go:316
- normalisePlatform() returns the raw distro name from gopsutil (debian, ubuntu, pve)
- This caused temperature collection to be skipped on ALL Linux hosts

Fix:
- Changed check to `if runtime.GOOS != "linux"` which correctly identifies Linux
- runtime.GOOS returns "linux" regardless of distribution

Also fixed documentation typo:
- Changed "Servers tab" to "Hosts tab" in HOST_AGENT.md and TEMPERATURE_MONITORING.md
- Reported by user in issue #661 comments

Testing:
- Verified build succeeds
- Confirmed runtime.GOOS returns "linux" on Linux systems

Related to #661
2025-11-08 10:25:01 +00:00
rcourtman
3ad35976b2 Clarify Docker agent cycling troubleshooting for cloned VMs/LXCs (related to #648)
Enhanced the "Docker hosts cycling" troubleshooting entry to explicitly
call out VM/LXC cloning as a cause of identical agent IDs. Added specific
remediation steps for regenerating machine IDs on cloned systems.

This addresses the resolution path discovered in discussion #648 where a
user cloned a Proxmox LXC and encountered cycling behavior even with
separate API tokens because the agent IDs were duplicated.
2025-11-07 22:59:19 +00:00
rcourtman
2b7492ac59 feat: Add temperature collection to pulse-host-agent (related to #661)
Implements temperature monitoring in pulse-host-agent to support Docker-in-VM
deployments where the sensor proxy socket cannot cross VM boundaries.

Changes:
- Create internal/sensors package with local collection and parsing
- Add temperature collection to host agent (Linux only, best-effort)
- Support CPU package/core, NVMe, and GPU temperature sensors
- Update TEMPERATURE_MONITORING.md with Docker-in-VM setup instructions
- Update HOST_AGENT.md to document temperature feature

The host agent now automatically collects temperature data on Linux systems
with lm-sensors installed. This provides an alternative path for temperature
monitoring when running Pulse in a VM, avoiding the unix socket limitation.

Temperature collection is best-effort and fails gracefully if lm-sensors is
not available, ensuring other metrics continue to be reported.

Related to #661
2025-11-07 22:54:40 +00:00
rcourtman
52bc23b850 docs: Fix remaining :rw mount references to :ro
Updates all remaining references to read-write socket mounts in
TEMPERATURE_MONITORING.md to use read-only (:ro) mounts for security.

Changes:
- Manual installation section
- Docker-only responsibilities section
- Ansible playbook example

All socket mounts should be :ro to prevent container tampering.
2025-11-07 17:14:47 +00:00
rcourtman
427cb383d8 docs: Remove deployment checklist (per user request) 2025-11-07 17:11:15 +00:00
rcourtman
f9dc2f6466 docs: Add comprehensive security audit documentation
Adds complete documentation for 2025-11-07 security audit and hardening:

- SECURITY_AUDIT_2025-11-07.md: Full professional audit report
  - 9 security issues identified and fixed (4 critical, 4 medium, 1 low)
  - Detailed findings, remediations, and testing
  - Security posture improved from B+ to A
  - 85%+ reduction in exploitable attack surface

- SECURITY_CHANGELOG.md: Detailed changelog with migration guide
  - Complete implementation details for all fixes
  - Configuration examples
  - Backwards compatibility notes
  - New metrics and features

- DEPLOYMENT_CHECKLIST.md: Step-by-step deployment guide
  - Pre-deployment backup procedures
  - Deployment steps for Docker and LXC
  - Verification procedures
  - Rollback procedures
  - Troubleshooting guide
  - Success criteria

- README.md: Updated with security hardening highlights
  - Links to audit report
  - Key security features added

Audit performed by Claude (Sonnet 4.5) + Codex collaboration.
All implementations by Codex based on Claude specifications.
100% remediation rate (9/9 issues fixed).
17 new tests added, all passing.

Related to security audit 2025-11-07.
2025-11-07 17:10:21 +00:00
rcourtman
48fabdd827 Improve Docker temperature monitoring documentation for clarity (related to #600)
Updated the Quick Start for Docker section in TEMPERATURE_MONITORING.md to be
more user-friendly and address common setup issues:

- Added clear explanation of why the proxy is needed (containers can't access hardware)
- Provided concrete IP example instead of placeholder
- Showed full docker-compose.yml context with proper YAML structure
- Added sudo to commands where needed
- Updated docker-compose commands to v2 syntax with note about v1
- Expanded verification steps with clearer success indicators
- Added reminder to check container name in verification commands

These improvements should help users who encounter blank temperature displays
due to missing proxy installation or bind mount configuration.
2025-11-07 15:09:42 +00:00
rcourtman
910f2dd800 Add troubleshooting entries for Docker agent token issues (related to #648)
Added two troubleshooting sections to DOCKER_MONITORING.md:

1. "Docker hosts cycling or appearing to replace each other" - explains
   why multiple agents sharing the same token cause the UI to switch
   between hosts instead of showing all simultaneously

2. "Agent rejected after host removal" - documents the re-enrollment
   process when a host is on the removal blocklist

These entries make common setup issues searchable while linking to
canonical setup instructions rather than duplicating them.
2025-11-07 10:55:45 +00:00
rcourtman
a1dc451ed4 Document alert reliability features and DLQ API
Add comprehensive documentation for new alert system reliability features:

**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
  - GET /api/notifications/dlq - Retrieve failed notifications
  - GET /api/notifications/queue/stats - Queue statistics
  - POST /api/notifications/dlq/retry - Retry DLQ items
  - POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
  - 18 metrics covering alerts, notifications, and queue health
  - Example Prometheus configuration
  - Example PromQL queries for common monitoring scenarios

**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
  - maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
  - flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms

All new features are fully documented with examples and default values.
2025-11-06 17:34:05 +00:00
rcourtman
becda56897 Fix critical rollback download URL bug and doc inconsistencies
Issues found during systematic audit after #642:

1. CRITICAL BUG - Rollback downloads were completely broken:
   - Code constructed: pulse-linux-amd64 (no version, no .tar.gz)
   - Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz
   - This would cause 404 errors on all rollback attempts
   - Fixed: Construct correct tarball URL with version
   - Added: Extract tarball after download to get binary

2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0:
   - Changed to use /latest/download/ for future-proof docs

3. API.md example had wrong filename format:
   - Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz
   - Ensures example matches actual release asset naming

The rollback bug would have affected any user attempting to roll back
to a previous version via the UI or API.
2025-11-06 14:25:32 +00:00
rcourtman
fd3a72606f Add standalone host-agent binaries to releases
Issue: HOST_AGENT.md documented downloading pulse-host-agent binaries
from GitHub releases, but those assets didn't exist. Only tarballs were
available, making manual installation unnecessarily complex.

Changes:
- Copy standalone host-agent binaries (all architectures) to release/
  directory alongside sensor-proxy binaries
- Include host-agent binaries in checksum generation
- Update HOST_AGENT.md to clarify available architectures
- Retroactively uploaded missing binaries to v4.26.1

This enables air-gapped and manual installations without requiring an
already-running Pulse server to download from.
2025-11-06 14:20:59 +00:00
rcourtman
6192e166f2 chore: prepare release v4.26.1 2025-11-06 12:13:56 +00:00
rcourtman
dfe960deb4 Fix container SSH detection and improve troubleshooting for issue #617
Related to #617

This fixes a misconfiguration scenario where Docker containers could
attempt direct SSH connections (producing [preauth] log spam) instead
of using the sensor proxy.

Changes:
- Fix container detection to check PULSE_DOCKER=true in addition to
  system.InContainer() heuristics (both temperature.go and config_handlers.go)
- Upgrade temperature collection log from Error to Warn with actionable
  guidance about mounting the proxy socket
- Add Info log when dev mode override is active so operators understand
  the security posture
- Add troubleshooting section to docs for SSH [preauth] logs from containers

The container detection was inconsistent - monitor.go checked both flags
but temperature.go and config_handlers.go only checked InContainer().
Now all locations consistently check PULSE_DOCKER || InContainer().
2025-11-06 09:57:53 +00:00
rcourtman
88ad986877 Revert "Hide Settings tab when authentication is not configured"
This reverts commit d5a1e3d07729bad61743e8645a636e2545e11038.
2025-11-05 23:21:34 +00:00
rcourtman
3d1c910daa Hide Settings tab when authentication is not configured
Related to #636

When authentication is not configured (hasAuth() returns false), the
Settings tab is now automatically hidden from the web interface. This
provides a cleaner monitoring-only view for unauthenticated deployments
where users only need to check the health of their environment.

The Settings icon beside the Alerts tab will only appear when
authentication is properly configured via PULSE_AUTH_USER/PASS,
API tokens, proxy auth, or OIDC.

Changes:
- Modified utilityTabs in App.tsx to conditionally include Settings
  based on hasAuth() signal
- Updated CONFIGURATION.md to document this UI behavior
2025-11-05 23:10:20 +00:00