Pulse/docs/TROUBLESHOOTING.md
rcourtman e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00

526 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pulse Troubleshooting Guide
## Common Issues and Solutions
### Correlate API Calls with Logs
Every API response includes an `X-Request-ID` header. When escalating issues, capture that value and use it to search the backend logs or log file. The same identifier is emitted as `request_id` in structured logs.
```bash
# Capture a request ID
curl -i https://pulse.example.com/api/state | grep X-Request-ID
# Search the rotating log file
grep 'request_id=abc123' /var/log/pulse/pulse.log
# Docker / kubectl example
docker logs pulse | grep 'request_id=abc123'
```
Include the `X-Request-ID` in support tickets or incident notes so responders can jump straight to the relevant log lines.
### Authentication Problems
#### Forgot Password / Lost Access
**Solution: Use the built-in recovery endpoint**
Pulse ships with a guarded recovery API that lets you regain access without wiping configuration.
1. **From the Pulse host (localhost only)**
Generate a short-lived recovery token or temporarily disable auth:
```bash
# Create a 30 minute recovery token (returns JSON with the token value)
curl -s -X POST http://localhost:7655/api/security/recovery \
-H 'Content-Type: application/json' \
-d '{"action":"generate_token","duration":30}'
# OR force local-only recovery access (writes .auth_recovery in the data dir)
curl -s -X POST http://localhost:7655/api/security/recovery \
-H 'Content-Type: application/json' \
-d '{"action":"disable_auth"}'
```
2. **If you generated a token**, use it from a trusted workstation:
```bash
curl -s -X POST https://pulse.example.com/api/security/recovery \
-H 'Content-Type: application/json' \
-H 'X-Recovery-Token: YOUR_TOKEN' \
-d '{"action":"disable_auth"}'
```
The token is single-use and expires automatically.
3. **Log in and reset credentials** using Settings → Security, then re-enable auth:
```bash
curl -s -X POST http://localhost:7655/api/security/recovery \
-H 'Content-Type: application/json' \
-d '{"action":"enable_auth"}'
```
Alternatively, delete `/etc/pulse/.auth_recovery` (or `/data/.auth_recovery` for Docker) and restart Pulse.
Only fall back to nuking `/etc/pulse` if the recovery endpoint is unreachable.
**Prevention:**
- Use a password manager
- Store exported configuration backups securely
- Generate API tokens for automation instead of sharing passwords
#### Cannot login after setting up security
**Symptoms**: "Invalid username or password" error despite correct credentials
**Common causes and solutions:**
1. **Truncated bcrypt hash** (most common)
- Check hash is exactly 60 characters: `echo -n "$PULSE_AUTH_PASS" | wc -c`
- Look for error in logs: `Bcrypt hash appears truncated!`
- Solution: Use full 60-character hash or Quick Security Setup
2. **Docker Compose $ character issue**
- Docker Compose interprets `$` as variable expansion
- **Wrong**: `PULSE_AUTH_PASS='$2a$12$hash...'`
- **Right**: `PULSE_AUTH_PASS='$$2a$$12$$hash...'` (escape with $$)
- Alternative: Use a .env file where no escaping is needed
3. **Environment variable not loaded**
- Check if variable is set: `docker exec pulse env | grep PULSE_AUTH`
- Verify quotes around hash: Must use single quotes
- Restart container after changes
#### Password change fails
**Error**: `exec: "sudo": executable file not found`
**Solution**: Update to v4.3.8+ which removes sudo requirement. For older versions:
```bash
# Manually update .env file
docker exec pulse sh -c "echo \"PULSE_AUTH_PASS='new-hash'\" >> /data/.env"
docker restart pulse
```
#### Can't access Pulse - stuck at login
**Symptoms**: Can't access Pulse after upgrade, no credentials work
**Solution**:
- If upgrading from pre-v4.5.0, you need to complete security setup first
- Clear browser cache and cookies
- Access http://your-ip:7655 to see setup wizard
- Complete setup, then restart container
### Docker-Specific Issues
#### No .env file in /data
**This is expected behavior** when using environment variables. The .env file is only created by:
- Quick Security Setup wizard
- Password change through UI
- Manual creation
If you provide auth via `-e` flags or docker-compose environment section, no .env is created.
#### Container won't start
Check logs: `docker logs pulse`
Common issues:
- Port already in use: Change port mapping
- Volume permissions: Ensure volume is writable
- Invalid environment variables: Check syntax
### Port change didn't take effect
1. **Confirm which service name is active**
```bash
systemctl status pulse 2>/dev/null \\
|| systemctl status pulse-backend 2>/dev/null \\
|| systemctl status pulse-hot-dev
```
- Docker: the container port mapping controls the public port (`-p host:7655`).
- Kubernetes (Helm chart): service is `svc/pulse`; update `service.port` in your values file and run `helm upgrade`.
2. **Verify configuration/environment overrides**
```bash
sudo systemctl show pulse --property=Environment
```
Helm users can run `kubectl get svc pulse -n <namespace> -o yaml` to confirm the current port.
3. **Check for port conflicts**
```bash
sudo lsof -i :8080
```
4. **Post-change validation**
- Restart the service (`systemctl restart`, `docker restart`, or `helm upgrade`).
- v4.24.0 logs these restarts/upgrades in **Settings → System → Updates** and `/api/updates/history`; capture the `event_id` for your change notes.
### Installation Issues
#### Binary not found (v4.3.7)
**Error**: `/opt/pulse/pulse: No such file or directory`
**Cause**: v4.3.7 install script bug
**Solution**: Update to v4.3.8 or manually fix:
```bash
sudo mkdir -p /opt/pulse/bin
sudo mv /opt/pulse/pulse /opt/pulse/bin/pulse
sudo systemctl daemon-reload
sudo systemctl restart pulse
```
#### Service name confusion
Pulse uses different service names depending on installation method:
- **Default systemd install**: `pulse`
- **Legacy installs (pre-v4.7)**: `pulse-backend`
- **Hot dev environment**: `pulse-hot-dev`
- **Docker**: N/A (container name)
To check which you have:
```bash
systemctl status pulse 2>/dev/null \
|| systemctl status pulse-backend 2>/dev/null \
|| systemctl status pulse-hot-dev
```
### Notification Issues
#### Emails not sending
1. Check email configuration in Settings → Alerts
2. Verify SMTP settings and credentials
3. Check logs for errors: `docker logs pulse | grep -i email`
4. Test with a simple webhook first
#### Webhook not working
- Verify URL is accessible from Pulse server
- Check for SSL certificate issues
- Try a test service like webhook.site
- Check logs for response codes (temporarily set `LOG_LEVEL=debug` via **Settings → System → Logging** or export `LOG_LEVEL=debug` and restart; review `webhook.delivery` entries, then revert to `info`)
### Temperature Monitoring Issues
#### Temperature data flickers after adding nodes
**Symptoms:** Dashboard temperatures alternate between values and `--`, or new nodes never show readings. Proxy logs contain `limiter.rejection` messages.
**Diagnosis:**
1. Confirm you are running a build with commit 46b8b8d or later (defaults are 1 rps, burst 5). Older binaries throttle multi-node clusters aggressively.
2. Check limiter metrics:
```bash
curl -s http://127.0.0.1:9127/metrics \
| grep -E 'pulse_proxy_limiter_(rejects|penalties)_total'
```
Any recent increment indicates rate-limit saturation.
3. Inspect scheduler health for temperature pollers (`breaker.state` should be `closed` and `deadLetter.present` must be `false`).
**Fix:** Increase the proxy burst/interval in `/etc/pulse-sensor-proxy/config.yaml`:
```yaml
rate_limit:
per_peer_interval_ms: 500 # medium cluster (≈10 nodes)
per_peer_burst: 10
```
Restart `pulse-sensor-proxy`, verify limiter counters stop increasing, and confirm the dashboard stabilises. Document the change in your operations log.
### VM Disk Monitoring Issues
#### VMs show "-" for disk usage
**This is normal and expected** - VMs require QEMU Guest Agent to report disk usage.
**Quick fix:**
1. Install guest agent in VM: `apt install qemu-guest-agent` (Linux) or virtio-win tools (Windows)
2. Enable in Proxmox: VM → Options → QEMU Guest Agent → Enable
3. Restart the VM
4. Wait 10 seconds for Pulse to poll again
**Detailed troubleshooting:**
See [VM Disk Monitoring Guide](VM_DISK_MONITORING.md) for full setup instructions.
#### How to diagnose VM disk issues
**Step 1: Check if guest agent is running**
On Proxmox host:
```bash
# Check if agent is enabled in VM config
qm config <VMID> | grep agent
# Test if agent responds
qm agent <VMID> ping
# Get filesystem info (what Pulse uses)
qm agent <VMID> get-fsinfo
```
Inside the VM:
```bash
# Linux
systemctl status qemu-guest-agent
# Windows (PowerShell)
Get-Service QEMU-GA
```
**Step 2: Run diagnostic script**
```bash
# On Proxmox host
curl -sSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/test-vm-disk.sh | bash
```
Or if Pulse is installed:
```bash
/opt/pulse/scripts/test-vm-disk.sh
```
### Ceph Cluster Data Missing
**Symptoms**: Ceph pools or health section missing in Storage view even though the cluster uses Ceph.
**Checklist:**
1. Confirm the Proxmox node exposes Ceph-backed storage (`Datacenter → Storage`). Types must be `rbd`, `cephfs`, or `ceph`.
2. Ensure Pulse has permission to call `/cluster/ceph/status` (Pulses Proxmox account needs `Sys.Audit` as part of `PVEAuditor`, provided by the setup script).
3. Check the backend logs for `Ceph status unavailable preserving previous Ceph state`. Intermittent errors are usually network timeouts; steady errors point to permissions.
4. Run from the Pulse host:
```bash
curl -sk https://pve-node:8006/api2/json/cluster/ceph/status \
-H "Authorization: PVEAPIToken=pulse-monitor@pam!token=<value>"
```
If this fails, verify firewall / token scope.
**Tip**: Pulse polls Ceph after storage refresh. If you recently added Ceph storage, wait one poll cycle or restart the backend to force detection.
### Backup View Filters Not Working
**Symptoms**: Backup chart does not highlight the selected time range or the grid ignores the picker.
**Checklist:**
1. Make sure you are running Pulse v4.29.0 or newer (the interactive picker was introduced alongside the new timeline). Check **Settings → System → About**.
2. Verify your browser is not forcing Legacy mode if the top-right toggle shows “Lightweight UI”, switch back to default.
3. When filters appear stuck:
- Click **Reset Filters** in the toolbar.
- Clear any search chips under the chart.
- Pick a preset (24h / 7d / 30d) to re-seed the view, then move back to Custom.
4. If the grid still shows stale data, open DevTools console and ensure no errors mentioning `chartsSelection` appear. Any error here usually means a stale service worker; hard refresh (Ctrl+Shift+R) clears it.
**Tip**: Selecting bars in the chart cross-highlights matching rows. If that does not happen, confirm you do not have browser extensions that block pointer events on canvas elements.
### Docker Agent Shows Hosts Offline
**Symptoms**: `/docker` tab marks hosts as offline or missing container metrics.
**Checklist:**
1. Run the agent manually with verbose logs:
```bash
sudo /usr/local/bin/pulse-docker-agent --interval 15s --debug
```
Look for HTTP 401 (token mismatch) or socket errors.
2. Confirm the host sees Docker:
```bash
sudo docker info | head -n 20
```
3. Make sure the agent ID is stable. If running inside transient containers, set `--agent-id` explicitly so Pulse does not treat each restart as a new host.
4. Verify Pulse shows a recent heartbeat (`lastSeen`) in `/api/state` → `dockerHosts`. Hosts are marked offline after 4× the configured interval with no update.
5. For reverse proxies/TLS issues, append `--insecure` temporarily to confirm whether certificate validation is the culprit.
**Restart loops**: The Docker workspace Issues column lists the last exit codes. Investigate recurring non-zero codes in `docker logs <container>` and adjust restart policy if needed.
**Step 3: Check Pulse logs**
```bash
# Docker
docker logs pulse | grep -i "guest agent\|fsinfo"
# Systemd
journalctl -u pulse -f | grep -i "guest agent\|fsinfo"
```
Look for specific error reasons:
- `agent-not-running` - Agent service not started in VM
- `agent-disabled` - Not enabled in VM config
- `agent-timeout` - Agent not responding (may need restart)
- `permission-denied` - Check permissions (see below)
- `no-filesystems` - Agent returned no usable filesystem data
#### Permission denied errors
If Pulse logs show permission denied when querying guest agent:
**Check permissions:**
```bash
# On Proxmox host
pveum user permissions pulse-monitor@pam
```
**Required permissions:**
- **Proxmox 9:** `VM.GuestAgent.Audit` privilege (Pulse setup adds this via the `PulseMonitor` role)
- **Proxmox 8:** `VM.Monitor` privilege (Pulse setup adds this via the `PulseMonitor` role)
- **All versions:** `Sys.Audit` is recommended for Ceph metrics and applied when available
**Fix permissions:**
Re-run the Pulse setup script on the Proxmox node:
```bash
curl -sSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/setup-pve.sh | bash
```
Or manually:
```bash
# Shared read-only access
pveum aclmod / -user pulse-monitor@pam -role PVEAuditor
# Extra privileges for guest metrics and Ceph
EXTRA_PRIVS=()
# Sys.Audit (Ceph, cluster status)
if pveum role list 2>/dev/null | grep -q "Sys.Audit"; then
EXTRA_PRIVS+=(Sys.Audit)
else
if pveum role add PulseTmpSysAudit -privs Sys.Audit 2>/dev/null; then
EXTRA_PRIVS+=(Sys.Audit)
pveum role delete PulseTmpSysAudit 2>/dev/null
fi
fi
# VM guest agent / monitor privileges
VM_PRIV=""
if pveum role list 2>/dev/null | grep -q "VM.Monitor"; then
VM_PRIV="VM.Monitor"
elif pveum role list 2>/dev/null | grep -q "VM.GuestAgent.Audit"; then
VM_PRIV="VM.GuestAgent.Audit"
else
if pveum role add PulseTmpVMMonitor -privs VM.Monitor 2>/dev/null; then
VM_PRIV="VM.Monitor"
pveum role delete PulseTmpVMMonitor 2>/dev/null
elif pveum role add PulseTmpGuestAudit -privs VM.GuestAgent.Audit 2>/dev/null; then
VM_PRIV="VM.GuestAgent.Audit"
pveum role delete PulseTmpGuestAudit 2>/dev/null
fi
fi
if [ -n "$VM_PRIV" ]; then
EXTRA_PRIVS+=("$VM_PRIV")
fi
if [ ${#EXTRA_PRIVS[@]} -gt 0 ]; then
PRIV_STRING="${EXTRA_PRIVS[*]}"
pveum role delete PulseMonitor 2>/dev/null
pveum role add PulseMonitor -privs "$PRIV_STRING"
pveum aclmod / -user pulse-monitor@pam -role PulseMonitor
fi
```
**Important:** Both API tokens and passwords work fine for guest agent access. If you see permission errors, it's a permission configuration issue, not an authentication method limitation.
#### Guest agent installed but no disk data
If agent responds to ping but returns no filesystem info:
1. **Check agent version** - Update to latest:
```bash
# Linux
apt update && apt install --only-upgrade qemu-guest-agent
systemctl restart qemu-guest-agent
```
2. **Check filesystem permissions** - Agent needs read access to filesystem data
3. **Windows VMs** - Ensure VirtIO drivers are up to date from latest virtio-win ISO
4. **Special filesystems only** - If VM only has special filesystems (tmpfs, ISO mounts), this is normal for Live systems
#### Specific VM types
**Cloud images:**
- Most have guest agent pre-installed but disabled
- Enable with: `systemctl enable --now qemu-guest-agent`
**Windows VMs:**
- Must install VirtIO guest tools
- Ensure "QEMU Guest Agent" service is running
- May need "QEMU Guest Agent VSS Provider" for full functionality
**Container-based VMs (Docker/Kubernetes hosts):**
- Will show high disk usage due to container layers
- This is accurate - containers consume real disk space
- Consider monitoring container disk separately
### Performance Issues
#### High CPU usage
- Polling interval is fixed at 10 seconds (matches Proxmox update cycle)
- Check number of monitored nodes
- Disable unused features (snapshots, backups monitoring)
#### High memory usage
- Normal for monitoring many nodes
- Check metrics retention settings
- Restart container to clear any memory leaks
### Network Issues
#### Cannot connect to Proxmox nodes
1. Verify Proxmox API is accessible:
```bash
curl -k https://proxmox-ip:8006
```
2. Check credentials have proper permissions (PVEAuditor minimum)
3. Verify network connectivity between Pulse and Proxmox
4. Check for firewall rules blocking port 8006
#### PBS connection issues
- Ensure API token has Datastore.Audit permission
- Check PBS is accessible on port 8007
- Verify token format: `user@realm!tokenid=secret`
### Update Issues
#### Updates not showing
- Check update channel in Settings → System
- Verify internet connectivity
- Check GitHub API rate limits
- Manual update: Pull latest Docker image or run install script
#### Update fails to apply
**Docker**: Pull new image and recreate container
**Native**: Run install script again or check logs
### Data Recovery
#### Lost authentication
See [Forgot Password / Lost Access](#forgot-password--lost-access) section above.
**Recommended approach**: Start fresh. Delete your Pulse data and restart.
#### Corrupt configuration
Restore from backup or delete config files to start fresh:
```bash
# Docker
docker exec pulse rm /data/*.json /data/*.enc
docker restart pulse
# Native
sudo rm /etc/pulse/*.json /etc/pulse/*.enc
sudo systemctl restart pulse
```
## Getting Help
### Collect diagnostic information
```bash
# Version
curl http://localhost:7655/api/version
# Logs (last 100 lines)
docker logs --tail 100 pulse # Docker
journalctl -u pulse -n 100 # Native
# Environment
docker exec pulse env | grep -E "PULSE|API" # Docker
systemctl show pulse --property=Environment # Native
```
### Report issues
When reporting issues, include:
1. Pulse version
2. Deployment type (Docker/LXC/Manual)
3. Error messages from logs
4. Steps to reproduce
5. Expected vs actual behavior
Report at: https://github.com/rcourtman/Pulse/issues