Pulse/docs/zfs-monitoring.md
rcourtman c91b7874ac docs: comprehensive v4.24.0 documentation audit and updates
Complete documentation overhaul for Pulse v4.24.0 release covering all new
features and operational procedures.

Documentation Updates (19 files):

P0 Release-Critical:
- Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook
- Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status
- Operations: Enhanced audit-log-rotation.md with scheduler health checks
- Security: Updated proxy hardening docs with rate limit defaults
- Docker: Added runtime logging and rollback procedures

P1 Deployment & Integration:
- KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification
- PORT_CONFIGURATION.md: Service naming, change tracking via update history
- REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification
- PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration
- TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows

Features Documented:
- X-RateLimit-* headers for all API responses
- Updates rollback workflow (UI & CLI)
- Scheduler health API with rich metadata
- Runtime logging configuration (no restart required)
- Adaptive polling (GA, enabled by default)
- Enhanced audit logging
- Circuit breakers and dead-letter queue

Supporting Changes:
- Discovery service enhancements
- Config handlers updates
- Sensor proxy installer improvements

Total Changes: 1,626 insertions(+), 622 deletions(-)
Files Modified: 24 (19 docs, 5 code)

All documentation is production-ready for v4.24.0 release.
2025-10-20 17:20:13 +00:00

98 lines
2.7 KiB
Markdown

# ZFS Pool Monitoring
Pulse v4.15.0+ includes automatic ZFS pool health monitoring for Proxmox VE nodes.
## Features
- **Automatic Detection**: Detects ZFS storage and monitors associated pools
- **Health Status**: Monitors pool state (ONLINE, DEGRADED, FAULTED)
- **Error Tracking**: Tracks read, write, and checksum errors
- **Device Monitoring**: Monitors individual devices within pools
- **Alert Generation**: Creates alerts for degraded pools and device errors
- **Frontend Display**: Shows ZFS issues inline with storage information
## Requirements
### Proxmox Permissions
The Pulse user needs `Sys.Audit` permission on `/nodes/{node}/disks` to access ZFS information:
```bash
# Grant permission for ZFS monitoring (already included in standard Pulse role)
pveum acl modify /nodes -user pulse-monitor@pam -role PVEAuditor
```
### API Endpoints Used
- `/nodes/{node}/disks/zfs` - Lists ZFS pools
- `/nodes/{node}/disks/zfs/{pool}` - Gets detailed pool status
## Configuration
ZFS monitoring is **enabled by default** in Pulse v4.15.0+.
### Disabling ZFS Monitoring
If you want to disable ZFS monitoring (e.g., for performance reasons):
```bash
# Add to /opt/pulse/.env or environment
PULSE_DISABLE_ZFS_MONITORING=true
```
## Alert Types
### Pool State Alerts
- **Warning**: Pool is DEGRADED
- **Critical**: Pool is FAULTED or UNAVAIL
### Error Alerts
- **Warning**: Any read/write/checksum errors detected
- Alerts include error counts and affected devices
### Device Alerts
- **Warning**: Device has errors but is ONLINE
- **Critical**: Device is FAULTED or UNAVAIL
## Frontend Display
ZFS issues appear in the Storage tab:
- Yellow warning bar for degraded pools
- Red error counts for devices with issues
- Detailed device status for troubleshooting
## Performance Impact
- Adds 2 API calls per node with ZFS storage
- Typically adds <1 second to polling cycle
- Only queries nodes that have ZFS storage
## Troubleshooting
### No ZFS Data Appearing
1. Check permissions: `pveum user permissions pulse-monitor@pam`
2. Verify ZFS pools exist: `zpool list`
3. Check logs: `grep ZFS /opt/pulse/pulse.log` (raise log level to `debug` via **Settings → System → Logging** if you need more context, then switch back to `info`).
### Permission Denied Errors
Grant the required permission:
```bash
pveum acl modify /nodes -user pulse-monitor@pam -role PVEAuditor
```
### High API Load
Disable ZFS monitoring if not needed:
```bash
echo "PULSE_DISABLE_ZFS_MONITORING=true" >> /opt/pulse/.env
systemctl restart pulse
```
## Example Alert
```
Alert: ZFS pool 'rpool' is DEGRADED
Node: pve1
Pool: rpool
State: DEGRADED
Errors: 12 read, 0 write, 3 checksum
Device sdb2: DEGRADED with 12 read errors
```
This helps administrators identify failing drives before complete failure occurs.