Pulse/docs/operations/sensor-proxy-config-management.md

469 lines
13 KiB
Markdown

# Sensor Proxy Configuration Management
This guide covers safe configuration management for pulse-sensor-proxy, including the new CLI tools introduced in v4.31.1+ to prevent config corruption.
## Overview
Starting with v4.31.1, pulse-sensor-proxy uses a two-file configuration system:
1. **Main config:** `/etc/pulse-sensor-proxy/config.yaml` - Contains all settings except allowed nodes
2. **Allowed nodes:** `/etc/pulse-sensor-proxy/allowed_nodes.yaml` - Separate file for the authorized node list
This separation prevents corruption from concurrent updates by the installer, control-plane sync, and self-heal timer.
## Architecture
### Why Two Files?
Earlier versions stored `allowed_nodes:` inline in `config.yaml`, causing corruption when:
- The installer updated node lists
- The self-heal timer ran (every 5 minutes)
- Control-plane sync modified the list
- Version detection had edge cases
Multiple code paths (shell, Python, Go) would race to update the same YAML file, creating duplicate `allowed_nodes:` keys that broke YAML parsing.
### New System (v4.31.1+)
**Phase 1 (Migration):**
- Force file-based mode exclusively
- Installer migrates inline blocks to `allowed_nodes.yaml`
- Self-heal timer includes corruption detection and repair
**Phase 2 (Atomic Operations):**
- Go CLI replaces all shell/Python config manipulation
- File locking prevents concurrent writes
- Atomic writes (temp file + rename) ensure consistency
- systemd validation prevents startup with corrupt config
## Configuration CLI Reference
### Validate Configuration
Check config files for errors before restarting the service:
```bash
# Validate both config.yaml and allowed_nodes.yaml
pulse-sensor-proxy config validate
# Validate specific config file
pulse-sensor-proxy config validate --config /path/to/config.yaml
# Validate specific allowed_nodes file
pulse-sensor-proxy config validate --allowed-nodes /path/to/allowed_nodes.yaml
```
**Exit codes:**
- 0 = valid
- Non-zero = validation failed (check stderr for details)
**Common validation errors:**
- "duplicate allowed_nodes blocks" - Run migration (see below)
- "failed to parse YAML" - Syntax error in config file
- "read_timeout must be positive" - Invalid timeout value
### Manage Allowed Nodes
The CLI provides two modes:
**Merge mode (default):** Adds nodes to existing list
```bash
# Add single node
pulse-sensor-proxy config set-allowed-nodes --merge 192.168.0.10
# Add multiple nodes
pulse-sensor-proxy config set-allowed-nodes \
--merge 192.168.0.1 \
--merge 192.168.0.2 \
--merge node1.local
```
**Replace mode:** Overwrites entire list
```bash
# Replace with new list
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge 192.168.0.1 \
--merge 192.168.0.2
# Clear the list (empty is valid for IPC-only clusters)
pulse-sensor-proxy config set-allowed-nodes --replace
```
**Custom paths:**
```bash
# Use non-default path
pulse-sensor-proxy config set-allowed-nodes \
--allowed-nodes /custom/path.yaml \
--merge 192.168.0.10
```
### How It Works
1. **File locking:** Uses `flock(LOCK_EX)` on separate `.lock` file
2. **Atomic writes:** Writes to temp file, syncs, then renames
3. **Deduplication:** Automatically removes duplicate entries
4. **Normalization:** Trims whitespace, sorts entries
5. **Empty lists allowed:** Useful for security lockdown or IPC-based discovery
## Common Tasks
### Adding Nodes After Cluster Expansion
When you add a new node to your Proxmox cluster:
```bash
# Add the new node to allowed list
pulse-sensor-proxy config set-allowed-nodes --merge new-node.local
# Validate config
pulse-sensor-proxy config validate
# Restart proxy to apply
sudo systemctl restart pulse-sensor-proxy
# Verify in Pulse UI
# Check Settings → Diagnostics → Temperature Proxy
```
### Removing Decommissioned Nodes
When removing a node from your cluster:
```bash
# Get current list
cat /etc/pulse-sensor-proxy/allowed_nodes.yaml
# Replace with updated list (without old node)
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge 192.168.0.1 \
--merge 192.168.0.2
# (omit the decommissioned node)
# Validate and restart
pulse-sensor-proxy config validate
sudo systemctl restart pulse-sensor-proxy
```
**Note:** The proxy cleanup system automatically removes SSH keys from deleted nodes. See temperature monitoring docs for details.
### Migrating from Inline Config
If you're running an older version with inline `allowed_nodes:` in config.yaml:
```bash
# Upgrade to latest version (auto-migrates)
curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \
sudo bash -s -- --standalone --pulse-server http://your-pulse:7655
# Verify migration
pulse-sensor-proxy config validate
# Check that allowed_nodes only appears in allowed_nodes.yaml
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml
# Should show: allowed_nodes.yaml:3:allowed_nodes:
# Should NOT show duplicate entries in config.yaml
```
### Changing Other Config Settings
For settings in `config.yaml` (not allowed_nodes):
```bash
# Stop the service first
sudo systemctl stop pulse-sensor-proxy
# Edit config.yaml manually
sudo nano /etc/pulse-sensor-proxy/config.yaml
# Validate before starting
pulse-sensor-proxy config validate
# Start service
sudo systemctl start pulse-sensor-proxy
# Check for errors
sudo systemctl status pulse-sensor-proxy
journalctl -u pulse-sensor-proxy -n 50
```
**Safe to edit in config.yaml:**
- `allowed_source_subnets`
- `allowed_peers` (UID/GID permissions)
- `rate_limit` settings
- `metrics_address`
- `http_*` settings (HTTPS mode)
- `pulse_control_plane` block
**Never edit manually:**
- `allowed_nodes:` (use CLI instead, or it will be in allowed_nodes.yaml anyway)
- Lock files (`.lock`)
## Troubleshooting
### Config Validation Fails
**Symptom:** `pulse-sensor-proxy config validate` returns error
**Diagnosis:**
```bash
# Run validation with full output
pulse-sensor-proxy config validate 2>&1
# Check for duplicate blocks
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml
# Check YAML syntax
python3 -c "import yaml; yaml.safe_load(open('/etc/pulse-sensor-proxy/config.yaml'))"
```
**Common fixes:**
- Duplicate blocks: Run migration (upgrade to v4.31.1+)
- YAML syntax errors: Fix indentation, remove tabs, check colons
- Missing required fields: Add `read_timeout`, `write_timeout`
### Service Won't Start After Config Change
**Diagnosis:**
```bash
# Check systemd logs
journalctl -u pulse-sensor-proxy -n 100
# Look for validation errors
journalctl -u pulse-sensor-proxy | grep -i "validation\|corrupt\|duplicate"
# Try starting in foreground for better errors
sudo -u pulse-sensor-proxy /opt/pulse/sensor-proxy/bin/pulse-sensor-proxy # legacy installs: /usr/local/bin/pulse-sensor-proxy
```
**Fix:**
```bash
# Validate config first
pulse-sensor-proxy config validate
# If validation passes but service fails, check permissions
ls -la /etc/pulse-sensor-proxy/
ls -la /var/lib/pulse-sensor-proxy/
# Ensure proxy user owns files
sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/
sudo chown -R pulse-sensor-proxy:pulse-sensor-proxy /var/lib/pulse-sensor-proxy/
```
### Lock File Errors
**Symptom:** `failed to acquire file lock` or `failed to open lock file`
**Cause:** Lock file has wrong permissions or process holds stale lock
**Fix:**
```bash
# Check lock file permissions (should be 0600)
ls -la /etc/pulse-sensor-proxy/*.lock
# Fix permissions
sudo chmod 0600 /etc/pulse-sensor-proxy/*.lock
sudo chown pulse-sensor-proxy:pulse-sensor-proxy /etc/pulse-sensor-proxy/*.lock
# If stale lock, identify holder
sudo lsof /etc/pulse-sensor-proxy/allowed_nodes.yaml.lock
# Kill stale process if needed (use with caution)
sudo kill <PID>
```
**Prevention:** Locks are automatically released when process exits. Don't manually delete lock files.
### Allowed Nodes List is Empty
**Symptom:** allowed_nodes.yaml exists but has no entries
**Is this a problem?** Not necessarily:
- Empty list is valid for clusters using IPC discovery (pvecm status)
- Control-plane mode populates the list automatically
- Standalone nodes require manual node entries
**To populate manually:**
```bash
# Add your cluster nodes
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge 192.168.0.1 \
--merge 192.168.0.2 \
--merge 192.168.0.3
# Verify
cat /etc/pulse-sensor-proxy/allowed_nodes.yaml
```
## Best Practices
### General Guidelines
1. **Always validate before restarting:**
```bash
pulse-sensor-proxy config validate && sudo systemctl restart pulse-sensor-proxy
```
2. **Use the CLI for allowed_nodes changes:**
- Don't edit `allowed_nodes.yaml` manually
- Use `config set-allowed-nodes` instead
3. **Stop service before editing config.yaml:**
- Prevents race conditions with running process
- systemd validation will catch errors on startup
4. **Back up config before major changes:**
```bash
sudo cp /etc/pulse-sensor-proxy/config.yaml /etc/pulse-sensor-proxy/config.yaml.backup
sudo cp /etc/pulse-sensor-proxy/allowed_nodes.yaml /etc/pulse-sensor-proxy/allowed_nodes.yaml.backup
```
5. **Monitor after changes:**
```bash
journalctl -u pulse-sensor-proxy -f
# Check Pulse UI: Settings → Diagnostics → Temperature Proxy
```
### Automation Scripts
When scripting config changes:
```bash
#!/bin/bash
set -euo pipefail
# Function to safely update allowed nodes
update_allowed_nodes() {
local nodes=("$@")
# Build command
local cmd="pulse-sensor-proxy config set-allowed-nodes --replace"
for node in "${nodes[@]}"; do
cmd="$cmd --merge $node"
done
# Execute with validation
if eval "$cmd"; then
echo "Allowed nodes updated successfully"
else
echo "Failed to update allowed nodes" >&2
return 1
fi
# Validate
if ! pulse-sensor-proxy config validate; then
echo "Config validation failed after update" >&2
return 1
fi
# Restart service
if sudo systemctl restart pulse-sensor-proxy; then
echo "Service restarted successfully"
else
echo "Service restart failed" >&2
return 1
fi
# Wait for service to be active
sleep 2
if systemctl is-active --quiet pulse-sensor-proxy; then
echo "Service is running"
else
echo "Service failed to start" >&2
journalctl -u pulse-sensor-proxy -n 20
return 1
fi
}
# Example usage
update_allowed_nodes "192.168.0.1" "192.168.0.2" "node3.local"
```
### Monitoring Config Health
Add to your monitoring system:
```bash
# Check for config corruption (should return 0)
pulse-sensor-proxy config validate
echo $?
# Check for duplicate blocks (should be empty)
grep "allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml | wc -l
# Check lock file permissions (should be 0600)
stat -c "%a" /etc/pulse-sensor-proxy/*.lock
# Check service is running
systemctl is-active pulse-sensor-proxy
```
## Migration Path
### Upgrading from Pre-v4.31.1
**Automatic migration** (recommended):
```bash
# Simply reinstall - migration runs automatically
curl -fsSL https://raw.githubusercontent.com/rcourtman/Pulse/main/scripts/install-sensor-proxy.sh | \
sudo bash -s -- --standalone --pulse-server http://your-pulse:7655
# Verify
pulse-sensor-proxy config validate
sudo systemctl status pulse-sensor-proxy
```
**Manual migration** (if needed):
```bash
# 1. Stop service
sudo systemctl stop pulse-sensor-proxy
# 2. Extract allowed_nodes from config.yaml
grep -A 100 "^allowed_nodes:" /etc/pulse-sensor-proxy/config.yaml > /tmp/nodes.txt
# 3. Parse and add to allowed_nodes.yaml
# (Example for simple list - adjust for your format)
pulse-sensor-proxy config set-allowed-nodes --replace \
--merge node1.local \
--merge node2.local
# 4. Remove allowed_nodes from config.yaml
# Edit manually or use sed:
sudo sed -i '/^allowed_nodes:/,/^[a-z_]/d' /etc/pulse-sensor-proxy/config.yaml
# 5. Add reference to allowed_nodes.yaml
echo "allowed_nodes_file: /etc/pulse-sensor-proxy/allowed_nodes.yaml" | \
sudo tee -a /etc/pulse-sensor-proxy/config.yaml
# 6. Validate
pulse-sensor-proxy config validate
# 7. Start service
sudo systemctl start pulse-sensor-proxy
```
## Related Documentation
- [Temperature Monitoring](../TEMPERATURE_MONITORING.md) - Setup and troubleshooting
- [Sensor Proxy README](/opt/pulse/cmd/pulse-sensor-proxy/README.md) - Complete CLI reference
- [Audit Log Rotation](audit-log-rotation.md) - Managing append-only logs
- [Temperature Monitoring Security](../TEMPERATURE_MONITORING_SECURITY.md) - Security architecture
## Support
If config management issues persist after following this guide:
1. Collect diagnostics:
```bash
pulse-sensor-proxy config validate 2>&1 > /tmp/validate.log
sudo systemctl status pulse-sensor-proxy > /tmp/status.log
journalctl -u pulse-sensor-proxy -n 200 > /tmp/journal.log
grep -n "allowed_nodes:" /etc/pulse-sensor-proxy/*.yaml > /tmp/grep.log
```
2. File an issue at https://github.com/rcourtman/Pulse/issues
3. Include:
- Pulse version
- Sensor proxy version (`pulse-sensor-proxy --version`)
- Output from diagnostic commands above
- Steps that led to the issue