Pulse/internal/monitoring
rcourtman 7dd7a0b0f9 Fix node/host dropout issue caused by cluster health failures
Implemented comprehensive state preservation to prevent temporary dropouts:

1. Node Grace Period (60s):
   - Track last-online timestamp for each Proxmox node
   - Preserve online status during grace period to prevent flapping
   - Applied to all node status checks throughout codebase

2. Efficient Polling Preservation:
   - Detect when cluster/resources returns empty arrays
   - Preserve previous VMs/containers if had resources before
   - Handles cluster health check failures gracefully

3. Traditional Polling Preservation:
   - Updated preservation logic for per-node VM/container polling
   - Triggers when zero resources returned regardless of node response
   - Fixed issue where nodes responding with empty data bypassed preservation

Root cause: Intermittent Proxmox cluster health failures ("no healthy nodes
available") caused both efficient and traditional polling to return empty
arrays, immediately clearing all VMs/containers from state.

Changes:
- internal/monitoring/monitor.go: Added node grace period, efficient polling preservation
- internal/monitoring/monitor_polling.go: Fixed traditional polling preservation logic

Fixes frequent UI flickering where vmCount/containerCount would briefly drop to zero.
2025-11-05 17:01:20 +00:00
..
backoff.go feat: implement error handling with circuit breakers and backoff (Phase 2 Task 7) 2025-10-20 15:13:37 +00:00
backoff_test.go test: add comprehensive unit tests for backoff and circuit breaker (Phase 2 Task 9a) 2025-10-20 15:13:38 +00:00
backup_guard_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
ceph.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
circuit_breaker.go feat: enhance scheduler health API with rich instance metadata 2025-10-20 15:13:38 +00:00
circuit_breaker_test.go test: add comprehensive unit tests for backoff and circuit breaker (Phase 2 Task 9a) 2025-10-20 15:13:38 +00:00
container_disk_usage.go feat: add professional logging with runtime configuration and performance optimization 2025-10-20 15:13:38 +00:00
diagnostic_snapshots.go Refine Proxmox node memory fallback (#582) 2025-10-22 15:36:26 +00:00
docker_commands.go feat: add docker agent command handling 2025-10-15 19:27:19 +00:00
docker_commands_test.go chore: snapshot current changes 2025-11-02 22:47:55 +00:00
fake_executor_integration.go test: add comprehensive integration test harness for adaptive polling (Phase 2 Task 9c) 2025-10-20 15:13:38 +00:00
fs_filters.go Ignore read-only guest filesystems in disk aggregation 2025-10-14 16:13:53 +00:00
fs_filters_test.go Ignore read-only guest filesystems in disk aggregation 2025-10-14 16:13:53 +00:00
harness_integration.go Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
helpers_test.go Expand monitoring and discovery test coverage 2025-10-16 08:17:08 +00:00
integration_integration_test.go test: add soak test with runtime instrumentation (Phase 2 Task 9d) 2025-10-20 15:13:38 +00:00
main_test.go Harden setup token flow and enforce encrypted persistence 2025-10-25 16:00:37 +00:00
metrics.go perf: reduce polling allocations and guest metadata load 2025-10-25 13:12:47 +00:00
metrics_history.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
metrics_history_concurrency_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
monitor.go Fix node/host dropout issue caused by cluster health failures 2025-11-05 17:01:20 +00:00
monitor_docker_test.go Refactor: Code cleanup and localStorage consolidation 2025-11-04 21:50:46 +00:00
monitor_health_test.go feat: enhance scheduler health API with rich instance metadata 2025-10-20 15:13:38 +00:00
monitor_host_agents_test.go perf: reduce polling allocations and guest metadata load 2025-10-25 13:12:47 +00:00
monitor_memory_test.go Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
monitor_pmg_test.go fix: use proper Monitor constructor in PMG tests to initialize all maps 2025-10-20 15:22:23 +00:00
monitor_polling.go Fix node/host dropout issue caused by cluster health failures 2025-11-05 17:01:20 +00:00
monitor_snapshots_test.go Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
monitor_storage_test.go Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
monitor_temperature_toggle_test.go Fix CSRF token validation and improve token management 2025-11-05 09:23:44 +00:00
poller.go feat: add professional logging with runtime configuration and performance optimization 2025-10-20 15:13:38 +00:00
ratetracker.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
ratetracker_concurrency_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
reload.go Propagate config updates to settings nodes (#588) 2025-10-22 13:45:13 +00:00
scheduler.go feat: enhance scheduler health API with rich instance metadata 2025-10-20 15:13:38 +00:00
staleness_tracker.go release: prepare v4.25.0 2025-10-22 10:46:18 +00:00
staleness_tracker_test.go test: add comprehensive staleness tracker unit tests (Phase 2 Task 9b) 2025-10-20 15:13:38 +00:00
task_queue.go perf: reduce polling allocations and guest metadata load 2025-10-25 13:12:47 +00:00
temperature.go Enhance container detection for temperature SSH safeguards (refs #601) 2025-11-04 22:30:35 +00:00
temperature_service.go Fix CSRF token validation and improve token management 2025-11-05 09:23:44 +00:00
temperature_test.go Stop legacy temperature SSH retries when auth fails (#595) 2025-10-22 19:35:51 +00:00