This commit addresses 4 P1 important issues and 1 P2 optimization in infrastructure components:
**P1-1: Missing Panic Recovery in Discovery Service** (service.go:172-195, 499-542)
- **Problem**: No panic recovery in Start(), ForceRefresh(), SetSubnet() goroutines
- **Impact**: Silent service death if scan panics, broken discovery with no monitoring
- **Fix**:
- Wrapped initial scan goroutine with defer/recover (lines 172-182)
- Wrapped scanLoop goroutine with defer/recover (lines 185-195)
- Wrapped ForceRefresh scan with defer/recover (lines 499-509)
- Wrapped SetSubnet scan with defer/recover (lines 532-542)
- All log panics with stack traces for debugging
**P1-2: Missing Panic Recovery in Config Watcher Callback** (watcher.go:546-556)
- **Problem**: User-provided onMockReload callback could panic and crash watcher
- **Impact**: Panicking callback kills watcher goroutine, no config updates
- **Fix**: Wrapped callback invocation with defer/recover and stack trace logging
**P1-3: Session Store Stop() Using Send Instead of Close** (session_store.go:16-84)
- **Problem**: Stop() used channel send which blocks if nobody reads
- **Impact**: Stop() hangs if backgroundWorker already exited
- **Fix**:
- Added sync.Once field stopOnce (line 22)
- Changed Stop() to use close() within stopOnce.Do() (lines 80-84)
- Prevents double-close panic and ensures all readers are signaled
**P2-1: Backup Cleanup Inefficient O(n²) Sort** (persistence.go:1424-1427)
- **Problem**: Bubble sort used to sort backups by modification time
- **Impact**: Inefficient for large backup counts (>100 files)
- **Fix**:
- Replaced bubble sort with sort.Slice() using O(n log n) algorithm
- Added "sort" import (line 9)
- Maintains same oldest-first ordering for deletion logic
All fixes add defensive programming without changing external behavior. Panic recovery ensures services continue operating even with bugs, while optimization reduces cleanup time for backup-heavy environments.
This commit addresses 3 critical P0 race conditions and resource leaks in core infrastructure:
**P0-1: Discovery Service Goroutine Leak** (service.go:468, 488)
- **Problem**: ForceRefresh() and SetSubnet() spawned unbounded goroutines without checking if scan already in progress
- **Impact**: Rapid API calls create goroutine explosion, resource exhaustion
- **Fix**:
- ForceRefresh: Check isScanning before spawning goroutine (lines 470-476)
- SetSubnet: Check isScanning, defer scan if already running (lines 491-504)
- Both now log when skipping to aid debugging
**P0-2: Config Persistence Unlock/Relock Race** (persistence.go:1177-1206)
- **Problem**: LoadNodesConfig() unlocked RLock, called SaveNodesConfig (acquires Lock), then relocked
- **Impact**: Another goroutine could modify config between unlock/relock, causing migrated data loss
- **Fix**:
- Copy instance slices while holding RLock to ensure consistency (lines 1189-1194)
- Release lock, save copies, then return without relocking (lines 1196-1205)
- Prevents TOCTOU vulnerability where migrations could be overwritten
**P0-3: Config Watcher Channel Close Race** (watcher.go:19-178)
- **Problem**: Stop() used select-check-close pattern vulnerable to concurrent calls
- **Impact**: Multiple Stop() calls panic on double-close
- **Fix**:
- Added sync.Once field stopOnce to ConfigWatcher struct (line 26)
- Changed Stop() to use stopOnce.Do() ensuring single execution (lines 175-178)
- Removed racy select-based guard
All fixes maintain backwards compatibility and add defensive logging for operational visibility.
Significantly enhanced network discovery feature to eliminate false positives,
provide real-time progress updates, and better error reporting.
Key improvements:
- Require positive Proxmox identification (version data, auth headers, or certificates)
instead of reporting any service on ports 8006/8007
- Add real-time progress tracking with phase/target counts and completion percentage
- Implement structured error reporting with IP, phase, type, and timestamp details
- Fix TLS timeout handling to prevent hangs on unresponsive hosts
- Expose progress and structured errors via WebSocket for UI consumption
- Reduce log verbosity by moving discovery logs to debug level
- Fix duplicate IP counting to ensure progress reaches 100%
Breaking changes: None (backward compatible with legacy API methods)
Improves configuration handling and system settings APIs to support
v4.24.0 features including runtime logging controls, adaptive polling
configuration, and enhanced config export/persistence.
Changes:
- Add config override system for discovery service
- Enhance system settings API with runtime logging controls
- Improve config persistence and export functionality
- Update security setup handling
- Refine monitoring and discovery service integration
These changes provide the backend support for the configuration
features documented in the v4.24.0 release.
Discovery Fixes:
- Always update cache even when scan finds no servers (prevents stale data)
- Remove automatic re-add of deleted nodes to discovery (was causing confusion)
- Optimize Docker subnet scanning from 762 IPs to 254 IPs (3x faster)
- Add getHostSubnetFromGateway() to detect host network from container
Frontend Type Fixes:
- Fix ThresholdsTable editScope type errors
- Fix SnapshotAlertConfig index signature
- Remove unused variable in Settings.tsx
These changes make discovery faster, more reliable, and fix the issue where
deleted nodes would persist in the discovery cache or immediately reappear.