Commit graph

9 commits

Author SHA1 Message Date
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
2e0da42a81 chore: reliability and maintenance improvements
Host agent:
- Add SHA256 checksum verification for downloaded binaries
- Verify checksum file matches expected bundle filename

WebSocket:
- Add write failure tracking with graceful disconnection
- Increase write deadline to 30s for large state payloads
- Better handling for slow clients (Raspberry Pi, slow networks)

Monitoring:
- Remove unused temperature proxy imports
- Add monitor polling improvements
- Expand test coverage

Other:
- Update package.json dependencies
- Fix generate-release-notes.sh path handling
- Minor reporting engine cleanup
2026-01-22 00:45:04 +00:00
rcourtman
a885fb5472 Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
rcourtman
b95c01066e Capture dynamic LXC IP metrics (#596) 2025-10-23 07:50:45 +00:00
rcourtman
be85459db2 Add LXC config metadata for guest drawers (#596) 2025-10-23 07:30:32 +00:00
rcourtman
c9543e8a7e Add qemu guest agent version metadata 2025-10-22 15:24:07 +00:00
rcourtman
73fb9d986f feat: add PBS/PMG stubs to test harness and implement HTTP config fetch
Resolves two remaining TODOs from codebase audit.

## 1. PBS/PMG Test Harness Stubs

**Location:** internal/monitoring/harness_integration.go:149-151

**Changes:**
- Added PBS client stub registration: `monitor.pbsClients[inst.Name] = &pbs.Client{}`
- Added PMG client stub registration: `monitor.pmgClients[inst.Name] = &pmg.Client{}`
- Added imports for pkg/pbs and pkg/pmg

**Purpose:**
Enables integration test scenarios to include PBS and PMG instance types
alongside existing PVE support. Stubs allow scheduler to register and
execute tasks for these instance types during integration testing.

**Testing:**
 TestAdaptiveSchedulerIntegration passes (55.5s)
 Integration test harness now supports all three instance types

## 2. HTTP Config URL Fetch

**Location:** cmd/pulse/config.go:226-261

**Problem:**
`PULSE_INIT_CONFIG_URL` was recognized but not implemented, returning
"URL import not yet implemented" error.

**Implementation:**
- URL validation (http/https schemes only)
- HTTP client with 15 second timeout
- Status code validation (2xx required)
- Empty response detection
- Base64 decoding with fallback to raw data
- Matches existing env-var behavior for `PULSE_INIT_CONFIG_DATA`

**Security:**
- Both HTTP and HTTPS supported (HTTPS recommended for production)
- URL scheme validation prevents file:// or other protocols
- Timeout prevents hanging on unresponsive servers

**Usage:**
```bash
export PULSE_INIT_CONFIG_URL="https://config-server/encrypted-config"
export PULSE_INIT_CONFIG_PASSPHRASE="secret"
pulse config auto-import
```

**Testing:**
 Code compiles cleanly
 Follows same pattern as existing PULSE_INIT_CONFIG_DATA handling

## Impact

- Completes integration test infrastructure for all instance types
- Enables automated config distribution via HTTP(S) for container deployments
- Removes last TODOs from codebase (no TODO/FIXME remaining in Go files)
2025-10-20 16:05:45 +00:00
rcourtman
14d06a1654 test: add soak test with runtime instrumentation (Phase 2 Task 9d)
Add comprehensive soak testing capabilities:

**Runtime Instrumentation:**
- Periodic sampling of heap, stack, goroutines, GC count
- Sample every 10s during harness runs
- HarnessReport includes full RuntimeSamples history
- Detect memory leaks (>10% sustained growth)
- Detect goroutine leaks (>20 leaked goroutines)

**Soak Test:**
- TestAdaptiveSchedulerSoak with 15min+ duration
- Skip unless -soak flag or HARNESS_SOAK_MINUTES set
- 80 synthetic instances (60 healthy, 15 transient, 5 permanent)
- Configurable duration via env var
- Validates: heap growth <10%, goroutines stable, queue depth bounded
- Staleness threshold: 45s for long-running tests

**Wrapper Script:**
- testing-tools/run_adaptive_soak.sh for easy execution
- Accepts duration in minutes: ./run_adaptive_soak.sh 30
- Logs to tmp/adaptive_soak_<timestamp>.log
- Sets proper timeout (duration + 5min buffer)

**Test Results (2-minute validation):**
- 80 instances, 17 samples
- Heap: 2.3MB → 3.1MB (healthy)
- Goroutines: 16 → 6 (no leak, actually decreased)
- Circuit breakers: correctly blocking transient failures

Run with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerSoak -soak -timeout 20m

Part of Phase 2 Task 9 (Integration/Soak Testing)
2025-10-20 15:13:38 +00:00
rcourtman
2636ba9137 test: add comprehensive integration test harness for adaptive polling (Phase 2 Task 9c)
Add PollExecutor seam and integration test infrastructure:

**PollExecutor Interface:**
- Add pluggable executor interface for testability
- Implement realExecutor wrapping existing poll functions
- Add SetExecutor() for test injection
- Zero impact on production behavior

**Integration Test Harness:**
- Build-tagged integration tests (go:build integration)
- Synthetic workload generator with configurable scenarios
- Fake executor simulating latencies, failures, recovery
- Runtime metrics collection (queue depth, staleness, goroutines)

**Comprehensive Assertions:**
- Queue depth bounds: stays within 1.5× instance count
- Staleness: healthy instances <20s, multiple poll cycles
- Circuit breakers: transient failures recover, permanent stay blocked
- Dead-letter queue: only permanent failures routed
- Scheduler health: snapshot consistency validation

**Test Scenarios:**
- 10 healthy PVE instances (rapid polling)
- 1 transient failure instance (fail → recover)
- 1 permanent failure instance (DLQ routing)
- 55s test duration with 3s base intervals
- Validates full adaptive scheduler lifecycle

Runs with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerIntegration

Part of Phase 2 Task 9 (Integration/Soak Testing)
2025-10-20 15:13:38 +00:00