Commit graph

14 commits

Author SHA1 Message Date
rcourtman
4120d75359 Surface shared cluster-only storage in alerts (#1341) 2026-03-30 19:25:54 +01:00
rcourtman
d4242d9a13 Fix ZFS pool attachment in storage frontend (discussion #1351) 2026-03-27 14:59:52 +00:00
rcourtman
ae6b663e95 Attach ZFS pools for dataset-backed storages 2026-03-26 22:29:32 +00:00
rcourtman
92e6075ee4 Fix ZFS pool matching for local-zfs storages 2026-03-26 09:09:17 +00:00
rcourtman
0ae2806f18 fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270)
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.

Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).

Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
2026-02-20 13:31:52 +00:00
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
ebc29b4fdb feat: show pending apt updates for Proxmox nodes (#1083)
- Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model
- Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update)
- Add 30-minute polling cache to avoid excessive API calls
- Add pendingUpdates to frontend Node type
- Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+)
- Update test stubs for interface compliance

Requires Sys.Audit permission on Proxmox API token to read apt updates.
2026-01-21 10:53:36 +00:00
rcourtman
af55362009 Fix inflated RAM usage reporting for LXC containers
Related to #553

## Problem

LXC containers showed inflated memory usage (e.g., 90%+ when actual usage was 50-60%,
96% when actual was 61%) because the code used the raw `mem` value from Proxmox's
`/cluster/resources` API endpoint. This value comes from cgroup `memory.current` which
includes reclaimable cache and buffers, making memory appear nearly full even when
plenty is available.

## Root Cause

- **Nodes**: Had sophisticated cache-aware memory calculation with RRD fallbacks
- **VMs (qemu)**: Had detailed memory calculation using guest agent meminfo
- **LXCs**: Naively used `res.Mem` directly without any cache-aware correction

The Proxmox cluster resources API's `mem` field for LXCs includes cache/buffers
(from cgroup memory accounting), which should be excluded for accurate "used" memory.

## Solution

Implement cache-aware memory calculation for LXC containers by:

1. Adding `GetLXCRRDData()` method to fetch RRD metrics for LXC containers from
   `/nodes/{node}/lxc/{vmid}/rrddata`
2. Using RRD `memavailable` to calculate actual used memory (total - available)
3. Falling back to RRD `memused` if `memavailable` is not available
4. Only using cluster resources `mem` value as last resort

This matches the approach already used for nodes and VMs, providing consistent
cache-aware memory reporting across all resource types.

## Changes

- Added `GuestRRDPoint` type and `GetLXCRRDData()` method to pkg/proxmox
- Added `GetLXCRRDData()` to ClusterClient for cluster-aware operations
- Modified LXC memory calculation in `pollPVEInstance()` to use RRD data when available
- Added guest memory snapshot recording for LXC containers
- Updated test stubs to implement the new interface method

## Testing

- Code compiles successfully
- Follows the same proven pattern used for nodes and VMs
- Includes diagnostic snapshot recording for troubleshooting
2025-11-06 00:16:18 +00:00
rcourtman
a885fb5472 Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
rcourtman
b95c01066e Capture dynamic LXC IP metrics (#596) 2025-10-23 07:50:45 +00:00
rcourtman
be85459db2 Add LXC config metadata for guest drawers (#596) 2025-10-23 07:30:32 +00:00
rcourtman
c9543e8a7e Add qemu guest agent version metadata 2025-10-22 15:24:07 +00:00
rcourtman
57429900a6 feat: add adaptive polling scheduler infrastructure (Phase 2 Tasks 1-3)
Implements adaptive scheduling foundation for Phase 2:
- Poll cycle metrics: duration, staleness, queue depth, in-flight counters
- Adaptive scheduler with pluggable staleness/interval/enqueue interfaces
- Config support: ADAPTIVE_POLLING_ENABLED flag + min/max/base intervals
- Feature flag defaults to disabled for safe rollout
- Scheduler wiring into Monitor with conditional instantiation

Tasks 1-3 of 10 complete. Ready for staleness tracker implementation.
2025-10-20 15:13:37 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00