Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-28 03:20:11 +00:00

Author	SHA1	Message	Date
rcourtman	387a79bc2d	Defer metrics store startup maintenance	2026-04-01 12:44:17 +01:00
rcourtman	177ae5f6da	Tighten integer and allocation bounds for CodeQL	2026-03-31 09:50:11 +01:00
rcourtman	9155480bbd	Use explicit integer bounds in Proxmox parsing	2026-03-31 09:43:04 +01:00
rcourtman	57cf7b305a	Validate SAML redirects and request URLs	2026-03-31 09:34:31 +01:00
rcourtman	33efdc3fb5	Normalize outbound client and update URLs	2026-03-31 09:31:56 +01:00
rcourtman	4f89a13975	Harden storage path handling for CodeQL	2026-03-31 08:55:45 +01:00
rcourtman	e93c8b40ae	Fix CodeQL integer and audit findings	2026-03-28 13:33:48 +00:00
rcourtman	e306c0a461	Tolerate partial guest network address payloads (#1319 ) Some checks are pending Build and Test / Secret Scan (push) Waiting to run Details Build and Test / Frontend & Backend (push) Waiting to run Details Core E2E Tests / Playwright Core E2E (push) Waiting to run Details	2026-03-27 17:09:09 +00:00
rcourtman	81b0a567ce	Harden guest network interface parsing (#1319 )	2026-03-27 17:05:34 +00:00
rcourtman	2ed4253573	Accept object-style single guest fsinfo results (#1319 )	2026-03-27 16:33:41 +00:00
rcourtman	d11e3d8f2d	Use Ceph monmap and mgrmap counts in cluster summaries (#1319 )	2026-03-27 16:23:57 +00:00
rcourtman	3d27c8f006	Accept object-style guest fsinfo disk metadata (#1319 )	2026-03-27 15:24:40 +00:00
rcourtman	fcfa0c2903	Skip malformed guest fsinfo entries (#1319 )	2026-03-27 15:23:13 +00:00
rcourtman	d4242d9a13	Fix ZFS pool attachment in storage frontend (discussion #1351 )	2026-03-27 14:59:52 +00:00
rcourtman	b5629fb1df	Normalize Windows volume GUID fsinfo mountpoints (#1319 )	2026-03-27 14:04:58 +00:00
rcourtman	b05d2b0489	Handle Windows fsinfo name fallback for guest disks (#1319 )	2026-03-27 11:39:22 +00:00
rcourtman	1f332bee52	Support privileged fsinfo totals for guest disks (#1319 )	2026-03-27 11:18:53 +00:00
rcourtman	f45f7401c0	Make metrics Flush wait for queued writes	2026-03-25 14:14:00 +00:00
rcourtman	2acf2e9ef9	Reduce metrics store transaction churn (#1124 )	2026-03-25 12:06:28 +00:00
rcourtman	1885bd02c0	Fix Proxmox tag color parsing (#1348 )	2026-03-25 10:40:31 +00:00
rcourtman	3a02dd171b	fix(proxmox): add GetClusterOptions to ClusterClient for tag colour fetch	2026-03-15 19:51:20 +00:00
rcourtman	caff845c1a	fix(ui): use Proxmox tag colours from datacenter config Pulse was generating tag colours from a hash of the tag name instead of using the colours configured in Proxmox. Now polls /cluster/options once per PVE instance and merges the tag-style colour map into state, which the frontend uses as the first-priority colour source for tag badges. Falls back to the existing special-tag and hash-based colours when Proxmox hasn't set a custom colour for a tag.	2026-03-15 19:49:46 +00:00
rcourtman	499ab812e3	Fix post-release regressions and lock v5 to single-tenant runtime	2026-03-05 23:46:35 +00:00
rcourtman	ac83074fc2	fix(hostmetrics): skip network mounts before usage probe (#1313 )	2026-03-03 20:27:41 +00:00
rcourtman	adad2bbd1c	fix(server): fail fast on frontend bind errors	2026-03-03 15:43:22 +00:00
rcourtman	b38488f2da	fix(proxmox): stabilize pulse monitor token lifecycle	2026-03-03 10:57:19 +00:00
rcourtman	0ae2806f18	fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270 ) Proxmox status.Mem includes page cache as "used" memory, inflating reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked host agent) were frequently unavailable, causing most VMs to fall through to the inflated status-mem source. Adds a new last-resort fallback that reads /proc/meminfo via the QEMU guest agent file-read endpoint to get accurate MemAvailable. Results are cached (60s positive, 5min negative backoff for unsupported VMs). Also fixes: RRD memavailable fallback missing from traditional polling path, cache key collisions in multi-PVE setups, FreeMem underflow guard inconsistency, and integer overflow in kB-to-bytes conversion.	2026-02-20 13:31:52 +00:00
rcourtman	8c7d507ea4	fix(alerts): make --disk-exclude suppress Proxmox SSD wear/health alerts (#1142 ) The --disk-exclude agent flag only filtered local metric collection but had no effect on server-side Proxmox disk health and SSD wearout alerts, which poll the Proxmox API directly. Users excluding disks (e.g. --disk-exclude sda) still received alerts for those disks. Agent now sends its DiskExclude patterns in each report. The server stores them on the Host model and consults them during Proxmox disk polling — excluded disks get a synthetic healthy status passed to CheckDiskHealth so any existing alerts clear immediately. Also adds FreeBSD pseudo-filesystem types (fdescfs, devfs, linprocfs, linsysfs) to the virtual FS filter and /var/run/ to special mount prefixes, fixing false disk-full alerts on FreeBSD for fdescfs mounts.	2026-02-20 13:31:52 +00:00
rcourtman	74d46807d8	fix(test): update /var/lib/docker fsfilter expectation to match new skip=false behavior (cherry picked from commit 6671a0d5a654ef8d0b023bd610e5c2fd5fedbf8b)	2026-02-18 12:59:23 +00:00
rcourtman	a54d71117b	fix(proxmox): prevent guest agent errors from marking endpoints unhealthy Backport of v6 commits a87c9950 and 347d7db1. Part 1 (a87c9950): Wrap the four guest agent c.get() errors with fmt.Errorf("guest agent ...: %w", err) so isVMSpecificError() correctly scopes them to the VM rather than the cluster endpoint. Part 2 (347d7db1): Replace the 20+ pattern blocklist in executeWithFailover with an allowlist via isEndpointConnectivityError(). Only true TCP/DNS/TLS failures mark an endpoint unhealthy. Any HTTP response from Proxmox — including 500 — proves the node is reachable and returns the error without affecting endpoint health.	2026-02-18 12:59:20 +00:00
rcourtman	a69a2061cb	fix(ui,disk): remove double 'ago' suffix and allow /var/lib/docker block devices (#1266 , #1143 ) #1266: ageFormatted already includes 'ago' from formatTimeDiff(); remove the duplicate literal suffix from the backup age tooltip in GuestRow.tsx. #1143: Remove /var/lib/docker from specialMountPrefixes so real block devices mounted there are visible in disk usage. Container overlay layers (fstype=overlay) are already filtered by virtualFSTypes and are unaffected. (cherry picked from commit 5acef3405d4288f627788675123e266d661c2fe3)	2026-02-18 12:56:58 +00:00
rcourtman	efa916ee2a	fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox RRD's memavailable metric (which excludes reclaimable page cache) when the qemu-guest-agent doesn't provide MemInfo.Available. Previously the fallback was detailedStatus.Mem (total - MemFree), inflating usage to 80%+ on VMs with normal Linux page cache. Mirrors the existing LXC rrd-memavailable path. FreeBSD ZFS ARC (#1264, #1051): The host agent now reads kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts the ARC size from reported memory usage. ZFS ARC is reclaimable under memory pressure (like Linux SReclaimable) but gopsutil counts it as wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms. Fixes #1270 Fixes #1264 Fixes #1051 (cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)	2026-02-18 12:56:53 +00:00
rcourtman	9d8f8b45b5	fix(docker,metrics): preserve container metadata on update and reduce DB writes Docker container URL preserved on update (#1054): container updates recreate the container with a new runtime ID. The agent now includes {oldContainerId, newContainerId} in the completion ACK payload; the server uses this to copy persisted metadata (custom URLs, descriptions, tags) to the new ID so nothing is lost. Migration is a copy, not a move, so rollback scenarios still find metadata under the original ID. Reduce metrics.db write amplification (#1124): add a UNIQUE index on (resource_type, resource_id, metric_type, timestamp, tier) so rollup reprocessing after a failed checkpoint uses INSERT OR IGNORE instead of creating duplicate rows. Existing duplicates are deduplicated once on startup if the index creation would otherwise fail. Also sets wal_autocheckpoint(500) to checkpoint the WAL more frequently, preventing unbounded WAL growth. Fixes #1054 Fixes #1124	2026-02-18 12:56:46 +00:00
rcourtman	c2d2e7de0e	fix: run retention before auto-vacuum migration to reduce VACUUM cost VACUUM creates a full copy of the database. Running retention first deletes stale data (5GB → ~60MB live), so the VACUUM copies far less data — faster startup and much less temporary disk space needed.	2026-02-11 13:37:29 +00:00
rcourtman	06a9f9694f	fix: migrate existing metrics databases to incremental auto-vacuum The auto_vacuum(INCREMENTAL) pragma from the previous commit only takes effect on new databases. SQLite requires a full VACUUM to restructure existing files when switching from NONE to INCREMENTAL. Without this, users upgrading from bloated 5GB+ databases would never reclaim space. Adds a one-time migration on startup that detects the current auto_vacuum mode and runs VACUUM to convert if needed. Subsequent startups skip the migration since the mode is already INCREMENTAL.	2026-02-11 13:35:03 +00:00
rcourtman	284bdd7ade	fix: prevent metrics.db bloat with automatic vacuum and WAL checkpointing The metrics database could grow to 5GB+ for modest setups because: 1. Retention deletes rows hourly but SQLite never reclaims the space 2. WAL file grows unbounded without explicit checkpointing 3. No cleanup runs on startup, so restarts accumulate stale data Fixes: - Enable auto_vacuum=INCREMENTAL so deleted pages can be reclaimed - Run incremental_vacuum after each retention cleanup - Force WAL checkpoint(TRUNCATE) after deletes to prevent WAL bloat - Run retention on startup to clean stale data immediately Expected DB size for a 50-resource setup drops from 5GB+ to ~60-70MB. Ref: GitHub Discussion #1231	2026-02-10 23:13:32 +00:00
rcourtman	815c990e85	fix(proxmox): avoid 403 on apt update checks	2026-02-09 20:28:09 +00:00
rcourtman	5c18748742	Add SMART disk lifecycle monitoring with historical charts Expand the smartctl collector to capture detailed SMART attributes (SATA and NVMe), propagate them through the full data pipeline, persist them as time-series metrics, and display them in an interactive disk detail drawer with historical sparkline charts. Backend: add SMARTAttributes struct, writeSMARTMetrics for persistent storage, "disk" resource type in metrics API with live fallback. Frontend: enhanced DiskList with Power-On column and SMART warnings, new DiskDetail drawer matching NodeDrawer styling patterns, generic HistoryChart metric support with proper tooltip formatting.	2026-02-04 13:35:40 +00:00
rcourtman	6427f28a08	Fix stale metrics store reference in reporting engine after monitor reload The reporting engine held a direct pointer to the metrics store, which becomes invalid after a monitor reload (settings change, node config save, etc.) closes and recreates the store. Use a dynamic getter closure that always resolves to the current monitor's active store. Also adds diagnostic logging when report queries return zero metrics, and integration tests covering the full metrics-to-report pipeline including reload scenarios. Fixes #1186	2026-02-04 12:34:40 +00:00
rcourtman	5c1487e406	feat: add resource picker and multi-resource report generation Replace manual resource ID entry with a searchable, filterable resource picker that uses live WebSocket state. Support selecting multiple resources (up to 50) for combined fleet reports. Multi-resource PDFs include a cover page, fleet summary table with aggregate health status, and condensed per-resource detail pages with overlaid CPU/memory charts. Multi-resource CSVs include a summary section followed by interleaved time-series data with resource columns. New POST /api/admin/reports/generate-multi endpoint handles multi-resource requests while the existing single-resource GET endpoint remains unchanged. Also fixes resource ID validation regex to allow colons used in VM/container IDs (e.g., "instance:node:vmid").	2026-02-04 10:24:23 +00:00
rcourtman	2ebe65bbc5	security: add scope checks to AI Patrol and agent profile endpoints - AI Patrol mutation endpoints (acknowledge, dismiss, suppress, snooze, resolve, findings/note, suppressions/) now require ai:execute scope to prevent low-privilege tokens from blinding patrol by hiding/suppressing findings - Agent profile admin endpoints (/api/admin/profiles/) now require settings:write scope to prevent low-privilege tokens from modifying fleet-wide agent behavior	2026-02-03 19:29:56 +00:00
rcourtman	4fdc0cae64	feat(reporting): enrich metric reports with detailed resource info	2026-02-03 18:51:27 +00:00
rcourtman	11aa3e05af	fix(reporting): correct SSD life documentation and logic - Fix misleading comment in DiskInfo struct that said "percentage of life used" when it's actually "percentage of life REMAINING" - Document that 100 = healthy, 0 = end of life, -1 = unknown - This matches the Proxmox API behavior where wearout "100 is best"	2026-02-03 18:24:09 +00:00
rcourtman	9f412e69b3	fix(reporting): correct SSD life interpretation (100% = healthy, not worn) The WearLevel field represents SSD life REMAINING, not wear used: - 100% = fully healthy (new drive) - 0% = end of life Fixed logic to: - Show critical warning when life <= 10% (not >= 90%) - Show warning when life <= 30% (not >= 70%) - Display values in green when healthy (>30% life remaining) - Rename column from "Wear" to "Life" for clarity	2026-02-03 18:20:54 +00:00
rcourtman	442d29e9b9	feat(reporting): enhance PDF reports with Executive Summary and actionable insights - Add professional cover page with branding and report period - Add Executive Summary page with health status banner (HEALTHY/WARNING/CRITICAL) - Add Quick Stats section with color-coded metrics and trend indicators - Add Key Observations with automated analysis of CPU, memory, disk, and disk wear - Add Recommended Actions section with prioritized, actionable items - Add Resource Details page with hardware info, storage pools, physical disks - Add color-coded tables for alerts, storage, and disk health - Add performance charts with area fills and proper scaling - Improve overall visual design with consistent color scheme - Fix SAML session invalidation to use correct SessionStore method	2026-02-03 18:17:31 +00:00
rcourtman	c6aeb9429b	fix: initialize reporting engine in standard binary Pro license holders running the standard Docker image/binary were getting "Reporting engine not initialized" errors because the reporting engine was only wired up in the enterprise build. Now the core server initializes the reporting engine automatically when the metrics store is ready, ensuring PDF/CSV report generation works for all Pro license holders regardless of which binary they use. The enterprise hooks are still honored if set, allowing the enterprise build to override with its own implementation if needed.	2026-02-03 16:53:20 +00:00
rcourtman	35eedcb5ac	Fix: metrics store tier fallback for mock mode sparklines When querying short time ranges (1h, 6h), the metrics store only looked in TierRaw and TierMinute which were empty in mock mode. The seeded data was stored in TierHourly and TierDaily. Updated tierFallbacks to include coarser tiers as fallbacks: - TierRaw now falls back to TierMinute, then TierHourly - TierMinute now falls back to TierRaw, then TierHourly This ensures sparkline data is available in mock/demo mode where historical data is seeded into coarser tiers.	2026-02-03 12:03:06 +00:00
rcourtman	ed5ab5eebf	Fix: flaky metrics fallback test — use WriteBatchSync for deterministic writes	2026-02-02 23:32:28 +00:00
rcourtman	eb2d07e48f	Chore: enhance core api and metrics testability Refactor Router to allow HTTP client injection for install script proxying. Add tests for unified agent install mechanism and additional metrics store coverage.	2026-02-02 22:01:36 +00:00
rcourtman	3b347b6548	fix: harden SQLite against I/O contention causing persistent lock errors - Move all SQLite pragmas from db.Exec() to DSN parameters so every connection the pool creates gets busy_timeout and other settings. Previously only the first connection had these applied. - Set MaxOpenConns(1) on audit, RBAC, and notification databases (metrics already had this). Fixes potential for multiple connections where new ones lack busy_timeout. - Increase busy_timeout from 5s to 30s across all databases to tolerate disk I/O pressure during backup windows. - Fix nested query deadlocks in GetRoles(), GetUserAssignments(), and CancelByAlertIDs() that would deadlock with MaxOpenConns(1). - Fix circuit breaker retryInterval not resetting on recovery, which caused the next trip to start at 5-minute backoff instead of 5s. Related to #1156	2026-02-02 17:29:14 +00:00

1 2 3 4 5

217 commits