vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-28 11:30:15 +00:00

rcourtman 778a2577b6 feat: Pulse v6 release

2026-03-18 16:06:30 +00:00

9.8 KiB

Raw Blame History

Storage Architecture Proposal

This document defines the intended storage model for Pulse beyond the current "show storage resources and raw S.M.A.R.T. fields" behavior.

The goal is to make storage genuinely useful for operators, not merely visible.

Problem

Today Pulse can surface storage-adjacent data from several sources:

Proxmox storage pools
Proxmox physical disks
Ceph
host-agent disk inventories
host-agent S.M.A.R.T. data
TrueNAS pools/datasets/disks

That is useful, but it is not yet a coherent storage product.

The current gaps are:

disk data is source-shaped instead of operator-shaped
S.M.A.R.T. attributes are visible, but risk is not modeled
topology is weak: disk -> pool/array/host/workload impact is incomplete
agent-only hosts need first-class storage treatment, not second-class fallback behavior
storage alerting is mostly threshold-oriented rather than consequence-oriented

Product Principle

Operators do not want "S.M.A.R.T. monitoring."

They want answers to:

Which disks are at risk?
Which pools/arrays are at risk because of those disks?
Is redundancy still intact?
Is this getting worse?
What needs action now?

Pulse should therefore treat S.M.A.R.T. as one input signal inside a broader storage health model.

Primary User Jobs

Homelab / power users

Identify failing disks before data loss
See parity/cache/array issues clearly
Map a bad disk to a specific device/serial/path
Understand whether replacement is urgent or watch-only

SMB / business operators

See storage risk by host, cluster, site, and business impact
Know whether backup targets and primary storage remain healthy
Detect degraded redundancy, not just degraded disks
Track long-term degradation trends and maintenance windows

Canonical Storage Model

Pulse should model storage in four layers.

1. Physical Disk

This is the actual block device.

Canonical resource type:

physical_disk

Identity signals, strongest first:

serial
WWN / EUI
controller-specific stable disk ID
source-scoped fallback (host, device path)

Core fields:

serial, WWN, device path
model, vendor, firmware
transport / type (sata, sas, nvme, usb, etc.)
size
health / risk / confidence
temperature
wear indicators
media / pending / reallocated / CRC / unsafe-shutdown style counters
telemetry freshness

2. Storage Membership

This is the topology layer.

A disk is often only meaningful in context:

member of mdraid array
member of ZFS vdev/pool
Unraid parity/data/cache assignment
Ceph OSD backing device
PBS datastore backing disk set

Pulse should model storage membership as first-class relationships, not implicit text fields.

Examples:

disk -> host
disk -> array
disk -> pool
disk -> OSD
pool -> workloads
datastore -> backup jobs / recovery points

3. Logical Storage Object

These are the operator-facing objects:

pool
datastore
filesystem
dataset
share
Ceph cluster / pool
backup repository

Canonical resource types already mostly exist:

storage
datastore
ceph

These resources should carry:

capacity
health
redundancy state
rebuild/resilver/scrub state
impacted children

4. Consumer Impact

This is the "why should I care" layer.

Storage objects should be traceable to:

VMs
LXCs
app containers / pods
backup jobs
recovery points

This allows Pulse to answer:

a degraded mirror affects these VMs
this backup datastore is filling and will affect these protection jobs
this failed disk left this array with no redundancy

S.M.A.R.T. Model

Raw telemetry

Pulse should ingest raw S.M.A.R.T. data when available, including vendor-specific subsets.

Raw attributes remain important in the detail view, but they should not be the primary UX.

Derived model

Pulse should derive a normalized disk health model from raw telemetry:

health_state
- healthy
- watch
- degraded
- critical
- unknown
risk_score
- 0-100
confidence
- low / medium / high
reason_codes
- pending_sectors_nonzero
- reallocated_sectors_rising
- nvme_spare_low
- temperature_sustained_high
- smart_failed
- telemetry_missing

Trend model

Current values are not enough.

Pulse should preserve time series for:

temperature
reallocated sectors
pending sectors
media errors
NVMe percentage used
available spare
unsafe shutdowns

Trend direction matters:

stable
improving
slowly worsening
sharply worsening

Source Strategy

Proxmox

Use Proxmox for:

storage pools
physical disks when available
Ceph
host/node topology

Use agent linkage to enrich Proxmox disks with:

better temperature coverage
richer S.M.A.R.T. attributes
better device identity

Unified host agent

The host agent must be a first-class storage source, not only an enrichment source.

For agent-backed hosts, Pulse should directly create:

physical_disk resources from agent S.M.A.R.T.
logical storage resources when the agent can report them
storage topology when the platform supports it

This matters for:

Unraid
generic Linux servers
bare-metal NAS boxes
non-Proxmox storage hosts

Unraid

Unraid deserves explicit treatment, not generic-Linux treatment forever.

Pulse should ultimately understand:

array state
parity devices
cache pools
disk disabled / missing / emulated state
rebuild progress
filesystem status
share impact

Initial fallback can still be generic host-agent disk ingestion, but the end state should be Unraid-aware topology.

ZFS / TrueNAS

Pulse should normalize:

pool health
vdev health
read/write/checksum errors
scrub status and age
resilver status and age
per-disk membership

Generic Linux

Even without a rich platform API, Pulse should still provide value:

agent physical disks
mdraid state if available
mount/device correlation
filesystem usage
telemetry coverage warnings

Alerts

Storage alerts should be layered.

Disk alerts

Examples:

S.M.A.R.T. failed
pending sectors non-zero
reallocated sectors rising
NVMe spare below threshold
sustained high temperature

Redundancy alerts

Examples:

pool degraded but still redundant
array has lost redundancy
parity invalid / parity missing
OSD count below safe threshold

Capacity alerts

Examples:

pool nearing full
backup datastore nearing full
cache pool under pressure

Telemetry coverage alerts

Examples:

disk telemetry missing for previously known disk
controller blocks S.M.A.R.T. visibility
host stopped reporting disk inventory

This category is important because silent storage blind spots are dangerous.

UX Proposal

The storage surface should be organized around three questions.

1. What is at risk?

Top-level storage page should prioritize:

disks needing attention
degraded pools/arrays
rebuilds/resilvers in progress
backup repositories at risk

2. Where is the risk?

Every disk or pool should show context:

host
platform
array / pool / vdev / parity role
impacted workloads / backups

3. What should I do?

Each finding should have a recommended action:

replace now
schedule maintenance
monitor trend
investigate controller / cable / cooling
improve telemetry coverage

Recommended Page Structure

Fleet summary

disks at risk
degraded storage objects
active rebuild/resilver operations
storage capacity hotspots

Disk view

Grouped and filterable by:

host
pool / array
risk state
platform
disk type

Columns:

device / serial
host
role
health
risk
temperature
wear
trend
last seen

Topology view

For a selected disk:

parent host
array / pool / vdev membership
redundancy state
affected storage objects
affected workloads / backups

Detail drawer

Include:

normalized summary
risk reasons
trend charts
raw S.M.A.R.T. attributes
source provenance
telemetry freshness

Data Model Requirements

The canonical unified resource model should support:

physical_disk from every valid source
disk identity merge across sources
parent/child relationships between host, disk, pool, workload
source provenance per disk field when signals disagree
storage topology edges, not just flat metadata blobs
freshness per source and per sub-signal

Rollout Plan

Phase 1: Canonical disk coverage

ensure every agent-backed host can emit physical_disk
unify disk identity across agent / Proxmox / TrueNAS sources
show agent-only disks in storage
attach disk metrics targets consistently

Phase 2: Disk health model

add derived S.M.A.R.T. health / risk / confidence
add reason codes
add telemetry freshness semantics
improve disk alerts

Phase 3: Topology

model disk -> pool/array/vdev membership
model redundancy state
propagate impact to workloads / backups

Phase 4: Platform specialization

Unraid-aware storage model
deeper ZFS / TrueNAS topology
mdraid normalization
controller-specific enrichments where feasible

Phase 5: Operator UX

risk-first storage landing page
action-oriented recommendations
maintenance-friendly detail workflows

Near-Term Priority

If I were sequencing this immediately, I would prioritize:

agent-only physical disk coverage
canonical disk identity merge by serial / WWN
disk metrics and S.M.A.R.T. trend persistence for agent-backed disks
derived disk risk model
topology edges for arrays/pools/parity

That gives Pulse a strong storage foundation before investing in more UI complexity.

Definition of "Useful"

Pulse storage is useful when an operator can answer, in under a minute:

what is unhealthy
what is merely noisy
what is losing redundancy
what will impact workloads or backups
what needs action now

If the user still has to mentally decode raw S.M.A.R.T. tables to get there, the storage model is not finished.

9.8 KiB Raw Blame History

Storage Architecture Proposal

Problem

Product Principle

Primary User Jobs

Homelab / power users

SMB / business operators

Canonical Storage Model

1. Physical Disk

2. Storage Membership

3. Logical Storage Object

4. Consumer Impact

S.M.A.R.T. Model

Raw telemetry

Derived model

Trend model

Source Strategy

Proxmox

Unified host agent

Unraid

ZFS / TrueNAS

Generic Linux

Alerts

Disk alerts

Redundancy alerts

Capacity alerts

Telemetry coverage alerts

UX Proposal

1. What is at risk?

2. Where is the risk?

3. What should I do?

Recommended Page Structure

Fleet summary

Disk view

Topology view

Detail drawer

Data Model Requirements

Rollout Plan

Phase 1: Canonical disk coverage

Phase 2: Disk health model

Phase 3: Topology

Phase 4: Platform specialization

Phase 5: Operator UX

Near-Term Priority

Definition of "Useful"

9.8 KiB

Raw Blame History