9.8 KiB
Storage Architecture Proposal
This document defines the intended storage model for Pulse beyond the current "show storage resources and raw S.M.A.R.T. fields" behavior.
The goal is to make storage genuinely useful for operators, not merely visible.
Problem
Today Pulse can surface storage-adjacent data from several sources:
- Proxmox storage pools
- Proxmox physical disks
- Ceph
- host-agent disk inventories
- host-agent S.M.A.R.T. data
- TrueNAS pools/datasets/disks
That is useful, but it is not yet a coherent storage product.
The current gaps are:
- disk data is source-shaped instead of operator-shaped
- S.M.A.R.T. attributes are visible, but risk is not modeled
- topology is weak: disk -> pool/array/host/workload impact is incomplete
- agent-only hosts need first-class storage treatment, not second-class fallback behavior
- storage alerting is mostly threshold-oriented rather than consequence-oriented
Product Principle
Operators do not want "S.M.A.R.T. monitoring."
They want answers to:
- Which disks are at risk?
- Which pools/arrays are at risk because of those disks?
- Is redundancy still intact?
- Is this getting worse?
- What needs action now?
Pulse should therefore treat S.M.A.R.T. as one input signal inside a broader storage health model.
Primary User Jobs
Homelab / power users
- Identify failing disks before data loss
- See parity/cache/array issues clearly
- Map a bad disk to a specific device/serial/path
- Understand whether replacement is urgent or watch-only
SMB / business operators
- See storage risk by host, cluster, site, and business impact
- Know whether backup targets and primary storage remain healthy
- Detect degraded redundancy, not just degraded disks
- Track long-term degradation trends and maintenance windows
Canonical Storage Model
Pulse should model storage in four layers.
1. Physical Disk
This is the actual block device.
Canonical resource type:
physical_disk
Identity signals, strongest first:
- serial
- WWN / EUI
- controller-specific stable disk ID
- source-scoped fallback
(host, device path)
Core fields:
- serial, WWN, device path
- model, vendor, firmware
- transport / type (
sata,sas,nvme,usb, etc.) - size
- health / risk / confidence
- temperature
- wear indicators
- media / pending / reallocated / CRC / unsafe-shutdown style counters
- telemetry freshness
2. Storage Membership
This is the topology layer.
A disk is often only meaningful in context:
- member of mdraid array
- member of ZFS vdev/pool
- Unraid parity/data/cache assignment
- Ceph OSD backing device
- PBS datastore backing disk set
Pulse should model storage membership as first-class relationships, not implicit text fields.
Examples:
- disk -> host
- disk -> array
- disk -> pool
- disk -> OSD
- pool -> workloads
- datastore -> backup jobs / recovery points
3. Logical Storage Object
These are the operator-facing objects:
- pool
- datastore
- filesystem
- dataset
- share
- Ceph cluster / pool
- backup repository
Canonical resource types already mostly exist:
storagedatastoreceph
These resources should carry:
- capacity
- health
- redundancy state
- rebuild/resilver/scrub state
- impacted children
4. Consumer Impact
This is the "why should I care" layer.
Storage objects should be traceable to:
- VMs
- LXCs
- app containers / pods
- backup jobs
- recovery points
This allows Pulse to answer:
- a degraded mirror affects these VMs
- this backup datastore is filling and will affect these protection jobs
- this failed disk left this array with no redundancy
S.M.A.R.T. Model
Raw telemetry
Pulse should ingest raw S.M.A.R.T. data when available, including vendor-specific subsets.
Raw attributes remain important in the detail view, but they should not be the primary UX.
Derived model
Pulse should derive a normalized disk health model from raw telemetry:
health_state- healthy
- watch
- degraded
- critical
- unknown
risk_score- 0-100
confidence- low / medium / high
reason_codespending_sectors_nonzeroreallocated_sectors_risingnvme_spare_lowtemperature_sustained_highsmart_failedtelemetry_missing
Trend model
Current values are not enough.
Pulse should preserve time series for:
- temperature
- reallocated sectors
- pending sectors
- media errors
- NVMe percentage used
- available spare
- unsafe shutdowns
Trend direction matters:
- stable
- improving
- slowly worsening
- sharply worsening
Source Strategy
Proxmox
Use Proxmox for:
- storage pools
- physical disks when available
- Ceph
- host/node topology
Use agent linkage to enrich Proxmox disks with:
- better temperature coverage
- richer S.M.A.R.T. attributes
- better device identity
Unified host agent
The host agent must be a first-class storage source, not only an enrichment source.
For agent-backed hosts, Pulse should directly create:
physical_diskresources from agent S.M.A.R.T.- logical storage resources when the agent can report them
- storage topology when the platform supports it
This matters for:
- Unraid
- generic Linux servers
- bare-metal NAS boxes
- non-Proxmox storage hosts
Unraid
Unraid deserves explicit treatment, not generic-Linux treatment forever.
Pulse should ultimately understand:
- array state
- parity devices
- cache pools
- disk disabled / missing / emulated state
- rebuild progress
- filesystem status
- share impact
Initial fallback can still be generic host-agent disk ingestion, but the end state should be Unraid-aware topology.
ZFS / TrueNAS
Pulse should normalize:
- pool health
- vdev health
- read/write/checksum errors
- scrub status and age
- resilver status and age
- per-disk membership
Generic Linux
Even without a rich platform API, Pulse should still provide value:
- agent physical disks
- mdraid state if available
- mount/device correlation
- filesystem usage
- telemetry coverage warnings
Alerts
Storage alerts should be layered.
Disk alerts
Examples:
- S.M.A.R.T. failed
- pending sectors non-zero
- reallocated sectors rising
- NVMe spare below threshold
- sustained high temperature
Redundancy alerts
Examples:
- pool degraded but still redundant
- array has lost redundancy
- parity invalid / parity missing
- OSD count below safe threshold
Capacity alerts
Examples:
- pool nearing full
- backup datastore nearing full
- cache pool under pressure
Telemetry coverage alerts
Examples:
- disk telemetry missing for previously known disk
- controller blocks S.M.A.R.T. visibility
- host stopped reporting disk inventory
This category is important because silent storage blind spots are dangerous.
UX Proposal
The storage surface should be organized around three questions.
1. What is at risk?
Top-level storage page should prioritize:
- disks needing attention
- degraded pools/arrays
- rebuilds/resilvers in progress
- backup repositories at risk
2. Where is the risk?
Every disk or pool should show context:
- host
- platform
- array / pool / vdev / parity role
- impacted workloads / backups
3. What should I do?
Each finding should have a recommended action:
- replace now
- schedule maintenance
- monitor trend
- investigate controller / cable / cooling
- improve telemetry coverage
Recommended Page Structure
Fleet summary
- disks at risk
- degraded storage objects
- active rebuild/resilver operations
- storage capacity hotspots
Disk view
Grouped and filterable by:
- host
- pool / array
- risk state
- platform
- disk type
Columns:
- device / serial
- host
- role
- health
- risk
- temperature
- wear
- trend
- last seen
Topology view
For a selected disk:
- parent host
- array / pool / vdev membership
- redundancy state
- affected storage objects
- affected workloads / backups
Detail drawer
Include:
- normalized summary
- risk reasons
- trend charts
- raw S.M.A.R.T. attributes
- source provenance
- telemetry freshness
Data Model Requirements
The canonical unified resource model should support:
physical_diskfrom every valid source- disk identity merge across sources
- parent/child relationships between host, disk, pool, workload
- source provenance per disk field when signals disagree
- storage topology edges, not just flat metadata blobs
- freshness per source and per sub-signal
Rollout Plan
Phase 1: Canonical disk coverage
- ensure every agent-backed host can emit
physical_disk - unify disk identity across agent / Proxmox / TrueNAS sources
- show agent-only disks in storage
- attach disk metrics targets consistently
Phase 2: Disk health model
- add derived S.M.A.R.T. health / risk / confidence
- add reason codes
- add telemetry freshness semantics
- improve disk alerts
Phase 3: Topology
- model disk -> pool/array/vdev membership
- model redundancy state
- propagate impact to workloads / backups
Phase 4: Platform specialization
- Unraid-aware storage model
- deeper ZFS / TrueNAS topology
- mdraid normalization
- controller-specific enrichments where feasible
Phase 5: Operator UX
- risk-first storage landing page
- action-oriented recommendations
- maintenance-friendly detail workflows
Near-Term Priority
If I were sequencing this immediately, I would prioritize:
- agent-only physical disk coverage
- canonical disk identity merge by serial / WWN
- disk metrics and S.M.A.R.T. trend persistence for agent-backed disks
- derived disk risk model
- topology edges for arrays/pools/parity
That gives Pulse a strong storage foundation before investing in more UI complexity.
Definition of "Useful"
Pulse storage is useful when an operator can answer, in under a minute:
- what is unhealthy
- what is merely noisy
- what is losing redundancy
- what will impact workloads or backups
- what needs action now
If the user still has to mentally decode raw S.M.A.R.T. tables to get there, the storage model is not finished.