mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-04-29 12:00:13 +00:00
471 lines
9.8 KiB
Markdown
471 lines
9.8 KiB
Markdown
# Storage Architecture Proposal
|
|
|
|
This document defines the intended storage model for Pulse beyond the current "show storage resources and raw S.M.A.R.T. fields" behavior.
|
|
|
|
The goal is to make storage genuinely useful for operators, not merely visible.
|
|
|
|
## Problem
|
|
|
|
Today Pulse can surface storage-adjacent data from several sources:
|
|
|
|
- Proxmox storage pools
|
|
- Proxmox physical disks
|
|
- Ceph
|
|
- host-agent disk inventories
|
|
- host-agent S.M.A.R.T. data
|
|
- TrueNAS pools/datasets/disks
|
|
|
|
That is useful, but it is not yet a coherent storage product.
|
|
|
|
The current gaps are:
|
|
|
|
- disk data is source-shaped instead of operator-shaped
|
|
- S.M.A.R.T. attributes are visible, but risk is not modeled
|
|
- topology is weak: disk -> pool/array/host/workload impact is incomplete
|
|
- agent-only hosts need first-class storage treatment, not second-class fallback behavior
|
|
- storage alerting is mostly threshold-oriented rather than consequence-oriented
|
|
|
|
## Product Principle
|
|
|
|
Operators do not want "S.M.A.R.T. monitoring."
|
|
|
|
They want answers to:
|
|
|
|
- Which disks are at risk?
|
|
- Which pools/arrays are at risk because of those disks?
|
|
- Is redundancy still intact?
|
|
- Is this getting worse?
|
|
- What needs action now?
|
|
|
|
Pulse should therefore treat S.M.A.R.T. as one input signal inside a broader storage health model.
|
|
|
|
## Primary User Jobs
|
|
|
|
### Homelab / power users
|
|
|
|
- Identify failing disks before data loss
|
|
- See parity/cache/array issues clearly
|
|
- Map a bad disk to a specific device/serial/path
|
|
- Understand whether replacement is urgent or watch-only
|
|
|
|
### SMB / business operators
|
|
|
|
- See storage risk by host, cluster, site, and business impact
|
|
- Know whether backup targets and primary storage remain healthy
|
|
- Detect degraded redundancy, not just degraded disks
|
|
- Track long-term degradation trends and maintenance windows
|
|
|
|
## Canonical Storage Model
|
|
|
|
Pulse should model storage in four layers.
|
|
|
|
### 1. Physical Disk
|
|
|
|
This is the actual block device.
|
|
|
|
Canonical resource type:
|
|
|
|
- `physical_disk`
|
|
|
|
Identity signals, strongest first:
|
|
|
|
- serial
|
|
- WWN / EUI
|
|
- controller-specific stable disk ID
|
|
- source-scoped fallback `(host, device path)`
|
|
|
|
Core fields:
|
|
|
|
- serial, WWN, device path
|
|
- model, vendor, firmware
|
|
- transport / type (`sata`, `sas`, `nvme`, `usb`, etc.)
|
|
- size
|
|
- health / risk / confidence
|
|
- temperature
|
|
- wear indicators
|
|
- media / pending / reallocated / CRC / unsafe-shutdown style counters
|
|
- telemetry freshness
|
|
|
|
### 2. Storage Membership
|
|
|
|
This is the topology layer.
|
|
|
|
A disk is often only meaningful in context:
|
|
|
|
- member of mdraid array
|
|
- member of ZFS vdev/pool
|
|
- Unraid parity/data/cache assignment
|
|
- Ceph OSD backing device
|
|
- PBS datastore backing disk set
|
|
|
|
Pulse should model storage membership as first-class relationships, not implicit text fields.
|
|
|
|
Examples:
|
|
|
|
- disk -> host
|
|
- disk -> array
|
|
- disk -> pool
|
|
- disk -> OSD
|
|
- pool -> workloads
|
|
- datastore -> backup jobs / recovery points
|
|
|
|
### 3. Logical Storage Object
|
|
|
|
These are the operator-facing objects:
|
|
|
|
- pool
|
|
- datastore
|
|
- filesystem
|
|
- dataset
|
|
- share
|
|
- Ceph cluster / pool
|
|
- backup repository
|
|
|
|
Canonical resource types already mostly exist:
|
|
|
|
- `storage`
|
|
- `datastore`
|
|
- `ceph`
|
|
|
|
These resources should carry:
|
|
|
|
- capacity
|
|
- health
|
|
- redundancy state
|
|
- rebuild/resilver/scrub state
|
|
- impacted children
|
|
|
|
### 4. Consumer Impact
|
|
|
|
This is the "why should I care" layer.
|
|
|
|
Storage objects should be traceable to:
|
|
|
|
- VMs
|
|
- LXCs
|
|
- app containers / pods
|
|
- backup jobs
|
|
- recovery points
|
|
|
|
This allows Pulse to answer:
|
|
|
|
- a degraded mirror affects these VMs
|
|
- this backup datastore is filling and will affect these protection jobs
|
|
- this failed disk left this array with no redundancy
|
|
|
|
## S.M.A.R.T. Model
|
|
|
|
### Raw telemetry
|
|
|
|
Pulse should ingest raw S.M.A.R.T. data when available, including vendor-specific subsets.
|
|
|
|
Raw attributes remain important in the detail view, but they should not be the primary UX.
|
|
|
|
### Derived model
|
|
|
|
Pulse should derive a normalized disk health model from raw telemetry:
|
|
|
|
- `health_state`
|
|
- healthy
|
|
- watch
|
|
- degraded
|
|
- critical
|
|
- unknown
|
|
- `risk_score`
|
|
- 0-100
|
|
- `confidence`
|
|
- low / medium / high
|
|
- `reason_codes`
|
|
- `pending_sectors_nonzero`
|
|
- `reallocated_sectors_rising`
|
|
- `nvme_spare_low`
|
|
- `temperature_sustained_high`
|
|
- `smart_failed`
|
|
- `telemetry_missing`
|
|
|
|
### Trend model
|
|
|
|
Current values are not enough.
|
|
|
|
Pulse should preserve time series for:
|
|
|
|
- temperature
|
|
- reallocated sectors
|
|
- pending sectors
|
|
- media errors
|
|
- NVMe percentage used
|
|
- available spare
|
|
- unsafe shutdowns
|
|
|
|
Trend direction matters:
|
|
|
|
- stable
|
|
- improving
|
|
- slowly worsening
|
|
- sharply worsening
|
|
|
|
## Source Strategy
|
|
|
|
### Proxmox
|
|
|
|
Use Proxmox for:
|
|
|
|
- storage pools
|
|
- physical disks when available
|
|
- Ceph
|
|
- host/node topology
|
|
|
|
Use agent linkage to enrich Proxmox disks with:
|
|
|
|
- better temperature coverage
|
|
- richer S.M.A.R.T. attributes
|
|
- better device identity
|
|
|
|
### Unified host agent
|
|
|
|
The host agent must be a first-class storage source, not only an enrichment source.
|
|
|
|
For agent-backed hosts, Pulse should directly create:
|
|
|
|
- `physical_disk` resources from agent S.M.A.R.T.
|
|
- logical storage resources when the agent can report them
|
|
- storage topology when the platform supports it
|
|
|
|
This matters for:
|
|
|
|
- Unraid
|
|
- generic Linux servers
|
|
- bare-metal NAS boxes
|
|
- non-Proxmox storage hosts
|
|
|
|
### Unraid
|
|
|
|
Unraid deserves explicit treatment, not generic-Linux treatment forever.
|
|
|
|
Pulse should ultimately understand:
|
|
|
|
- array state
|
|
- parity devices
|
|
- cache pools
|
|
- disk disabled / missing / emulated state
|
|
- rebuild progress
|
|
- filesystem status
|
|
- share impact
|
|
|
|
Initial fallback can still be generic host-agent disk ingestion, but the end state should be Unraid-aware topology.
|
|
|
|
### ZFS / TrueNAS
|
|
|
|
Pulse should normalize:
|
|
|
|
- pool health
|
|
- vdev health
|
|
- read/write/checksum errors
|
|
- scrub status and age
|
|
- resilver status and age
|
|
- per-disk membership
|
|
|
|
### Generic Linux
|
|
|
|
Even without a rich platform API, Pulse should still provide value:
|
|
|
|
- agent physical disks
|
|
- mdraid state if available
|
|
- mount/device correlation
|
|
- filesystem usage
|
|
- telemetry coverage warnings
|
|
|
|
## Alerts
|
|
|
|
Storage alerts should be layered.
|
|
|
|
### Disk alerts
|
|
|
|
Examples:
|
|
|
|
- S.M.A.R.T. failed
|
|
- pending sectors non-zero
|
|
- reallocated sectors rising
|
|
- NVMe spare below threshold
|
|
- sustained high temperature
|
|
|
|
### Redundancy alerts
|
|
|
|
Examples:
|
|
|
|
- pool degraded but still redundant
|
|
- array has lost redundancy
|
|
- parity invalid / parity missing
|
|
- OSD count below safe threshold
|
|
|
|
### Capacity alerts
|
|
|
|
Examples:
|
|
|
|
- pool nearing full
|
|
- backup datastore nearing full
|
|
- cache pool under pressure
|
|
|
|
### Telemetry coverage alerts
|
|
|
|
Examples:
|
|
|
|
- disk telemetry missing for previously known disk
|
|
- controller blocks S.M.A.R.T. visibility
|
|
- host stopped reporting disk inventory
|
|
|
|
This category is important because silent storage blind spots are dangerous.
|
|
|
|
## UX Proposal
|
|
|
|
The storage surface should be organized around three questions.
|
|
|
|
### 1. What is at risk?
|
|
|
|
Top-level storage page should prioritize:
|
|
|
|
- disks needing attention
|
|
- degraded pools/arrays
|
|
- rebuilds/resilvers in progress
|
|
- backup repositories at risk
|
|
|
|
### 2. Where is the risk?
|
|
|
|
Every disk or pool should show context:
|
|
|
|
- host
|
|
- platform
|
|
- array / pool / vdev / parity role
|
|
- impacted workloads / backups
|
|
|
|
### 3. What should I do?
|
|
|
|
Each finding should have a recommended action:
|
|
|
|
- replace now
|
|
- schedule maintenance
|
|
- monitor trend
|
|
- investigate controller / cable / cooling
|
|
- improve telemetry coverage
|
|
|
|
## Recommended Page Structure
|
|
|
|
### Fleet summary
|
|
|
|
- disks at risk
|
|
- degraded storage objects
|
|
- active rebuild/resilver operations
|
|
- storage capacity hotspots
|
|
|
|
### Disk view
|
|
|
|
Grouped and filterable by:
|
|
|
|
- host
|
|
- pool / array
|
|
- risk state
|
|
- platform
|
|
- disk type
|
|
|
|
Columns:
|
|
|
|
- device / serial
|
|
- host
|
|
- role
|
|
- health
|
|
- risk
|
|
- temperature
|
|
- wear
|
|
- trend
|
|
- last seen
|
|
|
|
### Topology view
|
|
|
|
For a selected disk:
|
|
|
|
- parent host
|
|
- array / pool / vdev membership
|
|
- redundancy state
|
|
- affected storage objects
|
|
- affected workloads / backups
|
|
|
|
### Detail drawer
|
|
|
|
Include:
|
|
|
|
- normalized summary
|
|
- risk reasons
|
|
- trend charts
|
|
- raw S.M.A.R.T. attributes
|
|
- source provenance
|
|
- telemetry freshness
|
|
|
|
## Data Model Requirements
|
|
|
|
The canonical unified resource model should support:
|
|
|
|
- `physical_disk` from every valid source
|
|
- disk identity merge across sources
|
|
- parent/child relationships between host, disk, pool, workload
|
|
- source provenance per disk field when signals disagree
|
|
- storage topology edges, not just flat metadata blobs
|
|
- freshness per source and per sub-signal
|
|
|
|
## Rollout Plan
|
|
|
|
### Phase 1: Canonical disk coverage
|
|
|
|
- ensure every agent-backed host can emit `physical_disk`
|
|
- unify disk identity across agent / Proxmox / TrueNAS sources
|
|
- show agent-only disks in storage
|
|
- attach disk metrics targets consistently
|
|
|
|
### Phase 2: Disk health model
|
|
|
|
- add derived S.M.A.R.T. health / risk / confidence
|
|
- add reason codes
|
|
- add telemetry freshness semantics
|
|
- improve disk alerts
|
|
|
|
### Phase 3: Topology
|
|
|
|
- model disk -> pool/array/vdev membership
|
|
- model redundancy state
|
|
- propagate impact to workloads / backups
|
|
|
|
### Phase 4: Platform specialization
|
|
|
|
- Unraid-aware storage model
|
|
- deeper ZFS / TrueNAS topology
|
|
- mdraid normalization
|
|
- controller-specific enrichments where feasible
|
|
|
|
### Phase 5: Operator UX
|
|
|
|
- risk-first storage landing page
|
|
- action-oriented recommendations
|
|
- maintenance-friendly detail workflows
|
|
|
|
## Near-Term Priority
|
|
|
|
If I were sequencing this immediately, I would prioritize:
|
|
|
|
1. agent-only physical disk coverage
|
|
2. canonical disk identity merge by serial / WWN
|
|
3. disk metrics and S.M.A.R.T. trend persistence for agent-backed disks
|
|
4. derived disk risk model
|
|
5. topology edges for arrays/pools/parity
|
|
|
|
That gives Pulse a strong storage foundation before investing in more UI complexity.
|
|
|
|
## Definition of "Useful"
|
|
|
|
Pulse storage is useful when an operator can answer, in under a minute:
|
|
|
|
- what is unhealthy
|
|
- what is merely noisy
|
|
- what is losing redundancy
|
|
- what will impact workloads or backups
|
|
- what needs action now
|
|
|
|
If the user still has to mentally decode raw S.M.A.R.T. tables to get there, the storage model is not finished.
|