mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-05-04 22:40:14 +00:00
feat: Pulse v6 release
This commit is contained in:
parent
2fe22c3308
commit
778a2577b6
3155 changed files with 594553 additions and 173975 deletions
471
docs/STORAGE_ARCHITECTURE.md
Normal file
471
docs/STORAGE_ARCHITECTURE.md
Normal file
|
|
@ -0,0 +1,471 @@
|
|||
# Storage Architecture Proposal
|
||||
|
||||
This document defines the intended storage model for Pulse beyond the current "show storage resources and raw S.M.A.R.T. fields" behavior.
|
||||
|
||||
The goal is to make storage genuinely useful for operators, not merely visible.
|
||||
|
||||
## Problem
|
||||
|
||||
Today Pulse can surface storage-adjacent data from several sources:
|
||||
|
||||
- Proxmox storage pools
|
||||
- Proxmox physical disks
|
||||
- Ceph
|
||||
- host-agent disk inventories
|
||||
- host-agent S.M.A.R.T. data
|
||||
- TrueNAS pools/datasets/disks
|
||||
|
||||
That is useful, but it is not yet a coherent storage product.
|
||||
|
||||
The current gaps are:
|
||||
|
||||
- disk data is source-shaped instead of operator-shaped
|
||||
- S.M.A.R.T. attributes are visible, but risk is not modeled
|
||||
- topology is weak: disk -> pool/array/host/workload impact is incomplete
|
||||
- agent-only hosts need first-class storage treatment, not second-class fallback behavior
|
||||
- storage alerting is mostly threshold-oriented rather than consequence-oriented
|
||||
|
||||
## Product Principle
|
||||
|
||||
Operators do not want "S.M.A.R.T. monitoring."
|
||||
|
||||
They want answers to:
|
||||
|
||||
- Which disks are at risk?
|
||||
- Which pools/arrays are at risk because of those disks?
|
||||
- Is redundancy still intact?
|
||||
- Is this getting worse?
|
||||
- What needs action now?
|
||||
|
||||
Pulse should therefore treat S.M.A.R.T. as one input signal inside a broader storage health model.
|
||||
|
||||
## Primary User Jobs
|
||||
|
||||
### Homelab / power users
|
||||
|
||||
- Identify failing disks before data loss
|
||||
- See parity/cache/array issues clearly
|
||||
- Map a bad disk to a specific device/serial/path
|
||||
- Understand whether replacement is urgent or watch-only
|
||||
|
||||
### SMB / business operators
|
||||
|
||||
- See storage risk by host, cluster, site, and business impact
|
||||
- Know whether backup targets and primary storage remain healthy
|
||||
- Detect degraded redundancy, not just degraded disks
|
||||
- Track long-term degradation trends and maintenance windows
|
||||
|
||||
## Canonical Storage Model
|
||||
|
||||
Pulse should model storage in four layers.
|
||||
|
||||
### 1. Physical Disk
|
||||
|
||||
This is the actual block device.
|
||||
|
||||
Canonical resource type:
|
||||
|
||||
- `physical_disk`
|
||||
|
||||
Identity signals, strongest first:
|
||||
|
||||
- serial
|
||||
- WWN / EUI
|
||||
- controller-specific stable disk ID
|
||||
- source-scoped fallback `(host, device path)`
|
||||
|
||||
Core fields:
|
||||
|
||||
- serial, WWN, device path
|
||||
- model, vendor, firmware
|
||||
- transport / type (`sata`, `sas`, `nvme`, `usb`, etc.)
|
||||
- size
|
||||
- health / risk / confidence
|
||||
- temperature
|
||||
- wear indicators
|
||||
- media / pending / reallocated / CRC / unsafe-shutdown style counters
|
||||
- telemetry freshness
|
||||
|
||||
### 2. Storage Membership
|
||||
|
||||
This is the topology layer.
|
||||
|
||||
A disk is often only meaningful in context:
|
||||
|
||||
- member of mdraid array
|
||||
- member of ZFS vdev/pool
|
||||
- Unraid parity/data/cache assignment
|
||||
- Ceph OSD backing device
|
||||
- PBS datastore backing disk set
|
||||
|
||||
Pulse should model storage membership as first-class relationships, not implicit text fields.
|
||||
|
||||
Examples:
|
||||
|
||||
- disk -> host
|
||||
- disk -> array
|
||||
- disk -> pool
|
||||
- disk -> OSD
|
||||
- pool -> workloads
|
||||
- datastore -> backup jobs / recovery points
|
||||
|
||||
### 3. Logical Storage Object
|
||||
|
||||
These are the operator-facing objects:
|
||||
|
||||
- pool
|
||||
- datastore
|
||||
- filesystem
|
||||
- dataset
|
||||
- share
|
||||
- Ceph cluster / pool
|
||||
- backup repository
|
||||
|
||||
Canonical resource types already mostly exist:
|
||||
|
||||
- `storage`
|
||||
- `datastore`
|
||||
- `ceph`
|
||||
|
||||
These resources should carry:
|
||||
|
||||
- capacity
|
||||
- health
|
||||
- redundancy state
|
||||
- rebuild/resilver/scrub state
|
||||
- impacted children
|
||||
|
||||
### 4. Consumer Impact
|
||||
|
||||
This is the "why should I care" layer.
|
||||
|
||||
Storage objects should be traceable to:
|
||||
|
||||
- VMs
|
||||
- LXCs
|
||||
- app containers / pods
|
||||
- backup jobs
|
||||
- recovery points
|
||||
|
||||
This allows Pulse to answer:
|
||||
|
||||
- a degraded mirror affects these VMs
|
||||
- this backup datastore is filling and will affect these protection jobs
|
||||
- this failed disk left this array with no redundancy
|
||||
|
||||
## S.M.A.R.T. Model
|
||||
|
||||
### Raw telemetry
|
||||
|
||||
Pulse should ingest raw S.M.A.R.T. data when available, including vendor-specific subsets.
|
||||
|
||||
Raw attributes remain important in the detail view, but they should not be the primary UX.
|
||||
|
||||
### Derived model
|
||||
|
||||
Pulse should derive a normalized disk health model from raw telemetry:
|
||||
|
||||
- `health_state`
|
||||
- healthy
|
||||
- watch
|
||||
- degraded
|
||||
- critical
|
||||
- unknown
|
||||
- `risk_score`
|
||||
- 0-100
|
||||
- `confidence`
|
||||
- low / medium / high
|
||||
- `reason_codes`
|
||||
- `pending_sectors_nonzero`
|
||||
- `reallocated_sectors_rising`
|
||||
- `nvme_spare_low`
|
||||
- `temperature_sustained_high`
|
||||
- `smart_failed`
|
||||
- `telemetry_missing`
|
||||
|
||||
### Trend model
|
||||
|
||||
Current values are not enough.
|
||||
|
||||
Pulse should preserve time series for:
|
||||
|
||||
- temperature
|
||||
- reallocated sectors
|
||||
- pending sectors
|
||||
- media errors
|
||||
- NVMe percentage used
|
||||
- available spare
|
||||
- unsafe shutdowns
|
||||
|
||||
Trend direction matters:
|
||||
|
||||
- stable
|
||||
- improving
|
||||
- slowly worsening
|
||||
- sharply worsening
|
||||
|
||||
## Source Strategy
|
||||
|
||||
### Proxmox
|
||||
|
||||
Use Proxmox for:
|
||||
|
||||
- storage pools
|
||||
- physical disks when available
|
||||
- Ceph
|
||||
- host/node topology
|
||||
|
||||
Use agent linkage to enrich Proxmox disks with:
|
||||
|
||||
- better temperature coverage
|
||||
- richer S.M.A.R.T. attributes
|
||||
- better device identity
|
||||
|
||||
### Unified host agent
|
||||
|
||||
The host agent must be a first-class storage source, not only an enrichment source.
|
||||
|
||||
For agent-backed hosts, Pulse should directly create:
|
||||
|
||||
- `physical_disk` resources from agent S.M.A.R.T.
|
||||
- logical storage resources when the agent can report them
|
||||
- storage topology when the platform supports it
|
||||
|
||||
This matters for:
|
||||
|
||||
- Unraid
|
||||
- generic Linux servers
|
||||
- bare-metal NAS boxes
|
||||
- non-Proxmox storage hosts
|
||||
|
||||
### Unraid
|
||||
|
||||
Unraid deserves explicit treatment, not generic-Linux treatment forever.
|
||||
|
||||
Pulse should ultimately understand:
|
||||
|
||||
- array state
|
||||
- parity devices
|
||||
- cache pools
|
||||
- disk disabled / missing / emulated state
|
||||
- rebuild progress
|
||||
- filesystem status
|
||||
- share impact
|
||||
|
||||
Initial fallback can still be generic host-agent disk ingestion, but the end state should be Unraid-aware topology.
|
||||
|
||||
### ZFS / TrueNAS
|
||||
|
||||
Pulse should normalize:
|
||||
|
||||
- pool health
|
||||
- vdev health
|
||||
- read/write/checksum errors
|
||||
- scrub status and age
|
||||
- resilver status and age
|
||||
- per-disk membership
|
||||
|
||||
### Generic Linux
|
||||
|
||||
Even without a rich platform API, Pulse should still provide value:
|
||||
|
||||
- agent physical disks
|
||||
- mdraid state if available
|
||||
- mount/device correlation
|
||||
- filesystem usage
|
||||
- telemetry coverage warnings
|
||||
|
||||
## Alerts
|
||||
|
||||
Storage alerts should be layered.
|
||||
|
||||
### Disk alerts
|
||||
|
||||
Examples:
|
||||
|
||||
- S.M.A.R.T. failed
|
||||
- pending sectors non-zero
|
||||
- reallocated sectors rising
|
||||
- NVMe spare below threshold
|
||||
- sustained high temperature
|
||||
|
||||
### Redundancy alerts
|
||||
|
||||
Examples:
|
||||
|
||||
- pool degraded but still redundant
|
||||
- array has lost redundancy
|
||||
- parity invalid / parity missing
|
||||
- OSD count below safe threshold
|
||||
|
||||
### Capacity alerts
|
||||
|
||||
Examples:
|
||||
|
||||
- pool nearing full
|
||||
- backup datastore nearing full
|
||||
- cache pool under pressure
|
||||
|
||||
### Telemetry coverage alerts
|
||||
|
||||
Examples:
|
||||
|
||||
- disk telemetry missing for previously known disk
|
||||
- controller blocks S.M.A.R.T. visibility
|
||||
- host stopped reporting disk inventory
|
||||
|
||||
This category is important because silent storage blind spots are dangerous.
|
||||
|
||||
## UX Proposal
|
||||
|
||||
The storage surface should be organized around three questions.
|
||||
|
||||
### 1. What is at risk?
|
||||
|
||||
Top-level storage page should prioritize:
|
||||
|
||||
- disks needing attention
|
||||
- degraded pools/arrays
|
||||
- rebuilds/resilvers in progress
|
||||
- backup repositories at risk
|
||||
|
||||
### 2. Where is the risk?
|
||||
|
||||
Every disk or pool should show context:
|
||||
|
||||
- host
|
||||
- platform
|
||||
- array / pool / vdev / parity role
|
||||
- impacted workloads / backups
|
||||
|
||||
### 3. What should I do?
|
||||
|
||||
Each finding should have a recommended action:
|
||||
|
||||
- replace now
|
||||
- schedule maintenance
|
||||
- monitor trend
|
||||
- investigate controller / cable / cooling
|
||||
- improve telemetry coverage
|
||||
|
||||
## Recommended Page Structure
|
||||
|
||||
### Fleet summary
|
||||
|
||||
- disks at risk
|
||||
- degraded storage objects
|
||||
- active rebuild/resilver operations
|
||||
- storage capacity hotspots
|
||||
|
||||
### Disk view
|
||||
|
||||
Grouped and filterable by:
|
||||
|
||||
- host
|
||||
- pool / array
|
||||
- risk state
|
||||
- platform
|
||||
- disk type
|
||||
|
||||
Columns:
|
||||
|
||||
- device / serial
|
||||
- host
|
||||
- role
|
||||
- health
|
||||
- risk
|
||||
- temperature
|
||||
- wear
|
||||
- trend
|
||||
- last seen
|
||||
|
||||
### Topology view
|
||||
|
||||
For a selected disk:
|
||||
|
||||
- parent host
|
||||
- array / pool / vdev membership
|
||||
- redundancy state
|
||||
- affected storage objects
|
||||
- affected workloads / backups
|
||||
|
||||
### Detail drawer
|
||||
|
||||
Include:
|
||||
|
||||
- normalized summary
|
||||
- risk reasons
|
||||
- trend charts
|
||||
- raw S.M.A.R.T. attributes
|
||||
- source provenance
|
||||
- telemetry freshness
|
||||
|
||||
## Data Model Requirements
|
||||
|
||||
The canonical unified resource model should support:
|
||||
|
||||
- `physical_disk` from every valid source
|
||||
- disk identity merge across sources
|
||||
- parent/child relationships between host, disk, pool, workload
|
||||
- source provenance per disk field when signals disagree
|
||||
- storage topology edges, not just flat metadata blobs
|
||||
- freshness per source and per sub-signal
|
||||
|
||||
## Rollout Plan
|
||||
|
||||
### Phase 1: Canonical disk coverage
|
||||
|
||||
- ensure every agent-backed host can emit `physical_disk`
|
||||
- unify disk identity across agent / Proxmox / TrueNAS sources
|
||||
- show agent-only disks in storage
|
||||
- attach disk metrics targets consistently
|
||||
|
||||
### Phase 2: Disk health model
|
||||
|
||||
- add derived S.M.A.R.T. health / risk / confidence
|
||||
- add reason codes
|
||||
- add telemetry freshness semantics
|
||||
- improve disk alerts
|
||||
|
||||
### Phase 3: Topology
|
||||
|
||||
- model disk -> pool/array/vdev membership
|
||||
- model redundancy state
|
||||
- propagate impact to workloads / backups
|
||||
|
||||
### Phase 4: Platform specialization
|
||||
|
||||
- Unraid-aware storage model
|
||||
- deeper ZFS / TrueNAS topology
|
||||
- mdraid normalization
|
||||
- controller-specific enrichments where feasible
|
||||
|
||||
### Phase 5: Operator UX
|
||||
|
||||
- risk-first storage landing page
|
||||
- action-oriented recommendations
|
||||
- maintenance-friendly detail workflows
|
||||
|
||||
## Near-Term Priority
|
||||
|
||||
If I were sequencing this immediately, I would prioritize:
|
||||
|
||||
1. agent-only physical disk coverage
|
||||
2. canonical disk identity merge by serial / WWN
|
||||
3. disk metrics and S.M.A.R.T. trend persistence for agent-backed disks
|
||||
4. derived disk risk model
|
||||
5. topology edges for arrays/pools/parity
|
||||
|
||||
That gives Pulse a strong storage foundation before investing in more UI complexity.
|
||||
|
||||
## Definition of "Useful"
|
||||
|
||||
Pulse storage is useful when an operator can answer, in under a minute:
|
||||
|
||||
- what is unhealthy
|
||||
- what is merely noisy
|
||||
- what is losing redundancy
|
||||
- what will impact workloads or backups
|
||||
- what needs action now
|
||||
|
||||
If the user still has to mentally decode raw S.M.A.R.T. tables to get there, the storage model is not finished.
|
||||
Loading…
Add table
Add a link
Reference in a new issue