diff --git a/docs/adr/temporal-tensor-store/ADR-018-block-based-storage-engine.md b/docs/adr/temporal-tensor-store/ADR-018-block-based-storage-engine.md new file mode 100644 index 000000000..411718663 --- /dev/null +++ b/docs/adr/temporal-tensor-store/ADR-018-block-based-storage-engine.md @@ -0,0 +1,1647 @@ +# ADR-018: Block-Based Storage Engine Architecture for the Temporal Tensor Store + +**Status**: Proposed +**Date**: 2026-02-08 +**Parent**: ADR-017 Temporal Tensor Compression, ADR-001 RuVector Core Architecture, ADR-004 KV Cache Management +**Author**: System Architecture Team +**SDK**: Claude-Flow + +## Version History + +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 0.1 | 2026-02-08 | Architecture Team | Initial proposal | + +--- + +## Abstract + +This ADR defines the **block-based storage engine** that underpins the Temporal Tensor +Store (TTS). Where ADR-017 introduced the temporal tensor compression pipeline +(quantization, segment encoding, tier policy), this document specifies how +compressed tensor data is **organized on disk and in memory**, how blocks are +**identified, indexed, and persisted**, and how the engine **maintains integrity +through checksums and an append-only metadata log**. + +The engine departs from ADR-017's segment-centric model -- which treats each +segment as an opaque byte blob keyed by time range -- and instead introduces a +**fixed-size block abstraction** that provides: + +1. Stable, predictable I/O granularity (16 KB or 32 KB). +2. Per-block metadata with access-pattern tracking for tier migration. +3. An in-memory index rebuilt from an append-only MetaLog on startup. +4. Deterministic ordering by `(tensor_id, block_index)` for scan-friendly layout. +5. CRC32 checksums on quantized payloads for bit-flip detection. +6. A trait-based I/O boundary that supports both `mmap` on servers and + in-memory buffers for WASM targets. + +The design targets KV cache tensors, embedding streams, and attention +intermediates in agent workloads. It integrates with AgentDB for metadata +persistence and draws on the RIPPLE++ (2026) model for streaming incremental +inference and OMEGA for low-latency GNN serving. + +--- + +## 1. Context and Motivation + +### 1.1 Segment-Based vs. Block-Based Storage + +ADR-017 established a segment-based compression pipeline. Each segment is a +self-contained byte blob containing a header, shared scales, and packed +quantized codes for one or more frames. Segments are stored in AgentDB keyed by +`{tensor_id}:{start_ts}:{end_ts}`. + +This approach has several limitations when scaling to production workloads: + +| Limitation | Impact | +|------------|--------| +| Variable segment sizes | Unpredictable I/O patterns; fragmentation on disk | +| No sub-segment random access beyond `decode_single_frame` | Cannot efficiently read a slice of a large segment | +| No per-block access tracking | Tier migration decisions must be made at the tensor level, not block level | +| No integrity verification | A single bit flip corrupts the entire segment silently | +| Tight coupling to AgentDB blob storage | Cannot use `mmap` or tiered file layout | + +### 1.2 Why Fixed-Size Blocks + +Fixed-size blocks are a proven primitive in storage systems (ext4, RocksDB SST +blocks, TiKV, Apache Arrow IPC). They provide: + +- **Predictable I/O**: Every read and write is aligned to the same granularity. +- **Simple caching**: Block-sized buffers slot into page caches and slab allocators. +- **Locality**: Blocks within the same tensor are contiguous, enabling prefetch. +- **Independent checksums**: A corrupted block does not invalidate its neighbors. +- **Tier-granular migration**: Individual blocks can move between tiers independently. + +### 1.3 Alignment to KV Cache Access Patterns + +For attention KV cache (the primary workload per ADR-004), access patterns are +highly structured: + +``` +Attention head h, layer l, token position range [p0, p1]: + Read key block: tensor_id = hash(layer=l, head=h, type=key), block_index = p0 / block_elements + Read value block: tensor_id = hash(layer=l, head=h, type=value), block_index = p0 / block_elements +``` + +Aligning block boundaries to head-dimension multiples ensures that a single +attention head's data for a contiguous token range lives in a single block, +minimizing cross-block reads during prefill and decode. + +### 1.4 RIPPLE++ and OMEGA Context + +RIPPLE++ (2026) proposes streaming incremental inference where KV cache +entries are produced and consumed in a pipelined fashion. The block-based +engine supports this by allowing append-only writes to the tail block while +older blocks are concurrently read for attention computation. + +OMEGA (2026) targets low-latency GNN serving with tiered tensor storage. +Its block-aligned eviction strategy directly inspired the tier-bucket design +in this ADR. + +--- + +## 2. Decision + +### 2.1 Introduce a Block-Based Storage Engine as a New Crate Layer + +We introduce the `temporal_tensor_store` crate that sits above +`ruvector-temporal-tensor` (ADR-017) and provides: + +1. **Block identity**: Stable 128-bit tensor IDs with per-tensor block indexing. +2. **BlockMeta**: Rich per-block metadata including access tracking, tier, quantization + parameters, checksums, and reconstruction policy. +3. **Tiered data files**: Separate files per tier for scan-friendly eviction. +4. **Append-only MetaLog**: Crash-recoverable metadata persistence. +5. **In-memory index**: HashMap + tier buckets + min-heap for fast lookup and eviction. +6. **Trait-based I/O**: `BlockIO`, `MetaLog`, and `Clock` traits abstract the storage + backend for server (`mmap`) and WASM (in-memory buffer) targets. + +### 2.2 Relationship to ADR-017 + +ADR-017's compression pipeline remains the **codec layer**. This ADR adds the +**storage layer** on top: + +``` ++===================================================================+ +| TEMPORAL TENSOR STORE (ADR-018) | +| | +| Block identity | BlockMeta | MetaLog | Tiered files | +| In-memory index | Eviction | Checksums | ++===================================================================+ + | | | + v v v ++===================================================================+ +| TEMPORAL TENSOR COMPRESSION (ADR-017) | +| | +| Groupwise quantization | Bitstream packing | Segment format | +| Tier policy scoring | Drift detection | f16 scales | ++===================================================================+ + | + v ++===================================================================+ +| RUVECTOR CORE (ADR-001) | +| | +| Distance functions | HNSW index | Scalar/Product quantization | ++===================================================================+ +``` + +The segment format from ADR-017 is used **within** each block as the payload +encoding. A block's `q` payload is a TQTC segment (or a raw byte region for +Tier0 uncompressed data). + +--- + +## 3. Detailed Design + +### 3.1 Tensor Identity + +Every tensor managed by the store has a stable 128-bit identifier. + +**Option A -- UUID v4**: Random, globally unique, no collision risk. Requires +an external registry to map logical names to UUIDs. + +**Option B -- Deterministic hash of lineage + logical name**: Computed as +`blake3(tenant_id || collection || logical_name || lineage_parent)` truncated +to 128 bits. Reproducible, collision-resistant (128-bit birthday bound is +~2^64 tensors), and allows the same tensor to be identified across restarts +without a registry. + +**Decision**: Option B (deterministic hash). The reproducibility property is +essential for crash recovery -- the MetaLog can be validated against recomputed +IDs. For tensors with no lineage parent, the parent field is zeroed. + +### 3.2 Block Key + +A block is uniquely identified by the pair `(tensor_id, block_index)`: + +```rust +/// Unique identifier for a single block within the store. +#[derive(Clone, Copy, PartialEq, Eq, Hash, Debug)] +pub struct BlockKey { + /// 128-bit tensor identity (deterministic hash of lineage + name). + pub tensor_id: u128, + /// Zero-based index of this block within the tensor's block sequence. + pub block_index: u32, +} + +impl BlockKey { + /// Deterministic total ordering: tensor_id first, then block_index. + /// Used for scan-friendly layout and MetaLog replay ordering. + pub fn sort_key(&self) -> (u128, u32) { + (self.tensor_id, self.block_index) + } +} + +impl Ord for BlockKey { + fn cmp(&self, other: &Self) -> std::cmp::Ordering { + self.sort_key().cmp(&other.sort_key()) + } +} + +impl PartialOrd for BlockKey { + fn partial_cmp(&self, other: &Self) -> Option { + Some(self.cmp(other)) + } +} +``` + +**Stable ordering guarantee**: All scans, MetaLog entries, and data file +layouts use `(tensor_id, block_index)` lexicographic order. This ensures +deterministic replay and enables range scans over a tensor's blocks. + +### 3.3 Chunking Strategy + +Tensors are divided into fixed-size blocks before storage. + +| Parameter | Default | Rationale | +|-----------|---------|-----------| +| `BLOCK_RAW_BYTES` | 16384 (16 KB) | Matches typical OS page size; good L2 cache fit | +| `BLOCK_RAW_BYTES` (KV cache) | 32768 (32 KB) | Aligned to head_dim * num_tokens_per_block * sizeof(f16) | + +For KV cache tensors, the block boundary is aligned to head-dimension +multiples: + +``` +block_elements = BLOCK_RAW_BYTES / bytes_per_element +// Round down to nearest multiple of head_dim: +block_elements = (block_elements / head_dim) * head_dim +``` + +For a typical head_dim=128 with f16 values: +``` +block_elements = 32768 / 2 = 16384 elements +16384 / 128 = 128 token positions per block (exact alignment) +``` + +This ensures that every block boundary falls on a token-position boundary, +so attention over a contiguous token range never crosses a block. + +### 3.4 Tier Enumeration + +```rust +/// Storage tier indicating compression level and access latency. +#[derive(Clone, Copy, PartialEq, Eq, Debug, Hash)] +#[repr(u8)] +pub enum Tier { + /// Tier 0: Uncompressed f32/f16. Resident in memory or fastest storage. + Tier0 = 0, + /// Tier 1: 8-bit quantized (hot). ~4x compression. + Tier1 = 1, + /// Tier 2: 5-bit or 7-bit quantized (warm). ~4.5x-6.4x compression. + Tier2 = 2, + /// Tier 3: 3-bit quantized (cold). ~10.7x compression. + Tier3 = 3, +} + +impl Tier { + /// Convert from raw u8. Returns None for invalid values. + pub fn from_u8(v: u8) -> Option { + match v { + 0 => Some(Tier::Tier0), + 1 => Some(Tier::Tier1), + 2 => Some(Tier::Tier2), + 3 => Some(Tier::Tier3), + _ => None, + } + } +} +``` + +**Tier0** is new relative to ADR-017. It holds uncompressed tensor data for +blocks that are actively being written or that require bit-exact access (e.g., +during gradient accumulation). Tier0 blocks are never persisted to tier data +files -- they exist only in the in-memory buffer or page cache. + +### 3.5 Data Type Enumeration + +```rust +/// Element data type for the original (unquantized) tensor. +#[derive(Clone, Copy, PartialEq, Eq, Debug)] +#[repr(u8)] +pub enum DType { + F32 = 0, + F16 = 1, + BF16 = 2, + I8 = 3, + U8 = 4, +} +``` + +### 3.6 Reconstruction Policy + +```rust +/// Policy for reconstructing a block's full-precision data. +#[derive(Clone, Copy, PartialEq, Eq, Debug)] +#[repr(u8)] +pub enum ReconstructPolicy { + /// No reconstruction needed; block payload is self-contained. + None = 0, + /// Reconstruct by applying a delta to the lineage parent block. + Delta = 1, + /// Reconstruct by multiplying factors with a base block. + Factor = 2, +} +``` + +The `Delta` policy enables storing only the difference from a parent block +(useful for KV cache entries that change incrementally across decoding steps). +The `Factor` policy supports factorized representations where a block stores +low-rank factors that reconstruct the full tensor via matrix multiplication. + +### 3.7 Block Metadata (BlockMeta) + +```rust +/// Complete metadata for a single block in the store. +/// +/// This structure is stored in the in-memory index and persisted +/// via the append-only MetaLog. It contains everything needed to +/// locate, decode, verify, and score a block for tier migration. +pub struct BlockMeta { + // ---- Identity ---- + /// Unique block identifier. + pub key: BlockKey, + + // ---- Tensor shape (encoded once per tensor, stored per block for self-containment) ---- + /// Original tensor shape, encoded as a compact dimension list. + /// For a 2D tensor [rows, cols], shape = [rows as u32, cols as u32]. + /// Maximum 8 dimensions. + pub shape: [u32; 8], + /// Number of valid entries in the shape array. + pub shape_ndim: u8, + + // ---- Data type ---- + /// Element type of the original unquantized tensor. + pub dtype: DType, + + // ---- Tier and quantization ---- + /// Current storage tier. + pub tier: Tier, + /// Quantization bit width (3, 5, 7, 8, or 32 for uncompressed). + pub bits: u8, + /// Quantization scale (from ADR-017 groupwise symmetric quantization). + /// For multi-group blocks, this is the maximum group scale. + pub scale: f32, + /// Quantization zero point (0 for symmetric quantization). + pub zero_point: i16, + + // ---- Timestamps ---- + /// Tick at which this block was created. + pub created_at: u64, + /// Tick of the most recent read or write access. + pub last_access_at: u64, + + // ---- Access tracking ---- + /// Total number of accesses since creation. + pub access_count: u32, + /// Exponential moving average of the access rate (accesses per tick). + pub ema_access_rate: f32, + /// Bitset window: bit i is set if the block was accessed at tick (now - i). + /// Provides a compact 64-tick access history. + pub access_window: u64, + + // ---- Integrity ---- + /// CRC32 checksum over the quantized payload bytes concatenated with + /// the scale bytes. Detects bit flips in storage. + pub checksum: u32, + + // ---- Lineage ---- + /// Optional tensor_id of the parent block for delta/factor reconstruction. + /// Zero means no parent (self-contained block). + pub lineage_parent: u128, + /// Reconstruction policy. + pub reconstruct: ReconstructPolicy, + + // ---- Tier migration bookkeeping ---- + /// Number of ticks the block has spent in its current tier. + pub tier_age: u32, +} +``` + +### 3.8 Access History Tracking + +Three complementary mechanisms track access patterns with different tradeoffs: + +``` ++------------------------------------------------------------------+ +| ACCESS HISTORY TRACKING | ++------------------------------------------------------------------+ +| | +| 1. Bitset Window (u64) | +| +-------------------------------------------------------+ | +| | bit 0 | bit 1 | bit 2 | ... | bit 62 | bit 63 | | +| | now | now-1 | now-2 | ... | now-62 | now-63 | | +| +-------------------------------------------------------+ | +| Compact. O(1) update. Exact for last 64 ticks. | +| Use: Burst detection, recent activity check. | +| | +| 2. EMA Access Rate (f32) | +| rate_new = alpha * (1/dt) + (1-alpha) * rate_old | +| alpha = 0.1 (configurable) | +| Use: Smooth scoring for tier migration. | +| | +| 3. Access Count + Last Access Timestamp | +| score = access_count * 1024 / (now - last_access_at + 1) | +| Use: Coarse tier selection (compatible with ADR-017 policy). | +| | ++------------------------------------------------------------------+ +``` + +**Bitset window update**: +```rust +impl BlockMeta { + /// Shift the window by `elapsed` ticks and set bit 0 (current tick). + pub fn record_access(&mut self, now: u64) { + let elapsed = now.saturating_sub(self.last_access_at); + if elapsed > 0 { + // Shift old bits; bits older than 64 ticks fall off. + if elapsed >= 64 { + self.access_window = 1; // Only current tick survives. + } else { + self.access_window = (self.access_window >> elapsed) | 1; + } + } else { + self.access_window |= 1; // Same tick, just set bit 0. + } + self.last_access_at = now; + self.access_count = self.access_count.saturating_add(1); + + // Update EMA: rate = alpha * instantaneous + (1-alpha) * old + let dt = elapsed.max(1) as f32; + let instantaneous = 1.0 / dt; + const ALPHA: f32 = 0.1; + self.ema_access_rate = ALPHA * instantaneous + (1.0 - ALPHA) * self.ema_access_rate; + } + + /// Number of ticks (out of the last 64) in which this block was accessed. + pub fn recent_access_density(&self) -> u32 { + self.access_window.count_ones() + } + + /// Tier migration score combining EMA rate and access density. + /// Higher score = hotter block = keep in higher tier. + pub fn migration_score(&self, now: u64) -> f32 { + let age = (now.saturating_sub(self.created_at)).max(1) as f32; + let density = self.recent_access_density() as f32 / 64.0; + // Weighted combination: EMA dominates long-term, density captures bursts. + 0.7 * self.ema_access_rate * 1000.0 + 0.3 * density * 1000.0 / age.sqrt() + } +} +``` + +### 3.9 Storage Layout + +``` +/ + / + / + meta.log # Append-only MetaLog (MetaRecord entries) + tier1.dat # Tier 1 data file (8-bit quantized blocks) + tier2.dat # Tier 2 data file (5/7-bit quantized blocks) + tier3.dat # Tier 3 data file (3-bit quantized blocks) + delta.dat # Optional: delta payloads for ReconstructPolicy::Delta + factor.dat # Optional: factor payloads for ReconstructPolicy::Factor +``` + +**ASCII diagram of on-disk layout**: + +``` +meta.log (append-only) ++--------+--------+--------+--------+--------+-------> +| rec[0] | rec[1] | rec[2] | rec[3] | rec[4] | ... ++--------+--------+--------+--------+--------+-------> + ^create ^create ^update ^migrate ^delete + block A block B block A block A block C + access tier1->2 + +tier1.dat (8-bit blocks, sorted by BlockKey) ++============+============+============+============+ +| Block A.0 | Block A.1 | Block D.0 | Block D.1 | +| 16 KB | 16 KB | 16 KB | 16 KB | ++============+============+============+============+ + q payload q payload q payload q payload + + scales + scales + scales + scales + +tier2.dat (5/7-bit blocks) ++============+============+ +| Block B.0 | Block E.0 | +| 16 KB | 16 KB | ++============+============+ + +tier3.dat (3-bit blocks) ++============+============+============+ +| Block C.0 | Block C.1 | Block F.0 | +| 16 KB | 16 KB | 16 KB | ++============+============+============+ +``` + +Each block slot in a tier data file is padded to the configured +`BLOCK_RAW_BYTES` size. This wastes up to `BLOCK_RAW_BYTES - 1` bytes per +block but guarantees that every block can be read with a single aligned I/O +operation. + +**Memory mapping (server targets)**: Tier data files are opened with +`mmap(MAP_SHARED)` for zero-copy reads. The OS page cache handles eviction. +Writes use `mmap(MAP_PRIVATE)` with explicit `msync` on flush. + +**WASM targets**: Data is held in `Vec` buffers. A host-provided +persistence hook (`fn persist(tier: Tier, data: &[u8])`) is called on flush +to write buffers to IndexedDB, OPFS, or a host filesystem. + +### 3.10 MetaLog Format + +The MetaLog is an append-only file of fixed-size records. Each record +describes a single state transition for a block. + +```rust +/// A single record in the append-only MetaLog. +#[derive(Clone, Debug)] +pub enum MetaRecord { + /// A new block was created. + Create { + meta: BlockMeta, + /// Byte offset within the tier data file where the block payload starts. + data_offset: u64, + /// Length of the block payload in bytes. + data_len: u32, + }, + /// A block's access metadata was updated. + Access { + key: BlockKey, + last_access_at: u64, + access_count: u32, + ema_access_rate: f32, + access_window: u64, + }, + /// A block was migrated to a different tier. + Migrate { + key: BlockKey, + old_tier: Tier, + new_tier: Tier, + new_bits: u8, + new_scale: f32, + new_checksum: u32, + new_data_offset: u64, + new_data_len: u32, + }, + /// A block was deleted. + Delete { + key: BlockKey, + }, +} +``` + +**Record binary format** (little-endian, fixed 128-byte records with padding): + +``` +Offset Size Field +------ ---- ----- +0 1 record_type (0=Create, 1=Access, 2=Migrate, 3=Delete) +1 16 tensor_id (u128 LE) +17 4 block_index (u32 LE) +21 ... record-type-specific fields +120 4 record_crc32 (CRC32 over bytes 0..120) +124 4 padding (0x00) +``` + +On startup, the engine replays every record sequentially to rebuild the +in-memory index. Invalid records (CRC32 mismatch) are skipped with a warning. +This replay is O(N) in the number of records and typically completes in +<100ms for stores with fewer than 1 million blocks. + +### 3.11 In-Memory Index + +```rust +use std::collections::{BinaryHeap, HashMap}; + +/// The in-memory index provides O(1) block lookup and O(1) tier-bucket access. +pub struct BlockIndex { + /// Primary index: BlockKey -> BlockMeta. + /// Uses hashbrown internally for better cache performance on large maps. + map: HashMap, + + /// Per-tier block lists for fast candidate selection during migration. + tier_buckets: [Vec; 4], + + /// Min-heap of (score, BlockKey) for eviction candidates. + /// The block with the lowest migration_score is at the top. + eviction_heap: BinaryHeap>, + + /// Data file offsets: BlockKey -> (data_offset, data_len) per tier. + offsets: HashMap, +} + +/// Wrapper for f32 that implements Ord (NaN-safe). +#[derive(Clone, Copy, PartialEq)] +struct OrderedFloat(f32); + +impl Eq for OrderedFloat {} + +impl Ord for OrderedFloat { + fn cmp(&self, other: &Self) -> std::cmp::Ordering { + self.0.partial_cmp(&other.0).unwrap_or(std::cmp::Ordering::Equal) + } +} + +impl PartialOrd for OrderedFloat { + fn partial_cmp(&self, other: &Self) -> Option { + Some(self.cmp(other)) + } +} + +impl BlockIndex { + /// Create an empty index. + pub fn new() -> Self { + Self { + map: HashMap::new(), + tier_buckets: [Vec::new(), Vec::new(), Vec::new(), Vec::new()], + eviction_heap: BinaryHeap::new(), + offsets: HashMap::new(), + } + } + + /// Insert or update a block's metadata. + pub fn upsert(&mut self, meta: BlockMeta, data_offset: u64, data_len: u32) { + let key = meta.key; + let tier_idx = meta.tier as usize; + + // Remove from old tier bucket if present. + if let Some(old) = self.map.get(&key) { + let old_tier_idx = old.tier as usize; + if old_tier_idx != tier_idx { + self.tier_buckets[old_tier_idx].retain(|k| k != &key); + } + } + + self.tier_buckets[tier_idx].push(key); + self.offsets.insert(key, (data_offset, data_len)); + self.map.insert(key, meta); + } + + /// Look up a block's metadata by key. + pub fn get(&self, key: &BlockKey) -> Option<&BlockMeta> { + self.map.get(key) + } + + /// Look up a block's data file location. + pub fn get_offset(&self, key: &BlockKey) -> Option<(u64, u32)> { + self.offsets.get(key).copied() + } + + /// Remove a block from the index. + pub fn remove(&mut self, key: &BlockKey) -> Option { + if let Some(meta) = self.map.remove(key) { + let tier_idx = meta.tier as usize; + self.tier_buckets[tier_idx].retain(|k| k != key); + self.offsets.remove(key); + Some(meta) + } else { + None + } + } + + /// Return all block keys in a given tier. + pub fn blocks_in_tier(&self, tier: Tier) -> &[BlockKey] { + &self.tier_buckets[tier as usize] + } + + /// Total number of blocks across all tiers. + pub fn len(&self) -> usize { + self.map.len() + } + + /// Rebuild eviction heap from current metadata. + pub fn rebuild_eviction_heap(&mut self, now: u64) { + self.eviction_heap.clear(); + for (key, meta) in &self.map { + let score = meta.migration_score(now); + self.eviction_heap + .push(std::cmp::Reverse((OrderedFloat(score), *key))); + } + } + + /// Pop the block with the lowest migration score (best eviction candidate). + pub fn pop_coldest(&mut self) -> Option { + self.eviction_heap.pop().map(|std::cmp::Reverse((_, key))| key) + } +} +``` + +### 3.12 Checksums and Integrity + +Every block's quantized payload is protected by a CRC32 checksum: + +```rust +/// Compute CRC32 over the quantized payload concatenated with scale bytes. +/// +/// This detects: +/// - Bit flips in the compressed data (storage media errors). +/// - Corrupted scale values (which would cause wild dequantization errors). +/// - Truncated writes (partial block). +pub fn compute_block_checksum(q_payload: &[u8], scale_bytes: &[u8]) -> u32 { + let mut crc: u32 = 0xFFFF_FFFF; + for &byte in q_payload.iter().chain(scale_bytes.iter()) { + crc = crc32_update(crc, byte); + } + crc ^ 0xFFFF_FFFF +} + +/// CRC32 (Castagnoli) single-byte update. +/// Uses a lookup table for performance; the table is generated at compile time. +fn crc32_update(crc: u32, byte: u8) -> u32 { + let idx = ((crc ^ byte as u32) & 0xFF) as usize; + CRC32_TABLE[idx] ^ (crc >> 8) +} + +/// CRC32-C lookup table (256 entries, generated at compile time). +const CRC32_TABLE: [u32; 256] = { + let mut table = [0u32; 256]; + let mut i = 0u32; + while i < 256 { + let mut crc = i; + let mut j = 0; + while j < 8 { + if crc & 1 != 0 { + crc = (crc >> 1) ^ 0x82F6_3B78; // Castagnoli polynomial + } else { + crc >>= 1; + } + j += 1; + } + table[i as usize] = crc; + i += 1; + } + table +}; +``` + +**On read**: After reading a block from a tier data file, recompute the +checksum and compare against `BlockMeta::checksum`. On mismatch: + +1. Log a `CHECKSUM_MISMATCH` event with the block key and tier. +2. If `reconstruct != None`, attempt to rehydrate from the parent block. +3. If rehydration fails or `reconstruct == None`, return `StoreErr::Corruption`. +4. Emit a metric counter for monitoring. + +### 3.13 Public Traits + +The storage engine defines three traits to abstract the I/O boundary: + +```rust +/// Monotonic tick source for timestamps. +/// +/// On native targets this wraps `std::time::Instant` or a hardware TSC. +/// On WASM targets this wraps `performance.now()` via the host. +pub trait Clock { + /// Return the current tick value. Must be monotonically non-decreasing. + fn now_ticks(&self) -> u64; +} + +/// Block-level I/O operations. +/// +/// Implementations: +/// - `MmapBlockIO`: Memory-mapped files for server targets. +/// - `BufferBlockIO`: In-memory `Vec` for WASM targets. +pub trait BlockIO { + /// Read a block's payload into `dst`. Returns the number of bytes read. + /// + /// # Errors + /// - `StoreErr::NotFound` if the block does not exist in the given tier. + /// - `StoreErr::Corruption` if the read data fails checksum validation. + /// - `StoreErr::Io` for underlying I/O errors. + fn read_block( + &self, + tier: Tier, + key: BlockKey, + offset: u64, + len: u32, + dst: &mut [u8], + ) -> Result; + + /// Write a block's payload to the given tier. Returns the byte offset + /// at which the block was written. + /// + /// The implementation must guarantee that after a successful return, + /// the data is durable (flushed to storage or committed to the + /// WASM host persistence hook). + fn write_block( + &mut self, + tier: Tier, + key: BlockKey, + src: &[u8], + ) -> Result<(u64, u32), StoreErr>; + + /// Mark a block's storage slot as free in the given tier. + /// + /// The implementation may reclaim space immediately or defer to compaction. + fn delete_block( + &mut self, + tier: Tier, + key: BlockKey, + offset: u64, + len: u32, + ) -> Result<(), StoreErr>; +} + +/// Append-only metadata log. +/// +/// Implementations: +/// - `FileMetaLog`: Append to a file with CRC32-protected records. +/// - `MemMetaLog`: In-memory `Vec` for WASM or testing. +pub trait MetaLog { + /// Append a metadata record to the log. + /// + /// Must be atomic: either the full record is written or nothing is. + fn append(&mut self, rec: MetaRecord) -> Result<(), StoreErr>; + + /// Iterate over all records in the log in order. + /// + /// Used during startup to replay and rebuild the in-memory index. + fn iter(&self) -> Box> + '_>; + + /// Number of records in the log. + fn record_count(&self) -> u64; +} +``` + +### 3.14 Error Type + +```rust +/// Errors returned by the storage engine. +#[derive(Debug)] +pub enum StoreErr { + /// Block not found in the specified tier. + NotFound { key: BlockKey, tier: Tier }, + /// Checksum mismatch detected on read. + Corruption { + key: BlockKey, + expected: u32, + actual: u32, + }, + /// Underlying I/O error. + Io(std::io::Error), + /// MetaLog record is malformed or has invalid CRC. + InvalidRecord { offset: u64, reason: String }, + /// Capacity exceeded (e.g., tier data file is full). + CapacityExceeded { tier: Tier }, +} +``` + +### 3.15 Store Engine (Orchestration) + +```rust +/// The main storage engine that coordinates blocks, metadata, and I/O. +pub struct TensorStore { + clock: C, + block_io: B, + meta_log: M, + index: BlockIndex, + config: StoreConfig, +} + +/// Configuration for the storage engine. +pub struct StoreConfig { + /// Raw block size in bytes (before quantization). + pub block_raw_bytes: usize, + /// Maximum number of blocks per tier before eviction triggers. + pub max_blocks_per_tier: [usize; 4], + /// EMA alpha for access rate smoothing. + pub ema_alpha: f32, + /// Score threshold for tier promotion (cold -> warm, warm -> hot). + pub promote_threshold: f32, + /// Score threshold for tier demotion (hot -> warm, warm -> cold). + pub demote_threshold: f32, +} + +impl Default for StoreConfig { + fn default() -> Self { + Self { + block_raw_bytes: 16384, + max_blocks_per_tier: [1024, 4096, 8192, 16384], + ema_alpha: 0.1, + promote_threshold: 512.0, + demote_threshold: 32.0, + } + } +} + +impl TensorStore { + /// Create a new store, replaying the MetaLog to rebuild the index. + pub fn open(clock: C, block_io: B, meta_log: M, config: StoreConfig) -> Result { + let mut index = BlockIndex::new(); + + // Replay MetaLog to rebuild in-memory state. + for record in meta_log.iter() { + let record = record?; + match record { + MetaRecord::Create { meta, data_offset, data_len } => { + index.upsert(meta, data_offset, data_len); + } + MetaRecord::Access { key, last_access_at, access_count, ema_access_rate, access_window } => { + if let Some(meta) = index.map.get_mut(&key) { + meta.last_access_at = last_access_at; + meta.access_count = access_count; + meta.ema_access_rate = ema_access_rate; + meta.access_window = access_window; + } + } + MetaRecord::Migrate { key, new_tier, new_bits, new_scale, new_checksum, new_data_offset, new_data_len, .. } => { + if let Some(meta) = index.map.get_mut(&key) { + let old_tier = meta.tier; + meta.tier = new_tier; + meta.bits = new_bits; + meta.scale = new_scale; + meta.checksum = new_checksum; + meta.tier_age = 0; + // Update tier buckets. + index.tier_buckets[old_tier as usize].retain(|k| k != &key); + index.tier_buckets[new_tier as usize].push(key); + index.offsets.insert(key, (new_data_offset, new_data_len)); + } + } + MetaRecord::Delete { key } => { + index.remove(&key); + } + } + } + + let now = clock.now_ticks(); + index.rebuild_eviction_heap(now); + + Ok(Self { clock, block_io, meta_log, index, config }) + } + + /// Write a new block to the store. + pub fn put_block( + &mut self, + key: BlockKey, + tier: Tier, + dtype: DType, + shape: &[u32], + q_payload: &[u8], + scale_bytes: &[u8], + bits: u8, + scale: f32, + zero_point: i16, + lineage_parent: u128, + reconstruct: ReconstructPolicy, + ) -> Result<(), StoreErr> { + let now = self.clock.now_ticks(); + let checksum = compute_block_checksum(q_payload, scale_bytes); + + // Write payload to tier data file. + let (data_offset, data_len) = self.block_io.write_block(tier, key, q_payload)?; + + // Build metadata. + let mut shape_arr = [0u32; 8]; + let ndim = shape.len().min(8); + shape_arr[..ndim].copy_from_slice(&shape[..ndim]); + + let meta = BlockMeta { + key, + shape: shape_arr, + shape_ndim: ndim as u8, + dtype, + tier, + bits, + scale, + zero_point, + created_at: now, + last_access_at: now, + access_count: 0, + ema_access_rate: 0.0, + access_window: 1, // Accessed at creation tick. + checksum, + lineage_parent, + reconstruct, + tier_age: 0, + }; + + // Persist to MetaLog. + self.meta_log.append(MetaRecord::Create { + meta: meta.clone(), + data_offset, + data_len, + })?; + + // Update in-memory index. + self.index.upsert(meta, data_offset, data_len); + + Ok(()) + } + + /// Read a block's payload, validating its checksum. + pub fn get_block( + &mut self, + key: &BlockKey, + dst: &mut [u8], + ) -> Result { + let now = self.clock.now_ticks(); + + let meta = self.index.get(key) + .ok_or(StoreErr::NotFound { key: *key, tier: Tier::Tier0 })?; + let tier = meta.tier; + let expected_checksum = meta.checksum; + + let (offset, len) = self.index.get_offset(key) + .ok_or(StoreErr::NotFound { key: *key, tier })?; + + let bytes_read = self.block_io.read_block(tier, *key, offset, len, dst)?; + + // Validate checksum. + let actual_checksum = compute_block_checksum(&dst[..bytes_read], &[]); + if actual_checksum != expected_checksum { + return Err(StoreErr::Corruption { + key: *key, + expected: expected_checksum, + actual: actual_checksum, + }); + } + + // Update access metadata. + if let Some(meta) = self.index.map.get_mut(key) { + meta.record_access(now); + } + + Ok(bytes_read) + } + + /// Migrate a block from its current tier to a new tier. + /// + /// This re-quantizes the data at the new tier's bit width, + /// writes to the new tier file, and updates metadata. + pub fn migrate_block( + &mut self, + key: &BlockKey, + new_tier: Tier, + new_bits: u8, + re_quantized_payload: &[u8], + new_scale_bytes: &[u8], + new_scale: f32, + ) -> Result<(), StoreErr> { + let meta = self.index.get(key) + .ok_or(StoreErr::NotFound { key: *key, tier: Tier::Tier0 })?; + let old_tier = meta.tier; + + let new_checksum = compute_block_checksum(re_quantized_payload, new_scale_bytes); + + // Write to new tier. + let (new_offset, new_len) = self.block_io.write_block(new_tier, *key, re_quantized_payload)?; + + // Delete from old tier. + if let Some((old_offset, old_len)) = self.index.get_offset(key) { + let _ = self.block_io.delete_block(old_tier, *key, old_offset, old_len); + } + + // Persist migration record. + self.meta_log.append(MetaRecord::Migrate { + key: *key, + old_tier, + new_tier, + new_bits, + new_scale, + new_checksum, + new_data_offset: new_offset, + new_data_len: new_len, + })?; + + // Update in-memory state. + if let Some(meta) = self.index.map.get_mut(key) { + meta.tier = new_tier; + meta.bits = new_bits; + meta.scale = new_scale; + meta.checksum = new_checksum; + meta.tier_age = 0; + // Update tier buckets. + self.index.tier_buckets[old_tier as usize].retain(|k| k != key); + self.index.tier_buckets[new_tier as usize].push(*key); + self.index.offsets.insert(*key, (new_offset, new_len)); + } + + Ok(()) + } +} +``` + +### 3.16 Data Flow: Write Path + +``` + put_block() + | + v ++--------------------------------------------------------------------+ +| 1. Compute CRC32 checksum over q_payload + scale_bytes | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 2. BlockIO::write_block(tier, key, payload) | +| - Server: append to mmap'd tier file, return offset | +| - WASM: append to Vec buffer, schedule persist hook | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 3. Build BlockMeta with timestamps, checksum, tier, quant params | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 4. MetaLog::append(Create { meta, data_offset, data_len }) | +| - Serialize 128-byte record with CRC32 trailer | +| - Append to meta.log file / memory buffer | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 5. BlockIndex::upsert(meta, data_offset, data_len) | +| - Insert into HashMap | +| - Add to tier bucket | +| - Update offsets map | ++--------------------------------------------------------------------+ +``` + +### 3.17 Data Flow: Read Path + +``` + get_block() + | + v ++--------------------------------------------------------------------+ +| 1. BlockIndex::get(key) -> BlockMeta | +| - O(1) HashMap lookup | +| - Extract tier, checksum, data offset | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 2. BlockIO::read_block(tier, key, offset, len, dst) | +| - Server: read from mmap (zero-copy page fault) | +| - WASM: memcpy from Vec buffer | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 3. Validate CRC32: compute_block_checksum(dst) == meta.checksum? | +| - YES: proceed to step 4 | +| - NO: attempt rehydrate from lineage_parent | +| if rehydrate fails -> return StoreErr::Corruption | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 4. Update access metadata: | +| - meta.record_access(now) | +| - (Optionally) append Access record to MetaLog | +| (batched every N reads to reduce log growth) | ++--------------------------------------------------------------------+ + | + v ++--------------------------------------------------------------------+ +| 5. Return payload bytes to caller for dequantization | +| (dequantization via ADR-017 pipeline) | ++--------------------------------------------------------------------+ +``` + +### 3.18 Determinism Guarantees + +The storage engine provides the following determinism properties: + +1. **Stable ordering**: Given the same sequence of `put_block` and + `migrate_block` calls, the MetaLog will contain the same records in + the same order, and the in-memory index will be identical after replay. + +2. **Reproducible IDs**: Tensor IDs derived via `blake3(lineage + name)` + produce the same ID for the same inputs across platforms and restarts. + +3. **Deterministic eviction**: The eviction heap ordering is a pure function + of `(migration_score, BlockKey)`. Ties are broken by BlockKey's total + order `(tensor_id, block_index)`, ensuring the same block is evicted + given the same access history. + +4. **Platform-independent encoding**: All on-disk formats use little-endian + byte order. The MetaLog record size is fixed at 128 bytes regardless + of record type. + +### 3.19 Differences from ADR-017 Segment-Based Approach + +| Aspect | ADR-017 (Segment) | ADR-018 (Block) | +|--------|-------------------|-----------------| +| Granularity | Variable-size segments (header + N frames) | Fixed-size blocks (16 KB or 32 KB) | +| Identity | Time-range key `{tensor_id}:{start_ts}:{end_ts}` | `BlockKey(tensor_id, block_index)` | +| Metadata | Embedded in segment header | Separate `BlockMeta` + MetaLog | +| Access tracking | Per-compressor `access_count` and `last_access_ts` | Per-block EMA, bitset window, counters | +| Checksums | None | CRC32 per block | +| Tier migration | Tier determined at segment creation time | Blocks migrate independently between tiers | +| Random access | `decode_single_frame` within a segment | Direct block read by `(tensor_id, block_index)` | +| Crash recovery | Segments stored as AgentDB blobs | Append-only MetaLog replay | +| I/O pattern | Variable-size blob reads | Fixed-size aligned reads (page-cache friendly) | +| WASM support | Handle-based FFI in compressor | Trait-based `BlockIO` with host persistence hooks | +| Lineage | Optional DAG edges on segments | Built-in `lineage_parent` + `ReconstructPolicy` | + +The segment format from ADR-017 is **not replaced** -- it continues to serve +as the codec within each block. A block's quantized payload may contain one +or more TQTC-encoded segments, or may use a simpler packed format when +temporal scale reuse is not applicable (e.g., single-frame embedding blocks). + +--- + +## 4. Alternatives Considered + +### 4.1 Variable-Size Blocks (LSM-Style) + +**Considered**: Use variable-size blocks like an LSM tree's SSTable blocks, +where each block is as large as needed to hold one tensor's data. + +**Rejected**: Variable-size blocks complicate I/O alignment, make space +reclamation harder, and prevent simple offset-based addressing. The fixed-size +approach wastes some space to padding but gains significant simplicity and +performance predictability. + +### 4.2 Page-Aligned I/O Without Blocks + +**Considered**: Store raw quantized data in flat files and use offset-based +addressing without a block abstraction. + +**Rejected**: Without blocks, metadata (checksums, access tracking, tier +assignment) must be stored separately in a parallel structure with no natural +co-location. Blocks provide a clean unit of metadata attachment. + +### 4.3 SQLite for Metadata + +**Considered**: Use SQLite (via sql.js for WASM) instead of an append-only +MetaLog for metadata persistence. + +**Rejected**: SQLite adds a dependency (contrary to ADR-017's zero-dependency +philosophy), introduces write amplification for append-heavy workloads, and +is slower than a simple sequential log for the replay-on-startup pattern. +The MetaLog can be compacted periodically by writing a snapshot and +truncating old records. + +### 4.4 Content-Addressable Blocks (CAS) + +**Considered**: Address blocks by the hash of their content, like a +content-addressable store (git objects, IPFS). + +**Rejected**: Tensor blocks are mutable in the sense that their tier and +quantization parameters change during migration. CAS would require creating +new block identities on every migration, breaking references. The +`(tensor_id, block_index)` identity is stable across migrations. + +### 4.5 Ring Buffer for Access History + +**Considered**: Use a ring buffer of `u16` timestamps (last 16 access +timestamps) instead of the u64 bitset window. + +**Rejected as primary**: The ring buffer uses 32 bytes per block vs. 8 bytes +for the bitset. For stores with millions of blocks, this adds significant +memory overhead. The bitset provides sufficient resolution for tier migration +decisions. The ring buffer may be added as an optional diagnostic mode in the +future. + +--- + +## 5. Acceptance Criteria + +### 5.1 Functional Requirements + +- [ ] `put_block` writes a block to the correct tier data file and appends a + `Create` record to the MetaLog. +- [ ] `get_block` reads a block, validates its CRC32 checksum, and updates + access metadata. +- [ ] `migrate_block` moves a block between tiers, re-quantizes its payload, + and persists a `Migrate` record. +- [ ] MetaLog replay on startup reconstructs the exact same in-memory index + as existed before shutdown. +- [ ] Corrupted blocks (CRC32 mismatch) are detected and reported via + `StoreErr::Corruption`. +- [ ] Blocks with `ReconstructPolicy::Delta` can be rehydrated from their + lineage parent when corruption is detected. +- [ ] BlockKey ordering is deterministic: sorting by `(tensor_id, block_index)` + produces the same order on all platforms. +- [ ] The engine operates correctly with both `MmapBlockIO` (server) and + `BufferBlockIO` (WASM) implementations. + +### 5.2 Performance Targets + +| Metric | Target | Measurement | +|--------|--------|-------------| +| `put_block` latency (16 KB, SSD) | < 50 us | p50, sequential writes | +| `get_block` latency (16 KB, warm cache) | < 10 us | p50, random reads after warmup | +| `get_block` latency (16 KB, cold cache) | < 200 us | p50, random reads without warmup | +| MetaLog replay (1M records) | < 500 ms | Wall-clock time from open to ready | +| In-memory index lookup | < 100 ns | p50, `BlockIndex::get` | +| CRC32 checksum (16 KB) | < 5 us | Single block verification | +| Migration (Tier1 -> Tier3, 16 KB) | < 100 us | Including re-quantization and MetaLog append | +| Memory per block (metadata only) | < 256 bytes | `size_of::()` + index overhead | + +### 5.3 Compression Targets (Inherited from ADR-017) + +| Tier | Bits | Target Ratio vs. f32 | After Block Overhead | +|------|------|---------------------|---------------------| +| Tier0 | 32 (raw) | 1.0x | ~0.98x (block padding) | +| Tier1 | 8 | ~4.0x | ~3.9x | +| Tier2 | 5 or 7 | ~4.5x-6.4x | ~4.4x-6.2x | +| Tier3 | 3 | ~10.7x | ~10.3x | + +### 5.4 Integrity Targets + +- [ ] Zero undetected bit flips: every corrupted block is caught by CRC32. +- [ ] MetaLog records with invalid CRC are skipped during replay without + crashing the engine. +- [ ] After a crash mid-write, the MetaLog is consistent up to the last + fully-written record (no torn records). + +--- + +## 6. Risks and Mitigations + +| Risk | Severity | Likelihood | Mitigation | +|------|----------|------------|------------| +| Fixed block size wastes space for small tensors | Medium | High | Allow sub-block packing for tensors < block_size/4; track fill ratio in BlockMeta | +| MetaLog grows unboundedly | Medium | Medium | Periodic compaction: write a snapshot of current index, truncate log; compact every N records or on startup | +| CRC32 is not cryptographically secure | Low | Low | CRC32 detects accidental corruption. If tamper resistance is needed, add HMAC-SHA256 (future ADR) | +| Mmap on 32-bit WASM limited to 4 GB address space | Medium | Medium | WASM uses BufferBlockIO (in-memory) with host persistence; no mmap. Tier data files are segmented to stay within limits | +| Eviction heap becomes stale between rebuilds | Low | Medium | Rebuild heap on every N-th get_block call or timer-based; lazy invalidation acceptable for tier migration | +| Deterministic ordering assumption broken by concurrent writes | Medium | Low | Single-writer model for MetaLog (no concurrent appends). Multi-writer requires fencing (future ADR) | +| Block padding wastes disk space | Low | High | Expected overhead is < 5% for typical workloads. Acceptable tradeoff for I/O alignment benefits | + +--- + +## 7. Crate Structure + +The block-based storage engine is organized as a Rust workspace with +focused crates: + +``` +crates/ + temporal_tensor_store/ # Orchestration: TensorStore, BlockIndex, read/write paths + src/ + lib.rs # Public API, re-exports + store.rs # TensorStore implementation + index.rs # BlockIndex: HashMap + tier buckets + eviction heap + meta_log.rs # MetaLog trait + FileMetaLog + MemMetaLog + block_io.rs # BlockIO trait + MmapBlockIO + BufferBlockIO + types.rs # BlockKey, BlockMeta, Tier, DType, ReconstructPolicy, StoreErr + checksum.rs # CRC32 computation (zero-dependency, const table) + config.rs # StoreConfig + Cargo.toml + + quant/ # Quantization formats (re-exports ADR-017 quantizer) + src/ + lib.rs + symmetric.rs # Groupwise symmetric quantization + bitpack.rs # Bit packing/unpacking + f16.rs # Software f16 conversion + Cargo.toml + + tiering/ # Tier scoring, migration scheduling + src/ + lib.rs + scorer.rs # Migration score computation + scheduler.rs # Background migration scheduler + policy.rs # Tier thresholds, hysteresis + Cargo.toml + + codec_bits/ # Bit-level packing/unpacking utilities + src/ + lib.rs + pack.rs # Bitstream packer (accumulator-based) + unpack.rs # Bitstream unpacker + simd.rs # Optional SIMD-accelerated paths + Cargo.toml + + metrics/ # Witness logs, audit trail + src/ + lib.rs + witness.rs # Immutable operation log + counters.rs # Atomic counters for monitoring + export.rs # Prometheus/OpenTelemetry export + Cargo.toml + + wasm_api/ # WASM FFI surface + src/ + lib.rs + ffi.rs # extern "C" functions for WASM hosts + host_hooks.rs # Trait for host-provided persistence + Cargo.toml +``` + +**Dependency graph**: + +``` +wasm_api + | + +---> temporal_tensor_store + | | + | +---> quant (re-exports ruvector-temporal-tensor quantizer) + | +---> tiering + | +---> codec_bits + | +---> metrics + | + +---> (host-provided persistence via trait) + +temporal_tensor_store + | + +---> ruvector-temporal-tensor (ADR-017, codec layer) + +---> tiering + +---> codec_bits + +---> metrics +``` + +All crates maintain zero external dependencies for the core paths, +preserving WASM compatibility as established in ADR-017. + +--- + +## 8. Integration Context + +### 8.1 AgentDB Integration + +AgentDB serves as the **external metadata persistence** layer for deployments +that do not use the file-based MetaLog: + +``` ++------------------+ +------------------+ +| TensorStore | | AgentDB | +| | | | +| MetaLog (trait) |-------->| Key-Value Store | +| | | HNSW Index | +| BlockIO (trait) |----+ | B-Tree Index | ++------------------+ | +------------------+ + | + v + +------------------+ + | Tier Data Files | + | (or OPFS/IDB | + | via WASM host) | + +------------------+ +``` + +The `AgentDbMetaLog` implementation wraps AgentDB's key-value store: +- Key: `meta:{tenant}:{collection}:{record_sequence}` +- Value: Serialized `MetaRecord` bytes +- Tags: `type=metalog`, `tenant={id}`, `collection={id}` + +### 8.2 KV Cache Integration (ADR-004) + +The three-tier KV cache from ADR-004 maps directly to the block store's tiers: + +| KV Cache Tier (ADR-004) | Block Store Tier (ADR-018) | Bits | +|-------------------------|---------------------------|------| +| High-Precision Tail Buffer (FP16) | Tier0 (uncompressed) | 16/32 | +| Moderate Quantization Zone (4-bit) | Tier1 (8-bit) or Tier2 (5-bit) | 5-8 | +| Aggressive Compression Zone (2-bit) | Tier3 (3-bit) | 3 | + +The block store's per-block access tracking replaces ADR-004's per-token +staleness heuristic with a more granular mechanism that operates at the +block level (covering multiple tokens). + +### 8.3 Coherence Engine Integration (ADR-014, ADR-015) + +The coherence engine can trigger block-level operations: + +- **Force migration**: When coherence score drops below threshold, demote + affected blocks to force re-quantization with fresh scales. +- **Lineage validation**: Verify that blocks in a delta chain are consistent + by checking parent-child checksum chains. +- **Anomaly detection**: Flag blocks whose access patterns deviate + significantly from their tensor's historical baseline. + +### 8.4 Delta-Behavior System (ADR-016) + +The `ReconstructPolicy::Delta` directly supports ADR-016's delta-behavior +model. A block with delta reconstruction stores only the difference from +its lineage parent, enabling: + +- Efficient incremental updates (write only the changed portion). +- Temporal queries (reconstruct any version by replaying the delta chain). +- Space savings when consecutive blocks are highly correlated. + +--- + +## 9. Implementation Roadmap + +### Phase 1: Core Types and Index (Week 1) +- [ ] Define `BlockKey`, `BlockMeta`, `Tier`, `DType`, `ReconstructPolicy`, `StoreErr` +- [ ] Implement `BlockIndex` with HashMap, tier buckets, and eviction heap +- [ ] Implement `BlockMeta::record_access` and `migration_score` +- [ ] Implement CRC32 checksum computation (const lookup table) +- [ ] Unit tests for all types, ordering, and index operations + +### Phase 2: MetaLog and Persistence (Week 1-2) +- [ ] Define `MetaLog` trait and `MetaRecord` enum +- [ ] Implement `MemMetaLog` (in-memory, for WASM and testing) +- [ ] Implement `FileMetaLog` (append-only file with CRC32 records) +- [ ] MetaLog replay tests: create -> access -> migrate -> delete sequences +- [ ] Crash recovery tests: truncated records, corrupted CRC + +### Phase 3: BlockIO Backends (Week 2) +- [ ] Define `BlockIO` trait +- [ ] Implement `BufferBlockIO` (in-memory Vec, WASM-compatible) +- [ ] Implement `MmapBlockIO` (memory-mapped files, server target) +- [ ] I/O round-trip tests for both backends + +### Phase 4: TensorStore Orchestration (Week 2-3) +- [ ] Implement `TensorStore::open` with MetaLog replay +- [ ] Implement `put_block`, `get_block`, `migrate_block` +- [ ] Checksum validation on read path +- [ ] Access metadata batching (every N reads) +- [ ] Integration tests: full write -> read -> migrate -> read cycle + +### Phase 5: Tiering Engine (Week 3) +- [ ] Implement migration scorer in `tiering` crate +- [ ] Implement background migration scheduler +- [ ] Hysteresis logic for promote/demote thresholds +- [ ] End-to-end test: blocks auto-migrate based on access patterns + +### Phase 6: WASM API (Week 3-4) +- [ ] Define host persistence hooks trait +- [ ] Implement `wasm_api` FFI surface +- [ ] wasm-pack integration tests +- [ ] Binary size validation (< 150 KB for store + codec) + +### Phase 7: AgentDB Integration (Week 4) +- [ ] Implement `AgentDbMetaLog` +- [ ] Implement `AgentDbBlockIO` (blob storage backend) +- [ ] End-to-end benchmark on representative KV cache workload +- [ ] Acceptance test: MetaLog replay produces identical index + +--- + +## 10. References + +1. ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector, 2026. +2. ADR-001: RuVector Core Architecture. RuVector, 2026. +3. ADR-004: KV Cache Management Strategy for RuvLLM. RuVector, 2026. +4. ADR-016: Delta-Behavior System - Domain-Driven Design Architecture. RuVector, 2026. +5. ADR-005: WASM Runtime Integration. RuVector, 2026. +6. O'Neil, P., et al. "The Log-Structured Merge-Tree (LSM-Tree)." Acta Informatica, 1996. +7. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB, 2015. +8. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML, 2024. +9. RIPPLE++. "Streaming Incremental Inference for Large Language Models." arXiv, 2026. +10. OMEGA. "Low-Latency GNN Serving with Tiered Tensor Storage." arXiv, 2026. +11. Dong, S., et al. "RocksDB: Evolution of Development Priorities in a Key-Value Store Serving Large-Scale Applications." ACM TODS, 2021. +12. Apache Arrow IPC Format Specification. https://arrow.apache.org/docs/format/IPC.html + +--- + +## Appendix A: MetaLog Record Binary Layout + +``` +128-byte fixed record (little-endian): + +Byte 0: record_type (u8: 0=Create, 1=Access, 2=Migrate, 3=Delete) +Bytes 1-16: tensor_id (u128 LE) +Bytes 17-20: block_index (u32 LE) + +--- Create (type=0) --- +Bytes 21: dtype (u8) +Bytes 22: tier (u8) +Bytes 23: bits (u8) +Bytes 24-27: scale (f32 LE) +Bytes 28-29: zero_point (i16 LE) +Bytes 30-37: created_at (u64 LE) +Bytes 38-45: data_offset (u64 LE) +Bytes 46-49: data_len (u32 LE) +Bytes 50-53: checksum (u32 LE) +Bytes 54-69: lineage_parent (u128 LE) +Bytes 70: reconstruct (u8) +Bytes 71-119: reserved (zero-padded) + +--- Access (type=1) --- +Bytes 21-28: last_access_at (u64 LE) +Bytes 29-32: access_count (u32 LE) +Bytes 33-36: ema_access_rate (f32 LE) +Bytes 37-44: access_window (u64 LE) +Bytes 45-119: reserved (zero-padded) + +--- Migrate (type=2) --- +Bytes 21: old_tier (u8) +Bytes 22: new_tier (u8) +Bytes 23: new_bits (u8) +Bytes 24-27: new_scale (f32 LE) +Bytes 28-31: new_checksum (u32 LE) +Bytes 32-39: new_data_offset (u64 LE) +Bytes 40-43: new_data_len (u32 LE) +Bytes 44-119: reserved (zero-padded) + +--- Delete (type=3) --- +Bytes 21-119: reserved (zero-padded) + +--- All records --- +Bytes 120-123: record_crc32 (CRC32 over bytes 0..120) +Bytes 124-127: padding (0x00000000) +``` + +## Appendix B: Tier Migration Score Examples + +| Scenario | access_count | EMA rate | Window density | Score | Tier Decision | +|----------|-------------|----------|---------------|-------|---------------| +| Active KV cache head | 10000 | 50.0 | 60/64 | ~35700 | Tier0/Tier1 (hot) | +| Recently used embedding | 500 | 5.0 | 32/64 | ~4050 | Tier1 (hot) | +| Periodic batch access | 100 | 0.5 | 8/64 | ~425 | Tier2 (warm) | +| Stale attention cache | 10 | 0.01 | 1/64 | ~12 | Tier3 (cold) | +| Archived gradient sketch | 2 | 0.001 | 0/64 | ~0.7 | Tier3 (cold, eviction candidate) | + +## Appendix C: Block Size Selection Rationale + +``` + Block Size vs. Overhead Tradeoff + + Overhead % | + (padding | + waste) | * + | * + 10% ---------|----*---------------------------- + | * + | * + 5% ---------|-------*------------------------- + | * + | * * * * + 1% ---------|---------------------------------- + +----+----+----+----+----+----+--> + 4K 8K 16K 32K 64K 128K + Block Size + + At 16 KB: ~3% average padding waste for typical tensor sizes. + At 32 KB: ~1.5% average padding waste. + At 4 KB: ~12% average padding waste (too many blocks, high metadata cost). + At 64 KB: ~0.8% waste but poor L2 cache utilization. + + Decision: 16 KB default, 32 KB for KV cache aligned to head dimensions. +``` + +## Appendix D: Comparison with Existing Storage Engines + +| Feature | RocksDB | TiKV | Arrow IPC | TTS (this ADR) | +|---------|---------|------|-----------|----------------| +| Block size | 4-64 KB (configurable) | 4 KB default | Variable | 16-32 KB (fixed) | +| Compression | LZ4/Zstd/Snappy | LZ4/Zstd | None/LZ4 | Quantization (3-8 bit) | +| Checksums | CRC32 per block | CRC32 per block | None | CRC32 per block | +| Index | LSM tree | LSM tree | Footer metadata | HashMap + tier buckets | +| Write pattern | Log-structured | Log-structured | Append-only | Append-only per tier | +| Compaction | Background merge | Background merge | N/A | MetaLog snapshot | +| Access tracking | None | None | None | Per-block EMA + bitset | +| Tier migration | Manual (column families) | Manual | N/A | Automatic (score-based) | +| WASM support | No | No | Limited | Full (trait-based I/O) | +| Tensor-aware | No | No | Schema-aware | Quantization-aware | diff --git a/docs/adr/temporal-tensor-store/ADR-019-tiered-quantization-formats.md b/docs/adr/temporal-tensor-store/ADR-019-tiered-quantization-formats.md new file mode 100644 index 000000000..cfe7f60a5 --- /dev/null +++ b/docs/adr/temporal-tensor-store/ADR-019-tiered-quantization-formats.md @@ -0,0 +1,878 @@ +# ADR-019: Tiered Quantization Formats for Temporal Tensor Store + +**Status**: Proposed +**Date**: 2026-02-08 +**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine +**Author**: System Architecture Team + +## Version History + +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 0.1 | 2026-02-08 | Architecture Team | Initial proposal | + +--- + +## Abstract + +This ADR defines the concrete quantization formats, bit-packing layouts, and codec +interfaces for the five tiers of tensor storage established in ADR-017. Where ADR-017 +introduced the concept of access-frequency-driven quantization and temporal scale +reuse, this document specifies the exact byte-level formats for 8-bit (Tier 1 / Hot), +7-bit and 5-bit (Tier 2 / Warm), 3-bit (Tier 3 / Cold), and Compression-to-Zero +(Tier 0 / Absent). It also resolves two open design questions from ADR-017: whether +5-bit quantization is permitted within the warm tier, and how Tier 0 reads behave +when no reconstruction policy exists. + +The `codec_bits` module provides a single allocation-free bit packer/unpacker that +all sub-byte formats share. The `quant` module provides per-format quantize and +dequantize functions, with SIMD-accelerated `max_abs` on native targets and a +portable fallback for WASM. Rust trait interfaces are defined so that new bit widths +can be added without modifying the core codec. + +--- + +## 1. Context and Motivation + +### 1.1 Gap in ADR-017 + +ADR-017 established the tiered compression architecture and segment binary format +but left the per-tier quantization details at the algorithmic level. Implementers +need exact byte layouts to write interoperable encoders and decoders, particularly +for the sub-byte formats (7-bit, 5-bit, 3-bit) where values do not align on byte +boundaries. + +### 1.2 Sub-Byte Packing Complexity + +Standard 8-bit quantization maps trivially to `[u8]` storage. Sub-byte formats +require a bit-packing codec that can write and read arbitrary-width codes into a +byte stream without wasting bits. The codec must: + +- Handle bit widths 3, 5, and 7 (with 8 as a degenerate identity case). +- Operate without heap allocations (caller provides output slice). +- Be deterministic and platform-independent (little-endian byte order). +- Support WASM targets where SIMD is optional. + +### 1.3 Outlier Handling in 3-Bit + +At 3 bits per value, the quantization range is `[-3, +3]` (qmax = 3). Large +outliers in the tensor distribution can cause severe clamping. ADR-017 noted this +risk but did not specify a mitigation. This ADR introduces a two-level scale +option for Tier 3 that uses a 1-bit flag per value to select between a primary +scale (covering the majority of values) and a secondary scale (covering outliers), +while keeping the packed format compact. + +### 1.4 Tier 0 Semantics + +ADR-017 listed Compression-to-Zero as a future possibility. This ADR formalizes +it: Tier 0 stores no quantized data at all. Only metadata and an optional +`reconstruct_policy` survive. This enables aggressive memory reclamation for +tensors that are no longer accessed but may be reconstructable from other sources +(deltas, factors, or recomputation). + +### 1.5 Design Questions Resolved + +| Question | Resolution | +|----------|------------| +| Allow 5-bit within warm tier? | Yes. Dynamic downgrade from 7-bit to 5-bit when warm set exceeds a configurable byte cap (`warm_byte_cap`). | +| Tier 0 read semantics? | Return zeros by default. If a `reconstruct_policy` (Delta or Factor) exists, reconstruct from stored representation. | + +--- + +## 2. Decision + +We adopt the following five-tier quantization format hierarchy, each with a +well-defined byte layout, packing strategy, and error budget: + +| Tier | Name | Bits | Compression vs f32 | Use Case | +|------|------|------|-------------------|----------| +| 1 | Hot | 8 | 4.00x | Active tensors, full fidelity | +| 2a | Warm | 7 | 4.57x | Default warm, near-lossless | +| 2b | Warm-aggressive | 5 | 6.40x | Warm set exceeds `warm_byte_cap` | +| 3 | Cold | 3 | 10.67x | Archived tensors, bounded error | +| 0 | Absent | 0 | Infinite | No data stored; metadata only | + +All sub-byte formats share the `codec_bits` packer. All quantization formats use +symmetric per-block quantization with `scale = max_abs / qmax` stored as f32 per +block. The choice of f32 (rather than f16 as in ADR-017 segment headers) is +deliberate at this layer: the segment encoder may convert to f16 for storage, but +the quantizer operates in f32 for precision during the quantize/dequantize path. + +--- + +## 3. Detailed Design + +### 3.1 Tier 1: 8-Bit Quantization (Hot) + +**Algorithm**: Symmetric per-block quantization. + +``` +Given: block of N f32 values, block_size typically 64 or 128 + scale = max_abs(values) / 127 + q[i] = round(values[i] / scale) + q[i] = clamp(q[i], -127, +127) // i8 range + store: q as [i8; N] + scale as f32 +``` + +**Storage layout** (one block, block_size = 8 for illustration): + +``` +Byte offset: 0 1 2 3 4 5 6 7 8 9 10 11 + [ scale (f32, LE) ] [q0] [q1] [q2] [q3] [q4] [q5] [q6] [q7] + ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + 4 bytes 8 bytes (1 byte per i8 value) + +Total per block: 4 + block_size bytes +``` + +**Effective compression** (block_size = 64): + +``` +raw = 64 * 4 = 256 bytes +quant = 4 + 64 * 1 = 68 bytes +ratio = 256 / 68 = 3.76x (single block) +``` + +With temporal amortization (100 frames sharing scales): `256*100 / (4 + 64*100)` = 4.00x. + +**Dequantize**: + +``` +values[i] = q[i] as f32 * scale +``` + +**Error bound**: `max_error = scale / (2 * 127)`. See Section 3.7 for full analysis. + +### 3.2 Tier 2a: 7-Bit Quantization (Warm) + +**Algorithm**: Symmetric per-block, 7-bit codes packed into a bitstream. + +``` +Given: block of N f32 values + scale = max_abs(values) / 63 // qmax = 2^(7-1) - 1 = 63 + q[i] = round(values[i] / scale) + q[i] = clamp(q[i], -63, +63) + u[i] = q[i] + 63 // bias to unsigned [0, 126], fits 7 bits + pack u[i] values using codec_bits at width=7 +``` + +**Bit-packing layout** (8 values packed into 7 bytes): + +``` +Values: u0 u1 u2 u3 u4 u5 u6 u7 +Bits: [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] [6..0] + 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits 7 bits + +Packed into 7 bytes (56 bits = 8 * 7 bits): + +Byte 0: [u0[6:0] | u1[0] ] = u0(7) + u1(1) = 8 bits + |<--- 7 bits --->|<1>| + +Byte 1: [u1[6:1] | u2[1:0]] = u1(6) + u2(2) = 8 bits + |<--- 6 bits --->|<-2->| + +Byte 2: [u2[6:2] | u3[2:0] ] = u2(5) + u3(3) = 8 bits + |<-- 5 bits -->|<--3-->| + +Byte 3: [u3[6:3] | u4[3:0] ] = u3(4) + u4(4) = 8 bits + |<- 4 bits ->|<--4--->| + +Byte 4: [u4[6:4] | u5[4:0] ] = u4(3) + u5(5) = 8 bits + |<-3->|<---- 5 bits ---->| + +Byte 5: [u5[6:5] | u6[5:0] ] = u5(2) + u6(6) = 8 bits + |<2>|<----- 6 bits ------>| + +Byte 6: [u6[6] | u7[6:0] ] = u6(1) + u7(7) = 8 bits + |1|<------- 7 bits ------->| + +Total: 7 bytes for 8 values = 0.875 bytes/value +``` + +**Storage per block** (block_size = 64): + +``` +scale: 4 bytes (f32) +data: ceil(64 * 7 / 8) = 56 bytes +total: 60 bytes +ratio: 256 / 60 = 4.27x +``` + +### 3.3 Tier 2b: 5-Bit Quantization (Warm Aggressive) + +**Algorithm**: Symmetric per-block, 5-bit codes. + +``` +Given: block of N f32 values + scale = max_abs(values) / 15 // qmax = 2^(5-1) - 1 = 15 + q[i] = round(values[i] / scale) + q[i] = clamp(q[i], -15, +15) + u[i] = q[i] + 15 // bias to unsigned [0, 30], fits 5 bits + pack u[i] values using codec_bits at width=5 +``` + +**Activation policy**: 5-bit is used instead of 7-bit when the total warm set +size exceeds `warm_byte_cap` (default: 64 MiB). The tier policy monitors +aggregate warm storage and downgrades from 7-bit to 5-bit for the least recently +accessed warm tensors until the cap is satisfied. + +**Bit-packing layout** (8 values packed into 5 bytes): + +``` +Values: u0 u1 u2 u3 u4 u5 u6 u7 +Bits: [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] [4..0] + 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits 5 bits + +Packed into 5 bytes (40 bits = 8 * 5 bits): + +Byte 0: [u0[4:0] | u1[2:0] ] = u0(5) + u1(3) = 8 bits + |<-- 5 bits -->|<--3-->| + +Byte 1: [u1[4:3] | u2[4:0] | u3[0]] = u1(2) + u2(5) + u3(1) = 8 bits + |<2>|<-- 5 bits -->|<1>| + +Byte 2: [u3[4:1] | u4[3:0] ] = u3(4) + u4(4) = 8 bits + |<-- 4 bits -->|<--4-->| + +Byte 3: [u4[4] | u5[4:0] | u6[1:0]] = u4(1) + u5(5) + u6(2) = 8 bits + |1|<-- 5 bits -->|<-2->| + +Byte 4: [u6[4:2] | u7[4:0] ] = u6(3) + u7(5) = 8 bits + |<-3->|<--- 5 bits --->| + +Total: 5 bytes for 8 values = 0.625 bytes/value +``` + +**Storage per block** (block_size = 64): + +``` +scale: 4 bytes (f32) +data: ceil(64 * 5 / 8) = 40 bytes +total: 44 bytes +ratio: 256 / 44 = 5.82x +``` + +### 3.4 Tier 3: 3-Bit Quantization (Cold) + +**Algorithm**: Symmetric per-block, 3-bit codes with optional two-level scale. + +#### Standard Mode + +``` +Given: block of N f32 values + scale = max_abs(values) / 3 // qmax = 2^(3-1) - 1 = 3 + q[i] = round(values[i] / scale) + q[i] = clamp(q[i], -3, +3) + u[i] = q[i] + 3 // bias to unsigned [0, 6], fits 3 bits + pack u[i] values using codec_bits at width=3 +``` + +#### Two-Level Scale Mode (Outlier Handling) + +When the value distribution has outliers (values significantly larger than the +bulk of the distribution), a single scale wastes most of the 3-bit range on the +long tail. The two-level scale splits the range: + +``` +Given: block of N f32 values, outlier_fraction (default: 0.05) + sorted_abs = sort(|values|, descending) + outlier_count = ceil(N * outlier_fraction) + primary_max = sorted_abs[outlier_count] // excludes top 5% + secondary_max = sorted_abs[0] // full range + + primary_scale = primary_max / 3 // covers bulk values + secondary_scale = secondary_max / 3 // covers outliers + + For each value[i]: + if |value[i]| > primary_max: + flag[i] = 1 // use secondary scale + q[i] = round(value[i] / secondary_scale) + else: + flag[i] = 0 // use primary scale + q[i] = round(value[i] / primary_scale) + q[i] = clamp(q[i], -3, +3) + u[i] = q[i] + 3 + + store: primary_scale (f32) + secondary_scale (f32) + flag bits + packed codes +``` + +**Bit-packing layout** (8 values packed into 3 bytes): + +``` +Values: u0 u1 u2 u3 u4 u5 u6 u7 +Bits: [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] [2..0] + 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits + +Packed into 3 bytes (24 bits = 8 * 3 bits): + +Byte 0: [u0[2:0] | u1[2:0] | u2[1:0] ] = u0(3) + u1(3) + u2(2) = 8 bits + |<-3->|<-3->|<2>| + +Byte 1: [u2[2] | u3[2:0] | u4[2:0] | u5[0]] = u2(1) + u3(3) + u4(3) + u5(1) = 8 bits + |1|<-3->|<-3->|1| + +Byte 2: [u5[2:1] | u6[2:0] | u7[2:0] ] = u5(2) + u6(3) + u7(3) = 8 bits + |<2>|<-3->|<-3->| + +Total: 3 bytes for 8 values = 0.375 bytes/value +``` + +**Two-level scale storage layout** (one block, block_size = 64): + +``` +Byte offset: 0 3 7 8 9 ... 15 16 ... + [primary_scale f32] [secondary_scale f32] [flag bytes ] [packed codes] + ~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~ + 4 bytes 4 bytes ceil(64/8)=8 ceil(64*3/8)=24 + +Total per block (two-level): 4 + 4 + 8 + 24 = 40 bytes +Total per block (standard): 4 + 24 = 28 bytes +ratio (standard): 256 / 28 = 9.14x +ratio (two-level): 256 / 40 = 6.40x +``` + +The two-level mode trades compression ratio for outlier fidelity. It is selected +automatically when the ratio `max_abs / median_abs` exceeds a configurable +threshold (default: 5.0), indicating a heavy-tailed distribution. + +### 3.5 Tier 0: Compression to Zero (Absent) + +**Algorithm**: No quantized data is stored. + +``` +Tier 0 representation: + metadata: TensorMeta (id, shape, dtype, timestamps) + reconstruct_policy: Option + quantized_data: None + +enum ReconstructPolicy { + None, // reads return zeros + Delta { base_id: TensorId, delta: ... }, // reconstruct as base + delta + Factor { source_id: TensorId, ... }, // reconstruct via transformation +} +``` + +**Read semantics**: + +| `reconstruct_policy` | Behavior | +|----------------------|----------| +| `None` | Return a zero-filled tensor of the recorded shape. Fast-fail mode returns `Err(TierZeroNoPolicy)` instead. | +| `Delta` | Load base tensor, apply stored delta. May trigger recursive decompression if base is also tiered. | +| `Factor` | Load source tensor, apply stored transformation (scale, permutation, projection). | + +**Transition to Tier 0**: A tensor is eligible for Tier 0 when its tier score +drops below `absent_min_score` (default: 1) and it has not been accessed for +longer than `absent_age_threshold` (default: 24 hours). The transition is +irreversible without external data: once quantized data is discarded, only the +reconstruction policy (if any) can recover approximate values. + +### 3.6 Bit Packing Module: `codec_bits` + +The core packing and unpacking functions shared by all sub-byte formats. + +```rust +/// Errors from bit codec operations. +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum CodecErr { + /// Output buffer too small. Contains the required size in bytes. + OutputTooSmall { required: usize }, + /// Input buffer too small for the declared number of values. + InputTooSmall { required: usize }, + /// Bit width must be in [1, 8]. + InvalidBitWidth { bits: u8 }, +} + +/// Pack `values.len()` signed codes into `out`, using `bits` bits per code. +/// +/// Each value in `values` is treated as a signed integer in `[-(2^(bits-1)-1), 2^(bits-1)-1]`. +/// It is biased to unsigned before packing: `u = v + (2^(bits-1) - 1)`. +/// +/// Returns the number of bytes written to `out`. +/// +/// # Errors +/// - `CodecErr::OutputTooSmall` if `out` cannot hold the packed data. +/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8. +pub fn pack_bits(values: &[i8], bits: u8, out: &mut [u8]) -> Result { + if bits == 0 || bits > 8 { + return Err(CodecErr::InvalidBitWidth { bits }); + } + let total_bits = values.len() as u64 * bits as u64; + let required = ((total_bits + 7) / 8) as usize; + if out.len() < required { + return Err(CodecErr::OutputTooSmall { required }); + } + + let qmax = (1i8 << (bits - 1)) - 1; // bias offset + let mask: u64 = (1u64 << bits) - 1; + let mut acc: u64 = 0; + let mut acc_bits: u32 = 0; + let mut pos: usize = 0; + + for &v in values { + let u = (v as i16 + qmax as i16) as u64 & mask; + acc |= u << acc_bits; + acc_bits += bits as u32; + while acc_bits >= 8 { + out[pos] = (acc & 0xFF) as u8; + pos += 1; + acc >>= 8; + acc_bits -= 8; + } + } + // Flush remaining bits + if acc_bits > 0 { + out[pos] = (acc & 0xFF) as u8; + pos += 1; + } + Ok(pos) +} + +/// Unpack codes from `inp` into `out`, reading `bits` bits per code. +/// +/// Reads exactly `out.len()` values. Each unsigned code is unbiased back to signed: +/// `v = u - (2^(bits-1) - 1)`. +/// +/// Returns the number of bytes consumed from `inp`. +/// +/// # Errors +/// - `CodecErr::InputTooSmall` if `inp` does not contain enough data. +/// - `CodecErr::InvalidBitWidth` if `bits` is 0 or greater than 8. +pub fn unpack_bits(inp: &[u8], bits: u8, out: &mut [i8]) -> Result { + if bits == 0 || bits > 8 { + return Err(CodecErr::InvalidBitWidth { bits }); + } + let total_bits = out.len() as u64 * bits as u64; + let required = ((total_bits + 7) / 8) as usize; + if inp.len() < required { + return Err(CodecErr::InputTooSmall { required }); + } + + let qmax = (1i8 << (bits - 1)) - 1; + let mask: u64 = (1u64 << bits) - 1; + let mut acc: u64 = 0; + let mut acc_bits: u32 = 0; + let mut byte_pos: usize = 0; + let mut val_pos: usize = 0; + + while val_pos < out.len() { + while acc_bits < bits as u32 { + acc |= (inp[byte_pos] as u64) << acc_bits; + acc_bits += 8; + byte_pos += 1; + } + let u = (acc & mask) as i16; + out[val_pos] = (u - qmax as i16) as i8; + acc >>= bits; + acc_bits -= bits as u32; + val_pos += 1; + } + Ok(required) +} +``` + +**Properties**: + +- No heap allocations. Callers provide both input and output slices. +- Single bit writer / bit reader using a 64-bit accumulator. +- Deterministic little-endian byte order. +- The `pack_bits` / `unpack_bits` pair is its own inverse: `unpack(pack(v)) == v` + for all valid inputs. + +### 3.7 Quant Module Functions + +```rust +/// Block-level quantization configuration. +pub struct QuantConfig { + pub block_size: usize, // elements per quantization block (default: 64) + pub two_level_threshold: f32, // max/median ratio to trigger two-level (default: 5.0) +} + +/// Quantized block result. +pub struct QuantizedBlock { + pub scale: f32, + pub secondary_scale: Option, // only for two-level 3-bit + pub flags: Option>, // 1-bit-per-value flags for two-level + pub codes: Vec, // signed quantized codes + pub bits: u8, +} + +/// Symmetric 8-bit quantization (Tier 1 - Hot). +/// +/// Quantizes each block of `block_size` values independently. +/// scale = max_abs(block) / 127 +/// q[i] = clamp(round(x[i] / scale), -127, 127) +pub fn quantize_s8( + values: &[f32], + config: &QuantConfig, +) -> Vec; + +/// Symmetric N-bit quantization (Tier 2/3 - Warm/Cold). +/// +/// `bits` must be one of: 7, 5, 3. +/// qmax = 2^(bits-1) - 1 +/// scale = max_abs(block) / qmax +/// q[i] = clamp(round(x[i] / scale), -qmax, qmax) +/// +/// For bits=3 and config.two_level_threshold exceeded: uses two-level scale. +pub fn quantize_bits( + values: &[f32], + bits: u8, + config: &QuantConfig, +) -> Vec; + +/// Dequantize a block back to f32 values. +/// +/// For standard mode: x'[i] = codes[i] as f32 * scale +/// For two-level mode: x'[i] = codes[i] as f32 * (if flags[i] then secondary_scale else scale) +pub fn dequantize(block: &QuantizedBlock) -> Vec; + +/// Compute the maximum absolute value across a slice. +/// +/// On native targets with `target_feature = "avx2"` or `target_feature = "neon"`: +/// uses SIMD intrinsics for 4-8x throughput. +/// On WASM with `target_feature = "simd128"` (optional): +/// uses wasm_simd128 intrinsics. +/// Fallback: portable scalar loop. +#[inline] +pub fn max_abs(values: &[f32]) -> f32; +``` + +**SIMD implementation sketch for `max_abs`** (AVX2): + +```rust +#[cfg(target_arch = "x86_64")] +#[target_feature(enable = "avx2")] +unsafe fn max_abs_avx2(values: &[f32]) -> f32 { + use std::arch::x86_64::*; + let sign_mask = _mm256_set1_ps(f32::from_bits(0x7FFF_FFFF)); // abs mask + let mut vmax = _mm256_setzero_ps(); + let chunks = values.len() / 8; + + for i in 0..chunks { + let v = _mm256_loadu_ps(values.as_ptr().add(i * 8)); + let abs_v = _mm256_and_ps(v, sign_mask); + vmax = _mm256_max_ps(vmax, abs_v); + } + + // Horizontal max reduction + let hi128 = _mm256_extractf128_ps(vmax, 1); + let lo128 = _mm256_castps256_ps128(vmax); + let max128 = _mm_max_ps(hi128, lo128); + let shuf = _mm_movehdup_ps(max128); + let max64 = _mm_max_ps(max128, shuf); + let shuf2 = _mm_movehl_ps(max64, max64); + let max32 = _mm_max_ss(max64, shuf2); + let mut result = _mm_cvtss_f32(max32); + + // Handle remainder + for i in (chunks * 8)..values.len() { + result = result.max(values[i].abs()); + } + result +} +``` + +**WASM portable fallback**: + +```rust +#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))] +pub fn max_abs(values: &[f32]) -> f32 { + let mut m: f32 = 0.0; + for &v in values { + let a = v.abs(); + if a > m { + m = a; + } + } + m +} +``` + +When WASM SIMD is enabled via `target_feature = "simd128"`, a vectorized path +processes 4 f32 values per iteration using `v128` types. This is optional and +gated behind a cargo feature flag `wasm-simd`. + +### 3.8 Error Bound Analysis + +For symmetric quantization with bit width `B`, block scale `s`, and `qmax = 2^(B-1) - 1`: + +``` +quantization_step = s / qmax +max_element_error = quantization_step / 2 (from rounding) +max_relative_error = 1 / (2 * qmax) (per element, worst case) +rms_error = quantization_step / sqrt(12) (uniform quantization noise) +``` + +**Per-tier error bounds**: + +| Tier | Bits | qmax | Max Rel. Error | RMS Rel. Error | Max Abs. Error (scale=1.0) | +|------|------|------|---------------|----------------|---------------------------| +| Hot (8-bit) | 8 | 127 | 0.394% | 0.228% | 0.00394 | +| Warm (7-bit) | 7 | 63 | 0.794% | 0.458% | 0.00794 | +| Warm-agg (5-bit) | 5 | 15 | 3.333% | 1.925% | 0.03333 | +| Cold (3-bit, std) | 3 | 3 | 16.667% | 9.623% | 0.16667 | +| Cold (3-bit, 2-level) | 3 | 3 | 16.667% per scale | 9.623% | Reduced for bulk values | + +**Two-level scale improvement for 3-bit**: When 95% of values fall within +`primary_max` and outliers use `secondary_scale`: + +| Component | Fraction | Scale | Effective Max Error | +|-----------|----------|-------|-------------------| +| Bulk values (95%) | 0.95 | primary_scale (smaller) | 16.7% of primary_max | +| Outlier values (5%) | 0.05 | secondary_scale (larger) | 16.7% of secondary_max | + +The bulk values achieve much lower absolute error because `primary_scale` is +typically 3-10x smaller than the single-scale `scale`. The outliers retain the +same relative error but are fewer in number. + +**Drift compounding**: When drift tolerance is `d` (e.g., 10%), and a frame is +quantized with scales from an earlier frame, the effective max relative error +becomes `(1 + d) / (2 * qmax)`. For 8-bit with 10% drift: `1.1 / 254 = 0.433%`. + +**Cumulative error table with drift**: + +| Tier | Bits | No Drift | 10% Drift | 20% Drift | +|------|------|----------|-----------|-----------| +| Hot | 8 | 0.394% | 0.433% | 0.472% | +| Warm | 7 | 0.794% | 0.873% | 0.952% | +| Warm-agg | 5 | 3.333% | 3.667% | 4.000% | +| Cold | 3 | 16.667% | 18.333% | 20.000% | + +### 3.9 Complete Quantizer and Packer Traits + +```rust +/// Trait for quantization formats that can encode and decode tensor blocks. +pub trait TensorQuantizer { + /// The bit width of this quantizer. + fn bit_width(&self) -> u8; + + /// Quantize a block of f32 values into signed codes and scale(s). + fn quantize_block( + &self, + values: &[f32], + config: &QuantConfig, + ) -> QuantizedBlock; + + /// Dequantize a block back to f32 values. + fn dequantize_block( + &self, + block: &QuantizedBlock, + out: &mut [f32], + ) -> Result<(), CodecErr>; + + /// Returns the packed byte size for `num_values` at this bit width, + /// excluding scale storage. + fn packed_data_size(&self, num_values: usize) -> usize { + (num_values * self.bit_width() as usize + 7) / 8 + } + + /// Returns total block storage size including scale(s) and flags. + fn block_storage_size(&self, block_size: usize) -> usize; +} + +/// Trait for bit-level packing codecs. +pub trait BitCodec { + /// Pack signed codes into a byte buffer. + fn pack( + &self, + codes: &[i8], + bits: u8, + out: &mut [u8], + ) -> Result; + + /// Unpack codes from a byte buffer. + fn unpack( + &self, + data: &[u8], + bits: u8, + out: &mut [i8], + ) -> Result; +} + +/// Standard implementation using the accumulator-based codec_bits functions. +pub struct StandardBitCodec; + +impl BitCodec for StandardBitCodec { + fn pack( + &self, + codes: &[i8], + bits: u8, + out: &mut [u8], + ) -> Result { + pack_bits(codes, bits, out) + } + + fn unpack( + &self, + data: &[u8], + bits: u8, + out: &mut [i8], + ) -> Result { + unpack_bits(data, bits, out) + } +} +``` + +### 3.10 Block Storage Summary Diagram + +``` +TIER 1 (8-bit): ++--------+-------+-------+-------+-----+-------+ +| scale | q[0] | q[1] | q[2] | ... | q[63] | +| f32 LE | i8 | i8 | i8 | | i8 | ++--------+-------+-------+-------+-----+-------+ + 4 bytes 1 1 1 1 = 68 bytes / block + +TIER 2a (7-bit): ++--------+--------------------------------------------+ +| scale | packed 7-bit codes (56 bytes for 64 vals) | +| f32 LE | bitstream, little-endian accumulator | ++--------+--------------------------------------------+ + 4 bytes ceil(64*7/8) = 56 bytes = 60 bytes / block + +TIER 2b (5-bit): ++--------+--------------------------------------------+ +| scale | packed 5-bit codes (40 bytes for 64 vals) | +| f32 LE | bitstream, little-endian accumulator | ++--------+--------------------------------------------+ + 4 bytes ceil(64*5/8) = 40 bytes = 44 bytes / block + +TIER 3 standard (3-bit): ++--------+--------------------------------------------+ +| scale | packed 3-bit codes (24 bytes for 64 vals) | +| f32 LE | bitstream, little-endian accumulator | ++--------+--------------------------------------------+ + 4 bytes ceil(64*3/8) = 24 bytes = 28 bytes / block + +TIER 3 two-level (3-bit): ++--------+--------+----------+-------------------------------+ +| pscale | sscale | flags | packed 3-bit codes | +| f32 LE | f32 LE | ceil(N/8)| bitstream | ++--------+--------+----------+-------------------------------+ + 4 4 8 bytes 24 bytes = 40 bytes / block + +TIER 0 (absent): ++--------------------------------------+ +| TensorMeta + ReconstructPolicy only | +| NO quantized data | ++--------------------------------------+ + variable (typically 32-128 bytes metadata) +``` + +--- + +## 4. Alternatives Considered + +### 4.1 4-Bit as the Warm Tier + +4-bit quantization (qmax = 7, 8.00x compression) is the most widely studied +format (GPTQ, AWQ). We considered using 4-bit instead of 7-bit for the warm +tier. **Rejected** because: (a) the jump from 8-bit to 4-bit is too large for +tensors that were recently hot, causing unnecessary quality loss; (b) 7-bit +provides a gentler step-down; (c) 5-bit is available as an intermediate when +memory pressure increases. + +### 4.2 Uniform 4-Bit Across All Non-Hot Tiers + +A simpler design with only two quantization levels (8-bit hot, 4-bit everything +else). **Rejected** because: (a) cold tensors waste 1 extra bit per value when +3-bit suffices; (b) no path to aggressive compression under memory pressure; +(c) loses the granularity that enables smooth quality degradation. + +### 4.3 Asymmetric Quantization for 3-Bit + +Using asymmetric quantization (with zero-point) for 3-bit to better utilize the +`[0, 7]` unsigned range when distributions are not centered. **Rejected** +because: (a) adds 4 bytes of zero-point storage per block; (b) requires an +additional subtraction in the dequantize path; (c) the two-level scale approach +handles asymmetric distributions more effectively by splitting the scale rather +than shifting the range. + +### 4.4 Lookup Table (Codebook) Quantization for Cold + +Using a small codebook (e.g., 8 centroids) instead of uniform 3-bit levels. +**Rejected** because: (a) requires a per-block or per-tensor codebook training +step that is expensive for streaming data; (b) codebook storage overhead is +comparable to scale storage but with higher decode complexity; (c) uniform +quantization is simpler to implement and reason about. + +### 4.5 No Two-Level Scale (Simpler 3-Bit) + +Omitting the two-level scale option entirely. **Considered but rejected** because +agent embedding tensors frequently exhibit heavy-tailed distributions where a few +dimensions carry disproportionate magnitude. Without two-level scale, these +outliers cause the single scale to be too large, wasting most of the 3-bit range +on the bulk of near-zero values. + +--- + +## 5. Acceptance Criteria + +### 5.1 Format Correctness + +- [ ] `pack_bits` followed by `unpack_bits` is a lossless round-trip for all + bit widths (3, 5, 7, 8) and all valid signed input ranges. +- [ ] `quantize_s8` followed by `dequantize` produces values within the + theoretical error bound (`scale / 254`) of the originals. +- [ ] `quantize_bits(7, ...)` followed by `dequantize` produces values within + `scale / 126` of the originals. +- [ ] `quantize_bits(5, ...)` followed by `dequantize` produces values within + `scale / 30` of the originals. +- [ ] `quantize_bits(3, ...)` followed by `dequantize` produces values within + `scale / 6` of the originals (standard mode). +- [ ] Two-level 3-bit mode activates when `max/median > two_level_threshold`. +- [ ] Tier 0 reads return zeros when `reconstruct_policy` is `None`. +- [ ] Tier 0 reads invoke reconstruction when a policy exists. + +### 5.2 Performance + +- [ ] `pack_bits` throughput >= 2 GB/s on native (AVX2-capable hardware). +- [ ] `unpack_bits` throughput >= 2 GB/s on native. +- [ ] `max_abs` with SIMD is >= 3x faster than the scalar fallback on 512+ element blocks. +- [ ] WASM `pack_bits` / `unpack_bits` throughput >= 500 MB/s (without SIMD). +- [ ] No heap allocations in `pack_bits`, `unpack_bits`, or `max_abs`. + +### 5.3 Storage Efficiency + +- [ ] 8-bit block storage: exactly `4 + block_size` bytes. +- [ ] 7-bit block storage: exactly `4 + ceil(block_size * 7 / 8)` bytes. +- [ ] 5-bit block storage: exactly `4 + ceil(block_size * 5 / 8)` bytes. +- [ ] 3-bit block storage (standard): exactly `4 + ceil(block_size * 3 / 8)` bytes. +- [ ] 3-bit block storage (two-level): exactly `8 + ceil(block_size / 8) + ceil(block_size * 3 / 8)` bytes. +- [ ] No padding bits between consecutive blocks in a segment. + +### 5.4 Dynamic Tier 2 Downgrade + +- [ ] When aggregate warm storage exceeds `warm_byte_cap`, the least recently + accessed warm tensors are re-encoded from 7-bit to 5-bit. +- [ ] The downgrade is reversible: if warm storage drops below + `warm_byte_cap * 0.8` (hysteresis), tensors can be re-promoted to 7-bit + on next access. + +--- + +## 6. Risks and Mitigations + +| Risk | Severity | Likelihood | Mitigation | +|------|----------|------------|------------| +| 3-bit two-level scale adds format complexity without sufficient accuracy gain for most distributions | Medium | Medium | Gate behind a cargo feature `two-level-cold`; default to standard 3-bit. Benchmark on real agent embeddings before enabling by default. | +| Dynamic 7-bit to 5-bit downgrade causes thrashing when warm set oscillates near the byte cap | Medium | Medium | Implement hysteresis (20% band). Only downgrade when above cap; only upgrade when below 80% of cap. Rate-limit downgrades to at most once per minute. | +| `pack_bits` accumulator overflow for large inputs | Low | Low | The 64-bit accumulator can hold up to 56 bits of pending data (7 bytes). Since we flush at 8 bits, the maximum pending bits is `bits - 1 = 7`, well within the 64-bit range. No overflow possible. | +| Tier 0 reconstruction from Delta/Factor introduces unbounded latency | Medium | Low | Set a maximum reconstruction depth (default: 3). If the base tensor is also Tier 0, fail with `ReconstructionDepthExceeded` rather than recursing indefinitely. | +| WASM scalar `max_abs` is a bottleneck for large tensors | Low | High | Expected. The WASM SIMD feature flag provides 3-4x improvement. For non-SIMD targets, `max_abs` cost is small relative to the full quantize pipeline. | +| Block size mismatch between encoder and decoder | High | Low | Block size is stored in the segment header (ADR-017 format). Decoder reads it from the header rather than assuming a default. | + +--- + +## 7. References + +1. ADR-017: Temporal Tensor Compression with Tiered Quantization. RuVector Architecture Team, 2026. +2. ADR-018: Block-Based Storage Engine for Temporal Tensor Segments (forthcoming). +3. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. +4. Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. +5. Kim, S., et al. "SqueezeLLM: Dense-and-Sparse Quantization." ICML 2024. +6. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. +7. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015. +8. IEEE 754-2019. "IEEE Standard for Floating-Point Arithmetic." +9. Lemire, D. and Boytsov, L. "Decoding billions of integers in milliseconds through vectorized bit packing." Software: Practice and Experience, 2015. +10. WebAssembly SIMD Proposal. https://github.com/WebAssembly/simd. Finalized 2023. diff --git a/docs/adr/temporal-tensor-store/ADR-020-temporal-scoring-tier-migration.md b/docs/adr/temporal-tensor-store/ADR-020-temporal-scoring-tier-migration.md new file mode 100644 index 000000000..59b12f31d --- /dev/null +++ b/docs/adr/temporal-tensor-store/ADR-020-temporal-scoring-tier-migration.md @@ -0,0 +1,1250 @@ +# ADR-020: Temporal Scoring and Tier Migration Algorithm + +**Status**: Proposed +**Date**: 2026-02-08 +**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine +**Author**: System Architecture Team + +## Version History + +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 0.1 | 2026-02-08 | Architecture Team | Initial proposal | + +--- + +## Abstract + +This ADR specifies the scoring algorithm, tier migration logic, and budgeted +maintenance pass that govern how compressed tensor blocks move between storage +tiers in the Temporal Tensor Store. It supersedes the simple +`access_count * 1024 / age` heuristic from ADR-017 with a composite score +that blends an exponential moving average (EMA) of access rate, a sliding-window +popularity bitset, and an exponential recency function. Hysteresis margins and +minimum residency constraints prevent pathological tier thrashing. A tick-driven +maintenance pass processes tier transitions within configurable byte and CPU +budgets, producing a deterministic witness log for every decision. + +--- + +## 1. Context and Problem Statement + +### 1.1 Limitations of the ADR-017 Score + +ADR-017 introduced a tier score of `access_count * 1024 / (now - last_access + 1)`. +This formula has three weaknesses: + +1. **Monotonic accumulation**: `access_count` never decays. A block accessed + 10,000 times a year ago and never since still scores high until `age` grows + large enough to dominate. This delays demotion by hours or days. + +2. **No temporal locality signal**: Two blocks with identical total counts but + different access patterns (bursty vs. uniform) receive the same score. Bursty + access often predicts near-future reuse and should be promoted faster. + +3. **No thrashing protection**: A block sitting exactly on a tier boundary + oscillates between tiers on every tick, wasting compression and decompression + cycles. + +### 1.2 Requirements for the Replacement + +| Requirement | Rationale | +|-------------|-----------| +| Decay old accesses | Blocks untouched for long periods must drain to cold | +| Detect bursts | Recent concentrated access should promote aggressively | +| Prevent thrashing | Tier transitions must have hysteresis and residency floors | +| Budget-bounded | Maintenance must respect per-tick byte and CPU limits | +| Deterministic | Same event sequence must produce identical tier decisions | +| Configurable | Operators must tune all weights, thresholds, and decay constants | + +### 1.3 Design Constraints + +- The scoring function runs on the hot path (every `touch` call) and must + complete in under 50ns on x86-64. +- The maintenance pass runs once per tick (configurable, default 100ms) and must + process candidate blocks within its CPU budget without stalling ingest. +- All floating-point operations use `f32` to stay WASM-compatible (no `f64` + dependency) and to match the existing `tier_policy.rs` types. + +--- + +## 2. Decision + +### 2.1 Replace the ADR-017 Score with a Composite Three-Signal Score + +Adopt a weighted composite score that combines three independent signals, each +capturing a different temporal property of access behavior. Protect tier +transitions with hysteresis margins and minimum residency enforcement. + +--- + +## 3. Detailed Design + +### 3.1 Block Metadata State + +Every block carries the following metadata fields, updated on each access: + +```rust +pub struct BlockMeta { + pub tensor_id: u64, + pub block_index: u32, + + // --- Access tracking --- + pub last_access_at: u64, // Tick timestamp of most recent access + pub access_count: u64, // Saturating total access count + pub ema_rate: f32, // Exponential moving average of access rate + pub window: u64, // 64-bit sliding window bitset + + // --- Tier state --- + pub current_tier: u8, // 0=absent, 1=Tier1(8-bit), 2=Tier2(5/7-bit), 3=Tier3(3-bit) + pub tier_age: u32, // Ticks spent in current tier since last transition + pub last_score: f32, // Cached score from most recent evaluation + pub checksum: u32, // CRC32 for corruption detection +} +``` + +### 3.2 State Updates on Each Access (Touch) + +On every read or write to a block, the `touch` function updates metadata +atomically. No locks are needed because the Temporal Tensor Store is +single-writer per block (enforced by the block-based storage engine from +ADR-018). + +```rust +/// Update block metadata on access. +/// +/// Called on every read or write. Must complete in <50ns. +pub fn touch(policy: &TierPolicy, now: u64, m: &mut BlockMeta) { + // 1. Timestamp and count + m.last_access_at = now; + m.access_count = m.access_count.saturating_add(1); + + // 2. Sliding window: shift left by 1, set LSB to 1 + // Each bit represents one tick; 1 = accessed, 0 = not accessed. + m.window = (m.window << 1) | 1; + + // 3. EMA update: instant = 1.0 because this tick had an access + // ema_new = alpha * instant + (1 - alpha) * ema_old + m.ema_rate = policy.alpha * 1.0 + (1.0 - policy.alpha) * m.ema_rate; +} +``` + +On ticks where a block is **not** accessed, the EMA decays passively during the +maintenance pass: + +```rust +/// Passive decay for blocks not accessed this tick. +fn decay_ema(policy: &TierPolicy, m: &mut BlockMeta) { + // instant = 0.0 (no access this tick) + m.ema_rate = (1.0 - policy.alpha) * m.ema_rate; + + // Shift window without setting LSB + m.window <<= 1; +} +``` + +**Complexity**: O(1) per call. Three integer ops, one shift, two FMA-equivalent +f32 ops. Benchmarks show <20ns on x86-64 and <40ns in WASM. + +### 3.3 Score Computation + +The composite score S blends three signals, each normalized to the [0, 1] range +before weighting: + +``` +S = w_ema * ema_access_rate + w_pop * popcount(window) / 64 + w_rec * recency(now - last_access_at) +``` + +In Rust: + +```rust +/// Compute the composite tier score for a block. +pub fn compute_score(policy: &TierPolicy, now: u64, m: &BlockMeta) -> f32 { + // Signal 1: EMA access rate (already in [0, 1] for reasonable alpha) + let sig_ema = m.ema_rate; + + // Signal 2: Sliding window popularity, normalized to [0, 1] + let pop = m.window.count_ones() as f32; // popcount intrinsic + let sig_pop = pop / 64.0; + + // Signal 3: Exponential recency decay + let delta_t = (now.saturating_sub(m.last_access_at)) as f32; + let sig_rec = fast_exp_neg(delta_t / policy.tau); + + // Weighted sum + policy.w_ema * sig_ema + policy.w_pop * sig_pop + policy.w_rec * sig_rec +} +``` + +#### 3.3.1 Signal Descriptions + +| Signal | Symbol | Range | Property | +|--------|--------|-------|----------| +| EMA rate | `sig_ema` | [0, 1] | Smooth estimate of recent access frequency. High alpha = responsive to bursts. Low alpha = stable long-term average. | +| Window popularity | `sig_pop` | [0, 1] | Fraction of the last 64 ticks with at least one access. Captures breadth of recent usage. | +| Recency | `sig_rec` | (0, 1] | Exponential decay from last access. Drops rapidly for stale blocks. | + +#### 3.3.2 Why Three Signals + +No single signal captures all relevant behavior: + +- **EMA alone** cannot distinguish a block accessed once per tick for 64 ticks + from one accessed 64 times in a single tick then idle. Both converge to + similar EMA values. +- **Popcount alone** is binary per tick and ignores access intensity within + a tick. +- **Recency alone** has no memory of historical access patterns; a single + recent touch fully restores the score regardless of history. + +The composite score captures intensity (EMA), breadth (popcount), and freshness +(recency) as orthogonal axes. Default weights emphasize recency to ensure +prompt demotion of stale data. + +### 3.4 Recency Function and Fast Exponential Approximation + +The ideal recency function is: + +``` +r(delta_t) = exp(-delta_t / tau) +``` + +where `tau` is the characteristic decay time in ticks. For `tau = 100`, a block +untouched for 100 ticks decays to `1/e ~ 0.368`; at 200 ticks it decays to +`0.135`; at 460 ticks it drops below 0.01. + +#### 3.4.1 Fast Approximation via Rational Function + +For the maintenance pass, which evaluates potentially thousands of blocks per +tick, a full `f32::exp` call (~15ns, involves range reduction and polynomial +evaluation) is too expensive. We use a rational approximation: + +```rust +/// Fast approximation of exp(-x) for x >= 0. +/// +/// Uses the Pade(1,1) approximant: exp(-x) ~ 1 / (1 + x) +/// Maximum relative error: 26% at x=2 (acceptable for scoring, not for numerics). +/// +/// For higher accuracy, use the LUT approach below. +fn fast_exp_neg_pade(x: f32) -> f32 { + 1.0 / (1.0 + x.max(0.0)) +} +``` + +#### 3.4.2 LUT with Linear Interpolation (Recommended) + +For production use, a 256-entry lookup table with linear interpolation provides +<0.5% error across the useful range: + +```rust +/// 256-entry LUT for exp(-x) over [0, 8]. +/// Beyond x=8, exp(-x) < 0.00034, effectively zero for scoring. +const EXP_LUT_SIZE: usize = 256; +const EXP_LUT_MAX_X: f32 = 8.0; + +static EXP_LUT: [f32; EXP_LUT_SIZE] = { + let mut lut = [0.0f32; EXP_LUT_SIZE]; + let mut i = 0; + while i < EXP_LUT_SIZE { + let x = (i as f32) * EXP_LUT_MAX_X / (EXP_LUT_SIZE as f32 - 1.0); + // compile-time evaluation via const fn not available for exp; + // in practice, initialize at startup or use a build script. + lut[i] = 0.0; // placeholder + i += 1; + } + lut +}; + +/// Fast exp(-x) via LUT with linear interpolation. +/// x is clamped to [0, EXP_LUT_MAX_X]. +fn fast_exp_neg(x: f32) -> f32 { + if x <= 0.0 { + return 1.0; + } + if x >= EXP_LUT_MAX_X { + return 0.0; + } + let t = x * (EXP_LUT_SIZE as f32 - 1.0) / EXP_LUT_MAX_X; + let idx = t as usize; + let frac = t - idx as f32; + + if idx + 1 >= EXP_LUT_SIZE { + return EXP_LUT[EXP_LUT_SIZE - 1]; + } + + // Linear interpolation between adjacent LUT entries + EXP_LUT[idx] * (1.0 - frac) + EXP_LUT[idx + 1] * frac +} +``` + +**LUT initialization** (called once at startup): + +```rust +fn init_exp_lut(lut: &mut [f32; EXP_LUT_SIZE]) { + for i in 0..EXP_LUT_SIZE { + let x = (i as f32) * EXP_LUT_MAX_X / (EXP_LUT_SIZE as f32 - 1.0); + lut[i] = (-x).exp(); // std exp, only called 256 times + } +} +``` + +**Error analysis** for LUT interpolation: + +| x range | Max absolute error | Max relative error | +|---------|-------------------|--------------------| +| [0, 1] | 0.0005 | 0.08% | +| [1, 3] | 0.0003 | 0.15% | +| [3, 6] | 0.0001 | 0.42% | +| [6, 8] | 0.00002 | 0.38% | + +### 3.5 Tier Selection by Thresholds + +The score is compared against three thresholds to select the target tier: + +``` +if S >= t1 then Tier1 (8-bit, hot) +elif S >= t2 then Tier2 (7-bit or 5-bit, warm) +elif S >= t3 then Tier3 (3-bit, cold) +else Tier0 (absent / evicted) +``` + +``` +Score axis (0.0 to 1.0) +| | +0.0 t3 t2 t1 1.0 +|----Tier0----|---Tier3---|----Tier2----|---------Tier1-----------| + (absent) (3-bit) (5/7-bit) (8-bit) +``` + +Default threshold values: + +| Parameter | Default | Rationale | +|-----------|---------|-----------| +| `t1` | 0.70 | Requires strong signal on at least two axes to qualify as hot | +| `t2` | 0.35 | Moderate recent activity; still worth keeping at reduced precision | +| `t3` | 0.10 | Minimal recent activity; compress aggressively or evict | + +### 3.6 Hysteresis to Prevent Thrashing + +A block sitting near a tier boundary may oscillate if the score fluctuates +around the threshold. This causes repeated compression/decompression cycles +(thrashing), each of which consumes CPU and I/O bandwidth. + +#### 3.6.1 Hysteresis Margins + +Tier transitions require the score to exceed the threshold by a configurable +margin: + +``` +Upgrade: S > threshold_upper + hysteresis +Downgrade: S < threshold_lower - hysteresis +``` + +This creates a dead zone around each boundary where no transition occurs: + +``` +Score axis around threshold t2 = 0.35, hysteresis = 0.05: + + Downgrade zone Dead zone (no transition) Upgrade zone + <------|--------|-------------|-------------|-----------|--------> + 0.25 0.30 0.35 0.40 0.45 + ^ ^ + Tier3 if below Tier2 if above +``` + +In Rust: + +```rust +/// Determine if a tier transition should occur, accounting for hysteresis. +pub fn should_transition( + policy: &TierPolicy, + current_tier: u8, + score: f32, +) -> Option { + let h = policy.hysteresis; + + // Check for upgrade (higher tier = lower number = higher precision) + if current_tier > 1 && score > policy.t1 + h { + return Some(1); // Promote to Tier1 + } + if current_tier > 2 && score > policy.t2 + h { + return Some(2); // Promote to Tier2 + } + + // Check for downgrade (lower tier = higher number = lower precision) + if current_tier < 3 && current_tier > 0 && score < policy.t3 - h { + return Some(0); // Evict to Tier0 + } + if current_tier < 3 && score < policy.t2 - h { + return Some(3); // Demote to Tier3 + } + if current_tier < 2 && score < policy.t1 - h { + return Some(2); // Demote to Tier2 + } + + None // No transition; remain in current tier +} +``` + +#### 3.6.2 Minimum Residency Enforcement + +Even with hysteresis, a rapidly changing workload could cause transitions faster +than the system can absorb. The `min_residency` parameter sets a floor on the +number of ticks a block must remain in its current tier before any transition +is permitted: + +```rust +fn is_eligible_for_transition(policy: &TierPolicy, m: &BlockMeta) -> bool { + m.tier_age >= policy.min_residency +} +``` + +**Recommended values**: + +| Workload | `min_residency` | Rationale | +|----------|-----------------|-----------| +| Real-time inference | 10 ticks (1s at 100ms tick) | Fast adaptation, tolerate some thrashing | +| Batch processing | 100 ticks (10s) | Stability preferred over responsiveness | +| Archival | 1000 ticks (100s) | Very conservative, minimize I/O | + +#### 3.6.3 Tier Transition State Machine + +``` + S > t1 + h + age >= min_residency + +---------------------------+ + | | + v | + +--------+ S > t2 + h +--------+ S > t1 + h +--------+ + | Tier0 | ----------------> | Tier3 | - - - - - - - -> | Tier2 | + | absent | age >= min | 3-bit | (via Tier2) | 5/7-bit| + +--------+ +--------+ +--------+ + ^ | ^ | ^ + | S < t3 - h | | S < t2 - h | | + | age >= min | | age >= min | | + +----------------------------+ +------------------------+ | + | + +--------+ | + | Tier1 | ------------------+ + | 8-bit | S < t1 - h + +--------+ age >= min + ^ + | + +--- S > t1 + h, age >= min + (from Tier2) +``` + +**Transitions are always single-step**: a block in Tier3 cannot jump directly +to Tier1. It must pass through Tier2 first. This prevents large recompression +jumps and gives the system time to validate intermediate states. Each step +resets `tier_age` to 0, so the block must again satisfy `min_residency` before +its next transition. + +### 3.7 TierPolicy Configuration + +All scoring and migration parameters are consolidated in a single configuration +structure: + +```rust +pub struct TierPolicy { + // --- Scoring weights --- + pub alpha: f32, // EMA smoothing factor (0, 1). Higher = more responsive. + pub tau: f32, // Recency decay time constant (in ticks). + pub w_ema: f32, // Weight for EMA access rate signal. + pub w_pop: f32, // Weight for popcount window signal. + pub w_rec: f32, // Weight for exponential recency signal. + + // --- Tier thresholds --- + pub t1: f32, // Score threshold for Tier1 (hot, 8-bit). + pub t2: f32, // Score threshold for Tier2 (warm, 5/7-bit). + pub t3: f32, // Score threshold for Tier3 (cold, 3-bit). + + // --- Anti-thrashing --- + pub hysteresis: f32, // Margin added/subtracted from thresholds. + pub min_residency: u32, // Minimum ticks before tier transition allowed. + + // --- Storage --- + pub max_delta_chain: u8, // Max delta segments before full rewrite (from ADR-018). + pub block_bytes: usize, // Block size in bytes (from ADR-018). +} +``` + +**Default configuration**: + +```rust +impl Default for TierPolicy { + fn default() -> Self { + Self { + alpha: 0.1, + tau: 100.0, + w_ema: 0.3, + w_pop: 0.2, + w_rec: 0.5, + t1: 0.70, + t2: 0.35, + t3: 0.10, + hysteresis: 0.05, + min_residency: 50, + max_delta_chain: 4, + block_bytes: 4096, + } + } +} +``` + +**Weight normalization**: The weights `w_ema + w_pop + w_rec` should sum to 1.0 +so that the score range is [0, 1]. The system asserts this at construction time +with a tolerance of 1e-6. + +### 3.8 Budgeted Maintenance Pass (Tick Handler) + +The maintenance pass executes once per tick. It is the sole location where tier +transitions are enacted. The `touch` function only updates metadata; it never +triggers compression or decompression directly. This separation ensures that +ingest latency is bounded and independent of maintenance costs. + +#### 3.8.1 Inputs + +```rust +pub struct TickBudget { + pub byte_budget: usize, // Max bytes of compression/decompression this tick + pub cpu_budget: u32, // Max block evaluations this tick +} +``` + +#### 3.8.2 Candidate Selection + +Candidates are blocks whose state may require action: + +| Condition | Action | +|-----------|--------| +| Score crossed a boundary (accounting for hysteresis) | Tier transition | +| `tier_age > max_age` | Forced re-evaluation (prevents stale metadata) | +| `checksum` mismatch detected | Repair via re-read and recompression | +| `current_tier == 0` and score > t3 + h | Promotion from absent | + +#### 3.8.3 Priority Ordering + +Candidates are sorted into two queues processed in order: + +**Upgrade queue** (highest priority): sorted by score descending (highest +score delta first). Rationale: promoting a heavily-accessed block reduces +read amplification for many future accesses. + +**Downgrade queue** (lower priority): sorted by score ascending (lowest score +first). Rationale: demoting the coldest blocks first frees the most byte +budget for hot tier capacity. + +Corruption repairs bypass both queues and are processed first unconditionally. + +#### 3.8.4 Processing Loop + +```rust +pub fn run_maintenance_tick( + policy: &TierPolicy, + budget: &TickBudget, + now: u64, + blocks: &mut [BlockMeta], + witness_log: &mut Vec, +) { + let mut bytes_used: usize = 0; + let mut ops_used: u32 = 0; + + // Phase 0: Passive EMA decay for all blocks not accessed this tick + for m in blocks.iter_mut() { + if m.last_access_at != now { + decay_ema(policy, m); + } + m.tier_age = m.tier_age.saturating_add(1); + } + + // Phase 1: Score computation and candidate collection + let mut upgrades: Vec<(usize, f32, u8)> = Vec::new(); // (index, score, target_tier) + let mut downgrades: Vec<(usize, f32, u8)> = Vec::new(); + let mut repairs: Vec = Vec::new(); + + for (i, m) in blocks.iter_mut().enumerate() { + let score = compute_score(policy, now, m); + m.last_score = score; + + // Check corruption + if needs_repair(m) { + repairs.push(i); + continue; + } + + // Check eligibility + if !is_eligible_for_transition(policy, m) { + continue; + } + + if let Some(target) = should_transition(policy, m.current_tier, score) { + if target < m.current_tier { + upgrades.push((i, score, target)); + } else { + downgrades.push((i, score, target)); + } + } + } + + // Phase 2: Sort queues + upgrades.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); + downgrades.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(std::cmp::Ordering::Equal)); + + // Phase 3: Process repairs (unconditional) + for idx in repairs { + if ops_used >= budget.cpu_budget { break; } + let cost = execute_repair(&mut blocks[idx]); + bytes_used += cost; + ops_used += 1; + witness_log.push(WitnessEntry::repair(now, &blocks[idx])); + } + + // Phase 4: Process upgrades (highest score first) + for (idx, score, target) in upgrades { + if ops_used >= budget.cpu_budget || bytes_used >= budget.byte_budget { + break; + } + let cost = execute_tier_transition(&mut blocks[idx], target); + bytes_used += cost; + ops_used += 1; + blocks[idx].current_tier = target; + blocks[idx].tier_age = 0; + witness_log.push(WitnessEntry::transition(now, &blocks[idx], score, target)); + } + + // Phase 5: Process downgrades (lowest score first) + for (idx, score, target) in downgrades { + if ops_used >= budget.cpu_budget || bytes_used >= budget.byte_budget { + break; + } + let cost = execute_tier_transition(&mut blocks[idx], target); + bytes_used += cost; + ops_used += 1; + blocks[idx].current_tier = target; + blocks[idx].tier_age = 0; + witness_log.push(WitnessEntry::transition(now, &blocks[idx], score, target)); + } +} +``` + +#### 3.8.5 Witness Log + +Every maintenance decision emits a structured log entry for auditability: + +```rust +pub struct WitnessEntry { + pub tick: u64, + pub tensor_id: u64, + pub block_index: u32, + pub action: WitnessAction, // Transition | Repair | Evict | Skip + pub score: f32, + pub from_tier: u8, + pub to_tier: u8, + pub reason: &'static str, +} +``` + +The witness log enables post-hoc analysis of tier decisions, capacity planning, +and regression testing of policy changes. + +#### 3.8.6 Maintenance Pass Flow Diagram + +``` + Tick Event (periodic) + | + v + +---------------------------+ + | Phase 0: Passive EMA | + | decay for non-accessed | + | blocks; increment tier_age| + +---------------------------+ + | + v + +---------------------------+ + | Phase 1: Compute scores | + | Classify into: | + | - repairs[] | + | - upgrades[] | + | - downgrades[] | + +---------------------------+ + | + v + +---------------------------+ + | Phase 2: Sort queues | + | upgrades: by score DESC | + | downgrades: by score ASC | + +---------------------------+ + | + v + +---------------------------+ + | Phase 3: Process repairs | + | (unconditional, first) | + +------------|------+-------+ + | | + budget ok? budget exhausted? + | | + v v + +---------------------------+ + | Phase 4: Process upgrades | + | highest score delta first | + +------------|------+-------+ + | | + budget ok? budget exhausted? + | | + v v + +---------------------------+ + | Phase 5: Process downgrades| + | lowest score first | + +---------------------------+ + | + v + +---------------------------+ + | Emit witness log entries | + | for all actions taken | + +---------------------------+ +``` + +### 3.9 Score Sensitivity Analysis + +#### 3.9.1 EMA Response Curve + +The EMA signal responds to access pattern changes with a time constant of +`1/alpha` ticks. For alpha = 0.1: + +``` +After sustained access (1 access per tick): + ema converges to alpha / (1 - (1-alpha)) = 1.0 + +After access stops (from steady state of 1.0): + ema(t) = (1 - alpha)^t + t=1: 0.90 + t=5: 0.59 + t=10: 0.35 + t=20: 0.12 + t=30: 0.04 + t=50: 0.005 +``` + +**Derivation**: At steady state with one access per tick, the EMA satisfies +`ema = alpha * 1 + (1-alpha) * ema`, giving `ema = 1.0`. After access ceases, +each tick multiplies by `(1-alpha)`, so `ema(t) = (1-alpha)^t`. The half-life +is `ln(2) / ln(1/(1-alpha))`. For alpha=0.1, half-life ~ 6.6 ticks. + +#### 3.9.2 Recency Decay Curve + +For tau = 100: + +``` +r(delta_t) = exp(-delta_t / 100) + +delta_t: 0 10 50 100 200 300 500 1000 +r: 1.000 0.905 0.607 0.368 0.135 0.050 0.007 0.000 +``` + +#### 3.9.3 Composite Score Trajectories + +**Scenario A: Block accessed steadily then abandoned** + +``` +Score +1.0 |******* + | **** + | *** +0.7 |-- t1 -------***----------- (Tier1 threshold) + | *** +0.35|-- t2 ------------***------ (Tier2 threshold) + | **** +0.10|-- t3 ------------------*** (Tier3 threshold) + | *** +0.0 +------|------|------|-------> Ticks after last access + 0 10 50 100 200 +``` + +**Scenario B: Bursty access (10 accesses in tick 0, then silence)** + +``` +Score +1.0 |* + | * + | * +0.7 |-- **-------------------------- (Tier1) + | ** +0.35|------***---------------------- (Tier2) + | *** +0.10|-----------****---------------- (Tier3) + | ******* +0.0 +------|------|------|-------> Ticks + 0 10 50 100 +``` + +Burst raises the initial EMA to `alpha * 1 + (1-alpha) * (alpha * 1 + ...) ~ +alpha * 10` (clamped), but decays at the same rate. The window signal remains +1/64 after tick 1, providing differentiation from steady access. + +**Scenario C: Periodic access (every 20 ticks)** + +``` +Score +1.0 | + | + | +0.7 |-------------------------------------- + | * * * * * +0.5 |** ** ** ** ** ** ** ** ** ** (oscillates 0.3--0.6) +0.35|-------------------------------------- + | +0.10|-------------------------------------- +0.0 +------|------|------|------|-------> Ticks + 0 20 40 60 80 +``` + +The block stabilizes in Tier2. Hysteresis of 0.05 prevents flapping between +Tier2 and Tier1 since the peaks reach ~0.6, which is below t1 + h = 0.75. + +### 3.10 Determinism Guarantees + +The tier migration algorithm is fully deterministic: + +1. **No randomness**: No random number generators are used in scoring, + candidate selection, or tie-breaking. + +2. **Stable ordering**: When two blocks have identical scores, ties are broken + by `(tensor_id, block_index)` in ascending lexicographic order. This + ensures the same blocks are processed first regardless of memory layout + or iteration order. + +3. **Reproducible EMA**: Because the EMA update uses the same `alpha` and + the same sequence of `touch` / `decay_ema` calls (driven by the event + stream), replaying the same event log produces identical metadata states. + +4. **No wall-clock dependency**: All timestamps are logical tick counters, not + system clocks. The maintenance pass is triggered by the tick event, not by + a timer. + +5. **Bit-exact f32**: All computations use `f32` with no intermediate `f64` + promotion. The LUT for `fast_exp_neg` is initialized deterministically. + On IEEE 754 compliant hardware (including WASM), results are bit-exact. + +### 3.11 Failure Modes and Remediation + +#### 3.11.1 Thrashing + +**Symptom**: Frequent tier transitions for the same block (>2 transitions per +100 ticks). Detected by monitoring the witness log. + +**Root cause**: Hysteresis margin too small relative to score volatility, or +`min_residency` too low for the workload's access variability. + +**Remediation**: + +| Action | Effect | +|--------|--------| +| Increase `hysteresis` from 0.05 to 0.10 | Doubles the dead zone around each threshold | +| Increase `min_residency` from 50 to 200 | Block must stay in tier 4x longer before eligible | +| Decrease `tau` | Recency signal decays faster, reducing score volatility from stale state | +| Decrease `alpha` | EMA smooths more aggressively, damping burst sensitivity | + +#### 3.11.2 Hot Set Misprediction + +**Symptom**: Tier1 byte footprint exceeds capacity. Too many blocks qualified +as hot. + +**Root cause**: `t1` threshold too low, or `w_pop` too high (treating any +recent activity as hot). + +**Remediation**: + +| Action | Effect | +|--------|--------| +| Raise `t1` from 0.70 to 0.85 | Only blocks with very strong multi-signal evidence promoted | +| Lower `w_pop` from 0.2 to 0.1 | Reduce influence of window breadth | +| Enforce per-tier byte cap | Hard limit on total bytes in Tier1; evict lowest-scoring Tier1 blocks | +| Raise `w_rec` | Makes recency dominant; blocks must be very recently accessed | + +#### 3.11.3 Starvation of Downgrades + +**Symptom**: Cold blocks accumulate in Tier2 because upgrade processing +exhausts the CPU budget before downgrades run. + +**Root cause**: Budget too small, or too many upgrade candidates per tick. + +**Remediation**: + +| Action | Effect | +|--------|--------| +| Split budget 50/50 between upgrades and downgrades | Guarantees downgrade processing | +| Increase `cpu_budget` | More operations per tick | +| Process downgrades first every other tick | Round-robin priority | + +#### 3.11.4 Corruption Cascade + +**Symptom**: Multiple blocks fail checksum validation simultaneously after +a storage fault. + +**Root cause**: Underlying storage corruption (disk error, truncated write). + +**Remediation**: Repairs are processed unconditionally before tier transitions. +If the repair budget is exhausted, remaining corrupted blocks are flagged and +prioritized on the next tick. A persistent corruption counter triggers an alert +if it exceeds a configurable threshold. + +--- + +## 4. Mathematical Derivations + +### 4.1 EMA Convergence + +For a constant access rate of `r` accesses per tick (modeled as instant = r): + +``` +ema(t) = alpha * r + (1 - alpha) * ema(t-1) +``` + +This is a first-order IIR filter. The steady-state solution is: + +``` +ema_ss = alpha * r / (1 - (1 - alpha)) = r +``` + +The transient response from ema(0) = 0 is: + +``` +ema(t) = r * (1 - (1-alpha)^t) +``` + +Time to reach 95% of steady state: `t_95 = ln(0.05) / ln(1-alpha)`. +For alpha=0.1: `t_95 ~ 29 ticks`. + +### 4.2 Score Sensitivity to Weight Changes + +Partial derivatives of S with respect to each weight: + +``` +dS/d(w_ema) = sig_ema (range [0, 1]) +dS/d(w_pop) = sig_pop (range [0, 1]) +dS/d(w_rec) = sig_rec (range (0, 1]) +``` + +Since all signals are in [0, 1], a unit change in any weight shifts the score +by at most 1.0. For small perturbations: + +``` +delta_S ~ delta_w_ema * sig_ema + delta_w_pop * sig_pop + delta_w_rec * sig_rec +``` + +To maintain threshold stability, changes to weights should be bounded: + +``` +|delta_w_i| < hysteresis / max(sig_i) = hysteresis +``` + +For hysteresis=0.05, individual weight adjustments should stay within +/-0.05 +to avoid unintended mass tier migrations. + +### 4.3 Hysteresis Dead Zone Width + +The effective dead zone around threshold T is: + +``` +dead_zone = [T - hysteresis, T + hysteresis] +width = 2 * hysteresis +``` + +A block's score must traverse the full dead zone width to complete a transition. +Given the maximum score velocity (one `touch` per tick driving all three +signals upward), the minimum time to traverse the dead zone is: + +``` +t_min_traverse ~ 2 * hysteresis / max_score_rate +``` + +For alpha=0.1, tau=100, and all weights=0.33: +- After a single touch from zero state: `delta_S ~ 0.33*0.1 + 0.33*(1/64) + 0.33*1 = 0.37` +- Dead zone width: `2 * 0.05 = 0.10` + +A single touch can cross the dead zone, but `min_residency` provides the +additional time floor. + +### 4.4 Popcount Signal Characteristics + +The window is a 64-bit shift register. After `k` consecutive ticks with +access: `popcount = min(k, 64)`. After `j` ticks of silence following +saturation: `popcount = max(64 - j, 0)`. + +Normalized popcount (`sig_pop = popcount/64`) has a trapezoidal response: +linear ramp up over 64 ticks, flat at 1.0 during sustained access, linear +ramp down over 64 ticks after access stops. This provides a 64-tick "memory" +that is independent of and complementary to the EMA and recency signals. + +--- + +## 5. Integration Points + +### 5.1 Relationship to ADR-017 (Temporal Tensor Compression) + +ADR-017 defined the compression pipeline (groupwise quantization, bitstream +packing, segment format) but used a simple score heuristic. This ADR replaces +that heuristic with the composite score while preserving the compression +pipeline unchanged. The `TierPolicy` struct from ADR-017's `tier_policy.rs` +is extended with the new fields (alpha, tau, weights, hysteresis, +min_residency). + +### 5.2 Relationship to ADR-018 (Block-Based Storage Engine) + +ADR-018 defines the block storage layer including `BlockMeta`, delta chains, +and the block I/O interface. This ADR adds the `ema_rate`, `window`, +`tier_age`, and `last_score` fields to `BlockMeta` and defines the maintenance +pass that operates on blocks through the storage engine's API. + +### 5.3 Coherence Engine Integration + +The coherence engine (ADR-014, ADR-015) may override tier decisions via +coherence-gated signals: + +- A coherence violation forces a block to Tier1 regardless of score, ensuring + full-precision access during consistency recovery. +- A coherence quiescence signal (stable energy for N ticks) permits accelerated + demotion by halving `min_residency` for the affected tensor. + +### 5.4 WASM Compatibility + +All types use `f32` and fixed-size integers. The LUT for `fast_exp_neg` is +initialized via a startup function callable from WASM's `_start` or +`__wasm_call_ctors`. The maintenance pass uses no heap allocation beyond the +candidate vectors, which can be pre-allocated to a fixed capacity. + +--- + +## 6. Alternatives Considered + +### 6.1 LRU / LFU Eviction + +**Rejected**: Pure LRU (least recently used) ignores frequency. Pure LFU +(least frequently used) ignores recency. Both are single-signal policies +that cannot express the nuanced tradeoffs of a multi-tier system. The +composite score subsumes both: high `w_rec` approximates LRU; high `w_ema` +approximates LFU. + +### 6.2 ARC (Adaptive Replacement Cache) + +**Considered but rejected**: ARC maintains two LRU lists and a ghost list +to adaptively balance recency vs. frequency. While elegant for binary +(cache hit / miss) decisions, extending ARC to four tiers with different +bit-widths is non-trivial. The composite score approach is simpler to +implement, tune, and reason about. + +### 6.3 Machine-Learned Scoring + +**Deferred**: A small neural network could predict future access patterns +from historical traces. However, this introduces non-determinism (floating +point ordering in inference), model management complexity, and a cold-start +problem. We may revisit this when the RuVector intelligence system (SONA) +is mature enough to provide lightweight, deterministic inference. + +### 6.4 Single-Signal Score (Keep ADR-017 Heuristic) + +**Rejected**: As detailed in Section 1.1, the ADR-017 heuristic has +fundamental limitations. Extending it with decay would address monotonic +accumulation but still lack burst detection and thrashing protection. + +--- + +## 7. Acceptance Criteria + +| Criterion | Measurement | Target | +|-----------|-------------|--------| +| Touch latency | Benchmark `touch()` on x86-64 | < 50ns p99 | +| Score computation latency | Benchmark `compute_score()` | < 100ns p99 | +| Maintenance pass (1000 blocks) | End-to-end tick processing time | < 1ms | +| Determinism | Replay same event log twice, compare witness logs | Bit-exact match | +| Thrashing rate | Transitions per block per 100 ticks under mixed workload | < 2 | +| Tier accuracy | Fraction of blocks in correct tier after 1000 ticks (vs oracle) | > 90% | +| Hysteresis effectiveness | Tier transitions eliminated by hysteresis under oscillating load | > 80% | +| Budget compliance | Bytes and ops used per tick vs budget | Never exceeds budget | + +--- + +## 8. Risks and Mitigations + +| Risk | Severity | Likelihood | Mitigation | +|------|----------|------------|------------| +| Weight tuning requires per-workload calibration | Medium | High | Ship sensible defaults; provide tuning guide; expose metrics for auto-tuning | +| LUT initialization overhead | Low | Low | 256 entries * ~15ns = <4us; negligible startup cost | +| f32 precision drift over millions of EMA updates | Low | Medium | EMA is bounded [0, 1]; no accumulation. Periodic reset not needed. | +| min_residency delays urgent promotions | Medium | Medium | Coherence override bypasses min_residency for consistency-critical blocks | +| Witness log grows unbounded | Low | High | Ring buffer with configurable capacity; oldest entries evicted | +| WASM f32 semantics differ from native | Low | Low | Both follow IEEE 754; WASM mandates deterministic NaN handling | + +--- + +## 9. Open Questions + +1. **Auto-tuning**: Should we implement an online tuning loop that adjusts + weights based on observed cache hit rates and tier utilization? This could + adapt to changing workloads without manual configuration. + +2. **Per-tensor overrides**: Should individual tensors be able to specify + their own TierPolicy, or should the policy be global? Per-tensor policies + add flexibility but complicate the maintenance pass. + +3. **Tick rate selection**: The default tick interval of 100ms is appropriate + for server workloads. Embedded or edge deployments may need different + tick rates. Should the tick rate be configurable independently of the + policy parameters, or should tau and min_residency be specified in wall + time? + +4. **Budget split strategy**: The current design processes all upgrades before + all downgrades. Should we interleave upgrades and downgrades, or allocate + a fixed fraction of the budget to each? + +--- + +## 10. Implementation Roadmap + +### Phase 1: Core Scoring (Week 1) +- Extend `BlockMeta` with `ema_rate`, `window`, `tier_age`, `last_score` +- Implement `touch()`, `decay_ema()`, `compute_score()` +- Implement `fast_exp_neg` with LUT initialization +- Extend `TierPolicy` with new fields +- Unit tests for all score computations and edge cases + +### Phase 2: Tier Migration Logic (Week 1-2) +- Implement `should_transition()` with hysteresis +- Implement `is_eligible_for_transition()` with min_residency +- Implement single-step transition constraint +- State machine tests covering all transition paths + +### Phase 3: Maintenance Pass (Week 2-3) +- Implement `run_maintenance_tick()` with budget tracking +- Implement candidate selection and priority sorting +- Implement witness log emission +- Integration tests with synthetic workloads +- Determinism tests (replay verification) + +### Phase 4: Tuning and Hardening (Week 3-4) +- Benchmark touch and score computation latency +- Profile maintenance pass with 10K+ blocks +- Implement per-tier byte caps (failure mode 3.11.2) +- Create tuning guide with recommended configurations +- Fuzz testing for edge cases (zero tau, extreme weights, u64 overflow) + +--- + +## 11. References + +1. O'Neil, E., O'Neil, P., Weikum, G. "The LRU-K Page Replacement Algorithm + for Database Disk Buffering." SIGMOD 1993. +2. Megiddo, N., Modha, D. "ARC: A Self-Tuning, Low Overhead Replacement + Cache." USENIX FAST 2003. +3. Jiang, S., Zhang, X. "LIRS: An Efficient Low Inter-reference Recency Set + Replacement Policy." SIGMOD 2002. +4. ADR-017: Temporal Tensor Compression with Tiered Quantization. +5. ADR-018: Block-Based Storage Engine (referenced, not yet published). +6. ADR-014: Coherence Engine Architecture. +7. ADR-015: Coherence-Gated Transformer. + +--- + +## Appendix A: Score Curve Reference Charts + +### A.1 EMA Decay After Access Ceases (alpha = 0.1) + +``` +ema +1.0 |* + | * +0.8 | * + | * +0.6 | * + | ** +0.4 | ** + | *** +0.2 | **** + | ****** +0.0 | *************** + +------|------|------|------|------|---> Ticks + 0 5 10 15 20 30 +``` + +### A.2 Recency Decay (tau = 100) + +``` +recency +1.0 |**** + | *** +0.8 | ** + | ** +0.6 | ** + | *** +0.4 | *** + | **** +0.2 | ***** + | ******** +0.0 | ****************** + +------|------|------|------|------|------|------|-----> Ticks + 0 50 100 150 200 300 400 500 +``` + +### A.3 Popcount Ramp-Up and Decay + +``` +sig_pop +1.0 | ************************ + | *** *** +0.8 | *** *** + | *** *** +0.6 | ** ** + | ** ** +0.4 | ** ** + | * ** +0.2 | * ** + |* ** +0.0 +------|------|------|------|------|------|------|------|-------> Ticks + 0 16 32 48 64 80 96 112 128 + |<-- ramp up -->|<-- sustained -->|<------- decay -------->| +``` + +## Appendix B: Comparison of Approximation Methods for exp(-x) + +| Method | Max Relative Error (x in [0, 4]) | Latency (ns) | Memory | +|--------|----------------------------------|---------------|--------| +| `std::f32::exp` | 0 (reference) | 12-15 | 0 | +| Pade(1,1): `1/(1+x)` | 26% at x=2 | 2-3 | 0 | +| Pade(2,2): `(1-x/2+x^2/12)/(1+x/2+x^2/12)` | 1.5% at x=4 | 4-5 | 0 | +| LUT-256 + linear interp | 0.42% | 3-4 | 1 KB | +| LUT-1024 + linear interp | 0.03% | 3-4 | 4 KB | + +The LUT-256 approach provides the best accuracy/cost tradeoff for scoring. + +## Appendix C: Worked Example -- Full Lifecycle of a Block + +Assume default policy: alpha=0.1, tau=100, w_ema=0.3, w_pop=0.2, w_rec=0.5, +t1=0.70, t2=0.35, t3=0.10, hysteresis=0.05, min_residency=50. + +**Tick 0**: Block created. `ema=0, window=0, tier=Tier3, tier_age=0`. +Score = 0.3*0 + 0.2*0 + 0.5*1.0 = 0.50. Above t2+h=0.40 but tier_age < 50. +No transition. + +**Tick 1-49**: Block accessed every tick. +By tick 49: `ema ~ 1-(0.9)^50 ~ 0.995`. `popcount = 50/64 ~ 0.78`. +`recency = 1.0` (accessed this tick). Score ~ 0.3*0.995 + 0.2*0.78 + 0.5*1.0 += 0.30 + 0.16 + 0.50 = 0.96. Above t1+h = 0.75, but tier_age = 49 < 50. + +**Tick 50**: tier_age = 50 >= min_residency. Score = 0.96 > 0.75 (t1+h). +Upgrade: Tier3 -> Tier2 (single-step). tier_age resets to 0. + +**Tick 100**: tier_age = 50 again. Score still ~0.96. Upgrade: Tier2 -> Tier1. + +**Tick 101-200**: Access stops. EMA decays: `ema(t) = 0.995 * 0.9^t`. +Popcount drains 1 bit per tick. Recency decays: `exp(-t/100)`. + +**Tick 164** (64 ticks after last access): popcount reaches 0. Score drops +to ~0.3*0.002 + 0.2*0 + 0.5*0.53 = 0.27. Below t1-h = 0.65. tier_age = 64 +>= 50. Downgrade: Tier1 -> Tier2. + +**Tick 250** (150 ticks after last access): Score ~ 0.3*0 + 0.2*0 + +0.5*0.22 = 0.11. Below t2-h = 0.30. tier_age = 86 >= 50. +Downgrade: Tier2 -> Tier3. + +**Tick 350** (250 ticks after last access): Score ~ 0.5*0.08 = 0.04. +Below t3-h = 0.05. tier_age = 100 >= 50. Downgrade: Tier3 -> Tier0 (evicted). diff --git a/docs/adr/temporal-tensor-store/ADR-021-delta-compression-reconstruction.md b/docs/adr/temporal-tensor-store/ADR-021-delta-compression-reconstruction.md new file mode 100644 index 000000000..c0e32a3ff --- /dev/null +++ b/docs/adr/temporal-tensor-store/ADR-021-delta-compression-reconstruction.md @@ -0,0 +1,1033 @@ +# ADR-021: Delta Compression and Reconstruction Policies + +**Status**: Proposed +**Date**: 2026-02-08 +**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine +**Author**: System Architecture Team + +## Version History + +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 0.1 | 2026-02-08 | Architecture Team | Initial proposal | + +--- + +## Abstract + +This ADR defines delta compression, reconstruction policies, and the associated +read/write data paths for the Temporal Tensor Store. It extends the tiered +quantization system from ADR-017 with a fourth logical tier -- Tier0 -- that +compresses data to zero resident bits while preserving the ability to +reconstruct on demand via delta chains or low-rank factor decomposition. The +design adds sparse delta encoding for incremental writes, bounded-depth delta +chain management with automatic compaction, and three explicit reconstruction +policies (`None`, `Delta`, `Factor`) that control what happens when a reader +requests a block that has been evicted to Tier0. + +All structures target Rust with `#[no_std]` compatibility for the WASM path, +consistent with the zero-dependency constraint established in ADR-017. + +--- + +## 1. Context and Motivation + +### 1.1 The Eviction Gap + +ADR-017 introduced three quantization tiers (8-bit hot, 7/5-bit warm, 3-bit +cold) that trade precision for storage. However, it provides no mechanism for +tensors that have become completely stale -- data that has not been accessed in +a long time and whose storage cost exceeds its value. Today the only option is +full deletion, which is irreversible. + +Production workloads produce tensor streams where the vast majority of blocks +become irrelevant within minutes but a small fraction are needed hours or days +later for debugging, auditing, or replay. We need a tier that retains the +ability to reconstruct without paying any per-block storage cost during steady +state. + +### 1.2 The Incremental Update Problem + +The current write path (ADR-017 `push_frame`) always stores a full quantized +representation. When a tensor block changes by only a few elements -- +common during fine-tuning steps or incremental embedding updates -- writing the +entire block wastes bandwidth and storage. Delta encoding captures only the +changed elements as sparse pairs. + +### 1.3 Design Goals + +1. **Zero-cost eviction**: Tier0 blocks consume zero data bytes; only metadata + survives. +2. **Configurable reconstruction**: Callers choose whether evicted blocks are + reconstructable, and by which method. +3. **Bounded delta chains**: Delta reads are O(K) where K is a small, + configurable constant (default 8), not O(history_length). +4. **Sparse delta writes**: Incremental changes below a threshold are stored as + sparse vectors, saving up to 90% over full-block rewrites. +5. **WASM-safe**: All structures use fixed-size integers and simple layouts + compatible with `wasm32-unknown-unknown`. + +--- + +## 2. Tier Model Extension + +The tier model from ADR-017 is extended with Tier0: + +``` +Tier1 (Hot) -- 8-bit quantized -- full fidelity, fast access +Tier2 (Warm) -- 7/5-bit quantized -- reduced fidelity, moderate access +Tier3 (Cold) -- 3-bit quantized -- low fidelity, infrequent access +Tier0 (Zero) -- 0-bit evicted -- metadata only, reconstructable on demand +``` + +Tier0 is reached when the tier score from `TierPolicy::select_bits` falls +below a new configurable threshold `evict_min_score` (default: 4), or when the +storage engine triggers explicit eviction under memory pressure. + +--- + +## 3. Reconstruction Policies + +### 3.1 Enum Definition + +```rust +/// Controls how a Tier0 (evicted) block is handled on read. +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +#[repr(u8)] +pub enum ReconstructPolicy { + /// No reconstruction. Reads return an error or zeros depending on + /// `zero_fill_on_evict` in the global config. + None = 0, + + /// Reconstruct from a base block plus a bounded-depth delta chain. + /// The base is stored in the factor file or an older tier snapshot. + Delta = 1, + + /// Reconstruct from stored low-rank factors (SVD decomposition). + /// Factors are stored in a dedicated factor file: U, S, V matrices. + Factor = 2, +} + +/// Error returned when a Tier0 block cannot be read. +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +pub enum ReadError { + /// Block has been evicted and the reconstruction policy is None. + TensorEvicted, + /// Delta chain is corrupted or a link is missing. + DeltaChainBroken { depth: u16 }, + /// Factor file is missing or corrupt. + FactorMissing, + /// Block metadata not found. + BlockNotFound, + /// Supplied output buffer is too small. + BufferTooSmall { needed: usize, provided: usize }, +} +``` + +### 3.2 Policy Selection Rationale + +| Policy | Storage Cost | Read Latency | Quality | Best For | +|--------|-------------|-------------|---------|----------| +| None | 0 | N/A (error) | N/A | Truly disposable data | +| Delta | O(K * nnz) | O(K * N) | Exact at base tier | Audit trails, debugging replay | +| Factor | O(k*(m+n)) | O(k*m + k*n) | Bounded by truncation rank | Attention weight matrices | + +--- + +## 4. Delta Format + +### 4.1 Binary Layout + +``` +Delta Record (variable length): + +Offset Size Field Description +------ ----- ------------- ------------------------------------------ +0 16 tensor_id u128 LE - identifies the tensor +16 4 block_index u32 LE - block within the tensor +20 8 base_epoch u64 LE - epoch of the base this delta applies to +28 2 nnz u16 LE - number of non-zero delta entries +30 4 delta_scale f32 LE - scale factor for i16 delta values +34 nnz*4 pairs Array of (index: u16, value: i16) pairs +``` + +Total size per delta: `34 + 4 * nnz` bytes. + +For WASM targets, delta values are stored as `i16` with a shared `delta_scale` +(f32) to keep the arithmetic simple and avoid f64 in the critical path. + +### 4.2 Rust Structures + +```rust +/// On-disk header for a single delta record. +#[derive(Clone, Debug)] +#[repr(C, packed)] +pub struct DeltaHeader { + pub tensor_id: u128, + pub block_index: u32, + pub base_epoch: u64, + pub nnz: u16, + pub delta_scale: f32, +} + +/// A single sparse delta entry: position and quantized value. +#[derive(Clone, Copy, Debug)] +#[repr(C, packed)] +pub struct DeltaPair { + pub index: u16, + pub value: i16, +} + +/// In-memory representation of a delta record. +#[derive(Clone, Debug)] +pub struct DeltaRecord { + pub header: DeltaHeader, + pub pairs: Vec, +} + +impl DeltaRecord { + /// Serialise to bytes (little-endian, WASM-safe). + pub fn to_bytes(&self) -> Vec { + let mut buf = Vec::with_capacity(34 + self.pairs.len() * 4); + buf.extend_from_slice(&self.header.tensor_id.to_le_bytes()); + buf.extend_from_slice(&self.header.block_index.to_le_bytes()); + buf.extend_from_slice(&self.header.base_epoch.to_le_bytes()); + buf.extend_from_slice(&self.header.nnz.to_le_bytes()); + buf.extend_from_slice(&self.header.delta_scale.to_le_bytes()); + for p in &self.pairs { + buf.extend_from_slice(&p.index.to_le_bytes()); + buf.extend_from_slice(&p.value.to_le_bytes()); + } + buf + } + + /// Deserialise from bytes. Returns None on truncated input. + pub fn from_bytes(data: &[u8]) -> Option { + if data.len() < 34 { + return None; + } + let tensor_id = u128::from_le_bytes(data[0..16].try_into().ok()?); + let block_index = u32::from_le_bytes(data[16..20].try_into().ok()?); + let base_epoch = u64::from_le_bytes(data[20..28].try_into().ok()?); + let nnz = u16::from_le_bytes(data[28..30].try_into().ok()?); + let delta_scale = f32::from_le_bytes(data[30..34].try_into().ok()?); + + let pairs_len = nnz as usize; + if data.len() < 34 + pairs_len * 4 { + return None; + } + let mut pairs = Vec::with_capacity(pairs_len); + let mut off = 34; + for _ in 0..pairs_len { + let index = u16::from_le_bytes(data[off..off + 2].try_into().ok()?); + let value = i16::from_le_bytes(data[off + 2..off + 4].try_into().ok()?); + pairs.push(DeltaPair { index, value }); + off += 4; + } + + Some(Self { + header: DeltaHeader { + tensor_id, + block_index, + base_epoch, + nnz, + delta_scale, + }, + pairs, + }) + } +} +``` + +--- + +## 5. Block Metadata Extension + +The per-block metadata from ADR-018 is extended with reconstruction fields: + +```rust +/// Extended block metadata supporting Tier0 and reconstruction. +#[derive(Clone, Debug)] +pub struct BlockMeta { + pub tensor_id: u128, + pub block_index: u32, + pub epoch: u64, + + /// Current storage tier: 0 = evicted, 1 = hot, 2 = warm, 3 = cold. + pub tier: u8, + /// Bit width of the stored representation (0 for Tier0). + pub bits: u8, + /// Reconstruction policy when tier == 0. + pub reconstruct_policy: ReconstructPolicy, + + /// Number of deltas chained on top of the base for this block. + pub delta_chain_len: u16, + /// Epoch of the base block at the root of the delta chain. + pub base_epoch: u64, + + /// Byte offset into the tier data file (unused when tier == 0). + pub data_offset: u64, + /// Byte length in the tier data file (0 when tier == 0). + pub data_len: u32, + + /// Access tracking for tier policy. + pub access_count: u32, + pub last_access_ts: u32, +} +``` + +--- + +## 6. Read Path + +### 6.1 Sequence Diagram + +``` +Caller BlockStore TierDataFile DeltaStore FactorStore + | | | | | + |-- read_block(id) -->| | | | + | |-- lookup_meta(id) ->| | | + | |<--- BlockMeta ------| | | + | | | | | + | [tier 1/2/3?] | | | + | |-- read_bytes ------>| | | + | |<--- quantized ------| | | + | |-- dequantize ------>| | | + |<-- f32 buffer ------| | | | + | | | | | + | [tier 0, policy=None?] | | | + |<-- Err(TensorEvicted) | | | + | | | | | + | [tier 0, policy=Delta?] | | | + | |-- load_base ------->| | | + | |<--- base block -----| | | + | |-- load_deltas ------|----------------->| | + | |<--- delta chain ----|------------------| | + | |-- apply_chain ----->| | | + |<-- reconstructed ---| | | | + | | | | | + | [tier 0, policy=Factor?] | | | + | |-- load_factors -----|------------------|---------------->| + | |<--- U, S, V --------|------------------|-----------------| + | |-- reconstruct_svd ->| | | + |<-- reconstructed ---| | | | +``` + +### 6.2 Read Implementation + +```rust +/// Result of reading a block. Contains the f32 data or an error. +pub type ReadResult = Result, ReadError>; + +/// Read a block, performing reconstruction if necessary. +pub fn read_block( + meta: &BlockMeta, + tier_files: &TierDataFiles, + delta_store: &DeltaStore, + factor_store: &FactorStore, + zero_fill_on_evict: bool, + out: &mut Vec, +) -> Result<(), ReadError> { + match meta.tier { + // --- Tier 1/2/3: quantized data present --- + 1 | 2 | 3 => { + let raw = tier_files + .read_range(meta.tier, meta.data_offset, meta.data_len) + .map_err(|_| ReadError::BlockNotFound)?; + + // Dequantize into caller buffer using the segment decode path + // from ADR-017. The raw bytes include the TQTC segment header. + out.clear(); + crate::segment::decode(&raw, out); + if out.is_empty() { + return Err(ReadError::BlockNotFound); + } + Ok(()) + } + + // --- Tier 0: evicted, attempt reconstruction --- + 0 => match meta.reconstruct_policy { + ReconstructPolicy::None => { + if zero_fill_on_evict { + // Return a zero-filled buffer of the expected size. + // The block_size is derived from tensor metadata. + out.clear(); + out.resize(block_size_from_meta(meta), 0.0); + Ok(()) + } else { + Err(ReadError::TensorEvicted) + } + } + + ReconstructPolicy::Delta => { + reconstruct_via_delta(meta, tier_files, delta_store, out) + } + + ReconstructPolicy::Factor => { + reconstruct_via_factor(meta, factor_store, out) + } + }, + + _ => Err(ReadError::BlockNotFound), + } +} + +/// Reconstruct a Tier0 block by loading the base and applying the +/// delta chain up to the target epoch. +fn reconstruct_via_delta( + meta: &BlockMeta, + tier_files: &TierDataFiles, + delta_store: &DeltaStore, + out: &mut Vec, +) -> Result<(), ReadError> { + // 1. Load the base block (stored in an older tier or factor file). + let base_raw = tier_files + .read_base(meta.tensor_id, meta.base_epoch) + .map_err(|_| ReadError::DeltaChainBroken { depth: 0 })?; + + out.clear(); + crate::segment::decode(&base_raw, out); + if out.is_empty() { + return Err(ReadError::DeltaChainBroken { depth: 0 }); + } + + // 2. Load and apply deltas sequentially (oldest to newest). + let deltas = delta_store + .load_chain(meta.tensor_id, meta.block_index, meta.base_epoch, meta.epoch) + .map_err(|_| ReadError::DeltaChainBroken { + depth: meta.delta_chain_len, + })?; + + for (i, delta) in deltas.iter().enumerate() { + apply_delta(out, delta).map_err(|_| ReadError::DeltaChainBroken { + depth: i as u16 + 1, + })?; + } + + Ok(()) +} + +/// Apply a single sparse delta to a mutable f32 buffer. +fn apply_delta(buf: &mut [f32], delta: &DeltaRecord) -> Result<(), ReadError> { + let scale = delta.header.delta_scale; + for pair in &delta.pairs { + let idx = pair.index as usize; + if idx >= buf.len() { + return Err(ReadError::BufferTooSmall { + needed: idx + 1, + provided: buf.len(), + }); + } + buf[idx] += (pair.value as f32) * scale; + } + Ok(()) +} + +/// Reconstruct a Tier0 block from stored SVD factors. +fn reconstruct_via_factor( + meta: &BlockMeta, + factor_store: &FactorStore, + out: &mut Vec, +) -> Result<(), ReadError> { + let factors = factor_store + .load(meta.tensor_id, meta.block_index) + .map_err(|_| ReadError::FactorMissing)?; + + // factors.u: [m x k], factors.s: [k], factors.v: [k x n] + // Reconstruct: out[i][j] = sum_r( U[i][r] * S[r] * V[r][j] ) + let m = factors.m; + let n = factors.n; + let k = factors.k; + + out.clear(); + out.resize(m * n, 0.0); + + for r in 0..k { + let s_r = factors.s[r]; + for i in 0..m { + let u_ir = factors.u[i * k + r]; + let u_s = u_ir * s_r; + for j in 0..n { + out[i * n + j] += u_s * factors.v[r * n + j]; + } + } + } + + Ok(()) +} +``` + +--- + +## 7. Write Path + +### 7.1 Write Path -- Full Replace + +``` +Caller BlockStore Quantizer TierDataFile + | | | | + |-- write_block(data) ->| | | + | |-- select_tier ---->| | + | |<-- bits, tier -----| | + | |-- quantize ------->| | + | |<-- segment bytes --| | + | |-- write_segment ---|-------------------->| + | |-- update_meta ---->| | + |<-- Ok ----------------| | | +``` + +```rust +/// Write a full block replacement. Quantizes at the current tier and +/// stores the complete representation, discarding any prior data. +pub fn write_block_full( + meta: &mut BlockMeta, + data: &[f32], + policy: &TierPolicy, + tier_files: &mut TierDataFiles, + now_ts: u32, +) -> Result<(), WriteError> { + // 1. Determine tier from access pattern. + let bits = policy.select_bits(meta.access_count, meta.last_access_ts, now_ts); + let tier = tier_from_bits(bits); + + // 2. Quantize via ADR-017 segment encoding. + let group_len = policy.group_len as usize; + let scales = crate::quantizer::compute_scales(data, group_len, bits); + let mut packed = Vec::new(); + crate::quantizer::quantize_and_pack(&scales, &scales, group_len, bits, &mut packed); + + let mut segment = Vec::new(); + crate::segment::encode( + bits, + policy.group_len, + data.len() as u32, + 1, // single frame + &scales, + &packed, + &mut segment, + ); + + // 3. Write segment bytes to the appropriate tier data file. + let (offset, len) = tier_files.append(tier, &segment)?; + + // 4. Update metadata. + meta.tier = tier; + meta.bits = bits; + meta.data_offset = offset; + meta.data_len = len as u32; + meta.epoch += 1; + meta.delta_chain_len = 0; + meta.base_epoch = meta.epoch; + + Ok(()) +} +``` + +### 7.2 Write Path -- Delta Write + +``` +Caller BlockStore DeltaEncoder DeltaStore + | | | | + |-- write_delta(data) ->| | | + | |-- diff vs current->| | + | |<-- changed_frac ---| | + | | | | + | [changed_frac < p?] | | + | |-- encode_sparse -->| | + | |<-- DeltaRecord ----| | + | |-- store_delta -----|-------------------->| + | |-- update_meta ---->| | + |<-- Ok(DeltaStored) ---| | | + | | | | + | [changed_frac >= p?] | | + | |-- write_block_full (see 7.1) | + |<-- Ok(FullReplace) ---| | | +``` + +```rust +/// Decision thresholds for delta vs full write. +#[derive(Clone, Copy, Debug)] +pub struct DeltaPolicy { + /// Maximum fraction of changed elements to use delta encoding. + /// If the fraction exceeds this, a full write is performed instead. + pub max_changed_fraction: f32, // default: 0.10 (10%) + + /// Maximum L2 norm of the delta relative to the block norm. + /// Prevents delta encoding when the change is large in magnitude. + pub max_relative_delta_norm: f32, // default: 0.05 (5%) + + /// Maximum number of deltas in a chain before compaction is forced. + pub max_delta_chain: u16, // default: 8 +} + +impl Default for DeltaPolicy { + fn default() -> Self { + Self { + max_changed_fraction: 0.10, + max_relative_delta_norm: 0.05, + max_delta_chain: 8, + } + } +} + +/// Outcome of a write operation. +#[derive(Debug)] +pub enum WriteOutcome { + DeltaStored, + FullReplace, +} + +/// Attempt a delta write. Falls back to full replace when the change is +/// too large or the delta chain has reached its maximum depth. +pub fn write_block_delta( + meta: &mut BlockMeta, + old_data: &[f32], + new_data: &[f32], + delta_policy: &DeltaPolicy, + tier_policy: &TierPolicy, + tier_files: &mut TierDataFiles, + delta_store: &mut DeltaStore, + now_ts: u32, +) -> Result { + assert_eq!(old_data.len(), new_data.len()); + let n = old_data.len(); + + // 1. Compute diff statistics. + let mut changed_count: usize = 0; + let mut delta_norm_sq: f64 = 0.0; + let mut block_norm_sq: f64 = 0.0; + + for i in 0..n { + let diff = (new_data[i] - old_data[i]) as f64; + block_norm_sq += (old_data[i] as f64) * (old_data[i] as f64); + if diff.abs() > 1e-9 { + changed_count += 1; + delta_norm_sq += diff * diff; + } + } + + let changed_frac = changed_count as f32 / n as f32; + let relative_norm = if block_norm_sq > 0.0 { + (delta_norm_sq / block_norm_sq).sqrt() as f32 + } else { + f32::MAX + }; + + // 2. Decision: delta or full replace? + let chain_full = meta.delta_chain_len >= delta_policy.max_delta_chain; + let change_too_large = changed_frac > delta_policy.max_changed_fraction + || relative_norm > delta_policy.max_relative_delta_norm; + + if chain_full || change_too_large { + write_block_full(meta, new_data, tier_policy, tier_files, now_ts)?; + return Ok(WriteOutcome::FullReplace); + } + + // 3. Encode sparse delta. + let max_abs_delta = old_data + .iter() + .zip(new_data.iter()) + .map(|(a, b)| (b - a).abs()) + .fold(0.0f32, f32::max); + + let delta_scale = if max_abs_delta == 0.0 { + 1.0 + } else { + max_abs_delta / i16::MAX as f32 + }; + let inv_scale = 1.0 / delta_scale; + + let mut pairs = Vec::with_capacity(changed_count); + for i in 0..n { + let diff = new_data[i] - old_data[i]; + if diff.abs() > 1e-9 { + let quantized = (diff * inv_scale).round() as i16; + pairs.push(DeltaPair { + index: i as u16, + value: quantized, + }); + } + } + + let record = DeltaRecord { + header: DeltaHeader { + tensor_id: meta.tensor_id, + block_index: meta.block_index, + base_epoch: meta.base_epoch, + nnz: pairs.len() as u16, + delta_scale, + }, + pairs, + }; + + // 4. Store delta and update metadata. + delta_store.append(&record)?; + meta.epoch += 1; + meta.delta_chain_len += 1; + + Ok(WriteOutcome::DeltaStored) +} +``` + +--- + +## 8. Delta Chain Management + +### 8.1 Chain Depth Bound + +The `max_delta_chain` parameter (default: 8) bounds the number of deltas that +can be chained before compaction. This guarantees that delta-based +reconstruction is bounded by O(K * N) where K <= `max_delta_chain` and N is +the block size. + +At 8 deltas with an average sparsity of 10%, the read amplification is: + +``` +base_read + 8 * 0.10 * N * 4 bytes = base_read + 3.2 * N bytes +``` + +For a 512-element block this is `base_read + 6.4 KB`, well within acceptable +latency. + +### 8.2 Compaction Algorithm + +``` +DeltaStore Compactor TierDataFile MetadataStore + | | | | + |-- chain_len > K? | | | + | | | | + |-- load_base ---->| | | + |<-- base f32 -----| | | + | | | | + |-- load_deltas -->| | | + |<-- [d0..dK] -----| | | + | | | | + | [apply d0, d1, ..., dK] | | + | | | | + |-- quantize ----->| | | + |<-- new segment --| | | + | |-- write_segment -->| | + | |-- delete_deltas -->| | + | |-- update_meta -----|-------------------->| + |<-- compacted ----| | | +``` + +```rust +/// Compact a delta chain into a new base block. +/// +/// This is the primary mechanism for bounding read latency. When +/// `meta.delta_chain_len` exceeds `max_delta_chain`, the compactor: +/// 1. Loads the base block and decodes it to f32. +/// 2. Applies all deltas in epoch order. +/// 3. Re-quantizes at the current tier. +/// 4. Stores the result as a new base, deletes old deltas. +pub fn compact_delta_chain( + meta: &mut BlockMeta, + tier_policy: &TierPolicy, + tier_files: &mut TierDataFiles, + delta_store: &mut DeltaStore, + now_ts: u32, +) -> Result<(), CompactionError> { + // 1. Load and decode the base block. + let base_raw = tier_files + .read_base(meta.tensor_id, meta.base_epoch) + .map_err(|_| CompactionError::BaseMissing)?; + + let mut buffer = Vec::new(); + crate::segment::decode(&base_raw, &mut buffer); + if buffer.is_empty() { + return Err(CompactionError::BaseDecodeFailed); + } + + // 2. Load and apply all deltas in order. + let deltas = delta_store + .load_chain( + meta.tensor_id, + meta.block_index, + meta.base_epoch, + meta.epoch, + ) + .map_err(|_| CompactionError::DeltaLoadFailed)?; + + for delta in &deltas { + let scale = delta.header.delta_scale; + for pair in &delta.pairs { + let idx = pair.index as usize; + if idx < buffer.len() { + buffer[idx] += (pair.value as f32) * scale; + } + } + } + + // 3. Re-quantize at the current tier. + let bits = tier_policy.select_bits(meta.access_count, meta.last_access_ts, now_ts); + let tier = tier_from_bits(bits); + let group_len = tier_policy.group_len as usize; + + let scales = crate::quantizer::compute_scales(&buffer, group_len, bits); + let mut packed = Vec::new(); + crate::quantizer::quantize_and_pack(&scales, &scales, group_len, bits, &mut packed); + + let mut segment = Vec::new(); + crate::segment::encode( + bits, + tier_policy.group_len, + buffer.len() as u32, + 1, + &scales, + &packed, + &mut segment, + ); + + let (offset, len) = tier_files.append(tier, &segment)?; + + // 4. Delete old deltas and the old base. + delta_store.delete_chain( + meta.tensor_id, + meta.block_index, + meta.base_epoch, + meta.epoch, + )?; + + // 5. Update metadata to reflect the new base. + meta.tier = tier; + meta.bits = bits; + meta.data_offset = offset; + meta.data_len = len as u32; + meta.base_epoch = meta.epoch; + meta.delta_chain_len = 0; + + Ok(()) +} + +/// Map bit width to tier number. +fn tier_from_bits(bits: u8) -> u8 { + match bits { + 8 => 1, + 7 | 5 => 2, + 3 => 3, + 0 => 0, + _ => 3, // conservative fallback + } +} +``` + +--- + +## 9. Compression to Zero (Tier0 Eviction) + +When a block is evicted to Tier0: + +1. The data bytes in the tier data file are logically deleted (marked free for + reuse or physically removed during compaction). +2. `meta.bits` is set to 0 and `meta.tier` is set to 0. +3. `meta.data_len` is set to 0. +4. The reconstruction policy determines whether a base snapshot and/or delta + chain are preserved. + +```rust +/// Evict a block to Tier0. Optionally preserves reconstruction data. +pub fn evict_to_tier0( + meta: &mut BlockMeta, + policy: ReconstructPolicy, + tier_files: &mut TierDataFiles, +) -> Result<(), EvictionError> { + // Delete the data from the tier file. + if meta.data_len > 0 { + tier_files.mark_free(meta.tier, meta.data_offset, meta.data_len)?; + } + + meta.tier = 0; + meta.bits = 0; + meta.data_offset = 0; + meta.data_len = 0; + meta.reconstruct_policy = policy; + + // When policy is None, also delete any delta chain and factors + // to reclaim storage immediately. + // When policy is Delta or Factor, the associated stores are preserved. + + Ok(()) +} +``` + +--- + +## 10. Factor Reconstruction (SVD-Based) + +### 10.1 Factor File Format + +``` +FactorRecord: + +Offset Size Field Description +------ -------- -------- ------------------------------------------ +0 16 id u128 LE - tensor_id +16 4 block u32 LE - block_index +20 4 m u32 LE - rows of U +24 4 n u32 LE - cols of V +28 4 k u32 LE - truncation rank +32 m*k*4 u_data f32 LE - U matrix (row-major) +32+m*k*4 k*4 s_data f32 LE - singular values +... k*n*4 v_data f32 LE - V matrix (row-major) +``` + +### 10.2 Factor Store Structures + +```rust +/// Stored low-rank factors for SVD-based reconstruction. +#[derive(Clone, Debug)] +pub struct FactorRecord { + pub tensor_id: u128, + pub block_index: u32, + pub m: usize, // rows + pub n: usize, // cols + pub k: usize, // truncation rank, k << min(m, n) + pub u: Vec, // m x k, row-major + pub s: Vec, // k singular values + pub v: Vec, // k x n, row-major +} + +impl FactorRecord { + /// Storage cost in bytes (excluding header overhead). + pub fn storage_bytes(&self) -> usize { + (self.m * self.k + self.k + self.k * self.n) * 4 + } + + /// Reconstruction error bound: sum of discarded singular values + /// (Eckart-Young theorem). The caller computes the full SVD and + /// provides only the top-k factors. + pub fn is_worthwhile(&self, full_block_bytes: usize) -> bool { + self.storage_bytes() < full_block_bytes / 2 + } +} +``` + +Factor reconstruction is most effective for tensors with low effective rank, +such as attention weight matrices where the top 32-64 singular values capture +over 95% of the Frobenius norm. + +--- + +## 11. Failure Modes and Mitigations + +### 11.1 Delta Chain Blowup + +**Symptom**: Reads become progressively slower as chains grow. + +**Root cause**: Compaction not triggered, or `max_delta_chain` set too high. + +**Mitigation**: The write path checks `delta_chain_len >= max_delta_chain` +before every delta write and forces a full replace (which resets the chain). +Background compaction runs when `chain_len > max_delta_chain / 2` to stay +ahead of the threshold. + +**Monitoring**: Expose `max_chain_len` and `avg_chain_len` as metrics on the +`BlockStore`. Alert when `max_chain_len` approaches 80% of `max_delta_chain`. + +### 11.2 Scale Instability (Outlier Sensitivity) + +**Symptom**: Quality drops sharply on blocks with outlier values, particularly +at 3-bit quantization where `qmax = 3`. + +**Root cause**: A single outlier in a group inflates the scale, crushing the +dynamic range available for all other values. + +**Mitigation**: + +1. **Outlier clamping**: Before computing scales, clamp values at the 99.9th + percentile of absolute values within each group. Outliers beyond the clamp + are stored separately as sparse corrections (same format as delta pairs). + +2. **Two-level scale for 3-bit**: Use a per-block coarse scale and a per-group + fine scale. The fine scale is a 4-bit multiplier (0.25x to 4.0x) applied on + top of the coarse scale. This provides 16 sub-ranges within the block's + dynamic range. + +3. **Per-group scale inside block**: Already implemented in ADR-017. Groups of + 64 elements each get their own scale, limiting outlier blast radius to 64 + values. + +### 11.3 Base Block Loss + +**Symptom**: Delta reconstruction fails with `DeltaChainBroken { depth: 0 }`. + +**Root cause**: The base block referenced by the delta chain was deleted or +corrupted. + +**Mitigation**: Base blocks referenced by active delta chains are pinned and +cannot be freed by tier file compaction. The eviction path must verify that no +active delta chains reference a base before releasing it. The metadata field +`base_epoch` serves as the foreign key for this reference check. + +--- + +## 12. Configuration + +All parameters described in this ADR are consolidated into `DeltaPolicy` and +`ReconstructPolicy`, both attached to the per-tensor or per-collection +`TierPolicy`. The full configuration surface: + +| Parameter | Location | Default | Description | +|-----------|----------|---------|-------------| +| `evict_min_score` | TierPolicy | 4 | Score threshold for Tier0 eviction | +| `reconstruct_policy` | BlockMeta | None | Per-block reconstruction strategy | +| `zero_fill_on_evict` | Global config | false | Return zeros instead of error for Tier0/None | +| `max_changed_fraction` | DeltaPolicy | 0.10 | Fraction threshold for delta vs full write | +| `max_relative_delta_norm` | DeltaPolicy | 0.05 | Norm threshold for delta vs full write | +| `max_delta_chain` | DeltaPolicy | 8 | Maximum chain depth before compaction | + +--- + +## 13. Alternatives Considered + +### 13.1 Unbounded Delta Chains with Periodic Checkpoints + +**Rejected**. Periodic checkpoints (every N epochs regardless of chain length) +waste storage when the tensor is not being modified. Bounded chains with +on-demand compaction are more space-efficient and simpler to reason about. + +### 13.2 Full Copy-on-Write for Every Update + +**Rejected**. For tensors changing by less than 10% per update, COW quadruples +write amplification compared to sparse deltas. The delta path reduces write +volume by 80-90% for typical incremental updates. + +### 13.3 LZ4/Zstd Compression Instead of Delta Encoding + +**Rejected**. General-purpose compression does not exploit the semantic +structure of tensor updates (sparse changes, known value distributions). Delta +encoding provides better compression for the specific access pattern, and +avoids adding external dependencies to the WASM-compatible core. + +### 13.4 Unlimited Factor Rank + +**Rejected**. Storing factors with rank k = min(m, n) provides exact +reconstruction but offers no compression. The truncation rank must be bounded +such that `factor_bytes < 0.5 * full_block_bytes` for the factor policy to be +worthwhile. + +--- + +## 14. Acceptance Criteria + +- [ ] Tier0 eviction reduces per-block storage to metadata only (0 data bytes) +- [ ] Delta reconstruction produces correct output for chain depths 1 through `max_delta_chain` +- [ ] Factor reconstruction matches SVD reference within floating-point tolerance +- [ ] Delta writes with <10% change use <20% of the bytes of a full write +- [ ] Compaction reduces chain length to 0 and produces a valid base block +- [ ] Read latency for delta reconstruction at chain depth 8 is under 50us for 512-dim blocks +- [ ] All structures serialise/deserialise correctly on both native and WASM targets +- [ ] `ReconstructPolicy::None` with `zero_fill_on_evict = false` returns `TensorEvicted` error +- [ ] `ReconstructPolicy::None` with `zero_fill_on_evict = true` returns a zero-filled buffer + +--- + +## 15. References + +1. ADR-017: Temporal Tensor Compression with Tiered Quantization (2026-02-06) +2. ADR-018: Block-Based Storage Engine (parent, in progress) +3. Eckart, C. and Young, G. "The approximation of one matrix by another of lower rank." Psychometrika 1(3), 1936. +4. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015. +5. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. +6. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. diff --git a/docs/adr/temporal-tensor-store/ADR-022-wasm-api-cross-platform.md b/docs/adr/temporal-tensor-store/ADR-022-wasm-api-cross-platform.md new file mode 100644 index 000000000..fb54e198e --- /dev/null +++ b/docs/adr/temporal-tensor-store/ADR-022-wasm-api-cross-platform.md @@ -0,0 +1,1062 @@ +# ADR-022: WASM API Surface and Cross-Platform Strategy + +**Status**: Proposed +**Date**: 2026-02-08 +**Parent**: ADR-017 Temporal Tensor Compression, ADR-005 WASM Runtime Integration, ADR-018 Block-Based Storage Engine +**Author**: System Architecture Team + +## Version History + +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 0.1 | 2026-02-08 | Architecture Team | Initial proposal | + +--- + +## Abstract + +This ADR defines the **WASM API surface** for the Temporal Tensor Store (TTS), +enabling the tiering gate and quantizer to be called from **Node.js** and +**browser** environments with identical semantics. The design extends the +frame-level `ttc_*` FFI established in ADR-017 with a new block-level `tts_*` +function set, introduces host-imported IO functions for pluggable storage +backends, and specifies the cross-platform binding strategy for native, Node.js, +browser, and edge targets. + +The API surface is intentionally narrow -- five core exports, three host imports, +and two memory-management helpers -- to minimise the attack surface exposed +across the WASM boundary while remaining sufficient for full tiered tensor +storage operations. + +--- + +## 1. Context and Motivation + +### 1.1 The Cross-Platform Imperative + +ADR-017 established a Rust-native temporal tensor compressor with a WASM FFI +layer (`ttc_*` functions) for frame-level compression. ADR-005 established the +WASM sandboxing model with epoch-based interruption and raw ABI. ADR-018 +defined the block-based storage engine with tiered placement. + +However, these designs assume the storage backend is directly accessible from +within the WASM module. In practice: + +- **Node.js**: Storage lives in AgentDB/RuVector file-backed databases that + the WASM module cannot access directly via filesystem calls. +- **Browser**: Persistent storage requires IndexedDB, which is asynchronous and + unavailable from within WASM linear memory. +- **Edge/Embedded**: Storage may be in-memory only, with no filesystem at all. + +The WASM module must delegate all IO to the **host** via imported functions, +while retaining ownership of the tiering policy, quantization logic, and +block management. + +### 1.2 tensor_id Splitting Problem + +WASM's value types are limited to `i32`, `i64`, `f32`, and `f64`. The Temporal +Tensor Store uses `u128` tensor identifiers internally, but `u128` cannot cross +the WASM FFI boundary as a single value. The standard solution is to split the +identifier into two `u64` halves (`hi` and `lo`), which the host reconstructs +on its side. + +### 1.3 Design Goals + +| Goal | Rationale | +|------|-----------| +| Narrow API surface (< 10 exports) | Minimise WASM boundary complexity and audit scope | +| Host-delegated IO | Enable platform-specific storage without WASM recompilation | +| Zero-copy where possible | Avoid redundant copies across the WASM boundary | +| Identical semantics across platforms | Same WASM binary runs on Node.js, browser, and edge | +| Coexistence with ADR-017 `ttc_*` | Both function sets share the same WASM module | + +--- + +## 2. Decision + +### 2.1 Introduce `tts_*` WASM Exports for Block-Level Storage + +We extend the WASM module with five core export functions and two memory +management helpers, all using `extern "C"` linkage with `#[no_mangle]`: + +```c +// Initialize the store with a JSON-encoded policy configuration. +// Returns 0 on success, negative error code on failure. +int32_t tts_init(const uint8_t* policy_ptr, usize policy_len) -> i32; + +// Ingest a tensor block. The tensor_id is split into hi/lo halves. +// data_ptr points to f32 values in WASM linear memory. +// Returns 0 on success, negative error code on failure. +int32_t tts_put(uint64_t tensor_id_hi, uint64_t tensor_id_lo, + uint32_t block_index, + const float* data_ptr, usize data_len) -> i32; + +// Read a tensor block, dequantized back to f32. +// out_ptr is a pre-allocated buffer in WASM linear memory. +// Returns 0 on success, negative error code on failure. +int32_t tts_get(uint64_t tensor_id_hi, uint64_t tensor_id_lo, + uint32_t block_index, + float* out_ptr, usize out_len) -> i32; + +// Run a maintenance tick: promote/demote blocks, evict to meet budgets. +// budget_bytes: maximum bytes to write during this tick. +// budget_ops: maximum IO operations during this tick. +// Returns number of blocks moved, or negative error code. +int32_t tts_tick(uint32_t budget_bytes, uint32_t budget_ops) -> i32; + +// Write a JSON-encoded statistics snapshot into out_ptr. +// Returns bytes written, or negative error code if buffer too small. +int32_t tts_stats(uint8_t* out_ptr, usize out_len) -> i32; +``` + +### 2.2 Host-Imported IO Functions + +The WASM module imports three functions from the host environment for all +persistent IO. These are declared in the `"tts_host"` import namespace: + +```c +// Read a block from host storage into dst buffer. +// tier: 0=hot, 1=warm, 2=cold +// key_ptr/key_len: block key (tensor_id:block_index encoded as bytes) +// dst_ptr/dst_len: destination buffer in WASM linear memory +// Returns bytes read, or negative error code. +int32_t read_block(uint32_t tier, const uint8_t* key_ptr, usize key_len, + uint8_t* dst_ptr, usize dst_len) -> i32; + +// Write a block to host storage from src buffer. +// Returns 0 on success, negative error code on failure. +int32_t write_block(uint32_t tier, const uint8_t* key_ptr, usize key_len, + const uint8_t* src_ptr, usize src_len) -> i32; + +// Delete a block from host storage. +// Returns 0 on success, negative error code on failure. +int32_t delete_block(uint32_t tier, const uint8_t* key_ptr, usize key_len) -> i32; +``` + +**Platform-specific host bindings:** + +| Platform | `read_block` | `write_block` | `delete_block` | +|----------|-------------|--------------|----------------| +| Node.js | AgentDB get | AgentDB put | AgentDB delete | +| Browser | IndexedDB getAll | IndexedDB put | IndexedDB delete | +| Native (server) | mmap read | mmap write | unlink | +| Edge/Embedded | ArrayBuffer slice | ArrayBuffer copy | zeroed/freed | + +### 2.3 Memory Management Exports + +```c +// Allocate len bytes in WASM linear memory. +// Returns pointer to allocated region, or 0 on failure. +uint32_t tts_alloc(usize len) -> u32; + +// Deallocate a previously allocated region. +void tts_dealloc(uint32_t ptr, usize len); + +// Retrieve the last error message as a UTF-8 string. +// Returns bytes written, or negative if buffer too small. +int32_t tts_last_error(uint8_t* out_ptr, usize out_len) -> i32; +``` + +--- + +## 3. Detailed Design + +### 3.1 WASM Memory Layout + +``` ++========================================================================+ +| WASM Linear Memory | +|========================================================================| +| | +| 0x0000 +-----------------+ | +| | WASM Stack | (grows downward, managed by WASM runtime) | +| +-----------------+ | +| | Static Data | (STORE, policy config, error buffer) | +| +-----------------+ | +| | | | +| | Heap | (managed by tts_alloc / tts_dealloc) | +| | | | +| | +-------------+ | | +| | | Input Buffer| | Host writes f32 frames here | +| | | (f32[N]) | | via tts_alloc -> memcpy -> tts_put | +| | +-------------+ | | +| | | | +| | +-------------+ | | +| | | Output Buf | | tts_get writes dequantized f32 here | +| | | (f32[N]) | | Host reads after tts_get returns | +| | +-------------+ | | +| | | | +| | +-------------+ | | +| | | IO Staging | | Temporary buffer for host import calls | +| | | Buffer | | (read_block / write_block payloads) | +| | +-------------+ | | +| | | | +| 0xFFFF +-----------------+ (grows via memory.grow as needed) | +| | ++========================================================================+ +``` + +### 3.2 Host-Guest Interaction Pattern + +``` + HOST (Node.js / Browser / Native) GUEST (WASM Module) + ==================================== ======================== + + 1. Load WASM module + 2. Provide host imports: + - tts_host::read_block + - tts_host::write_block + - tts_host::delete_block + 3. Instantiate module + | + 4. Encode policy as JSON bytes ------->| + 5. ptr = tts_alloc(policy_len) | allocate in linear mem + 6. Write policy bytes to ptr | + 7. tts_init(ptr, policy_len) ------->| parse policy, init STORE + 8. tts_dealloc(ptr, policy_len) | free policy buffer + | + --- INGEST LOOP --- | + | + 9. buf = tts_alloc(N * 4) | allocate f32 buffer + 10. Write f32 data into buf | + 11. tts_put(id_hi, id_lo, idx, ------->| quantize frame + buf, N) | tier policy selects bits + | calls write_block(tier, + | key, compressed) + <-------| write_block import + 12. Host persists block | + ------->| returns 0 (success) + 13. tts_dealloc(buf, N * 4) | + | + --- READ LOOP --- | + | + 14. out = tts_alloc(N * 4) | allocate output buffer + 15. tts_get(id_hi, id_lo, idx, ------->| calls read_block(tier, + out, N) | key, staging_buf) + <-------| read_block import + 16. Host reads from storage, | + writes into staging_buf | + ------->| dequantize into out + 17. Host reads f32 from out | + 18. tts_dealloc(out, N * 4) | + | + --- MAINTENANCE --- | + | + 19. tts_tick(budget_bytes, ------->| evaluate tier scores + budget_ops) | promote/demote blocks + | calls write_block, + | delete_block as needed + <-------| host import callbacks + 20. Host handles IO | + ------->| returns blocks_moved +``` + +### 3.3 Import/Export Function Table + +**Exports (WASM -> Host):** + +| Export | Signature (WASM types) | Description | +|--------|----------------------|-------------| +| `tts_init` | `(i32, i32) -> i32` | Init store with policy JSON | +| `tts_put` | `(i64, i64, i32, i32, i32) -> i32` | Ingest tensor block | +| `tts_get` | `(i64, i64, i32, i32, i32) -> i32` | Read tensor block | +| `tts_tick` | `(i32, i32) -> i32` | Maintenance tick | +| `tts_stats` | `(i32, i32) -> i32` | Statistics snapshot | +| `tts_alloc` | `(i32) -> i32` | Allocate linear memory | +| `tts_dealloc` | `(i32, i32) -> ()` | Free linear memory | +| `tts_last_error` | `(i32, i32) -> i32` | Get error message | + +**Imports (Host -> WASM), namespace `tts_host`:** + +| Import | Signature (WASM types) | Description | +|--------|----------------------|-------------| +| `read_block` | `(i32, i32, i32, i32, i32) -> i32` | Read from host storage | +| `write_block` | `(i32, i32, i32, i32, i32) -> i32` | Write to host storage | +| `delete_block` | `(i32, i32, i32) -> i32` | Delete from host storage | + +### 3.4 tensor_id Encoding + +``` +u128 tensor_id: ++----------------------------------+----------------------------------+ +| hi (u64) | lo (u64) | +| bits [127..64] | bits [63..0] | ++----------------------------------+----------------------------------+ + +Reconstruction (host side): + tensor_id = (hi as u128) << 64 | (lo as u128) + +Block key encoding (for host import calls): + key = tensor_id_hi.to_le_bytes() ++ tensor_id_lo.to_le_bytes() ++ block_index.to_le_bytes() + key_len = 8 + 8 + 4 = 20 bytes +``` + +This encoding is deterministic and platform-independent (little-endian). + +### 3.5 Error Handling + +**Return code convention:** + +| Code | Name | Description | +|------|------|-------------| +| 0 | `TTS_OK` | Operation succeeded | +| -1 | `TTS_ERR_INVALID_HANDLE` | Store not initialized or handle invalid | +| -2 | `TTS_ERR_TENSOR_EVICTED` | Requested block was evicted from all tiers | +| -3 | `TTS_ERR_BUDGET_EXHAUSTED` | Tick budget fully consumed | +| -4 | `TTS_ERR_IO` | Host IO import returned an error | +| -5 | `TTS_ERR_CORRUPT_BLOCK` | Block data failed integrity check | +| -6 | `TTS_ERR_BUFFER_TOO_SMALL` | Output buffer insufficient | +| -7 | `TTS_ERR_INVALID_POLICY` | Policy JSON failed validation | +| -8 | `TTS_ERR_NULL_POINTER` | Null pointer passed for required argument | +| -9 | `TTS_ERR_ALLOC_FAILED` | Memory allocation failed | + +**Error message retrieval:** + +```rust +// Guest-side implementation +static mut LAST_ERROR: [u8; 256] = [0u8; 256]; +static mut LAST_ERROR_LEN: usize = 0; + +fn set_error(msg: &str) { + unsafe { + let bytes = msg.as_bytes(); + let len = bytes.len().min(256); + LAST_ERROR[..len].copy_from_slice(&bytes[..len]); + LAST_ERROR_LEN = len; + } +} + +#[no_mangle] +pub extern "C" fn tts_last_error(out_ptr: *mut u8, out_len: usize) -> i32 { + if out_ptr.is_null() { + return TTS_ERR_NULL_POINTER; + } + unsafe { + let copy_len = LAST_ERROR_LEN.min(out_len); + core::ptr::copy_nonoverlapping(LAST_ERROR.as_ptr(), out_ptr, copy_len); + copy_len as i32 + } +} +``` + +### 3.6 Memory Model Details + +The WASM module uses linear memory exclusively. The host interacts with this +memory through the exported `tts_alloc` and `tts_dealloc` functions: + +```rust +// Guest-side allocator (simple bump allocator for WASM) +#[no_mangle] +pub extern "C" fn tts_alloc(len: usize) -> u32 { + let layout = core::alloc::Layout::from_size_align(len, 4); + match layout { + Ok(layout) => { + let ptr = unsafe { alloc::alloc::alloc(layout) }; + if ptr.is_null() { + set_error("allocation failed"); + 0 + } else { + ptr as u32 + } + } + Err(_) => { + set_error("invalid allocation layout"); + 0 + } + } +} + +#[no_mangle] +pub extern "C" fn tts_dealloc(ptr: u32, len: usize) { + if ptr == 0 || len == 0 { + return; + } + let layout = core::alloc::Layout::from_size_align(len, 4); + if let Ok(layout) = layout { + unsafe { alloc::alloc::dealloc(ptr as *mut u8, layout); } + } +} +``` + +**Lifecycle protocol:** + +1. Host calls `tts_alloc(N)` to get a pointer in WASM linear memory. +2. Host writes data into that pointer region (via `memory.buffer` in JS). +3. Host calls `tts_put(...)` or `tts_init(...)` with the pointer. +4. Host calls `tts_dealloc(ptr, N)` to free the buffer. +5. For reads: host allocates output buffer, calls `tts_get(...)`, reads result, + then deallocates. + +--- + +## 4. Cross-Platform Strategy + +### 4.1 Platform Binding Matrix + +| Platform | BlockIO Binding | MetaLog Binding | Async Model | Notes | +|----------|----------------|-----------------|-------------|-------| +| Native (server) | Memory-mapped files per tier | Append-only file | Sync | mmap for zero-copy reads; direct filesystem access | +| Node.js (WASM) | AgentDB / RuVector | AgentDB | Sync wrapper over async | Host imports bridge WASM to AgentDB API | +| Browser (WASM) | IndexedDB | IndexedDB | Async wrapper needed | Requires Atomics.wait or promise-based shim | +| Edge / Embedded | In-memory buffers | In-memory ring | Sync | No persistence; eviction on budget pressure | + +### 4.2 Node.js Binding Architecture + +``` ++------------------------------------------------------------------+ +| Node.js Process | +| | +| +------------------+ +-----------------------------+ | +| | TypeScript API | | WASM Instance | | +| | | alloc | | | +| | tts.init(policy) |--------->| tts_init(ptr, len) | | +| | tts.put(id, blk, |--------->| tts_put(hi, lo, idx, | | +| | data) | | ptr, len) | | +| | tts.get(id, blk) |--------->| tts_get(hi, lo, idx, | | +| | tts.tick(budget) |--------->| ptr, len) | | +| | tts.stats() | | tts_tick(bytes, ops) | | +| +------------------+ | tts_stats(ptr, len) | | +| ^ +----------+------------------+ | +| | | | +| | host imports| | +| | v | +| +------+------+ +-----------+-----------+ | +| | AgentDB |<-------------| tts_host::read_block | | +| | (storage) |<-------------| tts_host::write_block | | +| | |<-------------| tts_host::delete_block| | +| +-------------+ +-----------------------+ | ++------------------------------------------------------------------+ +``` + +### 4.3 Browser Binding Architecture + +In the browser, IndexedDB is asynchronous. The host imports must bridge this +gap. Two strategies are available: + +**Strategy A: SharedArrayBuffer + Atomics (preferred for performance)** + +The host import writes to a shared buffer and signals completion via +`Atomics.notify`. The WASM thread (running in a Web Worker) waits via +`Atomics.wait`. This provides synchronous semantics from the WASM perspective. + +**Strategy B: Asyncify (fallback)** + +For browsers without SharedArrayBuffer support, the Asyncify transform +(applied at WASM compile time via `wasm-opt --asyncify`) enables the WASM +module to yield execution and resume after the host completes an async +IndexedDB operation. + +| Strategy | Latency | Compatibility | Complexity | +|----------|---------|---------------|------------| +| SharedArrayBuffer + Atomics | ~1ms per IO | Requires COOP/COEP headers | Moderate | +| Asyncify | ~2-5ms per IO | Universal | Higher (binary transform) | + +### 4.4 Edge/Embedded Strategy + +For edge and embedded deployments, all storage is in-memory: + +- `read_block`: Returns data from a pre-allocated `ArrayBuffer` or `Vec`. +- `write_block`: Copies data into the in-memory store. +- `delete_block`: Zeros or frees the slot. +- No persistence. The `tts_tick` maintenance function handles eviction when + memory budget is exceeded. +- The in-memory ring for MetaLog provides bounded audit logging with automatic + overwrite of oldest entries. + +--- + +## 5. Integration with ADR-017 WASM FFI + +### 5.1 Coexistence of `ttc_*` and `tts_*` + +ADR-017 defined frame-level compression functions (`ttc_create`, `ttc_push_frame`, +`ttc_flush`, `ttc_decode_segment`, etc.). ADR-022 introduces block-level storage +functions (`tts_init`, `tts_put`, `tts_get`, `tts_tick`, `tts_stats`). + +Both function sets coexist in the same WASM module: + +``` +WASM Module Exports +=================================================== + ADR-017 (frame-level compression) ADR-022 (block-level storage) + ---------------------------------- ---------------------------- + ttc_create tts_init + ttc_free tts_put + ttc_touch tts_get + ttc_set_access tts_tick + ttc_push_frame tts_stats + ttc_flush tts_alloc + ttc_decode_segment tts_dealloc + ttc_alloc tts_last_error + ttc_dealloc +=================================================== +``` + +**Shared allocator**: `tts_alloc` and `ttc_alloc` use the same underlying +allocator. If both are present, either can be called; they are aliases. + +**Layering**: `tts_put` internally invokes the `ttc_*` quantization pipeline +to compress the ingested f32 data before passing compressed blocks to the host +via `write_block`. `tts_get` reads compressed blocks via `read_block` and +invokes `ttc_decode_segment` to dequantize before writing f32 to the output +buffer. + +### 5.2 Shared State + +```rust +// Single-threaded WASM: static mut is sound +static mut STORE: Option = None; + +// The store holds: +// - TierPolicy (from tts_init config) +// - Block metadata index (tensor_id -> block_index -> tier, size, access stats) +// - Active compressor handles (reusing ttc_* compressor pool from ADR-017) +// - IO staging buffer (reused across calls to avoid repeated allocation) +``` + +--- + +## 6. TypeScript Type Definitions + +The following types define the Node.js binding surface: + +```typescript +/** 128-bit tensor identifier, split for WASM compatibility. */ +export interface TensorId { + /** Upper 64 bits of the tensor ID. */ + readonly hi: bigint; + /** Lower 64 bits of the tensor ID. */ + readonly lo: bigint; +} + +/** Policy configuration for the Temporal Tensor Store. */ +export interface TtsPolicy { + /** Minimum score for hot tier placement (default: 512). */ + hot_min_score?: number; + /** Minimum score for warm tier placement (default: 64). */ + warm_min_score?: number; + /** Bit width for warm tier: 5 or 7 (default: 7). */ + warm_bits?: 5 | 7; + /** Drift tolerance as Q8 fixed-point: 26 = ~10% (default: 26). */ + drift_pct_q8?: number; + /** Elements per quantization group (default: 64). */ + group_len?: number; + /** Maximum bytes across all tiers before eviction. */ + max_total_bytes?: number; +} + +/** Statistics snapshot returned by tts.stats(). */ +export interface TtsStats { + /** Number of tensor blocks in each tier. */ + blocks_by_tier: { hot: number; warm: number; cold: number }; + /** Total bytes stored in each tier. */ + bytes_by_tier: { hot: number; warm: number; cold: number }; + /** Total number of unique tensor IDs tracked. */ + tensor_count: number; + /** Number of blocks promoted in the last tick. */ + last_tick_promotions: number; + /** Number of blocks demoted in the last tick. */ + last_tick_demotions: number; + /** Number of blocks evicted in the last tick. */ + last_tick_evictions: number; +} + +/** Budget parameters for a maintenance tick. */ +export interface TtsTickBudget { + /** Maximum bytes to write during this tick. */ + bytes: number; + /** Maximum IO operations during this tick. */ + ops: number; +} + +/** Result of a maintenance tick. */ +export interface TtsTickResult { + /** Number of blocks moved (promoted + demoted + evicted). */ + blocks_moved: number; +} + +/** Error codes returned by tts_* functions. */ +export const enum TtsError { + OK = 0, + INVALID_HANDLE = -1, + TENSOR_EVICTED = -2, + BUDGET_EXHAUSTED = -3, + IO_ERROR = -4, + CORRUPT_BLOCK = -5, + BUFFER_TOO_SMALL = -6, + INVALID_POLICY = -7, + NULL_POINTER = -8, + ALLOC_FAILED = -9, +} + +/** Host IO interface that platform bindings must implement. */ +export interface TtsHostIO { + /** Read a block from storage. Returns the block bytes. */ + readBlock(tier: number, key: Uint8Array): Uint8Array | null; + /** Write a block to storage. */ + writeBlock(tier: number, key: Uint8Array, data: Uint8Array): void; + /** Delete a block from storage. */ + deleteBlock(tier: number, key: Uint8Array): void; +} + +/** + * High-level TypeScript wrapper around the TTS WASM module. + * + * Usage: + * const tts = await TtsStore.create(wasmBytes, hostIO, policy); + * tts.put(tensorId, blockIndex, float32Data); + * const data = tts.get(tensorId, blockIndex); + * const moved = tts.tick({ bytes: 1048576, ops: 100 }); + * const stats = tts.stats(); + * tts.dispose(); + */ +export declare class TtsStore { + /** + * Instantiate the WASM module and initialize the store. + * @param wasmBytes - Compiled WASM module bytes. + * @param hostIO - Platform-specific IO implementation. + * @param policy - Tiering policy configuration. + */ + static create( + wasmBytes: ArrayBuffer, + hostIO: TtsHostIO, + policy?: TtsPolicy, + ): Promise; + + /** + * Ingest a tensor block. + * @param id - 128-bit tensor identifier (split into hi/lo). + * @param blockIndex - Block index within the tensor. + * @param data - Float32 data to store. + * @throws TtsStoreError on failure. + */ + put(id: TensorId, blockIndex: number, data: Float32Array): void; + + /** + * Read a tensor block, dequantized to f32. + * @param id - 128-bit tensor identifier. + * @param blockIndex - Block index within the tensor. + * @returns Dequantized Float32Array. + * @throws TtsStoreError if block was evicted or corrupted. + */ + get(id: TensorId, blockIndex: number): Float32Array; + + /** + * Run a maintenance tick to promote, demote, or evict blocks. + * @param budget - IO budget for this tick. + * @returns Number of blocks moved. + */ + tick(budget: TtsTickBudget): TtsTickResult; + + /** Get a statistics snapshot. */ + stats(): TtsStats; + + /** Release all WASM resources. */ + dispose(): void; +} +``` + +--- + +## 7. Safety Considerations + +### 7.1 Static Mutable State + +```rust +// WASM (single-threaded): sound, no data races possible +static mut STORE: Option = None; + +// Native targets: MUST use thread-safe alternatives +#[cfg(not(target_arch = "wasm32"))] +thread_local! { + static STORE: RefCell> = RefCell::new(None); +} + +// Or for shared-state native: +#[cfg(not(target_arch = "wasm32"))] +static STORE: once_cell::sync::Lazy>> = + once_cell::sync::Lazy::new(|| Mutex::new(None)); +``` + +### 7.2 Pointer Validation + +All exported functions validate pointers before use: + +```rust +#[no_mangle] +pub extern "C" fn tts_put( + tensor_id_hi: u64, tensor_id_lo: u64, + block_index: u32, + data_ptr: *const f32, data_len: usize, +) -> i32 { + // Null check + if data_ptr.is_null() { + set_error("data_ptr is null"); + return TTS_ERR_NULL_POINTER; + } + // Bounds check: ensure the slice is within WASM linear memory + #[cfg(debug_assertions)] + { + let end = (data_ptr as usize) + (data_len * core::mem::size_of::()); + assert!(end <= core::arch::wasm32::memory_size(0) * 65536, + "data_ptr + data_len exceeds linear memory"); + } + // Safe slice construction + let data = unsafe { core::slice::from_raw_parts(data_ptr, data_len) }; + // ... proceed with quantization and storage +} +``` + +### 7.3 Host Import Trust Model + +The WASM module trusts that host-imported functions (`read_block`, +`write_block`, `delete_block`) behave correctly with respect to the pointers +passed to them. This is the standard WASM host-guest contract: + +- The host must only read from `src_ptr` ranges within WASM linear memory. +- The host must only write to `dst_ptr` ranges within WASM linear memory. +- The host must not retain pointers across calls (WASM memory may relocate + on `memory.grow`). + +### 7.4 Debug Assertions + +Debug builds include additional safety checks: + +| Check | Location | Purpose | +|-------|----------|---------| +| Pointer bounds | All exported functions | Prevent out-of-bounds access | +| Block key length | `read_block`, `write_block` | Ensure 20-byte key format | +| Policy JSON validity | `tts_init` | Reject malformed configuration | +| Tier range | Host import calls | Ensure tier in {0, 1, 2} | +| Alloc alignment | `tts_alloc` | Ensure 4-byte alignment for f32 | + +--- + +## 8. Alternatives Considered + +### 8.1 WASI Filesystem for Storage + +**Rejected.** WASI provides `fd_read` / `fd_write` for filesystem access, which +would allow the WASM module to perform IO directly. However, WASI filesystem +access is not available in browsers, and granting filesystem access to the WASM +module undermines the sandboxing model established in ADR-005. Host-imported IO +keeps the module fully sandboxed. + +### 8.2 Component Model for the API + +**Rejected for now.** The WASM Component Model provides richer type definitions +and automatic binding generation via WIT (WASM Interface Types). However, as +noted in ADR-005 section 3.1, the Component Model is still evolving and adds +canonical ABI overhead. The raw C ABI is stable, universally supported, and +sufficient for this narrow API surface. Migration path: the `tts_*` signatures +are designed to be expressible in WIT for future migration. + +### 8.3 Separate WASM Modules for Compressor and Store + +**Rejected.** Running `ttc_*` and `tts_*` in separate WASM modules would +require cross-module communication (via the host) for every put/get operation, +adding significant overhead. A single module with shared linear memory is +simpler and faster. + +### 8.4 Passing tensor_id as a Pointer to 16 Bytes + +**Rejected.** While passing `tensor_id` as a `*const u8` pointing to 16 bytes +would avoid the hi/lo split, it adds a pointer indirection and requires the +host to allocate and manage a 16-byte buffer for every call. The hi/lo split +uses value types only, which is more efficient and eliminates a class of +pointer-related bugs. + +--- + +## 9. Acceptance Criteria + +### 9.1 Functional Requirements + +- [ ] `tts_init` correctly parses JSON policy and initializes the store +- [ ] `tts_put` quantizes f32 data and delegates to `write_block` host import +- [ ] `tts_get` calls `read_block`, dequantizes, and writes f32 to output +- [ ] `tts_tick` evaluates tier scores and moves blocks between tiers +- [ ] `tts_stats` returns valid JSON with tier-level statistics +- [ ] `tts_last_error` returns meaningful error messages for all error codes +- [ ] Host imports are called with correct tier, key, and buffer parameters +- [ ] Same WASM binary works in Node.js and browser without recompilation + +### 9.2 Performance Targets + +| Metric | Target | Notes | +|--------|--------|-------| +| `tts_put` latency (512-dim, WASM) | < 5us | Includes quantization + host IO | +| `tts_get` latency (512-dim, WASM) | < 5us | Includes host IO + dequantization | +| `tts_tick` latency (100 blocks) | < 1ms | Budget-bounded | +| WASM binary size (tts + ttc) | < 150KB | Release build, wasm-opt -Oz | +| Memory overhead per tracked tensor | < 64 bytes | Metadata only, excludes block data | + +### 9.3 Cross-Platform Targets + +| Platform | Requirement | +|----------|-------------| +| Node.js 20+ | Full functionality with AgentDB backend | +| Chrome 110+ | Full functionality with IndexedDB backend | +| Firefox 110+ | Full functionality with IndexedDB backend | +| Safari 16.4+ | Full functionality (SharedArrayBuffer with COOP/COEP) | +| Deno 1.30+ | Full functionality with filesystem backend | +| Edge / Embedded | In-memory mode, no persistence | + +--- + +## 10. Risks and Mitigations + +| Risk | Severity | Likelihood | Mitigation | +|------|----------|------------|------------| +| Browser async IO adds significant latency | High | Medium | SharedArrayBuffer + Atomics for sync semantics; batch IO in `tts_tick` | +| IndexedDB storage limits in browser | Medium | Medium | Implement LRU eviction in `tts_tick`; surface quota warnings in `tts_stats` | +| Host import ABI mismatch across platforms | High | Low | Comprehensive integration tests per platform; ABI versioning in policy JSON | +| WASM memory.grow invalidates host pointers | Medium | Medium | Document that host must re-read `memory.buffer` after any call that may allocate | +| Shared allocator contention between ttc/tts | Low | Low | Single-threaded WASM eliminates contention; native targets use separate pools | +| Future WASM multi-threading breaks static mut | Medium | Low | Replace with `thread_local!` for native; WASM threads require explicit opt-in | + +--- + +## 11. Open Questions + +1. **IndexedDB transaction granularity**: Should each `read_block`/`write_block` + call be a separate IndexedDB transaction, or should we batch within a + `tts_tick` invocation? + +2. **WASM module size budget**: With both `ttc_*` and `tts_*` in one module, + the 150KB target may be tight. Should we provide a `tts_*`-only build for + environments that do not need frame-level compression? + +3. **Policy hot-reload**: Should `tts_init` be callable multiple times to + update policy without losing block metadata, or should policy changes + require a full re-initialization? + +4. **Streaming reads**: Should `tts_get` support partial block reads (offset + + length) for large tensor blocks, or always return the full block? + +5. **Host import error propagation**: When a host import returns an error, + should `tts_put`/`tts_get` propagate the raw error code or map it to a + TTS-specific error? + +--- + +## 12. Implementation Roadmap + +### Phase 1: Core API Surface (Week 1) +- [ ] Define `tts_*` export functions in `ffi.rs` +- [ ] Define `tts_host` import declarations +- [ ] Implement `tts_init` with JSON policy parsing +- [ ] Implement `tts_alloc` / `tts_dealloc` / `tts_last_error` +- [ ] Unit tests for error handling and pointer validation + +### Phase 2: Storage Integration (Week 2) +- [ ] Implement `tts_put` with quantization pipeline and `write_block` calls +- [ ] Implement `tts_get` with `read_block` calls and dequantization +- [ ] Implement block key encoding (tensor_id + block_index) +- [ ] Integration tests with mock host imports + +### Phase 3: Tier Management (Week 2-3) +- [ ] Implement `tts_tick` with tier score evaluation +- [ ] Implement block promotion/demotion with budget enforcement +- [ ] Implement `tts_stats` with JSON serialization +- [ ] Stress tests: 10K blocks, rapid tier transitions + +### Phase 4: Node.js Binding (Week 3) +- [ ] TypeScript wrapper class (`TtsStore`) +- [ ] AgentDB `TtsHostIO` implementation +- [ ] npm package build with wasm-pack +- [ ] Integration tests against live AgentDB + +### Phase 5: Browser Binding (Week 4) +- [ ] IndexedDB `TtsHostIO` implementation +- [ ] SharedArrayBuffer + Atomics synchronization layer +- [ ] Asyncify fallback build +- [ ] Browser integration tests (Playwright) + +### Phase 6: Edge / Embedded (Week 4+) +- [ ] In-memory `TtsHostIO` implementation +- [ ] Ring-buffer MetaLog for audit +- [ ] Memory budget enforcement tests +- [ ] Binary size optimization (wasm-opt -Oz) + +--- + +## 13. References + +1. ADR-017: Temporal Tensor Compression with Tiered Quantization (this repo) +2. ADR-005: WASM Runtime Integration (this repo) +3. ADR-018: Block-Based Storage Engine (this repo) +4. WebAssembly Specification, Section 5: Binary Format. + https://webassembly.github.io/spec/core/binary/ +5. WebAssembly JS API. + https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface +6. Asyncify: Turning WASM modules into async generators. + https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html +7. IndexedDB API. + https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API +8. SharedArrayBuffer and Atomics. + https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer +9. wasm-bindgen: Facilitating high-level interactions between WASM and JS. + https://rustwasm.github.io/docs/wasm-bindgen/ +10. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series + Database." VLDB 2015. + +--- + +## Appendix A: Node.js Host Import Implementation + +```typescript +import { TtsHostIO } from "./types"; +import { AgentDB } from "@ruvector/agentdb"; + +const TIER_NAMES = ["hot", "warm", "cold"] as const; + +export class AgentDBHostIO implements TtsHostIO { + constructor(private readonly db: AgentDB) {} + + readBlock(tier: number, key: Uint8Array): Uint8Array | null { + const namespace = `tts:${TIER_NAMES[tier]}`; + const keyHex = Buffer.from(key).toString("hex"); + return this.db.getSync(namespace, keyHex); + } + + writeBlock(tier: number, key: Uint8Array, data: Uint8Array): void { + const namespace = `tts:${TIER_NAMES[tier]}`; + const keyHex = Buffer.from(key).toString("hex"); + this.db.putSync(namespace, keyHex, data); + } + + deleteBlock(tier: number, key: Uint8Array): void { + const namespace = `tts:${TIER_NAMES[tier]}`; + const keyHex = Buffer.from(key).toString("hex"); + this.db.deleteSync(namespace, keyHex); + } +} +``` + +## Appendix B: Browser Host Import Implementation (Asyncify) + +```typescript +import { TtsHostIO } from "./types"; + +const DB_NAME = "tts-blocks"; +const STORE_NAMES = ["hot", "warm", "cold"]; + +export class IndexedDBHostIO implements TtsHostIO { + private db: IDBDatabase | null = null; + + async init(): Promise { + return new Promise((resolve, reject) => { + const req = indexedDB.open(DB_NAME, 1); + req.onupgradeneeded = () => { + const db = req.result; + for (const store of STORE_NAMES) { + if (!db.objectStoreNames.contains(store)) { + db.createObjectStore(store); + } + } + }; + req.onsuccess = () => { this.db = req.result; resolve(); }; + req.onerror = () => reject(req.error); + }); + } + + readBlock(tier: number, key: Uint8Array): Uint8Array | null { + // With Asyncify, this synchronous-looking call actually yields + // to the event loop and resumes when the IDB transaction completes. + const tx = this.db!.transaction(STORE_NAMES[tier], "readonly"); + const store = tx.objectStore(STORE_NAMES[tier]); + const keyHex = Array.from(key, (b) => b.toString(16).padStart(2, "0")).join(""); + const req = store.get(keyHex); + // Asyncify transforms this into an awaitable suspension point + return req.result ? new Uint8Array(req.result) : null; + } + + writeBlock(tier: number, key: Uint8Array, data: Uint8Array): void { + const tx = this.db!.transaction(STORE_NAMES[tier], "readwrite"); + const store = tx.objectStore(STORE_NAMES[tier]); + const keyHex = Array.from(key, (b) => b.toString(16).padStart(2, "0")).join(""); + store.put(data.buffer, keyHex); + } + + deleteBlock(tier: number, key: Uint8Array): void { + const tx = this.db!.transaction(STORE_NAMES[tier], "readwrite"); + const store = tx.objectStore(STORE_NAMES[tier]); + const keyHex = Array.from(key, (b) => b.toString(16).padStart(2, "0")).join(""); + store.delete(keyHex); + } +} +``` + +## Appendix C: WASM Module Instantiation (Node.js) + +```typescript +import { readFile } from "node:fs/promises"; +import { TtsStore, TtsPolicy, TtsHostIO } from "./types"; + +export async function loadTtsModule( + wasmPath: string, + hostIO: TtsHostIO, + policy: TtsPolicy = {}, +): Promise { + const wasmBytes = await readFile(wasmPath); + const wasmMemory = new WebAssembly.Memory({ initial: 256, maximum: 4096 }); + + const importObject = { + env: { memory: wasmMemory }, + tts_host: { + read_block: (tier: number, keyPtr: number, keyLen: number, + dstPtr: number, dstLen: number): number => { + const mem = new Uint8Array(wasmMemory.buffer); + const key = mem.slice(keyPtr, keyPtr + keyLen); + const result = hostIO.readBlock(tier, key); + if (!result) return -2; // TTS_ERR_TENSOR_EVICTED + if (result.length > dstLen) return -6; // TTS_ERR_BUFFER_TOO_SMALL + mem.set(result, dstPtr); + return result.length; + }, + write_block: (tier: number, keyPtr: number, keyLen: number, + srcPtr: number, srcLen: number): number => { + const mem = new Uint8Array(wasmMemory.buffer); + const key = mem.slice(keyPtr, keyPtr + keyLen); + const data = mem.slice(srcPtr, srcPtr + srcLen); + hostIO.writeBlock(tier, key, data); + return 0; + }, + delete_block: (tier: number, keyPtr: number, keyLen: number): number => { + const mem = new Uint8Array(wasmMemory.buffer); + const key = mem.slice(keyPtr, keyPtr + keyLen); + hostIO.deleteBlock(tier, key); + return 0; + }, + }, + }; + + const { instance } = await WebAssembly.instantiate(wasmBytes, importObject); + const exports = instance.exports as Record; + + // Initialize the store with policy + const policyJson = new TextEncoder().encode(JSON.stringify(policy)); + const policyPtr = exports.tts_alloc(policyJson.length) as number; + new Uint8Array(wasmMemory.buffer).set(policyJson, policyPtr); + const initResult = exports.tts_init(policyPtr, policyJson.length) as number; + exports.tts_dealloc(policyPtr, policyJson.length); + + if (initResult !== 0) { + throw new Error(`tts_init failed with code ${initResult}`); + } + + // Return wrapped store object + return new TtsStoreImpl(exports, wasmMemory); +} +``` + +--- + +## Related Decisions + +- **ADR-005**: WASM Runtime Integration (sandboxing model, epoch interruption, raw ABI) +- **ADR-017**: Temporal Tensor Compression (frame-level `ttc_*` FFI, quantization pipeline) +- **ADR-018**: Block-Based Storage Engine (tiered placement, block format) +- **ADR-001**: RuVector Core Architecture (crate structure, dependency graph) +- **ADR-004**: KV Cache Management (three-tier cache model) diff --git a/docs/adr/temporal-tensor-store/ADR-023-benchmarking-acceptance-criteria.md b/docs/adr/temporal-tensor-store/ADR-023-benchmarking-acceptance-criteria.md new file mode 100644 index 000000000..6620076ac --- /dev/null +++ b/docs/adr/temporal-tensor-store/ADR-023-benchmarking-acceptance-criteria.md @@ -0,0 +1,422 @@ +# ADR-023: Benchmarking, Failure Modes, and Acceptance Criteria + +**Status**: Proposed +**Date**: 2026-02-08 +**Parent**: ADR-017 Temporal Tensor Compression, ADR-018 Block-Based Storage Engine +**Author**: System Architecture Team + +## Version History + +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 0.1 | 2026-02-08 | Architecture Team | Initial proposal | + +--- + +## Abstract + +This ADR defines benchmarking methodology, acceptance thresholds, failure modes, and CI strategy for the Temporal Tensor Store. It makes ADR-017's performance targets measurable and enforceable by specifying harnesses, pass/fail criteria, and automated regression detection. + +--- + +## 1. Context + +ADR-017 and ADR-018 together form the Temporal Tensor Store but leave gaps in how targets are measured, what happens when they are missed, and how regressions are caught. This ADR closes those gaps with concrete harness designs, a primary acceptance test, five catalogued failure modes with fix paths, and CI integration rules. + +--- + +## 2. Microbenchmark Targets + +All measurements use a single 16KB block (4096 f32 values, group_len=64). Harness: Criterion.rs with 200 samples, 5s measurement, 2s warm-up. + +### 2.1 Quantize and Dequantize Throughput + +| Operation | Bit Width | Native Target | WASM Target | +|-----------|-----------|--------------|-------------| +| Quantize | 8-bit | < 2 us | < 20 us | +| Quantize | 7-bit | < 2 us | < 20 us | +| Quantize | 5-bit | < 2.5 us | < 25 us | +| Quantize | 3-bit | < 3 us | < 30 us | +| Dequantize | 8-bit | < 2 us | < 20 us | +| Dequantize | 7-bit | < 2.5 us | < 25 us | +| Dequantize | 5-bit | < 3 us | < 30 us | +| Dequantize | 3-bit | < 5 us | < 50 us | + +### 2.2 Pack and Unpack Speed + +| Operation | Bit Width | Native Target | WASM Target | +|-----------|-----------|--------------|-------------| +| Pack 16KB | 8-bit | < 0.5 us | < 5 us | +| Pack 16KB | 7-bit | < 1 us | < 10 us | +| Pack 16KB | 5-bit | < 1 us | < 10 us | +| Pack 16KB | 3-bit | < 1.5 us | < 15 us | +| Unpack 16KB | 8-bit | < 0.5 us | < 5 us | +| Unpack 16KB | 7-bit | < 1 us | < 10 us | +| Unpack 16KB | 5-bit | < 1 us | < 10 us | +| Unpack 16KB | 3-bit | < 1.5 us | < 15 us | + +### 2.3 Tier Decision and Scoring + +| Operation | Native Target | WASM Target | +|-----------|--------------|-------------| +| Tier decision per block | < 50 ns | < 500 ns | +| Per-block scoring | < 20 ns | < 200 ns | +| Maintenance tick (1000 candidates) | < 1 ms | < 10 ms | +| Delta apply (sparse, 10% nnz) | < 1 us | < 10 us | + +### 2.4 Auxiliary Operations + +| Operation | Native Target | WASM Target | +|-----------|--------------|-------------| +| f32-to-f16 / f16-to-f32 (single) | < 5 ns | < 50 ns | +| Drift check (64-group block) | < 50 ns | < 500 ns | +| CRC32 checksum (16KB) | < 1 us | < 10 us | +| Segment encode (16KB, 1 frame) | < 3 us | < 30 us | +| Segment decode (16KB, 1 frame) | < 3 us | < 30 us | + +--- + +## 3. Macrobenchmark Targets + +### 3.1 KV Cache-Like Workload with Zipf Access Pattern + +| Parameter | Value | Rationale | +|-----------|-------|-----------| +| Total blocks | 1,000,000 | ~16 GB raw; representative large cache | +| Total accesses | 10,000,000 | Statistical stability | +| Distribution | Zipf (alpha=1.2) | Models real attention-pattern skew | +| Block size | 16 KB | Standard block from ADR-018 | +| Tier-1 byte cap | 2 GB | Memory-constrained deployment | + +### 3.2 Measurements + +Average read latency, P95 read latency, P99 read latency, bytes stored per token, MSE per tier (sampled from 1000 blocks per tier), tier churn rate (transitions/block/minute), Tier-1 occupancy (snapshotted every simulated second), and eviction count. + +### 3.3 Macrobenchmark Acceptance Thresholds + +| Metric | Target | Hard Fail | +|--------|--------|-----------| +| Avg read latency (native) | < 3 us | > 10 us | +| P95 read latency (native) | < 10 us | > 50 us | +| P99 read latency (native) | < 25 us | > 100 us | +| Avg read latency (WASM) | < 30 us | > 100 us | +| P95 read latency (WASM) | < 100 us | > 500 us | +| Bytes stored per token | < 2.5 bytes | > 4 bytes | +| Tier churn per block per min | < 0.1 avg | > 0.5 | +| Tier-1 byte usage | Under cap always | Any violation | + +--- + +## 4. Acceptance Thresholds (Critical) + +These gate merges to main. Any violation blocks the PR. + +### 4.1 Latency + +| Metric | Target | +|--------|--------| +| Tier-1 dequant latency (16KB block, native) | < 2 us | +| Tier-3 dequant latency (16KB block, native) | < 5 us | +| WASM dequant latency (16KB block, Node.js) | < 50 us | + +**Derivation**: A 16KB block requires 4096 multiplies. On AVX2 at 3.5 GHz (8 f32/cycle), the theoretical floor is ~146 ns. The 2 us target provides 14x headroom for unpacking, memory access, and loop overhead while staying well under the 10 us inference-impact threshold. The WASM 50 us target reflects measured 8-12x V8 overhead plus a 2x safety margin. + +### 4.2 Stability + +| Metric | Target | +|--------|--------| +| Tier churn per block per min | < 0.1 avg | +| Tier-1 byte budget | Under configured cap | +| Segment boundary rate | < 1 per 100 frames (stable tensor) | + +**Derivation**: At 0.1 transitions/block/min with 1M blocks, total transitions are ~1,667/sec. At ~5-10 us each, this consumes <2% CPU. At 1.0/block/min it becomes 8-17%, which is unacceptable. + +### 4.3 Quality Thresholds + +| Tier | Bits | Max MSE (normalized) | Max Relative Error | +|------|------|---------------------|-------------------| +| Hot (8-bit) | 8 | < 0.0001 | < 0.8% | +| Warm (7-bit) | 7 | < 0.0004 | < 1.6% | +| Warm (5-bit) | 5 | < 0.004 | < 6.5% | +| Cold (3-bit) | 3 | < 0.03 | < 30% | + +MSE normalized by squared L2-norm of original block. Relative error is max element-wise error divided by block max absolute value. + +--- + +## 5. Primary Acceptance Test + +### 5.1 Configuration + +``` +blocks: 1,000,000 accesses: 10,000,000 distribution: Zipf(1.2) +tier1_byte_cap: 2GB block_size: 16KB group_len: 64 +hot_min_score: 512 warm_min_score: 64 hysteresis: 32 +min_residency: 60 drift_pct_q8: 26 max_delta_chain: 8 +``` + +### 5.2 Pass Criteria + +The simulation PASSES if and only if all three hold simultaneously: +1. **Budget**: Tier-1 holds under configured byte cap at every epoch snapshot. +2. **Stability**: Average tier flips per block per minute < 0.1. +3. **Latency**: P95 read latency stays within tier target on host. + +### 5.3 Zipf Simulation Pseudocode + +``` +function run_zipf_simulation(config): + store = BlockStore::new(config.tier1_byte_cap) + blocks = Array[config.num_blocks] + for i in 0..config.num_blocks: + blocks[i] = generate_random_f32_block(config.block_size) + store.ingest(block_id=i, data=blocks[i], initial_tier=COLD) + + zipf = ZipfDistribution::new(config.num_blocks, config.alpha) + rng = StableRng::seed(42) + + latencies = Vec::new() + tier_flips = Array[config.num_blocks].fill(0) + prev_tier = Array[config.num_blocks].fill(COLD) + epoch_snapshots = Vec::new() + sim_clock = 0 + + for access in 0..config.num_accesses: + block_id = zipf.sample(rng) + sim_clock += 1 + + t_start = precise_now() + tier = store.current_tier(block_id) + data = store.read_block(block_id, sim_clock) + t_end = precise_now() + latencies.push(t_end - t_start) + + if tier != prev_tier[block_id]: + tier_flips[block_id] += 1 + prev_tier[block_id] = tier + + if access % config.maintenance_interval == 0: + store.run_maintenance_tick(sim_clock) + if access % config.snapshot_interval == 0: + epoch_snapshots.push(EpochSnapshot { + sim_clock, tier1_bytes: store.tier1_bytes(), + tier2_bytes: store.tier2_bytes(), + tier3_bytes: store.tier3_bytes(), + }) + + sim_minutes = sim_clock / config.ticks_per_minute + results = SimulationResults { + avg_latency: mean(latencies), + p95_latency: percentile(latencies, 0.95), + p99_latency: percentile(latencies, 0.99), + avg_churn: mean(tier_flips) / sim_minutes, + budget_violated: any(s.tier1_bytes > config.tier1_byte_cap for s in epoch_snapshots), + } + + // Quality sampling: 1000 blocks per tier + for tier in [HOT, WARM, COLD]: + for id in store.sample_block_ids(tier, 1000): + reconstructed = store.read_block(id, sim_clock) + results.quality[tier].push(mse(blocks[id], reconstructed)) + return results + +function assert_pass(results, config): + assert !results.budget_violated // Criterion 1 + assert results.avg_churn < 0.1 // Criterion 2 + assert results.p95_latency < config.p95 // Criterion 3 + for tier, samples in results.quality: + for mse in samples: + assert mse < config.mse_threshold[tier] // Criterion 4 +``` + +### 5.4 Reproducibility + +Fixed RNG seed (42), Zipf-Mandelbrot inverse CDF, monotonic clock (`Instant::now()`), CPU frequency scaling disabled or handled by Criterion warm-up. + +--- + +## 6. Failure Modes and Fix Paths + +### 6.1 Thrashing + +- **Symptom**: Tier flips > 0.1/block/min; excessive segment boundaries +- **Root cause**: Hysteresis too small; tau too large causing score oscillation +- **Fix**: Increase hysteresis (32 to 64+), increase min_residency (60 to 120+ ticks), reduce tau + +### 6.2 Delta Chain Blowup + +- **Symptom**: P95 read latency > 10x tier target; growing read amplification +- **Root cause**: Delta chains not compacted; unbounded chain growth +- **Fix**: Compact when chain exceeds max_delta_chain (default 8); schedule in maintenance tick; hard cap forces sync compaction on read at 2x max + +### 6.3 Scale Instability + +- **Symptom**: MSE exceeds threshold on bimodal/heavy-tailed tensors +- **Root cause**: Single per-group scale insufficient for outlier distributions +- **Fix**: Enable two-level scale for 3-bit; reduce group_len to 32 for affected blocks; clamp outliers at 3-sigma with sparse correction side-channel + +### 6.4 Hot Set Misprediction + +- **Symptom**: Tier-1 byte usage exceeds configured cap +- **Root cause**: Scoring promotes too many blocks; hot_min_score too low +- **Fix**: Raise t1_threshold, lower w_pop, enforce per-tier byte cap with LRU eviction, add feedback loop (auto-raise threshold when eviction rate exceeds N/sec) + +### 6.5 Checksum Corruption + +- **Symptom**: CRC32 mismatch on read +- **Root cause**: Bit flip in storage; partial write; pack/unpack bug +- **Fix**: Rehydrate from delta chain if available; attempt factor decomposition recovery; else mark Unrecoverable and emit alert metric; enable background scrubbing on idle blocks + +--- + +## 7. Benchmark Harness Design + +### 7.1 Microbenchmarks (Criterion.rs) + +``` +crates/ruvector-temporal-tensor/benches/ + quantize.rs -- per bit width + dequantize.rs -- per bit width + bitpack.rs -- pack/unpack per bit width + tier_policy.rs -- scoring and tier decision + f16_conversion.rs -- f32<->f16 + segment.rs -- encode/decode round-trip + maintenance.rs -- maintenance tick with N candidates +``` + +Input data: fixed seed (42), standard normal scaled to [-1.0, 1.0]. Median is the primary statistic. Regression detected when new CI lower bound exceeds baseline upper bound by >5%. + +### 7.2 Zipf Simulation (Custom Rust) + +Located at `crates/ruvector-temporal-tensor/tests/zipf_simulation.rs`. Supports `--quick` (100K blocks, 1M accesses, ~30s) for PR checks and `--full` (1M blocks, 10M accesses, ~5-10min) for nightly. Outputs JSON for CI and human-readable summary to stdout. Configurable via env vars (`ZIPF_BLOCKS`, `ZIPF_ACCESSES`, `ZIPF_ALPHA`). + +### 7.3 WASM Benchmarks + +Built with `wasm-pack build --release --target nodejs`. Node.js runner calls each FFI function in a 10,000-iteration loop, measured with `process.hrtime.bigint()`. Reports median, P95, P99 and computes WASM/native overhead ratio. + +--- + +## 8. CI Integration Guidelines + +### 8.1 Pipeline Stages + +| Stage | Trigger | Timeout | Scope | +|-------|---------|---------|-------| +| PR check | Every PR | 10 min | Criterion quick, Zipf quick, quality | +| Nightly | 02:00 UTC | 30 min | Full Criterion, Zipf full, WASM, quality sweep | +| Release gate | Tag push | 45 min | All benchmarks, cross-platform, WASM + native | + +### 8.2 Regression Detection + +```yaml +benchmark-check: + steps: + - run: cargo bench --bench '*' -- --output-format bencher | tee output.txt + - run: python scripts/bench_compare.py --baseline .bench_baseline.json + --current output.txt --threshold 0.10 --fail-on-regression + - run: cargo test --release --test zipf_simulation -- --quick +``` + +Baselines committed as `.bench_baseline.json` on main. Updated only on architecture-team-reviewed PRs that modify quantization or storage code. Comparison: `(new_median - baseline) / baseline`; fail at 10% for latency, 20% for throughput. + +### 8.3 Alerting + +| Condition | Action | +|-----------|--------| +| PR regression > 10% | Block merge; PR comment | +| Nightly regression > 15% | GitHub issue: `perf-regression` | +| Zipf simulation failure | GitHub issue: `acceptance-failure` | +| WASM overhead > 15x native | GitHub issue: `wasm-performance` | +| Quality violation | Block merge/release | + +--- + +## 9. SOTA Integration Benchmarks + +### 9.1 Reference Systems + +| System | Year | Key Result | +|--------|------|-----------| +| **RIPPLE++** | 2026 | Tens of thousands of updates/sec, sub-ms latency for incremental graph computation | +| **OMEGA** | 2025 | Sub-ms GNN inference via selective recompute | +| **STAG** | 2025 | Additivity-based incremental propagation; linear scaling with delta size | + +### 9.2 Comparison + +| Metric | Temporal Tensor Store | RIPPLE++ | OMEGA | STAG | +|--------|----------------------|----------|-------|------| +| Single read | < 2-5 us | N/A (graph) | ~100 us | ~50 us | +| Batch update (1000) | < 1 ms | ~10 ms | ~5 ms | ~2 ms | +| Memory/element | 0.375-1.0 B | 8 B | 4-8 B | 4 B | + +The store targets block-level compression rather than graph-level computation but shares the sub-millisecond incremental update goal. The maintenance tick budget (<1ms for 1000 candidates) is competitive. + +--- + +## 10. Test Scenarios + +### 10.1 Scenario Matrix + +| ID | Purpose | Blocks | Accesses | Distribution | +|----|---------|--------|----------|-------------| +| S1 | Baseline: uniform access | 10K | 1M | Uniform | +| S2 | Primary acceptance (Zipf) | 1M | 10M | Zipf(1.2) | +| S3 | High skew stress | 1M | 10M | Zipf(2.0) | +| S4 | Temporal shift (rotating hot set) | 100K | 5M | Rotating Zipf | +| S5 | Burst access pattern | 100K | 2M | Burst + uniform | +| S6 | Severe memory constraint (100MB cap) | 1M | 10M | Zipf(1.2) | +| S7 | Outlier/bimodal tensors | 10K | 500K | Zipf(1.2) | +| S8 | Stable tensors (near-zero drift) | 10K | 500K | Zipf(1.2) | + +### 10.2 Per-Scenario Pass Criteria + +| ID | Pass Condition | +|----|---------------| +| S1 | All blocks converge to same tier within 2x access count | +| S2 | Full acceptance test (Section 5.2) | +| S3 | Tier-1 < 5% of blocks; no budget violation | +| S4 | Churn < 0.2/block/min despite rotation | +| S5 | P95 spike during burst < 2x steady-state P95 | +| S6 | Zero OOM; cap held; avg latency < 5x unconstrained | +| S7 | MSE for bimodal blocks < 2x threshold | +| S8 | Segment count per block < 1.1 | + +--- + +## 11. Risks and Mitigations + +| Risk | Severity | Mitigation | +|------|----------|------------| +| CI noise causes false regressions | Medium | 2% Criterion noise threshold; require 3 consecutive failures; pin CI hardware | +| Zipf simulation too slow for PR | Medium | Quick mode (~30s); full mode nightly only | +| WASM results platform-dependent | Low | Pin Node.js version; accept 20% variance | +| Baseline drift over time | Medium | Rebaseline quarterly or on hardware change | + +--- + +## 12. Implementation Roadmap + +**Phase 1 (Week 1)**: Criterion benchmarks for all Section 2 operations; initial baselines; `bench_compare.py` script; PR pipeline integration. + +**Phase 2 (Week 1-2)**: Zipf simulation with quick/full modes and JSON output; nightly pipeline integration. + +**Phase 3 (Week 2)**: WASM Node.js benchmark runner; WASM-specific baselines; nightly pipeline. + +**Phase 4 (Week 2-3)**: Failure mode detectors (thrashing counter, delta chain monitor, quality sampler, corruption injection test); wire into simulation harness. + +**Phase 5 (Week 3)**: CI hardening (pinned hardware, nightly scheduling, alerting, release-gate workflow). + +--- + +## 13. References + +1. Frantar et al. "GPTQ: Accurate Post-Training Quantization." ICLR 2023. +2. Lin et al. "AWQ: Activation-aware Weight Quantization." MLSys 2024. +3. Criterion.rs documentation. https://bheisler.github.io/criterion.rs/ +4. Gray. "The Benchmark Handbook." Morgan Kaufmann, 1993. +5. Pelkonen et al. "Gorilla: In-Memory Time Series Database." VLDB 2015. +6. Li et al. "RIPPLE++: Incremental Graph Computation." SIGMOD 2026. +7. Chen et al. "OMEGA: Selective Recompute for Low-Latency GNN Serving." OSDI 2025. +8. Wang et al. "STAG: Additivity-Based Incremental Graph Propagation." VLDB 2025. +9. ADR-017: Temporal Tensor Compression. RuVector Architecture Team, 2026. +10. ADR-018: Block-Based Storage Engine. RuVector Architecture Team, 2026. diff --git a/docs/architecture/temporal-tensor-store-ddd.md b/docs/architecture/temporal-tensor-store-ddd.md new file mode 100644 index 000000000..31648a681 --- /dev/null +++ b/docs/architecture/temporal-tensor-store-ddd.md @@ -0,0 +1,1792 @@ +# Temporal Tensor Store: Domain-Driven Design Architecture + +**Version**: 0.1 +**Date**: 2026-02-08 +**Status**: Draft +**Parent ADRs**: ADR-017, ADR-018, ADR-019, ADR-020, ADR-021, ADR-022, ADR-023 + +--- + +## Strategic Design + +### Domain Vision + +The Temporal Tensor Store unifies caching, compression, and eviction into a single primitive. Each tensor chunk has an access history. Access history drives tier choice. Tier choice drives quantization bits and whether data stays materialized, stays compressed, or becomes reconstructable only via factors or deltas. + +> **This is not a cache.** The system answers: "At what fidelity should this block exist right now?" not "Is this block present?" + +The fundamental insight is that tensors in agent workloads exhibit temporal locality: most frames reuse the same value distribution, and access frequency decays predictably. By treating quantization tier as a continuous lifecycle state rather than a static configuration, the store compresses data in proportion to its staleness while guaranteeing bounded reconstruction error at every tier. + +### Core Domain + +**Tensor Lifecycle Management** -- The heart of the system. Manages the full lifecycle of tensor blocks from creation through tiered compression to eviction (compression to zero). Every block transitions through a state machine: Created -> Hot -> Warm -> Cold -> Evicted. The transition function is driven by a composite access score and bounded by configurable hysteresis to prevent oscillation. + +### Supporting Domains + +1. **Quantization Domain** -- Bit-packing, scale computation, encode/decode. Owns the mathematical transforms that convert between f32 values and packed bitstream representations at arbitrary bit widths (3, 5, 7, 8). Manages groupwise symmetric quantization with f16 scales. + +2. **Scoring & Migration Domain** -- Access tracking, score computation, tier decisions. Owns the temporal access profile for each block and the policy that maps scores to tiers. Responsible for maintenance scheduling and budgeted tick processing. + +3. **Storage Domain** -- Block IO, metadata persistence, checksums. Owns the physical layout of tier data files, the metadata log for crash recovery, and the in-memory index structures for fast lookup. + +### Generic Domains + +1. **Clock/Time** -- Tick-based time progression. Provides a monotonic tick counter that all scoring and maintenance operations reference. Decoupled from wall-clock time for deterministic replay. + +2. **Metrics/Witness** -- Audit logging, decision witnesses. Records every tiering decision with sufficient context to reconstruct the reasoning (score at time of decision, thresholds applied, resulting tier). Enables post-hoc analysis without affecting hot-path performance. + +3. **Configuration** -- Policy management. Versioned, immutable policy bundles that define thresholds, group sizes, drift tolerances, and tier-to-bit mappings. Policy changes create new bundles; active bundles cannot be modified. + +--- + +## Ubiquitous Language + +| Term | Definition | +|------|------------| +| **Block** | Fixed-size chunk of a tensor (16KB/32KB), the atomic unit of storage and tiering | +| **Tier** | Quantization level: Hot (8-bit), Warm (7/5-bit), Cold (3-bit), Absent (0-bit/evicted) | +| **Touch** | Record an access event on a block, incrementing its access count and updating its timestamp | +| **Score** | Composite metric combining EMA, popcount, and recency: `access_count * 1024 / (age + 1)` | +| **Drift** | When a tensor's value distribution changes beyond the scale tolerance, forcing a new segment | +| **Eviction** | Compression to zero bits; only metadata survives. Data is reconstructable via deltas or factors | +| **Reconstruction** | Rebuilding evicted data from delta chains or low-rank factor sets | +| **Compaction** | Collapsing a delta chain into a new base block to bound chain length | +| **Witness** | Audit log entry recording a tiering or eviction decision with full context | +| **Tick** | Time quantum for maintenance budget processing; one tick = one unit of the logical clock | +| **Segment** | Multi-frame compressed blob sharing quantization scales; the on-disk unit for temporal data | +| **Group** | Contiguous slice of tensor elements sharing one quantization scale (default: 64 elements) | +| **Scale** | f16 value representing `max(|v_i|) / qmax` for a group; shared across all frames in a segment | +| **qmax** | Maximum quantized integer for a bit width: `2^(bits-1) - 1` (127, 63, 15, 3 for 8/7/5/3-bit) | +| **Frame** | One tensor snapshot at a point in time; the input unit for temporal compression | + +--- + +## Bounded Contexts + +### Bounded Context Map + +``` ++============================================================================+ +| TEMPORAL TENSOR STORE | ++============================================================================+ +| | +| +--------------------+ +---------------------+ | +| | BC1: BLOCK | | BC2: QUANTIZATION | | +| | MANAGEMENT |<----->| CONTEXT | | +| | |Shared | (codec_bits, quant) | | +| | - TensorBlock |Kernel | | | +| | - BlockMeta | | - QuantizationCodec | | +| | - State machine | | - BitPacking | | +| | - Lifecycle | | - f16 conversion | | +| +--------+-----------+ +----------+----------+ | +| | | | +| | Shared | Shared | +| | Kernel | Kernel | +| v | | +| +--------------------+ | | +| | BC3: TEMPORAL |<----------------+ | +| | SCORING CONTEXT | | +| | | | +| | - AccessProfile | | +| | - TierPolicy | | +| | - Maintenance | | +| +--------+------------+ | +| | | +| | Customer/Supplier | +| v | +| +--------------------+ +---------------------+ | +| | BC4: STORAGE | | BC5: DELTA & | | +| | ENGINE CONTEXT |<----->| RECONSTRUCTION | | +| | |Cust/ | CONTEXT | | +| | - TieredStore |Suppl | | | +| | - BlockIO | | - DeltaChain | | +| | - MetaLog | | - FactorSet | | +| | - Index | | - Reconstruction | | +| +--------------------+ +---------------------+ | +| | ++============================================================================+ + +Integration Patterns: + <-----> Shared Kernel (shared types, co-owned) + ------> Customer/Supplier (downstream consumes upstream API) + ======> Published Language (stable, versioned contract) +``` + +### Event Flow Diagram + +``` + External Write Timer Tick + | | + v v + +----------+ +-------------+ + | BC1: | BlockAccessed | BC3: | + | Block |---------------->| Temporal | + | Mgmt | | Scoring | + +----+-----+ +------+------+ + | | + | BlockCreated | TierUpgradeRequested + | BlockTierChanged | TierDowngradeRequested + v v + +----------+ +-------------+ + | BC2: | quantize() | BC3: | + | Quant |<----------------| choose_tier | + | Context | +------+------+ + +----+-----+ | + | | MaintenanceCompleted + | packed bytes v + v +-------------+ + +----------+ | BC4: | + | BC4: | BlockWritten | Storage | + | Storage |<----------------| Engine | + | Engine | +------+------+ + +----+-----+ | + | | BlockEvicted + | BlockDeleted v + v +-------------+ + +----------+ | BC5: | + | BC5: | DeltaAppended | Delta & | + | Delta & |<----------------| Recon | + | Recon | +-------------+ + +----------+ +``` + +--- + +## Bounded Context 1: Block Management Context + +### Purpose + +Responsible for tensor block lifecycle: creation, chunking, metadata management, identity. This is the aggregate that owns the block state machine and enforces the invariant that blocks transition through tiers in a well-defined order. + +### Ubiquitous Language + +| Term | Definition | +|------|------------| +| **TensorBlock** | Aggregate root owning a block's identity, metadata, and state | +| **BlockKey** | Composite identity: (tensor_id: u128, block_index: u32) | +| **BlockMeta** | All metadata for a block: tier, checksums, timestamps, access stats | +| **TensorIdentity** | The parent tensor: id, shape, dtype, lineage parent | +| **BlockData** | The raw quantized bytes for a block at its current tier | + +### Aggregates + +#### TensorBlock (Aggregate Root) + +```rust +/// The primary aggregate root for the Block Management context. +/// Owns the full lifecycle of a tensor block from creation through eviction. +/// +/// Invariants: +/// - block_bytes must match configured block size +/// - checksum must be valid (CRC32 of quantized data) +/// - state transitions follow: Created -> Hot -> Warm -> Cold -> Evicted +/// - tier can only degrade by one step per maintenance tick (hysteresis) +/// - block_key is immutable after creation +pub struct TensorBlock { + /// Composite identity: (tensor_id, block_index) + key: BlockKey, + /// All metadata fields + meta: BlockMeta, + /// Current quantized data (None if evicted) + data: Option, + /// Reference to parent tensor identity + tensor_identity: TensorIdentity, + /// Domain events pending publication + pending_events: Vec, +} + +impl TensorBlock { + /// Create a new block from raw f32 data. + /// Initial tier is determined by the current access profile. + pub fn create( + key: BlockKey, + identity: TensorIdentity, + raw_data: &[f32], + initial_tier: Tier, + now_tick: u64, + ) -> Result { + let data = BlockData::from_raw(raw_data, initial_tier)?; + let checksum = Checksum::compute(&data.bytes); + + let meta = BlockMeta { + tier: initial_tier, + checksum, + created_at: now_tick, + last_accessed_at: now_tick, + last_tier_change_at: now_tick, + access_count: 0, + byte_size: data.bytes.len() as u32, + reconstruct_policy: ReconstructPolicy::None, + }; + + let mut block = Self { + key, + meta, + data: Some(data), + tensor_identity: identity, + pending_events: Vec::new(), + }; + + block.pending_events.push(BlockDomainEvent::BlockCreated { + key, + tier: initial_tier, + tick: now_tick, + }); + + Ok(block) + } + + /// Record an access. Updates count and timestamp. + pub fn touch(&mut self, now_tick: u64) { + self.meta.access_count = self.meta.access_count.wrapping_add(1); + self.meta.last_accessed_at = now_tick; + self.pending_events.push(BlockDomainEvent::BlockAccessed { + key: self.key, + tick: now_tick, + }); + } + + /// Transition to a new tier. Enforces hysteresis invariant. + pub fn change_tier( + &mut self, + new_tier: Tier, + new_data: Option, + now_tick: u64, + ) -> Result<(), BlockError> { + if new_tier == self.meta.tier { + return Ok(()); + } + + let old_tier = self.meta.tier; + self.meta.tier = new_tier; + self.meta.last_tier_change_at = now_tick; + self.data = new_data; + + if new_tier == Tier::Absent { + self.meta.reconstruct_policy = ReconstructPolicy::DeltaChain; + } + + self.pending_events.push(BlockDomainEvent::BlockTierChanged { + key: self.key, + old_tier, + new_tier, + tick: now_tick, + }); + + Ok(()) + } + + /// Evict the block: data is dropped, metadata retained. + pub fn evict(&mut self, now_tick: u64) -> Result<(), BlockError> { + if self.meta.tier == Tier::Absent { + return Err(BlockError::AlreadyEvicted); + } + + self.pending_events.push(BlockDomainEvent::BlockEvicted { + key: self.key, + previous_tier: self.meta.tier, + tick: now_tick, + }); + + self.meta.tier = Tier::Absent; + self.meta.last_tier_change_at = now_tick; + self.meta.reconstruct_policy = ReconstructPolicy::DeltaChain; + self.data = None; + + Ok(()) + } + + /// Verify data integrity via checksum. + pub fn verify_checksum(&self) -> bool { + match &self.data { + Some(data) => Checksum::compute(&data.bytes) == self.meta.checksum, + None => true, // Evicted blocks have no data to verify + } + } + + /// Drain pending domain events for publication. + pub fn take_events(&mut self) -> Vec { + std::mem::take(&mut self.pending_events) + } +} +``` + +### Entities + +```rust +/// Identity of the parent tensor that this block belongs to. +pub struct TensorIdentity { + /// Unique tensor identifier (128-bit UUID) + pub id: u128, + /// Shape of the full tensor (e.g., [1024, 768]) + pub shape: Shape, + /// Data type of the original tensor + pub dtype: DType, + /// Optional lineage parent (for delta chains) + pub lineage_parent: Option, +} + +/// Raw quantized bytes for a block at a specific tier. +pub struct BlockData { + /// Packed quantized bytes + pub bytes: Vec, + /// Tier at which this data was quantized + pub quantized_at_tier: Tier, +} + +impl BlockData { + pub fn from_raw(data: &[f32], tier: Tier) -> Result { + // Delegate to QuantizationCodec for encoding + let bytes = Vec::new(); // placeholder: actual encoding via BC2 + Ok(Self { + bytes, + quantized_at_tier: tier, + }) + } +} +``` + +### Value Objects + +```rust +/// Composite block identity. Immutable after creation. +#[derive(Clone, Copy, PartialEq, Eq, Hash, Debug)] +pub struct BlockKey { + pub tensor_id: u128, + pub block_index: u32, +} + +/// Quantization tier determining bit width and compression level. +#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug)] +pub enum Tier { + /// 8-bit quantization, ~4.0x compression + Hot = 0, + /// 7-bit or 5-bit quantization, ~4.57x or ~6.4x compression + Warm = 1, + /// 3-bit quantization, ~10.67x compression + Cold = 2, + /// Evicted: 0 bits, metadata only, reconstructable via deltas/factors + Absent = 3, +} + +/// Element data type of the original tensor. +#[derive(Clone, Copy, PartialEq, Eq, Debug)] +pub enum DType { + F32, + F16, + BF16, + I8, +} + +/// Policy for reconstructing evicted block data. +#[derive(Clone, Copy, PartialEq, Eq, Debug)] +pub enum ReconstructPolicy { + /// No reconstruction available (data loss accepted) + None, + /// Reconstruct from delta chain (base block + deltas) + DeltaChain, + /// Reconstruct from low-rank factors (U * S * V^T) + LowRankFactors, + /// Reconstruct from both deltas and factors (best-effort) + Hybrid, +} + +/// Tensor shape descriptor. +#[derive(Clone, Debug, PartialEq, Eq)] +pub struct Shape(pub Vec); + +/// CRC32 checksum for data integrity. +#[derive(Clone, Copy, PartialEq, Eq, Debug)] +pub struct Checksum(pub u32); + +impl Checksum { + pub fn compute(data: &[u8]) -> Self { + let mut crc: u32 = 0xFFFF_FFFF; + for &byte in data { + crc ^= byte as u32; + for _ in 0..8 { + if crc & 1 != 0 { + crc = (crc >> 1) ^ 0xEDB8_8320; + } else { + crc >>= 1; + } + } + } + Self(!crc) + } +} +``` + +### Domain Events + +| Event | Trigger | Payload | Consumers | +|-------|---------|---------|-----------| +| `BlockCreated` | New block materialized | key, tier, tick | Storage Engine, Scoring | +| `BlockAccessed` | Touch on a block | key, tick | Temporal Scoring | +| `BlockTierChanged` | Tier transition | key, old_tier, new_tier, tick | Storage, Metrics | +| `BlockEvicted` | Block compressed to zero | key, previous_tier, tick | Delta & Reconstruction | +| `BlockCorrupted` | Checksum mismatch | key, expected, actual | Alerting, Recovery | +| `BlockCompacted` | Delta chain collapsed | key, new_base_tier, tick | Storage Engine | + +```rust +#[derive(Clone, Debug)] +pub enum BlockDomainEvent { + BlockCreated { key: BlockKey, tier: Tier, tick: u64 }, + BlockAccessed { key: BlockKey, tick: u64 }, + BlockTierChanged { key: BlockKey, old_tier: Tier, new_tier: Tier, tick: u64 }, + BlockEvicted { key: BlockKey, previous_tier: Tier, tick: u64 }, + BlockCorrupted { key: BlockKey, expected: Checksum, actual: Checksum }, + BlockCompacted { key: BlockKey, new_base_tier: Tier, tick: u64 }, +} +``` + +--- + +## Bounded Context 2: Quantization Context + +### Purpose + +Responsible for all encoding/decoding operations across bit widths. Owns the groupwise symmetric quantization algorithm, f16 scale management, and bitstream packing. This context is a **shared kernel** with Block Management: both contexts reference the same quantization types, but the Quantization Context owns the encode/decode logic. + +### Ubiquitous Language + +| Term | Definition | +|------|------------| +| **QuantizationCodec** | Aggregate root encapsulating format selection and parameters | +| **QuantParams** | Value object: bits, scale, zero_point (always 0 for symmetric), group_len | +| **PackedBlock** | Value object: encoded bytes with format metadata | +| **GroupScale** | f16 scale for a group: `max(abs(v_i)) / qmax` | + +### Aggregates + +#### QuantizationCodec (Aggregate Root) + +```rust +/// Encapsulates groupwise symmetric quantization for all supported bit widths. +/// +/// Invariants: +/// - bits must be one of {3, 5, 7, 8} +/// - group_len must be >= 1 +/// - scales are stored as f16 (u16 bit pattern) to minimize metadata overhead +/// - qmax = 2^(bits-1) - 1 +pub struct QuantizationCodec { + /// Bit width for quantization + bits: u8, + /// Elements per quantization group + group_len: usize, + /// Cached qmax value + qmax: i32, +} + +impl QuantizationCodec { + pub fn new(bits: u8, group_len: usize) -> Self { + let qmax = qmax_from_bits(bits); + Self { bits, group_len, qmax } + } + + /// Quantize f32 values to packed bytes with f16 group scales. + /// + /// Returns (scales_f16, packed_bytes). + pub fn quantize(&self, values: &[f32]) -> (Vec, Vec) { + let scales = compute_scales(values, self.group_len, self.bits); + let scales_f32 = scales_to_f32(&scales); + let mut packed = Vec::new(); + quantize_and_pack_f32(values, &scales_f32, self.group_len, self.bits, &mut packed); + (scales, packed) + } + + /// Dequantize packed bytes back to f32 values. + pub fn dequantize( + &self, + packed: &[u8], + scales_f16: &[u16], + tensor_len: usize, + frame_count: usize, + ) -> Vec { + let scales_f32 = scales_to_f32(scales_f16); + let mut out = Vec::new(); + dequantize_f32( + packed, &scales_f32, self.group_len, + self.bits, tensor_len, frame_count, &mut out, + ); + out + } + + /// Check if a frame fits within existing scales (drift tolerance). + pub fn frame_fits_scales( + &self, + frame: &[f32], + scales_f32: &[f32], + drift_factor: f32, + ) -> bool { + frame_fits_scales_f32(frame, scales_f32, self.group_len, self.bits, drift_factor) + } +} + +/// Compute qmax for a given bit width: 2^(bits-1) - 1. +/// Returns 0 for invalid bit widths (0 or >8). +#[inline] +pub fn qmax_from_bits(bits: u8) -> i32 { + if bits == 0 || bits > 8 { return 0; } + (1i32 << (bits - 1)) - 1 +} +``` + +### Value Objects + +```rust +/// Quantization parameters for a single encoding operation. +#[derive(Clone, Debug, PartialEq)] +pub struct QuantParams { + /// Bit width (3, 5, 7, or 8) + pub bits: u8, + /// f16-encoded group scales (one per group) + pub scales_f16: Vec, + /// Cached f32 conversion of scales (for hot-path use) + pub scales_f32: Vec, + /// Elements per group + pub group_len: usize, +} + +/// Packed quantized block with format metadata. +#[derive(Clone, Debug)] +pub struct PackedBlock { + /// Packed bitstream bytes + pub bytes: Vec, + /// Quantization parameters used + pub params: QuantParams, + /// Number of frames encoded + pub frame_count: u32, + /// Number of f32 elements per frame + pub tensor_len: u32, +} + +/// Two-level scale for hierarchical quantization (future extension). +#[derive(Clone, Debug, PartialEq)] +pub struct TwoLevelScale { + pub primary_scale: f32, + pub secondary_scale: f32, + pub flags: u8, +} +``` + +### Domain Services + +```rust +/// Service orchestrating encode/decode for all quantization formats. +pub struct QuantizationService { + /// Codec instances keyed by bit width + codecs: [QuantizationCodec; 4], // indices 0-3 for bits 3,5,7,8 +} + +impl QuantizationService { + pub fn new(group_len: usize) -> Self { + Self { + codecs: [ + QuantizationCodec::new(3, group_len), + QuantizationCodec::new(5, group_len), + QuantizationCodec::new(7, group_len), + QuantizationCodec::new(8, group_len), + ], + } + } + + pub fn codec_for_tier(&self, tier: Tier) -> &QuantizationCodec { + match tier { + Tier::Hot => &self.codecs[3], // 8-bit + Tier::Warm => &self.codecs[2], // 7-bit (configurable to 5-bit) + Tier::Cold => &self.codecs[0], // 3-bit + Tier::Absent => &self.codecs[0], // N/A but provide fallback + } + } +} + +/// Service for packing and unpacking arbitrary-width bit codes. +pub struct BitPackingService; + +impl BitPackingService { + /// Pack unsigned codes of `bits` width into a byte stream. + /// Uses a 64-bit accumulator with no alignment padding. + pub fn pack(codes: &[u32], bits: u32, out: &mut Vec) { + let mut acc: u64 = 0; + let mut acc_bits: u32 = 0; + for &code in codes { + acc |= (code as u64) << acc_bits; + acc_bits += bits; + while acc_bits >= 8 { + out.push((acc & 0xFF) as u8); + acc >>= 8; + acc_bits -= 8; + } + } + if acc_bits > 0 { + out.push((acc & 0xFF) as u8); + } + } + + /// Unpack `count` unsigned codes of `bits` width from a byte stream. + pub fn unpack(data: &[u8], bits: u32, count: usize, out: &mut Vec) { + let mask = (1u64 << bits) - 1; + let mut acc: u64 = 0; + let mut acc_bits: u32 = 0; + let mut byte_idx = 0usize; + let mut decoded = 0usize; + while decoded < count { + while acc_bits < bits && byte_idx < data.len() { + acc |= (data[byte_idx] as u64) << acc_bits; + acc_bits += 8; + byte_idx += 1; + } + if acc_bits < bits { break; } + out.push((acc & mask) as u32); + acc >>= bits; + acc_bits -= bits; + decoded += 1; + } + } +} +``` + +--- + +## Bounded Context 3: Temporal Scoring Context + +### Purpose + +Responsible for access tracking, score computation, tier selection, and hysteresis. Owns the per-block access profile and the policy that determines when blocks migrate between tiers. This context is a **shared kernel** with Block Management: the scoring context produces tier recommendations that Block Management consumes. + +### Ubiquitous Language + +| Term | Definition | +|------|------------| +| **AccessProfile** | Aggregate root tracking per-block access history | +| **Score** | Composite metric: `access_count * 1024 / (age + 1)` | +| **AccessWindow** | u64 bitset representing access pattern over recent ticks | +| **EMARate** | Exponential moving average decay rate for smoothed scoring | +| **TierPolicy** | Configurable thresholds mapping scores to tiers | + +### Aggregates + +#### AccessProfile (Aggregate Root) + +```rust +/// Tracks per-block access history and computes tiering decisions. +/// +/// Invariants: +/// - access_count is monotonically non-decreasing +/// - last_access_at <= current tick +/// - tier_age tracks ticks since last tier change (hysteresis input) +/// - ema_rate is in (0.0, 1.0] +pub struct AccessProfile { + /// Block this profile tracks + key: BlockKey, + /// Exponential moving average decay rate + ema_rate: f32, + /// Sliding window bitset: bit i = access in tick (now - i) + window: u64, + /// Total access count (wrapping) + access_count: u32, + /// Tick of last access + last_access_at: u64, + /// Ticks since last tier change (for hysteresis) + tier_age: u64, + /// Current tier as determined by last scoring + current_tier: Tier, + /// Pending domain events + pending_events: Vec, +} + +impl AccessProfile { + pub fn new(key: BlockKey, initial_tier: Tier, now_tick: u64) -> Self { + Self { + key, + ema_rate: 0.9, + window: 0, + access_count: 0, + last_access_at: now_tick, + tier_age: 0, + current_tier: initial_tier, + pending_events: Vec::new(), + } + } + + /// Record an access event. Shifts the window and sets the current bit. + pub fn touch(&mut self, now_tick: u64) { + let elapsed = now_tick.saturating_sub(self.last_access_at); + if elapsed > 0 { + self.window = self.window.checked_shl(elapsed as u32).unwrap_or(0); + } + self.window |= 1; + self.access_count = self.access_count.wrapping_add(1); + self.last_access_at = now_tick; + + self.pending_events.push(ScoringDomainEvent::AccessRecorded { + key: self.key, + tick: now_tick, + }); + } + + /// Compute the composite access score. + pub fn compute_score(&self, now_tick: u64) -> f32 { + let age = now_tick.saturating_sub(self.last_access_at) + 1; + let popcount = self.window.count_ones() as f32; + let recency = self.access_count as f32 * 1024.0 / age as f32; + let ema_weight = self.ema_rate; + + // Composite: weighted combination of popcount and recency + ema_weight * recency + (1.0 - ema_weight) * popcount * 64.0 + } + + /// Determine the recommended tier based on current score. + pub fn choose_tier(&mut self, now_tick: u64, policy: &TierPolicy) -> Tier { + let score = self.compute_score(now_tick); + let score_u32 = score as u32; + + let recommended = if score_u32 >= policy.hot_min_score { + Tier::Hot + } else if score_u32 >= policy.warm_min_score { + Tier::Warm + } else { + Tier::Cold + }; + + if recommended != self.current_tier { + let old = self.current_tier; + self.current_tier = recommended; + self.tier_age = 0; + + let event = if recommended > old { + ScoringDomainEvent::TierDowngradeRequested { + key: self.key, + from: old, + to: recommended, + score, + tick: now_tick, + } + } else { + ScoringDomainEvent::TierUpgradeRequested { + key: self.key, + from: old, + to: recommended, + score, + tick: now_tick, + } + }; + self.pending_events.push(event); + } else { + self.tier_age += 1; + } + + recommended + } + + pub fn take_events(&mut self) -> Vec { + std::mem::take(&mut self.pending_events) + } +} +``` + +#### TierPolicy (Value Object, from implementation) + +```rust +/// Configurable scoring weights and thresholds for tier selection. +/// Directly corresponds to the TierPolicy struct in tier_policy.rs. +/// +/// Score = access_count * 1024 / (now_ts - last_access_ts + 1) +/// +/// | Tier | Condition | Bits | +/// |------|---------------------------|------| +/// | Hot | score >= hot_min_score | 8 | +/// | Warm | score >= warm_min_score | warm_bits (7 or 5) | +/// | Cold | otherwise | 3 | +#[derive(Clone, Copy, Debug)] +pub struct TierPolicy { + pub hot_min_score: u32, + pub warm_min_score: u32, + pub warm_bits: u8, + /// Drift tolerance as Q8 fixed-point. 26 means ~10.2% (26/256). + pub drift_pct_q8: u32, + pub group_len: u32, +} + +impl Default for TierPolicy { + fn default() -> Self { + Self { + hot_min_score: 512, + warm_min_score: 64, + warm_bits: 7, + drift_pct_q8: 26, + group_len: 64, + } + } +} + +impl TierPolicy { + /// Select bit width based on access pattern. + pub fn select_bits(&self, access_count: u32, last_access_ts: u32, now_ts: u32) -> u8 { + let age = now_ts.wrapping_sub(last_access_ts).wrapping_add(1); + let score = access_count.saturating_mul(1024).wrapping_div(age); + if score >= self.hot_min_score { + 8 + } else if score >= self.warm_min_score { + self.warm_bits + } else { + 3 + } + } + + /// Drift factor: 1.0 + drift_pct_q8/256 + pub fn drift_factor(&self) -> f32 { + 1.0 + (self.drift_pct_q8 as f32) / 256.0 + } +} +``` + +### Domain Services + +```rust +/// Budgeted tick processing: processes a limited number of blocks per tick +/// to avoid latency spikes during maintenance windows. +pub struct MaintenanceScheduler { + /// Maximum blocks to process per tick + budget_per_tick: usize, + /// Round-robin cursor into the block list + cursor: usize, + /// Tick counter + current_tick: u64, +} + +impl MaintenanceScheduler { + pub fn new(budget_per_tick: usize) -> Self { + Self { budget_per_tick, cursor: 0, current_tick: 0 } + } + + /// Process one maintenance tick. Returns the set of tier-change recommendations. + pub fn tick( + &mut self, + profiles: &mut [AccessProfile], + policy: &TierPolicy, + ) -> Vec { + self.current_tick += 1; + let mut events = Vec::new(); + let n = profiles.len().min(self.budget_per_tick); + + for _ in 0..n { + if self.cursor >= profiles.len() { + self.cursor = 0; + } + let profile = &mut profiles[self.cursor]; + profile.choose_tier(self.current_tick, policy); + events.extend(profile.take_events()); + self.cursor += 1; + } + + events.push(ScoringDomainEvent::MaintenanceCompleted { + tick: self.current_tick, + blocks_processed: n as u32, + }); + + events + } +} +``` + +### Domain Events + +| Event | Trigger | Consumers | +|-------|---------|-----------| +| `AccessRecorded` | Block touched | Score recomputation | +| `ScoreComputed` | Periodic scoring pass | Tier decision | +| `TierUpgradeRequested` | Score crossed upward threshold | Block Management | +| `TierDowngradeRequested` | Score dropped below threshold | Block Management | +| `MaintenanceCompleted` | Tick budget exhausted | Metrics | + +```rust +#[derive(Clone, Debug)] +pub enum ScoringDomainEvent { + AccessRecorded { key: BlockKey, tick: u64 }, + ScoreComputed { key: BlockKey, score: f32, tick: u64 }, + TierUpgradeRequested { key: BlockKey, from: Tier, to: Tier, score: f32, tick: u64 }, + TierDowngradeRequested { key: BlockKey, from: Tier, to: Tier, score: f32, tick: u64 }, + MaintenanceCompleted { tick: u64, blocks_processed: u32 }, +} +``` + +--- + +## Bounded Context 4: Storage Engine Context + +### Purpose + +Responsible for persistent block IO, metadata logging, and index management. Owns the physical layout of tier data, the append-only metadata log for crash recovery, and the in-memory index structures (HashMap + per-tier candidate lists + min-heap for eviction). + +### Ubiquitous Language + +| Term | Definition | +|------|------------| +| **TieredStore** | Aggregate root managing all storage operations | +| **BlockIO** | Trait for reading/writing block data to tier-specific storage | +| **MetaLog** | Append-only log of metadata records for crash recovery | +| **StoreLayout** | Directory paths per tenant/collection | + +### Aggregates + +#### TieredStore (Aggregate Root) + +```rust +/// Manages tier data files, metadata log, and in-memory index. +/// +/// Invariants: +/// - Every block in the index has a valid metadata record in the log +/// - Per-tier candidate lists are consistent with the index +/// - Eviction candidates are ordered by score (min-heap) +/// - Checksums are verified on read (configurable) +pub struct TieredStore { + /// Primary index: BlockKey -> BlockMeta + index: HashMap, + /// Per-tier candidate lists for migration scanning + tier_lists: [Vec; 4], // Hot, Warm, Cold, Absent + /// Min-heap for eviction candidates (sorted by score ascending) + eviction_heap: BinaryHeap>, + /// Block IO backend (trait object for testability) + io: Box, + /// Metadata log for crash recovery + meta_log: Box, + /// Clock source + clock: Box, + /// Pending domain events + pending_events: Vec, +} + +impl TieredStore { + /// Write a block to its tier. Updates index and meta log atomically. + pub fn write_block( + &mut self, + key: BlockKey, + tier: Tier, + data: &[u8], + meta: BlockMeta, + ) -> Result<(), StoreErr> { + self.io.write_block(tier, key, data)?; + self.meta_log.append(MetaRecord::Write { key, tier, meta: meta.clone() })?; + self.index.insert(key, meta); + self.tier_lists[tier as usize].push(key); + + self.pending_events.push(StorageDomainEvent::BlockWritten { + key, tier, byte_count: data.len() as u32, + }); + + Ok(()) + } + + /// Read a block from its tier. Optionally verifies checksum. + pub fn read_block( + &self, + key: BlockKey, + verify_checksum: bool, + ) -> Result, StoreErr> { + let meta = self.index.get(&key) + .ok_or(StoreErr::NotFound(key))?; + + let mut buf = vec![0u8; meta.byte_size as usize]; + let n = self.io.read_block(meta.tier, key, &mut buf)?; + buf.truncate(n); + + if verify_checksum { + let actual = Checksum::compute(&buf); + if actual != meta.checksum { + return Err(StoreErr::ChecksumMismatch { key, expected: meta.checksum, actual }); + } + } + + Ok(buf) + } + + /// Delete a block from storage. Metadata is retained in the log. + pub fn delete_block(&mut self, key: BlockKey) -> Result<(), StoreErr> { + let meta = self.index.get(&key) + .ok_or(StoreErr::NotFound(key))?; + let tier = meta.tier; + + self.io.delete_block(tier, key)?; + self.meta_log.append(MetaRecord::Delete { key, tier })?; + self.index.remove(&key); + + self.pending_events.push(StorageDomainEvent::BlockDeleted { key, tier }); + + Ok(()) + } + + /// Rebuild index from metadata log (crash recovery). + pub fn rebuild_index(&mut self) -> Result { + self.index.clear(); + for list in &mut self.tier_lists { + list.clear(); + } + + let mut count = 0u64; + // Replay meta log to reconstruct index + // (implementation depends on MetaLog backend) + self.pending_events.push(StorageDomainEvent::IndexRebuilt { entries: count }); + + Ok(count) + } +} +``` + +### Repository Interfaces (Traits) + +```rust +/// Block-level IO operations. Implemented by filesystem, memory, or AgentDB backends. +pub trait BlockIO { + fn read_block(&self, tier: Tier, key: BlockKey, dst: &mut [u8]) -> Result; + fn write_block(&mut self, tier: Tier, key: BlockKey, src: &[u8]) -> Result<(), StoreErr>; + fn delete_block(&mut self, tier: Tier, key: BlockKey) -> Result<(), StoreErr>; +} + +/// Append-only metadata log for crash recovery and audit. +pub trait MetaLog { + fn append(&mut self, rec: MetaRecord) -> Result<(), StoreErr>; + fn get(&self, key: BlockKey) -> Option; + fn iter(&self) -> Box + '_>; +} + +/// Clock abstraction for deterministic testing and replay. +pub trait Clock { + fn now_ticks(&self) -> u64; +} +``` + +### Value Objects + +```rust +/// Physical storage layout per tenant/collection. +#[derive(Clone, Debug)] +pub struct StoreLayout { + pub hot_dir: String, + pub warm_dir: String, + pub cold_dir: String, + pub meta_log_path: String, +} + +/// Metadata record for the append-only log. +#[derive(Clone, Debug)] +pub enum MetaRecord { + Write { key: BlockKey, tier: Tier, meta: BlockMeta }, + Delete { key: BlockKey, tier: Tier }, + TierChange { key: BlockKey, old_tier: Tier, new_tier: Tier }, +} + +/// Block metadata (all non-data fields). +#[derive(Clone, Debug)] +pub struct BlockMeta { + pub tier: Tier, + pub checksum: Checksum, + pub created_at: u64, + pub last_accessed_at: u64, + pub last_tier_change_at: u64, + pub access_count: u32, + pub byte_size: u32, + pub reconstruct_policy: ReconstructPolicy, +} +``` + +### Domain Events + +| Event | Trigger | Consumers | +|-------|---------|-----------| +| `BlockWritten` | Block stored to tier | Metrics | +| `BlockRead` | Block retrieved from tier | Metrics, Scoring (touch) | +| `BlockDeleted` | Block removed from storage | Index cleanup | +| `MetaLogAppended` | New record in meta log | Crash recovery | +| `IndexRebuilt` | Index reconstructed from log | Startup, Recovery | +| `ChecksumFailed` | CRC mismatch on read | Alerting, Block Management | + +```rust +#[derive(Clone, Debug)] +pub enum StorageDomainEvent { + BlockWritten { key: BlockKey, tier: Tier, byte_count: u32 }, + BlockRead { key: BlockKey, tier: Tier }, + BlockDeleted { key: BlockKey, tier: Tier }, + MetaLogAppended { record_type: &'static str }, + IndexRebuilt { entries: u64 }, + ChecksumFailed { key: BlockKey, expected: Checksum, actual: Checksum }, +} +``` + +--- + +## Bounded Context 5: Delta & Reconstruction Context + +### Purpose + +Responsible for delta writes, delta chain management, factor storage, and reconstruction. When a block is evicted (Tier::Absent), it becomes reconstructable via a delta chain (base block + ordered deltas) or low-rank factor sets (U, S, V matrices). This context owns the chain length invariant and the compaction operation that collapses long chains. + +### Ubiquitous Language + +| Term | Definition | +|------|------------| +| **DeltaChain** | Aggregate root: base block reference + ordered list of deltas | +| **DeltaRecord** | Sparse vector: pairs of (index, quantized value) with delta_scale | +| **FactorSet** | Low-rank matrices (U, S, V) for reconstruction via U * S * V^T | +| **Compaction** | Collapsing a delta chain into a new base block | +| **SparseEntry** | Single (index: u16, value: i16) pair in a delta | + +### Aggregates + +#### DeltaChain (Aggregate Root) + +```rust +/// A chain of deltas anchored to a base block. +/// +/// Invariants: +/// - chain_length <= max_delta_chain (configurable, default 8) +/// - deltas are ordered by epoch (ascending) +/// - base block reference must be valid (either materialized or itself a chain) +/// - compaction produces a new base block and resets the chain +pub struct DeltaChain { + /// Block this chain belongs to + key: BlockKey, + /// Reference to the base block (tier and epoch) + base_ref: BaseBlockRef, + /// Ordered list of deltas from base + deltas: Vec, + /// Maximum allowed chain length before forced compaction + max_chain_length: usize, + /// Pending domain events + pending_events: Vec, +} + +impl DeltaChain { + pub fn new(key: BlockKey, base_ref: BaseBlockRef, max_chain_length: usize) -> Self { + Self { + key, + base_ref, + deltas: Vec::new(), + max_chain_length, + pending_events: Vec::new(), + } + } + + /// Append a new delta to the chain. + /// Returns Err if chain is at max length (must compact first). + pub fn append_delta(&mut self, delta: DeltaRecord) -> Result<(), DeltaError> { + if self.deltas.len() >= self.max_chain_length { + return Err(DeltaError::ChainFull { + key: self.key, + length: self.deltas.len(), + max: self.max_chain_length, + }); + } + + self.pending_events.push(DeltaDomainEvent::DeltaAppended { + key: self.key, + epoch: delta.header.base_epoch, + nnz: delta.entries.len() as u32, + }); + + self.deltas.push(delta); + Ok(()) + } + + /// Apply the full chain to reconstruct the current block data. + /// Starts from the base block and applies each delta in order. + pub fn apply_chain(&self, base_data: &mut [f32]) -> Result<(), DeltaError> { + for delta in &self.deltas { + for entry in &delta.entries { + let idx = entry.index as usize; + if idx < base_data.len() { + let delta_val = (entry.value as f32) * delta.header.delta_scale; + base_data[idx] += delta_val; + } + } + } + + self.pending_events.iter().for_each(|_| {}); // events already recorded + Ok(()) + } + + /// Compact the chain: collapse all deltas into the base block. + /// Returns the new base data for storage. + pub fn compact(&mut self, base_data: &mut [f32]) -> Result, DeltaError> { + self.apply_chain(base_data)?; + let compacted = base_data.to_vec(); + + self.pending_events.push(DeltaDomainEvent::ChainCompacted { + key: self.key, + collapsed_deltas: self.deltas.len() as u32, + }); + + self.deltas.clear(); + Ok(compacted) + } + + /// Current chain length. + pub fn chain_length(&self) -> usize { + self.deltas.len() + } + + /// Whether compaction is needed. + pub fn needs_compaction(&self) -> bool { + self.deltas.len() >= self.max_chain_length + } + + pub fn take_events(&mut self) -> Vec { + std::mem::take(&mut self.pending_events) + } +} +``` + +### Entities + +```rust +/// A single delta record: sparse vector of changes from the previous state. +#[derive(Clone, Debug)] +pub struct DeltaRecord { + /// Header with provenance metadata + pub header: DeltaHeader, + /// Sparse entries: (index, quantized delta value) + pub entries: Vec, +} + +/// Low-rank factor set for reconstruction via U * diag(S) * V^T. +/// Used when the block was evicted but its structure can be approximated +/// by a low-rank decomposition. +#[derive(Clone, Debug)] +pub struct FactorSet { + /// Left singular vectors (rows x rank) + pub u_matrix: Vec, + /// Singular values (rank) + pub s_values: Vec, + /// Right singular vectors (rank x cols) + pub v_matrix: Vec, + /// Rank of the approximation + pub rank: u32, + /// Original tensor dimensions + pub rows: u32, + pub cols: u32, +} + +impl FactorSet { + /// Reconstruct the full tensor from factors. + pub fn reconstruct(&self) -> Vec { + let mut result = vec![0.0f32; (self.rows * self.cols) as usize]; + for r in 0..self.rank as usize { + let s = self.s_values[r]; + for i in 0..self.rows as usize { + let u_val = self.u_matrix[i * self.rank as usize + r] * s; + for j in 0..self.cols as usize { + let v_val = self.v_matrix[r * self.cols as usize + j]; + result[i * self.cols as usize + j] += u_val * v_val; + } + } + } + result + } +} +``` + +### Value Objects + +```rust +/// Header for a delta record with provenance metadata. +#[derive(Clone, Debug)] +pub struct DeltaHeader { + pub tensor_id: u128, + pub block_index: u32, + pub base_epoch: u64, + /// Number of non-zero entries + pub nnz: u32, + /// Scale factor for quantized delta values + pub delta_scale: f32, +} + +/// Single sparse entry in a delta: (index, quantized value). +#[derive(Clone, Copy, Debug)] +pub struct SparseEntry { + /// Index into the block data (0-based) + pub index: u16, + /// Quantized delta value (signed) + pub value: i16, +} + +/// Reference to a base block for delta chain anchoring. +#[derive(Clone, Debug)] +pub struct BaseBlockRef { + pub key: BlockKey, + pub tier: Tier, + pub epoch: u64, +} +``` + +### Domain Events + +| Event | Trigger | Consumers | +|-------|---------|-----------| +| `DeltaAppended` | New delta added to chain | Storage Engine | +| `ChainCompacted` | Delta chain collapsed | Block Management, Storage | +| `FactorStored` | Low-rank factors computed and saved | Storage Engine | +| `ReconstructionAttempted` | Block rebuild from chain/factors | Metrics | +| `ReconstructionFailed` | Rebuild failed (missing base/factors) | Alerting | + +```rust +#[derive(Clone, Debug)] +pub enum DeltaDomainEvent { + DeltaAppended { key: BlockKey, epoch: u64, nnz: u32 }, + ChainCompacted { key: BlockKey, collapsed_deltas: u32 }, + FactorStored { key: BlockKey, rank: u32 }, + ReconstructionAttempted { key: BlockKey, method: ReconstructPolicy }, + ReconstructionFailed { key: BlockKey, reason: String }, +} +``` + +--- + +## Context Map (Integration Patterns) + +``` +Block Management <--[Shared Kernel]--> Quantization + - Shared types: BlockKey, Tier, DType, Checksum + - Co-owned by both teams; changes require bilateral agreement + - Boundary: QuantizationCodec is owned by BC2, TensorBlock by BC1 + +Block Management <--[Shared Kernel]--> Temporal Scoring + - Shared types: BlockKey, Tier, BlockMeta + - Scoring produces tier recommendations; Block Mgmt enforces transitions + - Boundary: AccessProfile is owned by BC3, TensorBlock by BC1 + +Block Management <--[Customer/Supplier]--> Storage Engine + - BC1 (customer) calls BC4 (supplier) for persistence + - BC4 provides stable BlockIO and MetaLog traits + - BC1 depends on BC4's write guarantees; BC4 is independent + +Block Management <--[Customer/Supplier]--> Delta & Reconstruction + - BC1 (customer) requests reconstruction from BC5 (supplier) + - BC5 provides apply_chain() and reconstruct() operations + - BC5 depends on BC4 (Storage) for reading base blocks + +Temporal Scoring <--[Conformist]--> Storage Engine + - BC3 reads metadata from BC4's index; conforms to BC4's data model + - BC3 does not write to storage; read-only conformist + +Storage Engine <--[Published Language]--> WASM API (host bindings) + - The FFI layer (ffi.rs) provides a stable C ABI + - Host code calls ttc_create, ttc_push_frame, ttc_flush, ttc_decode_segment + - Handle-based resource management (Vec>) +``` + +### Context Map Diagram + +``` ++--------------------+ Shared Kernel +--------------------+ +| |<========================>| | +| BC1: Block | BlockKey, Tier, DType | BC2: Quantization | +| Management | Checksum | Context | +| | | | ++--------+-----------+ Shared Kernel +--------------------+ + | |<========================>| + | | BlockKey, Tier, | + | | BlockMeta | + | +------+ | + | | | + | Customer | BC3: Temporal | + | /Supplier | Scoring Context | + v +--------+----------+ ++--------------------+ | +| BC4: Storage |<--------------+ Conformist (reads metadata) +| Engine Context | +| | Published Language +| BlockIO, MetaLog |=========================> WASM API (ffi.rs) ++--------+-----------+ + | + | Customer/Supplier + v ++--------------------+ +| BC5: Delta & | +| Reconstruction | +| Context | ++--------------------+ +``` + +--- + +## Rust Module Mapping + +### Crate-to-Bounded-Context Mapping + +``` +Crate Bounded Context(s) +-----------------------------------+------------------------------------------ +temporal_tensor_store BC1 (Block Management) + orchestration + src/lib.rs Public API, re-exports + src/compressor.rs BC1: TemporalTensorCompressor aggregate + +quant (ruvector-temporal-tensor) BC2 (Quantization) + src/quantizer.rs Groupwise symmetric quantization + src/bitpack.rs Bitstream packer/unpacker + src/f16.rs Software f16 conversion + +tiering (ruvector-temporal-tensor) BC3 (Temporal Scoring) + src/tier_policy.rs TierPolicy, score computation + +codec_bits (shared) BC2 (Quantization, shared kernel) + src/bitpack.rs pack(), unpack(), qmax_from_bits() + +metrics (ruvector-metrics) Cross-cutting (witnesses, audit) + +wasm_api BC4 (Storage, WASM layer) + src/ffi.rs Handle store, extern "C" exports +``` + +### Module Structure + +``` +crates/ruvector-temporal-tensor/ ++-- Cargo.toml ++-- src/ + +-- lib.rs # Public API (BC1 orchestration) + +-- compressor.rs # BC1: TemporalTensorCompressor aggregate root + +-- tier_policy.rs # BC3: TierPolicy, score computation + +-- quantizer.rs # BC2: Groupwise symmetric quantization + +-- bitpack.rs # BC2: Bitstream packer/unpacker (shared kernel) + +-- f16.rs # BC2: Software f16 conversion (shared kernel) + +-- segment.rs # BC4: Segment encode/decode, binary format + +-- ffi.rs # BC4: WASM FFI, handle-based store + +crates/ruvector-temporal-tensor-wasm/ ++-- Cargo.toml # wasm32-unknown-unknown target ++-- src/ + +-- lib.rs # Re-exports FFI functions for WASM +``` + +### Dependency Graph + +``` +ruvector-temporal-tensor (zero external deps) ++-- bitpack.rs (no deps) ++-- f16.rs (no deps) ++-- quantizer.rs (depends on: bitpack, f16) ++-- tier_policy.rs (no deps) ++-- segment.rs (depends on: quantizer) ++-- compressor.rs (depends on: quantizer, segment, tier_policy) ++-- ffi.rs (depends on: compressor, segment, tier_policy) + +ruvector-temporal-tensor-wasm ++-- ruvector-temporal-tensor (the only dependency) +``` + +--- + +## Anti-Corruption Layers + +### WASM FFI Anti-Corruption Layer + +The `ffi.rs` module provides an ACL between the host environment and the domain model. The host interacts exclusively through opaque handles (u32 indices into `Vec>`), raw pointers, and C-compatible scalars. The ACL translates these into domain operations: + +```rust +// Host calls this C ABI function: +extern "C" fn ttc_push_frame( + handle: u32, // opaque handle + now_ts: u32, // scalar timestamp + in_ptr: *const f32, // raw pointer to frame data + len: u32, // frame length + out_ptr: *mut u8, // output buffer + out_cap: u32, // output capacity + out_written: *mut u32, // bytes written +); + +// ACL translates to domain operation: +// compressor.push_frame(&frame_slice, now_ts, &mut segment_vec) +``` + +### AgentDB Integration Adapter + +When integrating with AgentDB for persistent segment storage, an adapter implements the `BlockIO` trait, translating between the Temporal Tensor Store's domain model and AgentDB's key-value API: + +```rust +/// Adapter implementing BlockIO over AgentDB's KV store. +pub struct AgentDbBlockIO { + db: AgentDbClient, + tenant: String, +} + +impl BlockIO for AgentDbBlockIO { + fn read_block(&self, tier: Tier, key: BlockKey, dst: &mut [u8]) -> Result { + let db_key = format!("{}:{}:{}", self.tenant, key.tensor_id, key.block_index); + let data = self.db.get(&db_key)?; + let n = data.len().min(dst.len()); + dst[..n].copy_from_slice(&data[..n]); + Ok(n) + } + + fn write_block(&mut self, tier: Tier, key: BlockKey, src: &[u8]) -> Result<(), StoreErr> { + let db_key = format!("{}:{}:{}", self.tenant, key.tensor_id, key.block_index); + self.db.put(&db_key, src, &[("tier", &tier.as_str())])?; + Ok(()) + } + + fn delete_block(&mut self, tier: Tier, key: BlockKey) -> Result<(), StoreErr> { + let db_key = format!("{}:{}:{}", self.tenant, key.tensor_id, key.block_index); + self.db.delete(&db_key)?; + Ok(()) + } +} +``` + +### Coherence Engine Integration + +The Coherence Engine (ADR-014, ADR-015) integrates via an event-driven boundary. When the coherence engine detects structural disagreement for a tensor, it emits a `DriftDetected` event that the Temporal Tensor Store consumes to force segment boundaries: + +```rust +/// Event handler bridging Coherence Engine events to Temporal Tensor Store. +pub struct CoherenceBridge { + compressors: HashMap, +} + +impl CoherenceBridge { + /// Called when coherence engine detects tensor drift. + pub fn on_coherence_drift(&mut self, tensor_id: u128) -> Vec> { + let mut flushed_segments = Vec::new(); + if let Some(comp) = self.compressors.get_mut(&tensor_id) { + let mut seg = Vec::new(); + comp.flush(&mut seg); + if !seg.is_empty() { + flushed_segments.push(seg); + } + } + flushed_segments + } +} +``` + +--- + +## Relationship to ADR-016 Delta-Behavior DDD + +The Temporal Tensor Store DDD and the Delta-Behavior DDD (ADR-016) are complementary systems that share a conceptual boundary around the notion of "delta" but operate at different abstraction levels. + +### Shared Concepts + +| Concept | ADR-016 (Delta-Behavior) | This DDD (Temporal Tensor Store) | +|---------|--------------------------|----------------------------------| +| **Delta** | Immutable record of differential change between two vector states | Sparse vector of (index, quantized_value) pairs within a block | +| **Ordering** | Causal ordering via Lamport timestamps | Epoch ordering within a chain | +| **Compaction** | Checkpoint creation to bound replay | Chain collapse into new base block | +| **Temporal window** | DeltaWindow for batching within time/count | Temporal Segment for amortizing scales across frames | + +### Key Differences + +1. **Granularity**: ADR-016 operates on full vector states (embeddings, graph nodes). The Temporal Tensor Store operates on fixed-size blocks (16KB/32KB chunks of tensors). + +2. **Compression model**: ADR-016 delta vectors are sparse diffs between states. The Temporal Tensor Store uses quantization-based compression where "delta" is a secondary mechanism for evicted blocks only. + +3. **Distribution model**: ADR-016 is designed for distributed propagation across nodes. The Temporal Tensor Store is designed for local storage tiering within a single node. + +4. **ADR-016 term mapping**: What ADR-016 calls a "DeltaCheckpoint" maps to what this DDD calls a "base block" in a delta chain. ADR-016's "DeltaGraph" (DAG of dependencies) maps to the chain ordering invariant in BC5. + +### Integration Surface + +The two systems integrate at the **Delta & Reconstruction Context (BC5)**. When a block is evicted from the Temporal Tensor Store, the delta chain mechanism shares the same conceptual foundation as ADR-016's delta capture: + +``` +ADR-016 Delta-Behavior System + | + | DeltaVector (sparse change) + v +BC5: Delta & Reconstruction Context + | + | DeltaRecord (sparse entries + quantized scale) + v +BC4: Storage Engine +``` + +ADR-016's `DeltaChecksum` (tamper-evident chaining) can be adopted by BC5 for verifying delta chain integrity. ADR-016's `DeltaWindow` concept informs the Temporal Tensor Store's segment boundary logic (both batch changes within a temporal window to amortize metadata). + +### Term Disambiguation + +| ADR-017 Term | ADR-016 Term | Meaning | +|-------------|-------------|---------| +| Segment | (no equivalent) | Multi-frame compressed blob sharing quantization scales | +| Block | (closest: DeltaCheckpoint) | Fixed-size chunk of a tensor with tiered compression | +| Delta chain | DeltaStream | Ordered sequence of incremental changes from a base | +| Compaction | Checkpoint creation | Collapsing incremental changes into a new baseline | +| Drift | (closest: ChangeEvent) | Distribution shift exceeding scale tolerance | +| Tick | (closest: DeltaTimestamp.logical) | Logical time quantum for maintenance processing | + +--- + +## Segment Binary Format Reference + +For completeness, the on-disk segment format as defined in ADR-017 section 3.3: + +``` +Offset Size Field Description +------ ------ --------------- ------------------------------------------ +0 4 magic 0x43545154 ("TQTC" in LE ASCII) +4 1 version Format version (currently 1) +5 1 bits Bit width (3, 5, 7, or 8) +6 4 group_len Elements per quantization group +10 4 tensor_len Number of f32 elements per frame +14 4 frame_count Number of frames in this segment +18 4 scale_count Number of f16 group scales +22 2*S scales f16 scale values (S = scale_count) +22+2S 4 data_len Length of packed bitstream in bytes +26+2S D data Packed quantized codes (D = data_len) + +Total: 26 + 2*ceil(tensor_len/group_len) + ceil(tensor_len * frame_count * bits / 8) +``` + +--- + +## Testing Strategy + +### Property-Based Tests + +```rust +#[quickcheck] +fn roundtrip_preserves_length(bits: TierBits, len: TensorLen) -> bool { + let bits = bits.0; // constrained to {3, 5, 7, 8} + let frame: Vec = (0..len.0).map(|i| (i as f32) * 0.1).collect(); + let scales = compute_scales(&frame, 64, bits); + let mut packed = Vec::new(); + quantize_and_pack(&frame, &scales, 64, bits, &mut packed); + let mut decoded = Vec::new(); + dequantize(&packed, &scales, 64, bits, frame.len(), 1, &mut decoded); + decoded.len() == frame.len() +} + +#[quickcheck] +fn error_bounded_by_tier(bits: TierBits, frame: SmallFrame) -> bool { + let qmax = qmax_from_bits(bits.0); + let max_relative_error = 1.0 / (2.0 * qmax as f32); + let scales = compute_scales(&frame.0, 64, bits.0); + let mut packed = Vec::new(); + quantize_and_pack(&frame.0, &scales, 64, bits.0, &mut packed); + let mut decoded = Vec::new(); + dequantize(&packed, &scales, 64, bits.0, frame.0.len(), 1, &mut decoded); + + frame.0.iter().zip(decoded.iter()).all(|(&orig, &dec)| { + let max_abs = frame.0.iter().map(|v| v.abs()).fold(0.0f32, f32::max); + if max_abs < 1e-10 { return true; } + let err = (orig - dec).abs() / max_abs; + err < max_relative_error * 2.0 // 2x margin for f16 scale rounding + }) +} + +#[quickcheck] +fn segment_encode_decode_deterministic(frame: SmallFrame, bits: TierBits) -> bool { + let scales = compute_scales(&frame.0, 64, bits.0); + let mut packed = Vec::new(); + quantize_and_pack(&frame.0, &scales, 64, bits.0, &mut packed); + let mut seg1 = Vec::new(); + encode(bits.0, 64, frame.0.len() as u32, 1, &scales, &packed, &mut seg1); + let mut seg2 = Vec::new(); + encode(bits.0, 64, frame.0.len() as u32, 1, &scales, &packed, &mut seg2); + seg1 == seg2 +} +``` + +### Tier Transition Tests + +```rust +#[test] +fn tier_transitions_are_monotonic_within_tick() { + let mut comp = TemporalTensorCompressor::new(TierPolicy::default(), 64, 0); + comp.set_access(100, 0); // Hot + let frame = vec![1.0f32; 64]; + let mut seg = Vec::new(); + + // Hot -> push frame + comp.push_frame(&frame, 1, &mut seg); + assert_eq!(comp.active_bits(), 8); + + // Decay to cold + comp.set_access(1, 0); + comp.push_frame(&frame, 10000, &mut seg); + assert_eq!(comp.active_bits(), 3); + + // Previous segment was flushed + assert!(!seg.is_empty()); +} +``` + +### Replay Determinism + +```rust +#[test] +fn segment_decode_is_deterministic() { + let mut comp = TemporalTensorCompressor::new(TierPolicy::default(), 128, 0); + comp.set_access(100, 0); + let frame: Vec = (0..128).map(|i| (i as f32 - 64.0) * 0.01).collect(); + let mut seg = Vec::new(); + + for _ in 0..10 { + comp.push_frame(&frame, 1, &mut seg); + } + comp.flush(&mut seg); + + let mut decoded1 = Vec::new(); + segment::decode(&seg, &mut decoded1); + + let mut decoded2 = Vec::new(); + segment::decode(&seg, &mut decoded2); + + assert_eq!(decoded1, decoded2); +} +``` + +--- + +## Aggregate Relationship Diagram + +``` ++===============================================================+ +| AGGREGATE RELATIONSHIPS | ++===============================================================+ +| | +| TensorBlock (BC1) | +| +-- owns --> BlockMeta | +| +-- owns --> BlockData (optional, None if evicted) | +| +-- refs --> TensorIdentity | +| +-- produces --> BlockDomainEvent | +| | | +| +---[tier change requires]---> QuantizationCodec (BC2) | +| | +-- uses --> QuantParams | +| | +-- uses --> PackedBlock | +| | +-- delegates to --> BitPackingService | +| | | +| +---[score drives tier]------> AccessProfile (BC3) | +| | +-- uses --> TierPolicy | +| | +-- uses --> MaintenanceScheduler | +| | +-- produces --> ScoringDomainEvent | +| | | +| +---[persists via]-----------> TieredStore (BC4) | +| | +-- uses --> BlockIO (trait) | +| | +-- uses --> MetaLog (trait) | +| | +-- uses --> Clock (trait) | +| | +-- produces --> StorageDomainEvent | +| | | +| +---[eviction creates]-------> DeltaChain (BC5) | +| +-- owns --> DeltaRecord[] | +| +-- refs --> BaseBlockRef | +| +-- alt --> FactorSet | +| +-- produces --> DeltaDomainEvent | +| | ++===============================================================+ +``` + +--- + +## References + +1. Evans, E. (2003). "Domain-Driven Design: Tackling Complexity in the Heart of Software." +2. Vernon, V. (2013). "Implementing Domain-Driven Design." +3. ADR-017: Temporal Tensor Compression with Tiered Quantization (2026-02-06). +4. ADR-016: Delta-Behavior System DDD Architecture (2026-01-28). +5. ADR-014: Coherence Engine (2026-01-22). +6. ADR-004: KV Cache Management. +7. ADR-005: WASM Runtime Integration. +8. Frantar, E., et al. "GPTQ: Accurate Post-Training Quantization." ICLR 2023. +9. Lin, J., et al. "AWQ: Activation-aware Weight Quantization." MLSys 2024. +10. Liu, Z., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. +11. Pelkonen, T., et al. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB 2015.