mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-05-25 06:36:37 +00:00
* feat(rvf): add RuVector Format universal substrate specification Research and design for RVF — a streaming, progressive, adaptive, quantum-secure binary format for vector intelligence. Covers append-only segment model, two-level tail manifests, temperature tiering, progressive HNSW indexing, epoch-based overlay system, SIMD-optimized query paths, WASM microkernel for Cognitum tiles, domain profiles (RVDNA, RVText, RVGraph, RVVision), and post-quantum cryptography. https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW * feat(rvf): add deletion, filtered search, concurrency, and operations specs Fill four specification gaps in the RVF format design: - spec/07: Vector deletion lifecycle, JOURNAL_SEG wire format, deletion bitmaps - spec/08: Filtered search with META_SEG, METAIDX_SEG, filter expression language - spec/09: Writer locking, reader-writer coordination, versioning, space reclamation - spec/10: Batch operations API, error codes, network streaming protocol Also fixes the segment header field conflict between spec/01 and wire/binary-layout.md (checksum_algo/compression now u8, adds uncompressed_len at 0x38). https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW * feat(rvf): add RuVector Format SDK, 40 examples, MCP server, and documentation Complete RVF implementation including: - 12 Rust crates (rvf-types, rvf-wire, rvf-manifest, rvf-index, rvf-quant, rvf-crypto, rvf-runtime, rvf-import, rvf-wasm, rvf-node, rvf-server, plus integration tests) - 40 runnable examples covering core storage, agentic AI, production patterns, vertical domains, exotic capabilities, runtime targets, network/security, POSIX/systems, and network operations - TypeScript SDK (npm/packages/rvf) with RvfDatabase class - MCP server (npm/packages/rvf-mcp-server) with stdio and SSE transports - Node.js N-API bindings (npm/packages/rvf-node) - WASM package (npm/packages/rvf-wasm) - ADR-029 (canonical format), ADR-030 (computational container), ADR-031 (example repository) - DNA-style lineage provenance, computational containers (KERNEL_SEG, EBPF_SEG), witness chains, TEE attestation, domain profiles - Superseded ADR annotations for ADR-001, ADR-005, ADR-006, ADR-018-021 Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add CLI, WASM store, generate_all, and 46 output .rvf files - Add rvf-cli crate (665 lines, 9 subcommands: create/ingest/query/delete/status/inspect/compact/derive/serve) - Add WASM control plane store (alloc_setup, segment, store modules) for ~46 KB binary - Add generate_all.rs example producing 46 persistent .rvf files in output/ - Add Node.js N-API bindings for lineage, kernel/eBPF, and inspection - Add npm TypeScript backend/database/types for RVF integration - Update READMEs with CLI sections, MCP server docs, and crate map (13 crates) - All 40 examples verified passing Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add Claude Code appliance, improve Quick Start, fix API docs - Add claude_code_appliance.rs: self-booting RVF with SSH + Claude Code install (curl -fsSL https://claude.ai/install.sh | bash), 3 SSH users, eBPF filter, 20-package manifest, witness chain, lineage snapshot - Improve Quick Start: Install section (crate/CLI/npm/WASM/MCP), WASM browser example, generate_all reference, expanded Rust crate deps - Fix embed_kernel/embed_ebpf API docs to match actual signatures (u8 params with `as u8` cast, 6-param kernel, Option<&[u8]> btf) - Update generate_all.rs: add claude_code_appliance generator (47 files) - Regenerate all 47 output .rvf files Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add RVCOW branching, real kernel/eBPF/launcher, 795 tests Vector-native copy-on-write branching (ADR-031) with four new segment types (COW_MAP 0x20, REFCOUNT 0x21, MEMBERSHIP 0x22, DELTA 0x23), real Linux microkernel builder, QEMU microVM launcher, real eBPF programs, and 128-byte KernelBinding for tamper-evident kernel-manifest linkage. New crates: - rvf-kernel: Docker-based kernel build, real cpio/newc initramfs builder, SHA3-256 verification, prebuilt kernel support (37 tests) - rvf-launch: QEMU microVM launcher with QMP shutdown, KVM/TCG detection, virtio-blk/net port forwarding, kernel extraction (8 tests) - rvf-ebpf: 3 real BPF C programs (xdp_distance, socket_filter, tc_query_route) with clang compilation support (17 tests) RVCOW runtime: - CowEngine with read/write paths, write coalescing, snapshot-freeze - CowMap (flat-array), MembershipFilter (bitmap), CowCompactor - 3x read performance via pread optimization (1.3us/vector) - Branch creation: 2.6ms for 10K vectors, child = 162 bytes Security: 20-finding audit, 7 fixes applied including division-by-zero guards, integer overflow checks, and KernelBinding::from_bytes_validated(). CLI: 8 new commands (launch, embed-kernel, embed-ebpf, filter, freeze, verify-witness, verify-attestation, rebuild-refcounts), serve wired to real rvf-server. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): update README, add crate/npm READMEs, publish to crates.io and npm - Rewrite README with cognitive container terminology, grouped features, 4 comparison tables (vs Docker, Vector DBs, Git LFS, SQLite), updated benchmarks, architecture diagram, and 45 examples - Add READMEs for rvf-kernel, rvf-launch, rvf-ebpf, rvf-import crates - Add READMEs for @ruvector/rvf, rvf-node, rvf-wasm, rvf-mcp-server npm packages - Fix Cargo.toml metadata (homepage, readme, categories, keywords) and add version specs to all path dependencies for crates.io publishing - Fix clippy warnings in rvf-kernel/initramfs.rs and rvf-launch/lib.rs - Published to crates.io: rvf-types, rvf-wire, rvf-manifest, rvf-quant, rvf-index, rvf-crypto (remaining crates pending rate limit) - Published to npm: @ruvector/rvf, @ruvector/rvf-node, @ruvector/rvf-wasm, @ruvector/rvf-mcp-server Co-Authored-By: claude-flow <ruv@ruv.net> * chore: add rvf-kernel, rvf-ebpf, rvf-launch, rvf-server, rvf-import, rvf-cli to workspace Include all 15 RVF crates plus integration tests and benchmarks in the root workspace members list so cargo publish can resolve them by name. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add published packages, cognitive container branding, grouped capabilities - Add Published Packages section with 13 crates.io + 4 npm tables - Add Platform Support table (Linux, macOS, Windows, WASM, no_std) - Expand capability table from 9 to 15 rows in 4 groups - Rewrite all "How" descriptions in plain language - Update .rvf diagram to show all 20 segment types - Rename ADRs: computational container -> cognitive container - Add emojis to all section headers Co-Authored-By: claude-flow <ruv@ruv.net> * feat: update root README with RVF cognitive containers, expanded capabilities - Update intro: "gets smarter + ships as cognitive container" - Add self-booting microservice row to Pinecone comparison table - Expand capabilities from 34 to 42 features with dedicated RVF section - Update "Think of it as" to include Docker comparison and RVF explanation - Add RVF collapsed group to Ecosystem (13 crates, 4 npm, install commands) - Add RVF to Platform & Edge section with install commands - Add RVF npm packages (4) and Rust crates (13) to package reference - Add RVF rows to feature comparison table (6 new rows) - Add ADR-030/031 to ADR list - Add RVF to Installation table, Project Structure - Update attention mechanisms count from 39 to 40+ - Update npm count to 49+, Rust crates to 83 - Update footer with crates.io and RVF links Co-Authored-By: claude-flow <ruv@ruv.net> * feat: expand comparison table with emojis, cost, audit, branching, single-file Co-Authored-By: claude-flow <ruv@ruv.net> * docs: rewrite comparison table in plain language Co-Authored-By: claude-flow <ruv@ruv.net> * chore: clean up empty code change sections in the changes log --------- Co-authored-by: Claude <noreply@anthropic.com>
439 lines
16 KiB
Markdown
439 lines
16 KiB
Markdown
# RVF Wire Format Reference
|
|
|
|
## 1. File Structure
|
|
|
|
An RVF file is a byte stream with no fixed header at offset 0. All structure
|
|
is discovered from the tail.
|
|
|
|
```
|
|
Byte 0 EOF
|
|
| |
|
|
v v
|
|
+--------+--------+--------+ +--------+---------+--------+---------+
|
|
| Seg 0 | Seg 1 | Seg 2 | ... | Seg N | Seg N+1 | Seg N+2| Mfst K |
|
|
| VEC | VEC | INDEX | | VEC | HOT | INDEX | MANIF |
|
|
+--------+--------+--------+ +--------+---------+--------+---------+
|
|
^ ^
|
|
| |
|
|
Level 1 Mfst |
|
|
Level 0
|
|
(last 4KB)
|
|
```
|
|
|
|
### Alignment Rule
|
|
|
|
Every segment starts at a **64-byte aligned** boundary. If a segment's
|
|
payload + footer does not end on a 64-byte boundary, zero-padding is inserted
|
|
before the next segment header.
|
|
|
|
### Byte Order
|
|
|
|
All multi-byte integers are **little-endian**. All floating-point values
|
|
are IEEE 754 little-endian. This matches x86, ARM (in default mode), and
|
|
WASM native byte order.
|
|
|
|
## 2. Primitive Types
|
|
|
|
```
|
|
Type Size Encoding
|
|
---- ---- --------
|
|
u8 1 Unsigned 8-bit integer
|
|
u16 2 Unsigned 16-bit little-endian
|
|
u32 4 Unsigned 32-bit little-endian
|
|
u64 8 Unsigned 64-bit little-endian
|
|
i32 4 Signed 32-bit little-endian (two's complement)
|
|
i64 8 Signed 64-bit little-endian (two's complement)
|
|
f16 2 IEEE 754 half-precision little-endian
|
|
f32 4 IEEE 754 single-precision little-endian
|
|
f64 8 IEEE 754 double-precision little-endian
|
|
varint 1-10 LEB128 unsigned variable-length integer
|
|
svarint 1-10 ZigZag + LEB128 signed variable-length integer
|
|
hash128 16 First 128 bits of hash output
|
|
hash256 32 First 256 bits of hash output
|
|
```
|
|
|
|
### Varint Encoding (LEB128)
|
|
|
|
```
|
|
Value 0-127: 1 byte [0xxxxxxx]
|
|
Value 128-16383: 2 bytes [1xxxxxxx 0xxxxxxx]
|
|
Value 16384-2097151: 3 bytes [1xxxxxxx 1xxxxxxx 0xxxxxxx]
|
|
...up to 10 bytes for u64
|
|
```
|
|
|
|
### Delta Encoding
|
|
|
|
Sequences of sorted integers use delta encoding:
|
|
```
|
|
Original: [100, 105, 108, 120, 200]
|
|
Deltas: [100, 5, 3, 12, 80]
|
|
Encoded: [varint(100), varint(5), varint(3), varint(12), varint(80)]
|
|
```
|
|
|
|
With restart points every N entries, the first value in each restart group
|
|
is absolute (not delta-encoded).
|
|
|
|
## 3. Segment Header (64 bytes)
|
|
|
|
```
|
|
Offset Type Field Notes
|
|
------ ---- ----- -----
|
|
0x00 u32 magic Always 0x52564653 ("RVFS")
|
|
0x04 u8 version Format version (1)
|
|
0x05 u8 seg_type Segment type enum
|
|
0x06 u16 flags See flags bitfield
|
|
0x08 u64 segment_id Monotonic ordinal
|
|
0x10 u64 payload_length Bytes after header, before footer
|
|
0x18 u64 timestamp_ns UNIX nanoseconds
|
|
0x20 u8 checksum_algo 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
|
|
0x21 u8 compression 0=none, 1=LZ4, 2=ZSTD, 3=custom
|
|
0x22 u16 reserved_0 Must be 0x0000
|
|
0x24 u32 reserved_1 Must be 0x00000000
|
|
0x28 hash128 content_hash Payload hash (first 128 bits)
|
|
0x38 u32 uncompressed_len Original payload size (0 if no compression)
|
|
0x3C u32 alignment_pad Zero padding to 64B boundary
|
|
```
|
|
|
|
### Segment Type Enum
|
|
|
|
```
|
|
0x00 INVALID Not a valid segment
|
|
0x01 VEC_SEG Vector payloads
|
|
0x02 INDEX_SEG HNSW adjacency
|
|
0x03 OVERLAY_SEG Graph overlay deltas
|
|
0x04 JOURNAL_SEG Metadata mutations
|
|
0x05 MANIFEST_SEG Segment directory
|
|
0x06 QUANT_SEG Quantization dictionaries
|
|
0x07 META_SEG Key-value metadata
|
|
0x08 HOT_SEG Temperature-promoted data
|
|
0x09 SKETCH_SEG Access counter sketches
|
|
0x0A WITNESS_SEG Capability manifests
|
|
0x0B PROFILE_SEG Domain profile declarations
|
|
0x0C CRYPTO_SEG Key material / certificate anchors
|
|
0x0D-0xEF reserved
|
|
0xF0-0xFF extension Implementation-specific
|
|
```
|
|
|
|
### Flags Bitfield
|
|
|
|
```
|
|
Bit Mask Name Meaning
|
|
--- ---- ---- -------
|
|
0 0x0001 COMPRESSED Payload compressed per compression field
|
|
1 0x0002 ENCRYPTED Payload encrypted (key in CRYPTO_SEG)
|
|
2 0x0004 SIGNED Signature footer follows payload
|
|
3 0x0008 SEALED Immutable (compaction output)
|
|
4 0x0010 PARTIAL Partial/streaming write
|
|
5 0x0020 TOMBSTONE Logically deletes prior segment
|
|
6 0x0040 HOT Contains hot-tier data
|
|
7 0x0080 OVERLAY Contains overlay/delta data
|
|
8 0x0100 SNAPSHOT Full snapshot (not delta)
|
|
9 0x0200 CHECKPOINT Safe rollback point
|
|
10-15 reserved Must be zero
|
|
```
|
|
|
|
## 4. Signature Footer
|
|
|
|
Present only if `SIGNED` flag is set. Follows immediately after the payload.
|
|
|
|
```
|
|
Offset Type Field Notes
|
|
------ ---- ----- -----
|
|
0x00 u16 sig_algo 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
|
|
0x02 u16 sig_length Signature byte length
|
|
0x04 u8[] signature Signature bytes
|
|
var u32 footer_length Total footer size (for backward scan)
|
|
```
|
|
|
|
### Signature Algorithm Sizes
|
|
|
|
| Algorithm | sig_length | Post-Quantum | Performance |
|
|
|-----------|-----------|-------------|-------------|
|
|
| Ed25519 | 64 B | No | ~76,000 sign/s |
|
|
| ML-DSA-65 | 3,309 B | Yes (NIST Level 3) | ~4,500 sign/s |
|
|
| SLH-DSA-128s | 7,856 B | Yes (NIST Level 1) | ~350 sign/s |
|
|
|
|
## 5. VEC_SEG Payload Layout
|
|
|
|
Vector segments store blocks of vectors in columnar layout for compression.
|
|
|
|
```
|
|
+------------------------------------------+
|
|
| VEC_SEG Payload |
|
|
+------------------------------------------+
|
|
| Block Directory |
|
|
| block_count: u32 |
|
|
| For each block: |
|
|
| block_offset: u32 (from payload start)|
|
|
| vector_count: u32 |
|
|
| dim: u16 |
|
|
| dtype: u8 |
|
|
| tier: u8 |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
| Block 0 |
|
|
| +-- Columnar Vectors --+ |
|
|
| | dim_0[0..count] | <- all vals |
|
|
| | dim_1[0..count] | for dim 0 |
|
|
| | ... | then dim 1 |
|
|
| | dim_D[0..count] | etc. |
|
|
| +----------------------+ |
|
|
| +-- ID Map --+ |
|
|
| | encoding: u8 (0=raw, 1=delta-varint) |
|
|
| | restart_interval: u16 |
|
|
| | id_count: u32 |
|
|
| | [restart_offsets: u32[]] (if delta) |
|
|
| | [ids: encoded] |
|
|
| +-----------+ |
|
|
| +-- Block CRC --+ |
|
|
| | crc32c: u32 | |
|
|
| +----------------+ |
|
|
| [64B padding] |
|
|
+------------------------------------------+
|
|
| Block 1 |
|
|
| ... |
|
|
+------------------------------------------+
|
|
```
|
|
|
|
### Data Type Enum
|
|
|
|
```
|
|
0x00 f32 32-bit float
|
|
0x01 f16 16-bit float
|
|
0x02 bf16 bfloat16
|
|
0x03 i8 signed 8-bit integer (scalar quantized)
|
|
0x04 u8 unsigned 8-bit integer
|
|
0x05 i4 4-bit integer (packed, 2 per byte)
|
|
0x06 binary 1-bit (packed, 8 per byte)
|
|
0x07 pq Product-quantized codes
|
|
0x08 custom Custom encoding (see QUANT_SEG)
|
|
```
|
|
|
|
### Columnar vs Interleaved
|
|
|
|
**VEC_SEG** (columnar): `dim_0[all], dim_1[all], ..., dim_D[all]`
|
|
- Better compression (similar values adjacent)
|
|
- Better for batch operations
|
|
- Worse for single-vector random access
|
|
|
|
**HOT_SEG** (interleaved): `vec_0[all_dims], vec_1[all_dims], ...`
|
|
- Better for single-vector access (one cache line per vector)
|
|
- Better for top-K refinement (sequential scan)
|
|
- No compression benefit
|
|
|
|
## 6. INDEX_SEG Payload Layout
|
|
|
|
```
|
|
+------------------------------------------+
|
|
| INDEX_SEG Payload |
|
|
+------------------------------------------+
|
|
| Index Header |
|
|
| index_type: u8 (0=HNSW, 1=IVF, 2=flat)|
|
|
| layer_level: u8 (A=0, B=1, C=2) |
|
|
| M: u16 (HNSW max neighbors per layer) |
|
|
| ef_construction: u32 |
|
|
| node_count: u64 |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
| Restart Point Index |
|
|
| restart_interval: u32 |
|
|
| restart_count: u32 |
|
|
| [restart_offset: u32] * count |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
| Adjacency Data |
|
|
| For each node (sorted by node_id): |
|
|
| layer_count: varint |
|
|
| For each layer: |
|
|
| neighbor_count: varint |
|
|
| [delta_neighbor_id: varint] * cnt |
|
|
| [64B padding per restart group] |
|
|
+------------------------------------------+
|
|
| Prefetch Hints (optional) |
|
|
| hint_count: u32 |
|
|
| For each hint: |
|
|
| node_range_start: u64 |
|
|
| node_range_end: u64 |
|
|
| page_offset: u64 |
|
|
| page_count: u32 |
|
|
| prefetch_ahead: u32 |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
```
|
|
|
|
## 7. HOT_SEG Payload Layout
|
|
|
|
The hot segment stores the most-accessed vectors in interleaved (row-major)
|
|
layout with their neighbor lists co-located for cache locality.
|
|
|
|
```
|
|
+------------------------------------------+
|
|
| HOT_SEG Payload |
|
|
+------------------------------------------+
|
|
| Hot Header |
|
|
| vector_count: u32 |
|
|
| dim: u16 |
|
|
| dtype: u8 (f16 or i8) |
|
|
| neighbor_M: u16 |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
| Interleaved Hot Data |
|
|
| For each hot vector: |
|
|
| vector_id: u64 |
|
|
| vector: [dtype * dim] |
|
|
| neighbor_count: u16 |
|
|
| [neighbor_id: u64] * neighbor_count |
|
|
| [64B aligned per entry] |
|
|
+------------------------------------------+
|
|
```
|
|
|
|
Each hot entry is self-contained: vector + neighbors in one contiguous block.
|
|
A sequential scan of the HOT_SEG for top-K refinement reads vectors and
|
|
neighbors without any pointer chasing.
|
|
|
|
### Hot Entry Size Example
|
|
|
|
For 384-dim fp16 vectors with M=16 neighbors:
|
|
```
|
|
8 (id) + 768 (vector) + 2 (count) + 128 (neighbors) = 906 bytes
|
|
Padded to 64B: 960 bytes per entry
|
|
```
|
|
|
|
1000 hot vectors = 960 KB (fits in L2 cache on most CPUs).
|
|
|
|
## 8. MANIFEST_SEG Payload Layout
|
|
|
|
```
|
|
+------------------------------------------+
|
|
| MANIFEST_SEG Payload |
|
|
+------------------------------------------+
|
|
| TLV Records (Level 1 manifest) |
|
|
| For each record: |
|
|
| tag: u16 |
|
|
| length: u32 |
|
|
| pad: u16 (to 8B alignment) |
|
|
| value: [u8; length] |
|
|
| [8B aligned] |
|
|
+------------------------------------------+
|
|
| Level 0 Root Manifest (last 4096 bytes) |
|
|
| (See 02-manifest-system.md for layout) |
|
|
+------------------------------------------+
|
|
```
|
|
|
|
## 9. SKETCH_SEG Payload Layout
|
|
|
|
```
|
|
+------------------------------------------+
|
|
| SKETCH_SEG Payload |
|
|
+------------------------------------------+
|
|
| Sketch Header |
|
|
| block_count: u32 |
|
|
| width: u32 (counters per row) |
|
|
| depth: u32 (hash functions) |
|
|
| counter_bits: u8 (8 or 16) |
|
|
| decay_shift: u8 (aging right-shift) |
|
|
| total_accesses: u64 |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
| Sketch Data |
|
|
| For each block: |
|
|
| block_id: u32 |
|
|
| counters: [u8; width * depth] |
|
|
| [64B aligned per block] |
|
|
+------------------------------------------+
|
|
```
|
|
|
|
## 10. QUANT_SEG Payload Layout
|
|
|
|
```
|
|
+------------------------------------------+
|
|
| QUANT_SEG Payload |
|
|
+------------------------------------------+
|
|
| Quant Header |
|
|
| quant_type: u8 |
|
|
| 0 = scalar (min-max per dim) |
|
|
| 1 = product quantization |
|
|
| 2 = binary threshold |
|
|
| 3 = residual PQ |
|
|
| tier: u8 |
|
|
| dim: u16 |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
| Type-specific data: |
|
|
| |
|
|
| Scalar (type 0): |
|
|
| min: [f32; dim] |
|
|
| max: [f32; dim] |
|
|
| |
|
|
| PQ (type 1): |
|
|
| M: u16 (subspaces) |
|
|
| K: u16 (centroids per sub) |
|
|
| sub_dim: u16 (dims per sub) |
|
|
| codebook: [f32; M * K * sub_dim] |
|
|
| |
|
|
| Binary (type 2): |
|
|
| threshold: [f32; dim] |
|
|
| |
|
|
| Residual PQ (type 3): |
|
|
| coarse_centroids: [f32; K_coarse * dim]|
|
|
| residual_codebook: [f32; M * K * sub] |
|
|
| |
|
|
| [64B aligned] |
|
|
+------------------------------------------+
|
|
```
|
|
|
|
## 11. Checksum Algorithms
|
|
|
|
| ID | Algorithm | Output | Speed (HW accel) | Use Case |
|
|
|----|-----------|--------|-------------------|----------|
|
|
| 0 | CRC32C | 4 B (stored in 16B field, zero-padded) | ~3 GB/s (SSE4.2) | Per-block integrity |
|
|
| 1 | XXH3-128 | 16 B | ~50 GB/s (AVX2) | Segment content hash |
|
|
| 2 | SHAKE-256 | 16 or 32 B | ~1 GB/s | Cryptographic verification |
|
|
|
|
Default recommendation:
|
|
- Block-level CRC: CRC32C (fastest, hardware accelerated)
|
|
- Segment content hash: XXH3-128 (fast, good distribution)
|
|
- Crypto witness hashes: SHAKE-256 (post-quantum safe)
|
|
|
|
## 12. Compression
|
|
|
|
| ID | Algorithm | Ratio | Decompress Speed | Use Case |
|
|
|----|-----------|-------|-----------------|----------|
|
|
| 0 | None | 1.0x | N/A | Hot tier |
|
|
| 1 | LZ4 | 1.5-3x | ~4 GB/s | Warm tier, low latency |
|
|
| 2 | ZSTD | 3-6x | ~1.5 GB/s | Cold tier, high ratio |
|
|
| 3 | Custom | Varies | Varies | Domain-specific |
|
|
|
|
Compression is applied per-segment payload. Individual blocks within a
|
|
segment share the same compression.
|
|
|
|
## 13. Tail Scan Algorithm
|
|
|
|
```python
|
|
def find_latest_manifest(file):
|
|
file_size = file.seek(0, SEEK_END)
|
|
|
|
# Try fast path: last 4096 bytes
|
|
file.seek(file_size - 4096)
|
|
root = file.read(4096)
|
|
if root[0:4] == b'RVM0' and verify_crc(root):
|
|
return parse_root_manifest(root)
|
|
|
|
# Slow path: scan backward for MANIFEST_SEG header
|
|
scan_pos = file_size - 64 # Start at last 64B boundary
|
|
while scan_pos >= 0:
|
|
file.seek(scan_pos)
|
|
header = file.read(64)
|
|
if (header[0:4] == b'RVFS' and
|
|
header[5] == 0x05 and # MANIFEST_SEG
|
|
verify_segment_header(header)):
|
|
return parse_manifest_segment(file, scan_pos)
|
|
scan_pos -= 64 # Previous 64B boundary
|
|
|
|
raise CorruptFileError("No valid MANIFEST_SEG found")
|
|
```
|
|
|
|
Worst case: full backward scan at 64B granularity. For a 4 GB file, this is
|
|
67M checks — but each check is a 4-byte comparison, so it completes in ~100ms
|
|
on a modern CPU with mmap. In practice, the fast path succeeds on the first try
|
|
for non-corrupt files.
|