* feat(rvf): add RuVector Format universal substrate specification Research and design for RVF — a streaming, progressive, adaptive, quantum-secure binary format for vector intelligence. Covers append-only segment model, two-level tail manifests, temperature tiering, progressive HNSW indexing, epoch-based overlay system, SIMD-optimized query paths, WASM microkernel for Cognitum tiles, domain profiles (RVDNA, RVText, RVGraph, RVVision), and post-quantum cryptography. https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW * feat(rvf): add deletion, filtered search, concurrency, and operations specs Fill four specification gaps in the RVF format design: - spec/07: Vector deletion lifecycle, JOURNAL_SEG wire format, deletion bitmaps - spec/08: Filtered search with META_SEG, METAIDX_SEG, filter expression language - spec/09: Writer locking, reader-writer coordination, versioning, space reclamation - spec/10: Batch operations API, error codes, network streaming protocol Also fixes the segment header field conflict between spec/01 and wire/binary-layout.md (checksum_algo/compression now u8, adds uncompressed_len at 0x38). https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW * feat(rvf): add RuVector Format SDK, 40 examples, MCP server, and documentation Complete RVF implementation including: - 12 Rust crates (rvf-types, rvf-wire, rvf-manifest, rvf-index, rvf-quant, rvf-crypto, rvf-runtime, rvf-import, rvf-wasm, rvf-node, rvf-server, plus integration tests) - 40 runnable examples covering core storage, agentic AI, production patterns, vertical domains, exotic capabilities, runtime targets, network/security, POSIX/systems, and network operations - TypeScript SDK (npm/packages/rvf) with RvfDatabase class - MCP server (npm/packages/rvf-mcp-server) with stdio and SSE transports - Node.js N-API bindings (npm/packages/rvf-node) - WASM package (npm/packages/rvf-wasm) - ADR-029 (canonical format), ADR-030 (computational container), ADR-031 (example repository) - DNA-style lineage provenance, computational containers (KERNEL_SEG, EBPF_SEG), witness chains, TEE attestation, domain profiles - Superseded ADR annotations for ADR-001, ADR-005, ADR-006, ADR-018-021 Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add CLI, WASM store, generate_all, and 46 output .rvf files - Add rvf-cli crate (665 lines, 9 subcommands: create/ingest/query/delete/status/inspect/compact/derive/serve) - Add WASM control plane store (alloc_setup, segment, store modules) for ~46 KB binary - Add generate_all.rs example producing 46 persistent .rvf files in output/ - Add Node.js N-API bindings for lineage, kernel/eBPF, and inspection - Add npm TypeScript backend/database/types for RVF integration - Update READMEs with CLI sections, MCP server docs, and crate map (13 crates) - All 40 examples verified passing Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add Claude Code appliance, improve Quick Start, fix API docs - Add claude_code_appliance.rs: self-booting RVF with SSH + Claude Code install (curl -fsSL https://claude.ai/install.sh | bash), 3 SSH users, eBPF filter, 20-package manifest, witness chain, lineage snapshot - Improve Quick Start: Install section (crate/CLI/npm/WASM/MCP), WASM browser example, generate_all reference, expanded Rust crate deps - Fix embed_kernel/embed_ebpf API docs to match actual signatures (u8 params with `as u8` cast, 6-param kernel, Option<&[u8]> btf) - Update generate_all.rs: add claude_code_appliance generator (47 files) - Regenerate all 47 output .rvf files Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add RVCOW branching, real kernel/eBPF/launcher, 795 tests Vector-native copy-on-write branching (ADR-031) with four new segment types (COW_MAP 0x20, REFCOUNT 0x21, MEMBERSHIP 0x22, DELTA 0x23), real Linux microkernel builder, QEMU microVM launcher, real eBPF programs, and 128-byte KernelBinding for tamper-evident kernel-manifest linkage. New crates: - rvf-kernel: Docker-based kernel build, real cpio/newc initramfs builder, SHA3-256 verification, prebuilt kernel support (37 tests) - rvf-launch: QEMU microVM launcher with QMP shutdown, KVM/TCG detection, virtio-blk/net port forwarding, kernel extraction (8 tests) - rvf-ebpf: 3 real BPF C programs (xdp_distance, socket_filter, tc_query_route) with clang compilation support (17 tests) RVCOW runtime: - CowEngine with read/write paths, write coalescing, snapshot-freeze - CowMap (flat-array), MembershipFilter (bitmap), CowCompactor - 3x read performance via pread optimization (1.3us/vector) - Branch creation: 2.6ms for 10K vectors, child = 162 bytes Security: 20-finding audit, 7 fixes applied including division-by-zero guards, integer overflow checks, and KernelBinding::from_bytes_validated(). CLI: 8 new commands (launch, embed-kernel, embed-ebpf, filter, freeze, verify-witness, verify-attestation, rebuild-refcounts), serve wired to real rvf-server. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): update README, add crate/npm READMEs, publish to crates.io and npm - Rewrite README with cognitive container terminology, grouped features, 4 comparison tables (vs Docker, Vector DBs, Git LFS, SQLite), updated benchmarks, architecture diagram, and 45 examples - Add READMEs for rvf-kernel, rvf-launch, rvf-ebpf, rvf-import crates - Add READMEs for @ruvector/rvf, rvf-node, rvf-wasm, rvf-mcp-server npm packages - Fix Cargo.toml metadata (homepage, readme, categories, keywords) and add version specs to all path dependencies for crates.io publishing - Fix clippy warnings in rvf-kernel/initramfs.rs and rvf-launch/lib.rs - Published to crates.io: rvf-types, rvf-wire, rvf-manifest, rvf-quant, rvf-index, rvf-crypto (remaining crates pending rate limit) - Published to npm: @ruvector/rvf, @ruvector/rvf-node, @ruvector/rvf-wasm, @ruvector/rvf-mcp-server Co-Authored-By: claude-flow <ruv@ruv.net> * chore: add rvf-kernel, rvf-ebpf, rvf-launch, rvf-server, rvf-import, rvf-cli to workspace Include all 15 RVF crates plus integration tests and benchmarks in the root workspace members list so cargo publish can resolve them by name. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(rvf): add published packages, cognitive container branding, grouped capabilities - Add Published Packages section with 13 crates.io + 4 npm tables - Add Platform Support table (Linux, macOS, Windows, WASM, no_std) - Expand capability table from 9 to 15 rows in 4 groups - Rewrite all "How" descriptions in plain language - Update .rvf diagram to show all 20 segment types - Rename ADRs: computational container -> cognitive container - Add emojis to all section headers Co-Authored-By: claude-flow <ruv@ruv.net> * feat: update root README with RVF cognitive containers, expanded capabilities - Update intro: "gets smarter + ships as cognitive container" - Add self-booting microservice row to Pinecone comparison table - Expand capabilities from 34 to 42 features with dedicated RVF section - Update "Think of it as" to include Docker comparison and RVF explanation - Add RVF collapsed group to Ecosystem (13 crates, 4 npm, install commands) - Add RVF to Platform & Edge section with install commands - Add RVF npm packages (4) and Rust crates (13) to package reference - Add RVF rows to feature comparison table (6 new rows) - Add ADR-030/031 to ADR list - Add RVF to Installation table, Project Structure - Update attention mechanisms count from 39 to 40+ - Update npm count to 49+, Rust crates to 83 - Update footer with crates.io and RVF links Co-Authored-By: claude-flow <ruv@ruv.net> * feat: expand comparison table with emojis, cost, audit, branching, single-file Co-Authored-By: claude-flow <ruv@ruv.net> * docs: rewrite comparison table in plain language Co-Authored-By: claude-flow <ruv@ruv.net> * chore: clean up empty code change sections in the changes log --------- Co-authored-by: Claude <noreply@anthropic.com>
8.5 KiB
RVF Segment Model
1. Append-Only Segment Architecture
An RVF file is a linear sequence of segments. Each segment is a self-contained, independently verifiable unit. New data is always appended — never inserted into or overwritten within existing segments.
+------------+------------+------------+ +------------+
| Segment 0 | Segment 1 | Segment 2 | ... | Segment N | <-- EOF
+------------+------------+------------+ +------------+
^
Latest MANIFEST_SEG
(source of truth)
Why Append-Only
| Property | Benefit |
|---|---|
| Write amplification | Zero — each byte written once until compaction |
| Crash safety | Partial segment at tail is detectable and discardable |
| Concurrent reads | Readers see a consistent snapshot at any manifest boundary |
| Streaming ingest | Writer never blocks on reorganization |
| mmap friendliness | Pages only grow — no invalidation of mapped regions |
2. Segment Header
Every segment begins with a fixed 64-byte header. The header is 64-byte aligned to match SIMD register width.
Offset Size Field Description
------ ---- ----- -----------
0x00 4 magic 0x52564653 ("RVFS" in ASCII)
0x04 1 version Segment format version (currently 1)
0x05 1 seg_type Segment type enum (see below)
0x06 2 flags Bitfield: compressed, encrypted, signed, sealed, etc.
0x08 8 segment_id Monotonically increasing segment ordinal
0x10 8 payload_length Byte length of payload (after header, before footer)
0x18 8 timestamp_ns Nanosecond UNIX timestamp of segment creation
0x20 1 checksum_algo Hash algorithm enum: 0=CRC32C, 1=XXH3-128, 2=SHAKE-256
0x21 1 compression Compression enum: 0=none, 1=LZ4, 2=ZSTD, 3=custom
0x22 2 reserved_0 Must be zero
0x24 4 reserved_1 Must be zero
0x28 16 content_hash First 128 bits of payload hash (algorithm per checksum_algo)
0x38 4 uncompressed_len Original payload size (0 if no compression)
0x3C 4 alignment_pad Padding to reach 64-byte boundary
Total header: 64 bytes (one cache line, one AVX-512 register width).
Magic Validation
Readers scanning backward from EOF look for 0x52564653 at 64-byte aligned
boundaries. This enables fast tail-scan even on corrupted files.
Flags Bitfield
Bit 0: COMPRESSED Payload is compressed per compression field
Bit 1: ENCRYPTED Payload is encrypted (key info in manifest)
Bit 2: SIGNED A signature footer follows the payload
Bit 3: SEALED Segment is immutable (compaction output)
Bit 4: PARTIAL Segment is a partial write (streaming ingest)
Bit 5: TOMBSTONE Segment logically deletes a prior segment
Bit 6: HOT Segment contains temperature-promoted data
Bit 7: OVERLAY Segment contains overlay/delta data
Bit 8: SNAPSHOT Segment contains full snapshot (not delta)
Bit 9: CHECKPOINT Segment is a safe rollback point
Bits 10-15: reserved
3. Segment Types
Value Name Purpose
----- ---- -------
0x01 VEC_SEG Raw vector payloads (the actual embeddings)
0x02 INDEX_SEG HNSW adjacency lists, entry points, routing tables
0x03 OVERLAY_SEG Graph overlay deltas, partition updates, min-cut witnesses
0x04 JOURNAL_SEG Metadata mutations (label changes, deletions, moves)
0x05 MANIFEST_SEG Segment directory, hotset pointers, epoch state
0x06 QUANT_SEG Quantization dictionaries and codebooks
0x07 META_SEG Arbitrary key-value metadata (tags, provenance, lineage)
0x08 HOT_SEG Temperature-promoted hot data (vectors + neighbors)
0x09 SKETCH_SEG Access counter sketches for temperature decisions
0x0A WITNESS_SEG Capability manifests, proof of computation, audit trails
0x0B PROFILE_SEG Domain profile declarations (RVDNA, RVText, etc.)
0x0C CRYPTO_SEG Key material, signature chains, certificate anchors
0x0D METAIDX_SEG Metadata inverted indexes for filtered search
Reserved Range
Types 0x00 and 0xF0-0xFF are reserved. 0x00 indicates an uninitialized
or zeroed region (not a valid segment). 0xF0-0xFF are reserved for
implementation-specific extensions.
4. Segment Footer
If the SIGNED flag is set, the payload is followed by a signature footer:
Offset Size Field Description
------ ---- ----- -----------
0x00 2 sig_algo Signature algorithm: 0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
0x02 2 sig_length Byte length of signature
0x04 var signature The signature bytes
var 4 footer_length Total footer size (for backward scanning)
Unsigned segments have no footer — the next segment header follows immediately after the payload (at the next 64-byte aligned boundary).
5. Segment Lifecycle
Write Path
1. Allocate segment ID (monotonic counter)
2. Compute payload hash
3. Write header + payload + optional footer
4. fsync (or fdatasync for non-manifest segments)
5. Write MANIFEST_SEG referencing the new segment
6. fsync the manifest
The two-fsync protocol ensures that:
- If crash occurs before step 6, the orphan segment is harmless (no manifest points to it)
- If crash occurs during step 6, the partial manifest is detectable (bad hash)
- After step 6, the segment is durably committed
Read Path
1. Seek to EOF
2. Scan backward for latest MANIFEST_SEG (look for magic at aligned boundaries)
3. Parse manifest -> get segment directory
4. Map segments on demand (progressive loading)
Compaction
Compaction merges multiple segments into fewer, larger, sealed segments:
Before: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3]
After: [VEC_SEG_1] [VEC_SEG_2] [VEC_SEG_3] [MANIFEST_3] [VEC_SEG_sealed] [MANIFEST_4]
^^^^^^^^^^^^^^^^^
New sealed segment
merging 1+2+3
Old segments are marked with TOMBSTONE entries in the new manifest. Space is reclaimed when the file is eventually rewritten (or old segments are in a separate file in multi-file mode).
Multi-File Mode
For very large datasets, RVF can span multiple files:
data.rvf Main file with manifests and hot data
data.rvf.cold.0 Cold segment shard 0
data.rvf.cold.1 Cold segment shard 1
data.rvf.idx.0 Index segment shard 0
The manifest in the main file contains shard references with file paths and byte ranges. This enables cold data to live on slower storage while hot data stays on fast storage.
6. Segment Addressing
Segments are addressed by their segment_id (monotonically increasing 64-bit
integer). The manifest maps segment IDs to file offsets (and optionally shard
file paths in multi-file mode).
Within a segment, data is addressed by block offset — a 32-bit offset from the start of the segment payload. This limits individual segments to 4 GB, which is intentional: it keeps segments manageable for compaction and progressive loading.
Block Structure Within VEC_SEG
+-------------------+
| Block Header (16B)|
| block_id: u32 |
| count: u32 |
| dim: u16 |
| dtype: u8 |
| pad: [u8; 5] |
+-------------------+
| Vectors |
| (count * dim * |
| sizeof(dtype)) |
| [64B aligned] |
+-------------------+
| ID Map |
| (varint delta |
| encoded IDs) |
+-------------------+
| Block Footer |
| crc32c: u32 |
+-------------------+
Vectors within a block are stored columnar — all dimension 0 values, then all dimension 1 values, etc. This maximizes compression ratio. But the HOT_SEG stores vectors interleaved (row-major) for cache-friendly sequential scan during top-K refinement.
7. Invariants
- Segment IDs are strictly monotonically increasing within a file
- A valid RVF file contains at least one MANIFEST_SEG
- The last MANIFEST_SEG is always the source of truth
- Segment headers are always 64-byte aligned
- No segment payload exceeds 4 GB
- Content hashes are computed over the raw (uncompressed, unencrypted) payload
- Sealed segments are never modified — only tombstoned
- A reader that cannot find a valid MANIFEST_SEG must reject the file