ruvector/docs/research/rvf/wire/binary-layout.md
rUv f8870b3c71 feat(rvf): RuVector Format — Universal Cognitive Container SDK (#166)
* feat(rvf): add RuVector Format universal substrate specification

Research and design for RVF — a streaming, progressive, adaptive, quantum-secure
binary format for vector intelligence. Covers append-only segment model, two-level
tail manifests, temperature tiering, progressive HNSW indexing, epoch-based overlay
system, SIMD-optimized query paths, WASM microkernel for Cognitum tiles, domain
profiles (RVDNA, RVText, RVGraph, RVVision), and post-quantum cryptography.

https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW

* feat(rvf): add deletion, filtered search, concurrency, and operations specs

Fill four specification gaps in the RVF format design:
- spec/07: Vector deletion lifecycle, JOURNAL_SEG wire format, deletion bitmaps
- spec/08: Filtered search with META_SEG, METAIDX_SEG, filter expression language
- spec/09: Writer locking, reader-writer coordination, versioning, space reclamation
- spec/10: Batch operations API, error codes, network streaming protocol

Also fixes the segment header field conflict between spec/01 and wire/binary-layout.md
(checksum_algo/compression now u8, adds uncompressed_len at 0x38).

https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW

* feat(rvf): add RuVector Format SDK, 40 examples, MCP server, and documentation

Complete RVF implementation including:
- 12 Rust crates (rvf-types, rvf-wire, rvf-manifest, rvf-index, rvf-quant,
  rvf-crypto, rvf-runtime, rvf-import, rvf-wasm, rvf-node, rvf-server,
  plus integration tests)
- 40 runnable examples covering core storage, agentic AI, production
  patterns, vertical domains, exotic capabilities, runtime targets,
  network/security, POSIX/systems, and network operations
- TypeScript SDK (npm/packages/rvf) with RvfDatabase class
- MCP server (npm/packages/rvf-mcp-server) with stdio and SSE transports
- Node.js N-API bindings (npm/packages/rvf-node)
- WASM package (npm/packages/rvf-wasm)
- ADR-029 (canonical format), ADR-030 (computational container),
  ADR-031 (example repository)
- DNA-style lineage provenance, computational containers (KERNEL_SEG,
  EBPF_SEG), witness chains, TEE attestation, domain profiles
- Superseded ADR annotations for ADR-001, ADR-005, ADR-006, ADR-018-021

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add CLI, WASM store, generate_all, and 46 output .rvf files

- Add rvf-cli crate (665 lines, 9 subcommands: create/ingest/query/delete/status/inspect/compact/derive/serve)
- Add WASM control plane store (alloc_setup, segment, store modules) for ~46 KB binary
- Add generate_all.rs example producing 46 persistent .rvf files in output/
- Add Node.js N-API bindings for lineage, kernel/eBPF, and inspection
- Add npm TypeScript backend/database/types for RVF integration
- Update READMEs with CLI sections, MCP server docs, and crate map (13 crates)
- All 40 examples verified passing

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add Claude Code appliance, improve Quick Start, fix API docs

- Add claude_code_appliance.rs: self-booting RVF with SSH + Claude Code
  install (curl -fsSL https://claude.ai/install.sh | bash), 3 SSH users,
  eBPF filter, 20-package manifest, witness chain, lineage snapshot
- Improve Quick Start: Install section (crate/CLI/npm/WASM/MCP), WASM
  browser example, generate_all reference, expanded Rust crate deps
- Fix embed_kernel/embed_ebpf API docs to match actual signatures
  (u8 params with `as u8` cast, 6-param kernel, Option<&[u8]> btf)
- Update generate_all.rs: add claude_code_appliance generator (47 files)
- Regenerate all 47 output .rvf files

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add RVCOW branching, real kernel/eBPF/launcher, 795 tests

Vector-native copy-on-write branching (ADR-031) with four new segment
types (COW_MAP 0x20, REFCOUNT 0x21, MEMBERSHIP 0x22, DELTA 0x23),
real Linux microkernel builder, QEMU microVM launcher, real eBPF
programs, and 128-byte KernelBinding for tamper-evident kernel-manifest
linkage.

New crates:
- rvf-kernel: Docker-based kernel build, real cpio/newc initramfs builder,
  SHA3-256 verification, prebuilt kernel support (37 tests)
- rvf-launch: QEMU microVM launcher with QMP shutdown, KVM/TCG detection,
  virtio-blk/net port forwarding, kernel extraction (8 tests)
- rvf-ebpf: 3 real BPF C programs (xdp_distance, socket_filter,
  tc_query_route) with clang compilation support (17 tests)

RVCOW runtime:
- CowEngine with read/write paths, write coalescing, snapshot-freeze
- CowMap (flat-array), MembershipFilter (bitmap), CowCompactor
- 3x read performance via pread optimization (1.3us/vector)
- Branch creation: 2.6ms for 10K vectors, child = 162 bytes

Security: 20-finding audit, 7 fixes applied including division-by-zero
guards, integer overflow checks, and KernelBinding::from_bytes_validated().

CLI: 8 new commands (launch, embed-kernel, embed-ebpf, filter, freeze,
verify-witness, verify-attestation, rebuild-refcounts), serve wired to
real rvf-server.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): update README, add crate/npm READMEs, publish to crates.io and npm

- Rewrite README with cognitive container terminology, grouped features,
  4 comparison tables (vs Docker, Vector DBs, Git LFS, SQLite), updated
  benchmarks, architecture diagram, and 45 examples
- Add READMEs for rvf-kernel, rvf-launch, rvf-ebpf, rvf-import crates
- Add READMEs for @ruvector/rvf, rvf-node, rvf-wasm, rvf-mcp-server npm packages
- Fix Cargo.toml metadata (homepage, readme, categories, keywords) and
  add version specs to all path dependencies for crates.io publishing
- Fix clippy warnings in rvf-kernel/initramfs.rs and rvf-launch/lib.rs
- Published to crates.io: rvf-types, rvf-wire, rvf-manifest, rvf-quant,
  rvf-index, rvf-crypto (remaining crates pending rate limit)
- Published to npm: @ruvector/rvf, @ruvector/rvf-node, @ruvector/rvf-wasm,
  @ruvector/rvf-mcp-server

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: add rvf-kernel, rvf-ebpf, rvf-launch, rvf-server, rvf-import, rvf-cli to workspace

Include all 15 RVF crates plus integration tests and benchmarks in the
root workspace members list so cargo publish can resolve them by name.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add published packages, cognitive container branding, grouped capabilities

- Add Published Packages section with 13 crates.io + 4 npm tables
- Add Platform Support table (Linux, macOS, Windows, WASM, no_std)
- Expand capability table from 9 to 15 rows in 4 groups
- Rewrite all "How" descriptions in plain language
- Update .rvf diagram to show all 20 segment types
- Rename ADRs: computational container -> cognitive container
- Add emojis to all section headers

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat: update root README with RVF cognitive containers, expanded capabilities

- Update intro: "gets smarter + ships as cognitive container"
- Add self-booting microservice row to Pinecone comparison table
- Expand capabilities from 34 to 42 features with dedicated RVF section
- Update "Think of it as" to include Docker comparison and RVF explanation
- Add RVF collapsed group to Ecosystem (13 crates, 4 npm, install commands)
- Add RVF to Platform & Edge section with install commands
- Add RVF npm packages (4) and Rust crates (13) to package reference
- Add RVF rows to feature comparison table (6 new rows)
- Add ADR-030/031 to ADR list
- Add RVF to Installation table, Project Structure
- Update attention mechanisms count from 39 to 40+
- Update npm count to 49+, Rust crates to 83
- Update footer with crates.io and RVF links

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat: expand comparison table with emojis, cost, audit, branching, single-file

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs: rewrite comparison table in plain language

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: clean up empty code change sections in the changes log

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-02-14 13:14:49 -05:00

16 KiB

RVF Wire Format Reference

1. File Structure

An RVF file is a byte stream with no fixed header at offset 0. All structure is discovered from the tail.

Byte 0                                                               EOF
|                                                                      |
v                                                                      v
+--------+--------+--------+     +--------+---------+--------+---------+
| Seg 0  | Seg 1  | Seg 2  | ... | Seg N  | Seg N+1 | Seg N+2| Mfst K |
| VEC    | VEC    | INDEX  |     | VEC    | HOT     | INDEX  | MANIF  |
+--------+--------+--------+     +--------+---------+--------+---------+
                                                               ^       ^
                                                               |       |
                                                        Level 1 Mfst   |
                                                                Level 0
                                                              (last 4KB)

Alignment Rule

Every segment starts at a 64-byte aligned boundary. If a segment's payload + footer does not end on a 64-byte boundary, zero-padding is inserted before the next segment header.

Byte Order

All multi-byte integers are little-endian. All floating-point values are IEEE 754 little-endian. This matches x86, ARM (in default mode), and WASM native byte order.

2. Primitive Types

Type        Size    Encoding
----        ----    --------
u8          1       Unsigned 8-bit integer
u16         2       Unsigned 16-bit little-endian
u32         4       Unsigned 32-bit little-endian
u64         8       Unsigned 64-bit little-endian
i32         4       Signed 32-bit little-endian (two's complement)
i64         8       Signed 64-bit little-endian (two's complement)
f16         2       IEEE 754 half-precision little-endian
f32         4       IEEE 754 single-precision little-endian
f64         8       IEEE 754 double-precision little-endian
varint      1-10    LEB128 unsigned variable-length integer
svarint     1-10    ZigZag + LEB128 signed variable-length integer
hash128     16      First 128 bits of hash output
hash256     32      First 256 bits of hash output

Varint Encoding (LEB128)

Value 0-127:        1 byte   [0xxxxxxx]
Value 128-16383:    2 bytes  [1xxxxxxx 0xxxxxxx]
Value 16384-2097151: 3 bytes [1xxxxxxx 1xxxxxxx 0xxxxxxx]
...up to 10 bytes for u64

Delta Encoding

Sequences of sorted integers use delta encoding:

Original:  [100, 105, 108, 120, 200]
Deltas:    [100,   5,   3,  12,  80]
Encoded:   [varint(100), varint(5), varint(3), varint(12), varint(80)]

With restart points every N entries, the first value in each restart group is absolute (not delta-encoded).

3. Segment Header (64 bytes)

Offset  Type    Field              Notes
------  ----    -----              -----
0x00    u32     magic              Always 0x52564653 ("RVFS")
0x04    u8      version            Format version (1)
0x05    u8      seg_type           Segment type enum
0x06    u16     flags              See flags bitfield
0x08    u64     segment_id         Monotonic ordinal
0x10    u64     payload_length     Bytes after header, before footer
0x18    u64     timestamp_ns       UNIX nanoseconds
0x20    u8      checksum_algo      0=CRC32C, 1=XXH3-128, 2=SHAKE-256
0x21    u8      compression        0=none, 1=LZ4, 2=ZSTD, 3=custom
0x22    u16     reserved_0         Must be 0x0000
0x24    u32     reserved_1         Must be 0x00000000
0x28    hash128 content_hash       Payload hash (first 128 bits)
0x38    u32     uncompressed_len   Original payload size (0 if no compression)
0x3C    u32     alignment_pad      Zero padding to 64B boundary

Segment Type Enum

0x00    INVALID         Not a valid segment
0x01    VEC_SEG         Vector payloads
0x02    INDEX_SEG       HNSW adjacency
0x03    OVERLAY_SEG     Graph overlay deltas
0x04    JOURNAL_SEG     Metadata mutations
0x05    MANIFEST_SEG    Segment directory
0x06    QUANT_SEG       Quantization dictionaries
0x07    META_SEG        Key-value metadata
0x08    HOT_SEG         Temperature-promoted data
0x09    SKETCH_SEG      Access counter sketches
0x0A    WITNESS_SEG     Capability manifests
0x0B    PROFILE_SEG     Domain profile declarations
0x0C    CRYPTO_SEG      Key material / certificate anchors
0x0D-0xEF  reserved
0xF0-0xFF  extension    Implementation-specific

Flags Bitfield

Bit  Mask    Name         Meaning
---  ----    ----         -------
0    0x0001  COMPRESSED   Payload compressed per compression field
1    0x0002  ENCRYPTED    Payload encrypted (key in CRYPTO_SEG)
2    0x0004  SIGNED       Signature footer follows payload
3    0x0008  SEALED       Immutable (compaction output)
4    0x0010  PARTIAL      Partial/streaming write
5    0x0020  TOMBSTONE    Logically deletes prior segment
6    0x0040  HOT          Contains hot-tier data
7    0x0080  OVERLAY      Contains overlay/delta data
8    0x0100  SNAPSHOT     Full snapshot (not delta)
9    0x0200  CHECKPOINT   Safe rollback point
10-15        reserved     Must be zero

Present only if SIGNED flag is set. Follows immediately after the payload.

Offset  Type    Field           Notes
------  ----    -----           -----
0x00    u16     sig_algo        0=Ed25519, 1=ML-DSA-65, 2=SLH-DSA-128s
0x02    u16     sig_length      Signature byte length
0x04    u8[]    signature       Signature bytes
var     u32     footer_length   Total footer size (for backward scan)

Signature Algorithm Sizes

Algorithm sig_length Post-Quantum Performance
Ed25519 64 B No ~76,000 sign/s
ML-DSA-65 3,309 B Yes (NIST Level 3) ~4,500 sign/s
SLH-DSA-128s 7,856 B Yes (NIST Level 1) ~350 sign/s

5. VEC_SEG Payload Layout

Vector segments store blocks of vectors in columnar layout for compression.

+------------------------------------------+
| VEC_SEG Payload                          |
+------------------------------------------+
| Block Directory                          |
|   block_count: u32                       |
|   For each block:                        |
|     block_offset: u32 (from payload start)|
|     vector_count: u32                    |
|     dim: u16                             |
|     dtype: u8                            |
|     tier: u8                             |
|   [64B aligned]                          |
+------------------------------------------+
| Block 0                                  |
|   +-- Columnar Vectors --+               |
|   | dim_0[0..count]      |  <- all vals  |
|   | dim_1[0..count]      |     for dim 0 |
|   | ...                  |     then dim 1 |
|   | dim_D[0..count]      |     etc.      |
|   +----------------------+               |
|   +-- ID Map --+                         |
|   | encoding: u8 (0=raw, 1=delta-varint) |
|   | restart_interval: u16                |
|   | id_count: u32                        |
|   | [restart_offsets: u32[]] (if delta)   |
|   | [ids: encoded]                       |
|   +-----------+                          |
|   +-- Block CRC --+                      |
|   | crc32c: u32    |                     |
|   +----------------+                     |
|   [64B padding]                          |
+------------------------------------------+
| Block 1                                  |
|   ...                                    |
+------------------------------------------+

Data Type Enum

0x00    f32     32-bit float
0x01    f16     16-bit float
0x02    bf16    bfloat16
0x03    i8      signed 8-bit integer (scalar quantized)
0x04    u8      unsigned 8-bit integer
0x05    i4      4-bit integer (packed, 2 per byte)
0x06    binary  1-bit (packed, 8 per byte)
0x07    pq      Product-quantized codes
0x08    custom  Custom encoding (see QUANT_SEG)

Columnar vs Interleaved

VEC_SEG (columnar): dim_0[all], dim_1[all], ..., dim_D[all]

  • Better compression (similar values adjacent)
  • Better for batch operations
  • Worse for single-vector random access

HOT_SEG (interleaved): vec_0[all_dims], vec_1[all_dims], ...

  • Better for single-vector access (one cache line per vector)
  • Better for top-K refinement (sequential scan)
  • No compression benefit

6. INDEX_SEG Payload Layout

+------------------------------------------+
| INDEX_SEG Payload                        |
+------------------------------------------+
| Index Header                             |
|   index_type: u8 (0=HNSW, 1=IVF, 2=flat)|
|   layer_level: u8 (A=0, B=1, C=2)       |
|   M: u16 (HNSW max neighbors per layer)  |
|   ef_construction: u32                   |
|   node_count: u64                        |
|   [64B aligned]                          |
+------------------------------------------+
| Restart Point Index                      |
|   restart_interval: u32                  |
|   restart_count: u32                     |
|   [restart_offset: u32] * count          |
|   [64B aligned]                          |
+------------------------------------------+
| Adjacency Data                           |
|   For each node (sorted by node_id):     |
|     layer_count: varint                  |
|     For each layer:                      |
|       neighbor_count: varint             |
|       [delta_neighbor_id: varint] * cnt  |
|   [64B padding per restart group]        |
+------------------------------------------+
| Prefetch Hints (optional)                |
|   hint_count: u32                        |
|   For each hint:                         |
|     node_range_start: u64                |
|     node_range_end: u64                  |
|     page_offset: u64                     |
|     page_count: u32                      |
|     prefetch_ahead: u32                  |
|   [64B aligned]                          |
+------------------------------------------+

7. HOT_SEG Payload Layout

The hot segment stores the most-accessed vectors in interleaved (row-major) layout with their neighbor lists co-located for cache locality.

+------------------------------------------+
| HOT_SEG Payload                          |
+------------------------------------------+
| Hot Header                               |
|   vector_count: u32                      |
|   dim: u16                               |
|   dtype: u8 (f16 or i8)                  |
|   neighbor_M: u16                        |
|   [64B aligned]                          |
+------------------------------------------+
| Interleaved Hot Data                     |
|   For each hot vector:                   |
|     vector_id: u64                       |
|     vector: [dtype * dim]                |
|     neighbor_count: u16                  |
|     [neighbor_id: u64] * neighbor_count  |
|     [64B aligned per entry]              |
+------------------------------------------+

Each hot entry is self-contained: vector + neighbors in one contiguous block. A sequential scan of the HOT_SEG for top-K refinement reads vectors and neighbors without any pointer chasing.

Hot Entry Size Example

For 384-dim fp16 vectors with M=16 neighbors:

8 (id) + 768 (vector) + 2 (count) + 128 (neighbors) = 906 bytes
Padded to 64B: 960 bytes per entry

1000 hot vectors = 960 KB (fits in L2 cache on most CPUs).

8. MANIFEST_SEG Payload Layout

+------------------------------------------+
| MANIFEST_SEG Payload                     |
+------------------------------------------+
| TLV Records (Level 1 manifest)           |
|   For each record:                       |
|     tag: u16                             |
|     length: u32                          |
|     pad: u16 (to 8B alignment)           |
|     value: [u8; length]                  |
|     [8B aligned]                         |
+------------------------------------------+
| Level 0 Root Manifest (last 4096 bytes)  |
|   (See 02-manifest-system.md for layout) |
+------------------------------------------+

9. SKETCH_SEG Payload Layout

+------------------------------------------+
| SKETCH_SEG Payload                       |
+------------------------------------------+
| Sketch Header                            |
|   block_count: u32                       |
|   width: u32 (counters per row)          |
|   depth: u32 (hash functions)            |
|   counter_bits: u8 (8 or 16)            |
|   decay_shift: u8 (aging right-shift)    |
|   total_accesses: u64                    |
|   [64B aligned]                          |
+------------------------------------------+
| Sketch Data                              |
|   For each block:                        |
|     block_id: u32                        |
|     counters: [u8; width * depth]        |
|   [64B aligned per block]               |
+------------------------------------------+

10. QUANT_SEG Payload Layout

+------------------------------------------+
| QUANT_SEG Payload                        |
+------------------------------------------+
| Quant Header                             |
|   quant_type: u8                         |
|     0 = scalar (min-max per dim)         |
|     1 = product quantization             |
|     2 = binary threshold                 |
|     3 = residual PQ                      |
|   tier: u8                               |
|   dim: u16                               |
|   [64B aligned]                          |
+------------------------------------------+
| Type-specific data:                      |
|                                          |
| Scalar (type 0):                         |
|   min: [f32; dim]                        |
|   max: [f32; dim]                        |
|                                          |
| PQ (type 1):                             |
|   M: u16 (subspaces)                     |
|   K: u16 (centroids per sub)             |
|   sub_dim: u16 (dims per sub)            |
|   codebook: [f32; M * K * sub_dim]       |
|                                          |
| Binary (type 2):                         |
|   threshold: [f32; dim]                  |
|                                          |
| Residual PQ (type 3):                    |
|   coarse_centroids: [f32; K_coarse * dim]|
|   residual_codebook: [f32; M * K * sub]  |
|                                          |
| [64B aligned]                            |
+------------------------------------------+

11. Checksum Algorithms

ID Algorithm Output Speed (HW accel) Use Case
0 CRC32C 4 B (stored in 16B field, zero-padded) ~3 GB/s (SSE4.2) Per-block integrity
1 XXH3-128 16 B ~50 GB/s (AVX2) Segment content hash
2 SHAKE-256 16 or 32 B ~1 GB/s Cryptographic verification

Default recommendation:

  • Block-level CRC: CRC32C (fastest, hardware accelerated)
  • Segment content hash: XXH3-128 (fast, good distribution)
  • Crypto witness hashes: SHAKE-256 (post-quantum safe)

12. Compression

ID Algorithm Ratio Decompress Speed Use Case
0 None 1.0x N/A Hot tier
1 LZ4 1.5-3x ~4 GB/s Warm tier, low latency
2 ZSTD 3-6x ~1.5 GB/s Cold tier, high ratio
3 Custom Varies Varies Domain-specific

Compression is applied per-segment payload. Individual blocks within a segment share the same compression.

13. Tail Scan Algorithm

def find_latest_manifest(file):
    file_size = file.seek(0, SEEK_END)

    # Try fast path: last 4096 bytes
    file.seek(file_size - 4096)
    root = file.read(4096)
    if root[0:4] == b'RVM0' and verify_crc(root):
        return parse_root_manifest(root)

    # Slow path: scan backward for MANIFEST_SEG header
    scan_pos = file_size - 64  # Start at last 64B boundary
    while scan_pos >= 0:
        file.seek(scan_pos)
        header = file.read(64)
        if (header[0:4] == b'RVFS' and
            header[5] == 0x05 and  # MANIFEST_SEG
            verify_segment_header(header)):
            return parse_manifest_segment(file, scan_pos)
        scan_pos -= 64  # Previous 64B boundary

    raise CorruptFileError("No valid MANIFEST_SEG found")

Worst case: full backward scan at 64B granularity. For a 4 GB file, this is 67M checks — but each check is a 4-byte comparison, so it completes in ~100ms on a modern CPU with mmap. In practice, the fast path succeeds on the first try for non-corrupt files.