mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-26 16:04:02 +00:00

rUv f8870b3c71 feat(rvf): RuVector Format — Universal Cognitive Container SDK (#166 )

* feat(rvf): add RuVector Format universal substrate specification

Research and design for RVF — a streaming, progressive, adaptive, quantum-secure
binary format for vector intelligence. Covers append-only segment model, two-level
tail manifests, temperature tiering, progressive HNSW indexing, epoch-based overlay
system, SIMD-optimized query paths, WASM microkernel for Cognitum tiles, domain
profiles (RVDNA, RVText, RVGraph, RVVision), and post-quantum cryptography.

https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW

* feat(rvf): add deletion, filtered search, concurrency, and operations specs

Fill four specification gaps in the RVF format design:
- spec/07: Vector deletion lifecycle, JOURNAL_SEG wire format, deletion bitmaps
- spec/08: Filtered search with META_SEG, METAIDX_SEG, filter expression language
- spec/09: Writer locking, reader-writer coordination, versioning, space reclamation
- spec/10: Batch operations API, error codes, network streaming protocol

Also fixes the segment header field conflict between spec/01 and wire/binary-layout.md
(checksum_algo/compression now u8, adds uncompressed_len at 0x38).

https://claude.ai/code/session_01DDqjGE51JpsRE3DgUjFyjW

* feat(rvf): add RuVector Format SDK, 40 examples, MCP server, and documentation

Complete RVF implementation including:
- 12 Rust crates (rvf-types, rvf-wire, rvf-manifest, rvf-index, rvf-quant,
  rvf-crypto, rvf-runtime, rvf-import, rvf-wasm, rvf-node, rvf-server,
  plus integration tests)
- 40 runnable examples covering core storage, agentic AI, production
  patterns, vertical domains, exotic capabilities, runtime targets,
  network/security, POSIX/systems, and network operations
- TypeScript SDK (npm/packages/rvf) with RvfDatabase class
- MCP server (npm/packages/rvf-mcp-server) with stdio and SSE transports
- Node.js N-API bindings (npm/packages/rvf-node)
- WASM package (npm/packages/rvf-wasm)
- ADR-029 (canonical format), ADR-030 (computational container),
  ADR-031 (example repository)
- DNA-style lineage provenance, computational containers (KERNEL_SEG,
  EBPF_SEG), witness chains, TEE attestation, domain profiles
- Superseded ADR annotations for ADR-001, ADR-005, ADR-006, ADR-018-021

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add CLI, WASM store, generate_all, and 46 output .rvf files

- Add rvf-cli crate (665 lines, 9 subcommands: create/ingest/query/delete/status/inspect/compact/derive/serve)
- Add WASM control plane store (alloc_setup, segment, store modules) for ~46 KB binary
- Add generate_all.rs example producing 46 persistent .rvf files in output/
- Add Node.js N-API bindings for lineage, kernel/eBPF, and inspection
- Add npm TypeScript backend/database/types for RVF integration
- Update READMEs with CLI sections, MCP server docs, and crate map (13 crates)
- All 40 examples verified passing

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add Claude Code appliance, improve Quick Start, fix API docs

- Add claude_code_appliance.rs: self-booting RVF with SSH + Claude Code
  install (curl -fsSL https://claude.ai/install.sh | bash), 3 SSH users,
  eBPF filter, 20-package manifest, witness chain, lineage snapshot
- Improve Quick Start: Install section (crate/CLI/npm/WASM/MCP), WASM
  browser example, generate_all reference, expanded Rust crate deps
- Fix embed_kernel/embed_ebpf API docs to match actual signatures
  (u8 params with `as u8` cast, 6-param kernel, Option<&[u8]> btf)
- Update generate_all.rs: add claude_code_appliance generator (47 files)
- Regenerate all 47 output .rvf files

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add RVCOW branching, real kernel/eBPF/launcher, 795 tests

Vector-native copy-on-write branching (ADR-031) with four new segment
types (COW_MAP 0x20, REFCOUNT 0x21, MEMBERSHIP 0x22, DELTA 0x23),
real Linux microkernel builder, QEMU microVM launcher, real eBPF
programs, and 128-byte KernelBinding for tamper-evident kernel-manifest
linkage.

New crates:
- rvf-kernel: Docker-based kernel build, real cpio/newc initramfs builder,
  SHA3-256 verification, prebuilt kernel support (37 tests)
- rvf-launch: QEMU microVM launcher with QMP shutdown, KVM/TCG detection,
  virtio-blk/net port forwarding, kernel extraction (8 tests)
- rvf-ebpf: 3 real BPF C programs (xdp_distance, socket_filter,
  tc_query_route) with clang compilation support (17 tests)

RVCOW runtime:
- CowEngine with read/write paths, write coalescing, snapshot-freeze
- CowMap (flat-array), MembershipFilter (bitmap), CowCompactor
- 3x read performance via pread optimization (1.3us/vector)
- Branch creation: 2.6ms for 10K vectors, child = 162 bytes

Security: 20-finding audit, 7 fixes applied including division-by-zero
guards, integer overflow checks, and KernelBinding::from_bytes_validated().

CLI: 8 new commands (launch, embed-kernel, embed-ebpf, filter, freeze,
verify-witness, verify-attestation, rebuild-refcounts), serve wired to
real rvf-server.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): update README, add crate/npm READMEs, publish to crates.io and npm

- Rewrite README with cognitive container terminology, grouped features,
  4 comparison tables (vs Docker, Vector DBs, Git LFS, SQLite), updated
  benchmarks, architecture diagram, and 45 examples
- Add READMEs for rvf-kernel, rvf-launch, rvf-ebpf, rvf-import crates
- Add READMEs for @ruvector/rvf, rvf-node, rvf-wasm, rvf-mcp-server npm packages
- Fix Cargo.toml metadata (homepage, readme, categories, keywords) and
  add version specs to all path dependencies for crates.io publishing
- Fix clippy warnings in rvf-kernel/initramfs.rs and rvf-launch/lib.rs
- Published to crates.io: rvf-types, rvf-wire, rvf-manifest, rvf-quant,
  rvf-index, rvf-crypto (remaining crates pending rate limit)
- Published to npm: @ruvector/rvf, @ruvector/rvf-node, @ruvector/rvf-wasm,
  @ruvector/rvf-mcp-server

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: add rvf-kernel, rvf-ebpf, rvf-launch, rvf-server, rvf-import, rvf-cli to workspace

Include all 15 RVF crates plus integration tests and benchmarks in the
root workspace members list so cargo publish can resolve them by name.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(rvf): add published packages, cognitive container branding, grouped capabilities

- Add Published Packages section with 13 crates.io + 4 npm tables
- Add Platform Support table (Linux, macOS, Windows, WASM, no_std)
- Expand capability table from 9 to 15 rows in 4 groups
- Rewrite all "How" descriptions in plain language
- Update .rvf diagram to show all 20 segment types
- Rename ADRs: computational container -> cognitive container
- Add emojis to all section headers

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat: update root README with RVF cognitive containers, expanded capabilities

- Update intro: "gets smarter + ships as cognitive container"
- Add self-booting microservice row to Pinecone comparison table
- Expand capabilities from 34 to 42 features with dedicated RVF section
- Update "Think of it as" to include Docker comparison and RVF explanation
- Add RVF collapsed group to Ecosystem (13 crates, 4 npm, install commands)
- Add RVF to Platform & Edge section with install commands
- Add RVF npm packages (4) and Rust crates (13) to package reference
- Add RVF rows to feature comparison table (6 new rows)
- Add ADR-030/031 to ADR list
- Add RVF to Installation table, Project Structure
- Update attention mechanisms count from 39 to 40+
- Update npm count to 49+, Rust crates to 83
- Update footer with crates.io and RVF links

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat: expand comparison table with emojis, cost, audit, branching, single-file

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs: rewrite comparison table in plain language

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: clean up empty code change sections in the changes log

---------

Co-authored-by: Claude <noreply@anthropic.com>

2026-02-14 13:14:49 -05:00

12 KiB

Raw Permalink Blame History

RVF Ultra-Fast Query Path

1. CPU Shape Optimization

The block layout determines performance at the hardware level. RVF is designed to match the shape of modern CPUs: wide SIMD, deep caches, hardware prefetch.

Four Optimizations

Strict 64-byte alignment for all numeric arrays
Columnar + interleaved hybrid for compression and speed
Prefetch hints for cache-friendly graph traversal
Dictionary-coded IDs for fast random access

2. Strict Alignment

Every numeric array in RVF starts at a 64-byte aligned offset. This matches:

Target	Register Width	Alignment
AVX-512	512 bits = 64 bytes	64 B
AVX2	256 bits = 32 bytes	64 B (superset)
ARM NEON	128 bits = 16 bytes	64 B (superset)
WASM v128	128 bits = 16 bytes	64 B (superset)
Cache line	Typically 64 bytes	64 B (exact)

By aligning to 64 bytes, RVF ensures:

Zero-copy load into any SIMD register (no unaligned penalty)
No cache-line splits (each access touches exactly one cache line)
Optimal hardware prefetch behavior (prefetcher operates on cache lines)

Alignment in Practice

Segment header:           64 B (naturally aligned, first item in segment)
Block header:             Padded to 64 B boundary
Vector data start:        64 B aligned from block start
Each dimension column:    64 B aligned (columnar VEC_SEG)
Each vector entry:        64 B aligned (interleaved HOT_SEG)
ID map:                   64 B aligned
Restart point index:      64 B aligned

Padding bytes between sections are zero-filled and excluded from checksums.

3. Columnar + Interleaved Hybrid

Columnar Storage (VEC_SEG) — Optimized for Compression

Block layout (1024 vectors, 384 dimensions, fp16):

Offset 0x000:   dim_0[vec_0], dim_0[vec_1], ..., dim_0[vec_1023]   (2048 B)
Offset 0x800:   dim_1[vec_0], dim_1[vec_1], ..., dim_1[vec_1023]   (2048 B)
...
Offset 0xBF800: dim_383[vec_0], ..., dim_383[vec_1023]              (2048 B)

Total: 384 * 2048 = 786,432 bytes (768 KB per block)

Why columnar for cold/warm storage:

Adjacent values in the same dimension are correlated -> higher compression ratio
LZ4 on columnar fp16 achieves 1.5-2.5x compression (vs 1.1-1.3x on interleaved)
ZSTD on columnar fp16 achieves 2.5-4x compression
Batch operations (computing mean, variance) scan one dimension at a time

Interleaved Storage (HOT_SEG) — Optimized for Speed

Entry layout (one hot vector, 384 dim fp16):

Offset 0x000:   vector_id (8 B)
Offset 0x008:   dim_0, dim_1, dim_2, ..., dim_383  (768 B)
Offset 0x308:   neighbor_count (2 B)
Offset 0x30A:   neighbor_0, neighbor_1, ... (8 B each)
Offset 0x38A:   padding to 64B boundary
                --> 960 bytes per entry (at M=16 neighbors)

Why interleaved for hot data:

One vector = one sequential read (no column gathering)
Distance computation: load vector, compute, move to next (streaming pattern)
Neighbors co-located: after finding a good candidate, immediately traverse
960 bytes per entry = 15 cache lines = predictable memory access

When to Use Each

Operation	Layout	Reason
Bulk distance computation	Columnar	SIMD operates on dimension columns
Top-K refinement scan	Interleaved	Sequential scan of candidates
Compression/archival	Columnar	Better ratio
HNSW search (hot region)	Interleaved	Vector + neighbors together
Batch insert	Columnar	Write once, compress well

4. Prefetch Hints

The Problem

HNSW search is pointer-chasing: compute distance at node A, read neighbor list, jump to node B, compute distance, repeat. Each jump is a random memory access. On a 10M vector file, this means:

HNSW search: ~100-200 distance computations per query
Each computation: 1 random read (vector) + 1 random read (neighbors)
Random read latency: 50-100 ns (DRAM), 10-50 μs (SSD)
Total: 10-40 μs (DRAM), 1-10 ms (SSD) without prefetch

The Solution

Store neighbor lists contiguously and add prefetch offsets in the manifest so the runtime can issue prefetch instructions ahead of time.

Prefetch Table Structure

The manifest contains a prefetch table mapping node ID ranges to contiguous page regions:

prefetch_table:
  entry_count: u32
  entries:
    [0]: node_ids 0-9999      -> pages at offset 0x100000, 50 pages, prefetch 3 ahead
    [1]: node_ids 10000-19999  -> pages at offset 0x200000, 50 pages, prefetch 3 ahead
    ...

Runtime Prefetch Strategy

def hnsw_search_with_prefetch(query, entry_point, ef_search):
    candidates = MaxHeap()
    visited = BitSet()
    worklist = MinHeap([(distance(query, entry_point), entry_point)])

    while worklist:
        dist, node = worklist.pop()

        # PREFETCH: while processing this node, prefetch neighbors' data
        neighbors = get_neighbors(node)
        for n in neighbors[:PREFETCH_AHEAD]:
            if n not in visited:
                prefetch_vector(n)      # madvise(WILLNEED) or __builtin_prefetch
                prefetch_neighbors(n)   # prefetch neighbor list page

        # COMPUTE: distance to neighbors (data should be in cache by now)
        for n in neighbors:
            if n not in visited:
                visited.add(n)
                d = distance(query, get_vector(n))
                if d < candidates.max() or len(candidates) < ef_search:
                    candidates.push((d, n))
                    worklist.push((d, n))

    return candidates.top_k(k)

Contiguous Neighbor Layout

HOT_SEG stores vectors and neighbors together. For cold INDEX_SEGs, neighbor lists are laid out in node ID order within contiguous pages:

Page 0:  neighbors[node_0], neighbors[node_1], ..., neighbors[node_63]
Page 1:  neighbors[node_64], ..., neighbors[node_127]
...

Because HNSW search tends to traverse nodes in the same graph neighborhood (spatially close node IDs if data was inserted in order), sequential node IDs tend to be accessed together. Contiguous layout turns random access into sequential reads.

Expected Improvement

Configuration	p95 Latency (10M vectors)
No prefetch, random layout	2.5 ms
No prefetch, contiguous layout	1.2 ms
Prefetch, contiguous layout	0.3 ms
Prefetch, contiguous + hot cache	0.15 ms

5. Dictionary-Coded IDs

The Problem

Vector IDs in neighbor lists and ID maps are 64-bit integers. For 10M vectors, most IDs fit in 24 bits. Storing full 64-bit IDs wastes ~5 bytes per entry.

With M=16 neighbors per node and 10M nodes:

Raw: 10M * 16 * 8 = 1.2 GB of ID data
Desired: < 300 MB

Varint Delta Encoding

IDs within a block or neighbor list are sorted and delta-encoded:

Original IDs:    [1000, 1005, 1008, 1020, 1100]
Deltas:          [1000,    5,    3,   12,   80]
Varint bytes:    [  2B,  1B,  1B,   1B,   1B]  = 6 bytes (vs 40 bytes raw)

Restart Points

Every N entries (default N=64), the delta resets to an absolute value:

Group 0 (entries 0-63):    delta from 0 (absolute start)
Group 1 (entries 64-127):  delta from entry[64] (restart)
Group 2 (entries 128-191): delta from entry[128] (restart)

The restart point index stores the offset of each restart group:

restart_index:
  interval: 64
  offsets: [0, 156, 298, 445, ...]  // byte offsets into encoded data

Random Access

To find the neighbors of node N:

1. group = N / restart_interval            // O(1)
2. offset = restart_index[group]           // O(1)
3. seek to offset in encoded data          // O(1)
4. decode sequentially from restart to N   // O(restart_interval) = O(64)

Total: O(64) varint decodes = ~50-100 ns. Compare with sorted array binary search: O(log N) = O(24) comparisons with cache misses = ~200-500 ns.

SIMD Varint Decoding

Modern SIMD can decode varints in bulk:

AVX-512 VBMI: ~8 varints per cycle using VPERMB + VPSHUFB
Throughput: 2-4 billion integers/second (Lemire et al.)

At 16 neighbors per node, one HNSW search step decodes 16 varints in ~2-4 ns.

Compression Ratio

Encoding	Bytes per ID (avg)	10M * 16 neighbors
Raw u64	8.0 B	1,220 MB
Raw u32	4.0 B	610 MB
Varint (no delta)	3.2 B	488 MB
Varint delta	1.5 B	229 MB
Varint delta + restart	1.6 B	244 MB

Delta encoding with restart points achieves ~5x compression over raw u64 while maintaining fast random access.

6. Cache Behavior Analysis

L1/L2/L3 Working Sets

For a typical query on 10M vectors (384 dim, fp16):

HNSW search:
  ~150 distance computations
  Each computation: 768 B (vector) + ~128 B (neighbor list) ≈ 896 B
  Total working set: 150 * 896 ≈ 131 KB

Top-K refinement (hot cache scan):
  ~1000 candidates checked
  Each: 960 B (interleaved HOT_SEG entry)
  Total: 960 KB

Query vector: 768 B (always in L1)
Quantization tables: 96 KB (PQ codebook, always in L2)

Cache Level	Size	What Fits
L1 (32-48 KB)	Query vector + current node	Always hit
L2 (256 KB-1 MB)	PQ tables + 100-200 hot entries	Usually hit
L3 (8-32 MB)	Hot cache + partial index	Mostly hit
DRAM	Everything	Full dataset

p95 Latency Budget

HNSW traversal:    150 nodes * 100 ns/node = 15 μs (L3 hit)
Distance compute:  150 * 50 ns = 7.5 μs (SIMD)
Top-K refinement:  1000 * 10 ns = 10 μs (hot cache, L2/L3 hit)
Overhead:          5 μs (heap ops, bookkeeping)
                   -------
Total p95:         ~37.5 μs ≈ 0.04 ms

With prefetch:     ~30 μs (hide 25% of traversal latency)

This matches the target of < 0.3 ms p95 on desktop hardware. The dominant cost is memory bandwidth, not computation — which is why cache-friendly layout and prefetch are critical.

7. Distance Function SIMD Implementations

L2 Distance (fp16, 384 dim, AVX-512)

; 384 fp16 values = 768 bytes = 12 ZMM registers
; Process 32 fp16 values per iteration (convert to 16 fp32 per half)

.loop:
    vmovdqu16   zmm0, [rsi + rcx]      ; Load 32 fp16 from A
    vmovdqu16   zmm1, [rdi + rcx]      ; Load 32 fp16 from B
    vcvtph2ps   zmm2, ymm0             ; Convert low 16 to fp32
    vcvtph2ps   zmm3, ymm1
    vsubps      zmm2, zmm2, zmm3       ; diff = A - B
    vfmadd231ps zmm4, zmm2, zmm2       ; acc += diff * diff
    ; Repeat for high 16
    vextracti64x4 ymm0, zmm0, 1
    vextracti64x4 ymm1, zmm1, 1
    vcvtph2ps   zmm2, ymm0
    vcvtph2ps   zmm3, ymm1
    vsubps      zmm2, zmm2, zmm3
    vfmadd231ps zmm4, zmm2, zmm2
    add         rcx, 64
    cmp         rcx, 768
    jl          .loop

; Horizontal sum of zmm4 -> scalar result
; ~12 iterations, ~24 FMA ops, ~12 cycles total

Inner Product (int8, 384 dim, AVX-512 VNNI)

; 384 int8 values = 384 bytes = 6 ZMM registers
; VPDPBUSD: 64 uint8*int8 multiply-adds per cycle

.loop:
    vmovdqu8    zmm0, [rsi + rcx]      ; 64 uint8 from A
    vmovdqu8    zmm1, [rdi + rcx]      ; 64 int8 from B
    vpdpbusd    zmm2, zmm0, zmm1       ; acc += dot(A, B) per 4 bytes
    add         rcx, 64
    cmp         rcx, 384
    jl          .loop

; 6 iterations, 6 VPDPBUSD ops, ~6 cycles
; ~16x faster than fp16 L2

Hamming Distance (binary, 384 dim, AVX-512)

; 384 bits = 48 bytes = 1 partial ZMM load
; VPOPCNTDQ: popcount on 8 x 64-bit words per cycle

    vmovdqu8    zmm0, [rsi]            ; Load 48 bytes (384 bits) from A
    vmovdqu8    zmm1, [rdi]            ; Load 48 bytes from B
    vpxorq      zmm2, zmm0, zmm1       ; XOR -> differing bits
    vpopcntq    zmm3, zmm2             ; Popcount per 64-bit word
    ; Horizontal sum of 6 popcounts -> Hamming distance
    ; ~3 cycles total

8. Summary: Query Path Hot Loop

The complete hot path for one HNSW search step:

1. Load current node's neighbor list       [L2/L3 cache, 128 B, ~5 ns]
2. Issue prefetch for next neighbors       [~1 ns]
3. For each neighbor (M=16):
   a. Check visited bitmap                 [L1, ~1 ns]
   b. Load neighbor vector (hot cache)     [L2/L3, 768 B, ~5-10 ns]
   c. SIMD distance (fp16, 384 dim)        [~12 cycles = ~4 ns]
   d. Heap insert if better                [~5 ns]
4. Total per step: ~300-500 ns
5. Total per query (~150 steps): ~50-75 μs

This achieves 13,000-20,000 QPS per thread on desktop hardware — matching or exceeding dedicated vector databases for in-memory workloads.

12 KiB Raw Permalink Blame History

RVF Ultra-Fast Query Path

1. CPU Shape Optimization

Four Optimizations

2. Strict Alignment

Alignment in Practice

3. Columnar + Interleaved Hybrid

Columnar Storage (VEC_SEG) — Optimized for Compression

Interleaved Storage (HOT_SEG) — Optimized for Speed

When to Use Each

4. Prefetch Hints

The Problem

The Solution

Prefetch Table Structure

Runtime Prefetch Strategy

Contiguous Neighbor Layout

Expected Improvement

5. Dictionary-Coded IDs

The Problem

Varint Delta Encoding

Restart Points

Random Access

SIMD Varint Decoding

Compression Ratio

6. Cache Behavior Analysis

L1/L2/L3 Working Sets

p95 Latency Budget

7. Distance Function SIMD Implementations

L2 Distance (fp16, 384 dim, AVX-512)

Inner Product (int8, 384 dim, AVX-512 VNNI)

Hamming Distance (binary, 384 dim, AVX-512)

8. Summary: Query Path Hot Loop

12 KiB

Raw Permalink Blame History