mirror of https://github.com/ruvnet/RuVector.git synced 2026-07-10 01:38:44 +00:00

History

ruvnet 3a1afa2284 feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate Implements the M1 scope of docs/research/ruLake/ as an intermediary that fans out vector queries across heterogeneous backends (Parquet, BigQuery, Snowflake, Delta, Iceberg, local) behind a single RVF wire protocol, with a RaBitQ-compressed cache in front. ## What ships - Research docs under docs/research/ruLake/ (9 files, ~2.5k lines), reframed from the earlier "plug RVF into BigQuery" shape to the intermediary/federation shape. BigQuery-native compute becomes a Tier-2 push-down optimization inside the BigQueryBackend adapter, not a new product shape. - ADR-155 v2 as "Proposed" — captures the seven alternatives considered (plug-in-per-lake, standalone vector DB, Iceberg extension, Trino connector, JVM intermediary, notebook-only, push-through-only), consequences, and eight open questions. - crates/ruvector-rulake/ — new workspace member: - `BackendAdapter` trait with minimum surface (id / list_collections / pull_vectors / generation / supports_pushdown). - `LocalBackend` in-memory reference implementation (thread-safe). - `VectorCache` wrapping ruvector_rabitq::RabitqPlusIndex, with per- collection generation tracking and `Consistency::{Fresh, Eventual}` policies. - `RuLake` entry point: register backends, search single or federated, cache-stats introspection. - 7 smoke tests (`tests/federation_smoke.rs`): byte-exact match vs direct RaBitQ, cache-coherence after backend mutation, cross-backend fan-out with correct score ordering, cache-hit-faster-than-miss, three error-path tests. - `rulake-demo` bin: unified benchmark producing the same-run table in BENCHMARK.md. ## Measured numbers (LocalBackend, D=128, rerank×20, 300 queries) \| n \| direct RaBitQ+ QPS \| ruLake Fresh QPS \| ruLake Eventual QPS \| tax \| \|--------:\|-------------------:\|-----------------:\|--------------------:\|------:\| \| 5,000 \| 17,311 \| 17,874 \| 17,858 \| 0.97× \| \| 50,000 \| 5,162 \| 5,123 \| 5,050 \| 1.01× \| \| 100,000 \| 3,122 \| 3,117 \| 3,114 \| 1.00× \| Intermediary tax is effectively zero on a local backend. Federated across 2 shards: 2,470 QPS @ n=100k (0.79× of single-shard); 4 shards: 1,781 QPS (0.57×) — sequential fan-out, parallel merge is the v2 optimisation per ADR-155 §Consequences. ## Build + test status (this crate only) ``` cargo build -p ruvector-rulake --release ✓ cargo test -p ruvector-rulake --release ✓ 7 passed cargo clippy -p ruvector-rulake --release --all-targets -- -D warnings ✓ clean cargo fmt -p ruvector-rulake -- --check ✓ clean cargo run -p ruvector-rulake --release --bin rulake-demo ✓ reproduces BENCHMARK.md ``` ## Scope this commit does NOT cover (M2-M5, see 07-implementation-plan.md) - ParquetBackend, BigQueryBackend, SnowflakeBackend, IcebergBackend, DeltaBackend (real-backend adapters). - Push-down paths into backends with native vector ops. - Governance / RBAC / PII / lineage / audit (M4). - SIFT1M recall measurement on the real-backend path. - Parallel fan-out via rayon. - LRU cache eviction. Co-Authored-By: claude-flow <ruv@ruv.net>		2026-04-23 18:38:49 -04:00
..
00-master-plan.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
01-architecture.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
02-datalake-comparison.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
03-bigquery-integration.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
04-governance-and-compliance.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
05-performance-budget.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
06-positioning.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
07-implementation-plan.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00
README.md	feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate	2026-04-23 18:38:49 -04:00

README.md

ruLake — Vector-Native Federation Intermediary

Status: Research spike · proposed Date: 2026-04-23 Branch: research/rulake-datalake-analysis Companion ADR: ADR-155

Elevator pitch

ruLake is a vector-native federation intermediary. An application or agent speaks the RVF wire protocol to rvf-server; rvf-server routes each query through a planner that dispatches sub-queries to whichever backend holds the raw vectors — BigQuery, Snowflake, Iceberg, Delta, S3-Parquet, or a local file. A RaBitQ-compressed cache sits between the planner and the backends so the hot working set is answered in memory at ~957 QPS / 100 % recall@10 (per ruvector-rabitq/BENCHMARK.md), while cold reads fall through to the source of truth.

The shape is Trino/Presto-for-vectors, not Pinecone-v2. ruLake does not own the storage; it owns the wire format, the compression, the cache coherence protocol, the query plan, and a single governance choke point across whichever backends are plugged in.

4-layer architecture

┌──────────────────────────────────────────────────────────────────┐
│ L4  Governance                                                   │
│     RBAC · column mask · lineage · GDPR 2-phase delete · audit   │
│     (single choke point across all backends)                     │
├──────────────────────────────────────────────────────────────────┤
│ L3  Query plane                                                  │
│     rvf-server (HTTP/SSE + RVF wire) → planner → router          │
│     Federated ANN: fan-out per backend, merge-by-score, rerank   │
├──────────────────────────────────────────────────────────────────┤
│ L2  Cache + Index                                                │
│     RaBitQ-compressed hot cache · HNSW graph (per collection)    │
│     · deterministic rotation seed · witness-chained manifest     │
├──────────────────────────────────────────────────────────────────┤
│ L1  Backend adapters                                             │
│     ParquetBackend  BigQueryBackend  SnowflakeBackend  …         │
│     IcebergBackend  DeltaBackend     LocalBackend (tests)        │
│     Each adapter: list / pull-vectors / optional-push-down       │
└──────────────────────────────────────────────────────────────────┘

App talks to L3 via RVF wire. L3 asks L2 "is this collection cached, fresh?". Cache miss → L1 pulls vectors from the authoritative backend, L2 compresses them into RaBitQ codes, L3 answers the query. Cache hit → L2 answers directly. L4 instruments the whole path.

What changed vs the first cut of this spike

The first cut framed ruLake as a plug-in: teach BigQuery to read RVF via external tables, remote functions, UDF-with-a-RaBitQ-kernel. The intermediary reframing (this version) swaps the relationship: ruLake is the front door, backends plug into it. Justifications:

RVF is already format-native, not storage-native. rvf-runtime, rvf-server, rvf-federation already assume "we speak RVF over whatever bytes you hand us".
RaBitQ rotation + 1-bit codes are backend-agnostic — compress once, serve from any backend.
Governance (RBAC, PII, lineage) is a single choke point instead of N parallel integrations.
The BigQuery-native integration becomes a Tier-2 push-down optimization inside the BigQueryBackend adapter, not a new product shape.

The cost: ruLake now owns a cache-coherence problem (backend updates under the cache) and a latency hop for cases where the app is fine calling a native vector API directly. Those are named in 05-performance-budget.md §"Intermediary tax".

File	Role
`00-master-plan.md`	Goal tree, 5 milestones, 12-wk timeline, risk register
`01-architecture.md`	The four layers in detail; interface contracts; query-path walk
`02-datalake-comparison.md`	Per-backend adapter story: BQ, Snowflake, Databricks, Iceberg, Trino, DuckDB
`03-bigquery-integration.md`	Tier-2 push-down: what BQ-native compute buys over pure federation
`04-governance-and-compliance.md`	The 10 enterprise deal-breakers + what ruLake must own at the choke point
`05-performance-budget.md`	Honest numbers (measured vs "target, unmeasured"), intermediary tax analysis
`06-positioning.md`	What ruLake is NOT; hype rubric; 3 win / 3 lose shapes
`07-implementation-plan.md`	Week-by-week 12-wk plan, acceptance tests per milestone, v2 deferrals

One-sentence answer to "what is this?"

Trino for vectors: you write one query in RVF; ruLake fans out to every backend that holds a piece of the answer, merges under uniform governance, and hands you top-k from a RaBitQ-compressed cache sitting in front.