Implements the M1 scope of docs/research/ruLake/ as an intermediary that
fans out vector queries across heterogeneous backends (Parquet, BigQuery,
Snowflake, Delta, Iceberg, local) behind a single RVF wire protocol, with
a RaBitQ-compressed cache in front.
## What ships
- **Research docs** under docs/research/ruLake/ (9 files, ~2.5k lines),
reframed from the earlier "plug RVF into BigQuery" shape to the
intermediary/federation shape. BigQuery-native compute becomes a Tier-2
push-down optimization inside the BigQueryBackend adapter, not a new
product shape.
- **ADR-155 v2** as "Proposed" — captures the seven alternatives
considered (plug-in-per-lake, standalone vector DB, Iceberg extension,
Trino connector, JVM intermediary, notebook-only, push-through-only),
consequences, and eight open questions.
- **crates/ruvector-rulake/** — new workspace member:
- `BackendAdapter` trait with minimum surface (id / list_collections /
pull_vectors / generation / supports_pushdown).
- `LocalBackend` in-memory reference implementation (thread-safe).
- `VectorCache` wrapping ruvector_rabitq::RabitqPlusIndex, with per-
collection generation tracking and `Consistency::{Fresh, Eventual}`
policies.
- `RuLake` entry point: register backends, search single or federated,
cache-stats introspection.
- 7 smoke tests (`tests/federation_smoke.rs`): byte-exact match vs
direct RaBitQ, cache-coherence after backend mutation, cross-backend
fan-out with correct score ordering, cache-hit-faster-than-miss,
three error-path tests.
- `rulake-demo` bin: unified benchmark producing the same-run table in
BENCHMARK.md.
## Measured numbers (LocalBackend, D=128, rerank×20, 300 queries)
| n | direct RaBitQ+ QPS | ruLake Fresh QPS | ruLake Eventual QPS | tax |
|--------:|-------------------:|-----------------:|--------------------:|------:|
| 5,000 | 17,311 | 17,874 | 17,858 | 0.97× |
| 50,000 | 5,162 | 5,123 | 5,050 | 1.01× |
| 100,000 | 3,122 | 3,117 | 3,114 | 1.00× |
**Intermediary tax is effectively zero on a local backend.** Federated
across 2 shards: 2,470 QPS @ n=100k (0.79× of single-shard); 4 shards:
1,781 QPS (0.57×) — sequential fan-out, parallel merge is the v2
optimisation per ADR-155 §Consequences.
## Build + test status (this crate only)
```
cargo build -p ruvector-rulake --release ✓
cargo test -p ruvector-rulake --release ✓ 7 passed
cargo clippy -p ruvector-rulake --release --all-targets -- -D warnings ✓ clean
cargo fmt -p ruvector-rulake -- --check ✓ clean
cargo run -p ruvector-rulake --release --bin rulake-demo ✓ reproduces BENCHMARK.md
```
## Scope this commit does NOT cover (M2-M5, see 07-implementation-plan.md)
- ParquetBackend, BigQueryBackend, SnowflakeBackend, IcebergBackend,
DeltaBackend (real-backend adapters).
- Push-down paths into backends with native vector ops.
- Governance / RBAC / PII / lineage / audit (M4).
- SIFT1M recall measurement on the real-backend path.
- Parallel fan-out via rayon.
- LRU cache eviction.
Co-Authored-By: claude-flow <ruv@ruv.net>
|
||
|---|---|---|
| .. | ||
| 00-master-plan.md | ||
| 01-architecture.md | ||
| 02-datalake-comparison.md | ||
| 03-bigquery-integration.md | ||
| 04-governance-and-compliance.md | ||
| 05-performance-budget.md | ||
| 06-positioning.md | ||
| 07-implementation-plan.md | ||
| README.md | ||
ruLake — Vector-Native Federation Intermediary
Status: Research spike · proposed
Date: 2026-04-23
Branch: research/rulake-datalake-analysis
Companion ADR: ADR-155
Elevator pitch
ruLake is a vector-native federation intermediary. An application or
agent speaks the RVF wire protocol to rvf-server; rvf-server routes
each query through a planner that dispatches sub-queries to whichever
backend holds the raw vectors — BigQuery, Snowflake, Iceberg, Delta,
S3-Parquet, or a local file. A RaBitQ-compressed cache sits between the
planner and the backends so the hot working set is answered in memory at
~957 QPS / 100 % recall@10 (per
ruvector-rabitq/BENCHMARK.md),
while cold reads fall through to the source of truth.
The shape is Trino/Presto-for-vectors, not Pinecone-v2. ruLake does not own the storage; it owns the wire format, the compression, the cache coherence protocol, the query plan, and a single governance choke point across whichever backends are plugged in.
4-layer architecture
┌──────────────────────────────────────────────────────────────────┐
│ L4 Governance │
│ RBAC · column mask · lineage · GDPR 2-phase delete · audit │
│ (single choke point across all backends) │
├──────────────────────────────────────────────────────────────────┤
│ L3 Query plane │
│ rvf-server (HTTP/SSE + RVF wire) → planner → router │
│ Federated ANN: fan-out per backend, merge-by-score, rerank │
├──────────────────────────────────────────────────────────────────┤
│ L2 Cache + Index │
│ RaBitQ-compressed hot cache · HNSW graph (per collection) │
│ · deterministic rotation seed · witness-chained manifest │
├──────────────────────────────────────────────────────────────────┤
│ L1 Backend adapters │
│ ParquetBackend BigQueryBackend SnowflakeBackend … │
│ IcebergBackend DeltaBackend LocalBackend (tests) │
│ Each adapter: list / pull-vectors / optional-push-down │
└──────────────────────────────────────────────────────────────────┘
App talks to L3 via RVF wire. L3 asks L2 "is this collection cached, fresh?". Cache miss → L1 pulls vectors from the authoritative backend, L2 compresses them into RaBitQ codes, L3 answers the query. Cache hit → L2 answers directly. L4 instruments the whole path.
What changed vs the first cut of this spike
The first cut framed ruLake as a plug-in: teach BigQuery to read RVF via external tables, remote functions, UDF-with-a-RaBitQ-kernel. The intermediary reframing (this version) swaps the relationship: ruLake is the front door, backends plug into it. Justifications:
- RVF is already format-native, not storage-native.
rvf-runtime,rvf-server,rvf-federationalready assume "we speak RVF over whatever bytes you hand us". - RaBitQ rotation + 1-bit codes are backend-agnostic — compress once, serve from any backend.
- Governance (RBAC, PII, lineage) is a single choke point instead of N parallel integrations.
- The BigQuery-native integration becomes a Tier-2 push-down
optimization inside the
BigQueryBackendadapter, not a new product shape.
The cost: ruLake now owns a cache-coherence problem (backend updates
under the cache) and a latency hop for cases where the app is fine
calling a native vector API directly. Those are named in
05-performance-budget.md §"Intermediary tax".
Contents
| File | Role |
|---|---|
00-master-plan.md |
Goal tree, 5 milestones, 12-wk timeline, risk register |
01-architecture.md |
The four layers in detail; interface contracts; query-path walk |
02-datalake-comparison.md |
Per-backend adapter story: BQ, Snowflake, Databricks, Iceberg, Trino, DuckDB |
03-bigquery-integration.md |
Tier-2 push-down: what BQ-native compute buys over pure federation |
04-governance-and-compliance.md |
The 10 enterprise deal-breakers + what ruLake must own at the choke point |
05-performance-budget.md |
Honest numbers (measured vs "target, unmeasured"), intermediary tax analysis |
06-positioning.md |
What ruLake is NOT; hype rubric; 3 win / 3 lose shapes |
07-implementation-plan.md |
Week-by-week 12-wk plan, acceptance tests per milestone, v2 deferrals |
One-sentence answer to "what is this?"
Trino for vectors: you write one query in RVF; ruLake fans out to every backend that holds a piece of the answer, merges under uniform governance, and hands you top-k from a RaBitQ-compressed cache sitting in front.