mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-27 08:45:07 +00:00

ruvnet 3a1afa2284 feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate

Implements the M1 scope of docs/research/ruLake/ as an intermediary that
fans out vector queries across heterogeneous backends (Parquet, BigQuery,
Snowflake, Delta, Iceberg, local) behind a single RVF wire protocol, with
a RaBitQ-compressed cache in front.

## What ships

- **Research docs** under docs/research/ruLake/ (9 files, ~2.5k lines),
  reframed from the earlier "plug RVF into BigQuery" shape to the
  intermediary/federation shape. BigQuery-native compute becomes a Tier-2
  push-down optimization inside the BigQueryBackend adapter, not a new
  product shape.
- **ADR-155 v2** as "Proposed" — captures the seven alternatives
  considered (plug-in-per-lake, standalone vector DB, Iceberg extension,
  Trino connector, JVM intermediary, notebook-only, push-through-only),
  consequences, and eight open questions.
- **crates/ruvector-rulake/** — new workspace member:
  - `BackendAdapter` trait with minimum surface (id / list_collections /
    pull_vectors / generation / supports_pushdown).
  - `LocalBackend` in-memory reference implementation (thread-safe).
  - `VectorCache` wrapping ruvector_rabitq::RabitqPlusIndex, with per-
    collection generation tracking and `Consistency::{Fresh, Eventual}`
    policies.
  - `RuLake` entry point: register backends, search single or federated,
    cache-stats introspection.
  - 7 smoke tests (`tests/federation_smoke.rs`): byte-exact match vs
    direct RaBitQ, cache-coherence after backend mutation, cross-backend
    fan-out with correct score ordering, cache-hit-faster-than-miss,
    three error-path tests.
  - `rulake-demo` bin: unified benchmark producing the same-run table in
    BENCHMARK.md.

## Measured numbers (LocalBackend, D=128, rerank×20, 300 queries)

| n       | direct RaBitQ+ QPS | ruLake Fresh QPS | ruLake Eventual QPS | tax   |
|--------:|-------------------:|-----------------:|--------------------:|------:|
|   5,000 |             17,311 |           17,874 |              17,858 | 0.97× |
|  50,000 |              5,162 |            5,123 |               5,050 | 1.01× |
| 100,000 |              3,122 |            3,117 |               3,114 | 1.00× |

**Intermediary tax is effectively zero on a local backend.** Federated
across 2 shards: 2,470 QPS @ n=100k (0.79× of single-shard); 4 shards:
1,781 QPS (0.57×) — sequential fan-out, parallel merge is the v2
optimisation per ADR-155 §Consequences.

## Build + test status (this crate only)

```
cargo build  -p ruvector-rulake --release                            ✓
cargo test   -p ruvector-rulake --release                            ✓ 7 passed
cargo clippy -p ruvector-rulake --release --all-targets -- -D warnings   ✓ clean
cargo fmt    -p ruvector-rulake -- --check                           ✓ clean
cargo run    -p ruvector-rulake --release --bin rulake-demo          ✓ reproduces BENCHMARK.md
```

## Scope this commit does NOT cover (M2-M5, see 07-implementation-plan.md)

- ParquetBackend, BigQueryBackend, SnowflakeBackend, IcebergBackend,
  DeltaBackend (real-backend adapters).
- Push-down paths into backends with native vector ops.
- Governance / RBAC / PII / lineage / audit (M4).
- SIFT1M recall measurement on the real-backend path.
- Parallel fan-out via rayon.
- LRU cache eviction.

Co-Authored-By: claude-flow <ruv@ruv.net>

2026-04-23 18:38:49 -04:00

8.4 KiB

Raw Permalink Blame History

06 — Positioning

The positioning rule for this spike: every time we are tempted to write "ruLake is …," we write "ruLake is NOT …" first. Most of the ways this kind of project fails are in the gap between what the engineering team thinks they are shipping and what the GTM team sells.

This document is the hype-avoidance rubric for the spike. It ships alongside the v1 release.

What ruLake IS

A read-optimised, vector-native format + catalog + kernel adapter layer that sits between object storage and the SQL engine the enterprise already operates. It ships as:

A Parquet / Iceberg extension for a vector column.
A per-engine UDF (BigQuery remote function in v1; DuckDB extension in v1; others deferred).
A catalog adapter emitting lineage into Dataplex / Unity / Polaris.
The existing 22-crate RVF workspace, unchanged.

That is the whole product.

What ruLake IS NOT

Not a vector database

ruLake is not a competitor to Pinecone, Weaviate, Milvus, Qdrant, or LanceDB. Those products are systems of record — they run their own cluster, manage their own storage, expose their own API. ruLake runs no cluster of its own and exposes no API of its own beyond the UDF that the host warehouse calls.

If a prospect asks "does ruLake replace Pinecone?", the answer is "only if you were using Pinecone because you wanted a vector column inside your datalake — in which case, yes. If you were using Pinecone because you wanted a managed serving tier with sub-ms p99, ruLake does not displace it."

Not a replacement for BigQuery / Snowflake / Databricks

The host warehouse remains the query planner, the RBAC boundary, the billing boundary, the audit root, and the UI. ruLake adds a UDF and a lineage edge. That is all.

If a prospect asks "should we rip out BigQuery?", the answer is "no, we plug in."

Not a storage system

ruLake has no storage service. Bytes live in S3 / GCS / Azure Blob with the customer's existing bucket-level governance. We do not run replicated storage, we do not manage durability, we do not charge for storage. If the bucket burns, ruLake burns.

Not a new table format

ruLake rides on existing table formats — Iceberg v2 primarily, Delta via Iceberg interop. ruLake adds a convention (a .rvf sidecar referenced by table properties), not a new open-format standard. Talking about "the ruLake table format" is wrong; it is "Iceberg + ruLake sidecars."

Not an embedding / model / featureization service

ruLake does not produce embeddings. Customers bring their own model (OpenAI, Cohere, Vertex AI, open-source, internal). ruLake stores and serves vectors; it does not compute them.

Not a real-time streaming system

ruLake's ingest path is batch-shaped. Append-only segments + daily compaction is OLAP ergonomics. If a customer needs sub-second ingest-to-query, we point them at rvf-server's TCP streaming mode (which is real-time) but that is not the ruLake product.

Not quantum-safe out of the box

rvf-crypto supports ML-DSA-65 and SLH-DSA-128s. ruLake optionally enables them for bundle signing. Per-row encryption at rest with customer-managed PQ keys is a v2 line item. Today's ruLake bundle is signed with PQ signatures but encrypted at rest with GCS-managed (classical) keys. This is fine for 2026 but will need revisiting.

Not a sub-millisecond query serving tier

See 05-performance-budget.md. The BQ Tier-1 path is dominated by HTTPS round-trip, ~30–80 ms warm. For sub-ms use cases, customers embed ruLake (DuckDB extension, WASM tile, or direct rvf-runtime use) — but that is not the BQ story and must be documented separately.

Not a data mesh

We track lineage. We do not build a mesh. Integration with Starburst's data products, Atlan's metadata catalog, or Soda's data contracts are all out of scope for v1.

Not a MLOps platform

Model registry, feature store, experiment tracking, training pipelines — all out of scope. ruLake plugs into whichever one the customer runs. rvf-federation carries federated-learning primitives today; their exposure as a product is a separate spike.

Not production-ready as of v1 completion

v1 of this spike produces: a working BQ integration on a single region with a single-instance UDF, DuckDB extension, Iceberg manifests, and the governance story. It does not produce: multi-region replication, active-active, HA, at-scale SRE runbooks, or a support organisation. Those are post-spike.

Hype-Avoidance Rubric

If any of these sentences shows up in a talk, a landing page, or a sales deck, flag it:

Suspect claim	Why it is suspect / what to say instead
"Fastest vector database."	We are not a database. Say "fastest embedded vector kernel we have measured on a laptop, 957 QPS at 100% recall@10 at n=100k — see BENCHMARK.md."
"Billion-scale vector search."	We have measured to n=100k. Billion-scale is a 2026-H2 acceptance target. Say "designed for billion-scale, measured at n=100k, SIFT1M benchmark is a tracked follow-up."
"Built-in GDPR compliance."	We provide the orchestration. Compliance is the customer's — and legal's — call. Say "GDPR orchestration primitives with a documented two-phase delete SLA."
"Zero-ops vector search inside BigQuery."	The Cloud Run UDF is ops. Say "vector search inside BigQuery with one Cloud Run service per region."
"Quantum-resistant by default."	Only the signatures are PQ; encryption is classical GCS. Say "post-quantum signatures (ML-DSA-65); classical encryption at rest in v1."
"Provably correct query results."	Witness chain proves read integrity, not correctness. Say "witness-chain-backed audit trail for every query."
"AI-native data lake."	Say literally anything else.
"Eliminates your vector database."	See "Not a vector database" above. Say "alternative to standing up a separate vector database when your requirements are datalake-shaped."
"100% recall."	Only on our clustered Gaussian fixture. SIFT1M is unmeasured. Say "100% recall@10 at n=100k on BENCHMARK.md fixture; SIFT1M target, unmeasured."
"Drop-in replacement for Pinecone."	See above. Say "complementary to Pinecone for OLAP-shaped vector workloads inside the datalake."

When in doubt, the grounding test is: can an engineer reproduce the claim from a file in the repo in under 30 minutes? If no, rewrite.

Three Customer Shapes Where ruLake Wins

These are the shapes of prospect where the spike actually produces value. If the prospect does not look like one of these, step away.

"We want vector search but our security team said no new systems." The warehouse is BQ or Snowflake, governance lives in Dataplex / Unity, and standing up a Pinecone cluster requires a 6-month security review. ruLake is a Cloud Run service and a remote function — much smaller attack surface.
"We need the same vector index readable from BQ and from laptops." Data-science team runs notebooks with DuckDB against a GCS bucket; the production query path is BQ. ruLake's bundle-plus-UDF shape is the only design that makes that one bundle.
"We have to prove to an auditor that vector X was retrieved by job Y on date Z." Witness chains + lineage edges produce cryptographic provenance the auditor can replay offline. BQ's audit logs alone do not do this.

Three Customer Shapes Where ruLake Loses

State these out loud. Walking away early is cheaper.

"We need sub-millisecond p99." The BQ path is fundamentally HTTPS-shaped. Point them at embedded ruLake (DuckDB, WASM) or a dedicated vector DB.
"We need real-time feature store ingest at 100k rows/s." Append-only + nightly compaction is wrong shape. Point them at a streaming vector store.
"We only use BigQuery and BQ Vector Search meets our needs." Let them. ruLake's portability argument is moot here.

The One-Line Pitch

ruLake is the adapter layer that makes a .rvf vector bundle read like a regular column inside BigQuery, DuckDB, and Iceberg-aware engines — so you do not have to stand up a second system of record to do vector search.

If the one-line pitch starts to grow, it is drifting.

8.4 KiB Raw Permalink Blame History Unescape Escape