mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-30 03:53:34 +00:00

ruvnet 3a1afa2284 feat(rulake): vector-native federation intermediary — ADR-155 + MVP crate

Implements the M1 scope of docs/research/ruLake/ as an intermediary that
fans out vector queries across heterogeneous backends (Parquet, BigQuery,
Snowflake, Delta, Iceberg, local) behind a single RVF wire protocol, with
a RaBitQ-compressed cache in front.

## What ships

- **Research docs** under docs/research/ruLake/ (9 files, ~2.5k lines),
  reframed from the earlier "plug RVF into BigQuery" shape to the
  intermediary/federation shape. BigQuery-native compute becomes a Tier-2
  push-down optimization inside the BigQueryBackend adapter, not a new
  product shape.
- **ADR-155 v2** as "Proposed" — captures the seven alternatives
  considered (plug-in-per-lake, standalone vector DB, Iceberg extension,
  Trino connector, JVM intermediary, notebook-only, push-through-only),
  consequences, and eight open questions.
- **crates/ruvector-rulake/** — new workspace member:
  - `BackendAdapter` trait with minimum surface (id / list_collections /
    pull_vectors / generation / supports_pushdown).
  - `LocalBackend` in-memory reference implementation (thread-safe).
  - `VectorCache` wrapping ruvector_rabitq::RabitqPlusIndex, with per-
    collection generation tracking and `Consistency::{Fresh, Eventual}`
    policies.
  - `RuLake` entry point: register backends, search single or federated,
    cache-stats introspection.
  - 7 smoke tests (`tests/federation_smoke.rs`): byte-exact match vs
    direct RaBitQ, cache-coherence after backend mutation, cross-backend
    fan-out with correct score ordering, cache-hit-faster-than-miss,
    three error-path tests.
  - `rulake-demo` bin: unified benchmark producing the same-run table in
    BENCHMARK.md.

## Measured numbers (LocalBackend, D=128, rerank×20, 300 queries)

| n       | direct RaBitQ+ QPS | ruLake Fresh QPS | ruLake Eventual QPS | tax   |
|--------:|-------------------:|-----------------:|--------------------:|------:|
|   5,000 |             17,311 |           17,874 |              17,858 | 0.97× |
|  50,000 |              5,162 |            5,123 |               5,050 | 1.01× |
| 100,000 |              3,122 |            3,117 |               3,114 | 1.00× |

**Intermediary tax is effectively zero on a local backend.** Federated
across 2 shards: 2,470 QPS @ n=100k (0.79× of single-shard); 4 shards:
1,781 QPS (0.57×) — sequential fan-out, parallel merge is the v2
optimisation per ADR-155 §Consequences.

## Build + test status (this crate only)

```
cargo build  -p ruvector-rulake --release                            ✓
cargo test   -p ruvector-rulake --release                            ✓ 7 passed
cargo clippy -p ruvector-rulake --release --all-targets -- -D warnings   ✓ clean
cargo fmt    -p ruvector-rulake -- --check                           ✓ clean
cargo run    -p ruvector-rulake --release --bin rulake-demo          ✓ reproduces BENCHMARK.md
```

## Scope this commit does NOT cover (M2-M5, see 07-implementation-plan.md)

- ParquetBackend, BigQueryBackend, SnowflakeBackend, IcebergBackend,
  DeltaBackend (real-backend adapters).
- Push-down paths into backends with native vector ops.
- Governance / RBAC / PII / lineage / audit (M4).
- SIFT1M recall measurement on the real-backend path.
- Parallel fan-out via rayon.
- LRU cache eviction.

Co-Authored-By: claude-flow <ruv@ruv.net>

2026-04-23 18:38:49 -04:00

13 KiB

Raw Permalink Blame History

03 — BigQuery Integration (Tier 1)

BigQuery is the load-bearing v1 target. This doc enumerates three candidate integration architectures, names their costs and compromises head-on, and picks the primary. The picked architecture is what 07-implementation-plan.md builds.

What We Need BigQuery To Do

Analyst runs a SQL query that looks like:

SELECT id, title, score
FROM `project.ds.embeddings` AS e,
     `project.ds.LIB`.RULAKE_SEARCH(
        @query_vec,
        k => 10,
        rerank_factor => 20
     ) AS s
WHERE e.id = s.id
ORDER BY s.distance
LIMIT 10;

Where RULAKE_SEARCH is whatever UDF / remote function / BigLake table valued function we ship. The exit bar: this query returns correct top-k against a 1 M-vector corpus, 100% recall@10 vs the exact answer, within a latency budget 05-performance-budget.md commits to.

Candidate A — Parquet Bridge + BigLake External Table + Remote Function

GCS bucket
 ├── embeddings_v1/data_*.parquet        (Arrow FixedSizeList<Float32, D>)
 ├── embeddings_v1/vec.rvf, idx.rvf, ... (opaque RVF bundle, read by UDF)
 └── embeddings_v1/table.rulake.json     (bundle manifest)

BigQuery
 ├── BigLake external table over the Parquet files
 ├── Remote function  ruLake.SEARCH(vec, k, rerank)
 │     → HTTPS → Cloud Run service (Rust, stateful across cold-starts)
 │     → Cloud Run service reads the `.rvf` bundle from GCS via range reads
 │     → Runs RaBitQ+ symmetric scan + rerank×N
 │     → Returns top-k {id, distance} as a repeated JSON field
 └── Lineage → Dataplex via our adapter emitting edges from job metadata

Pros

Every piece is a standard BQ primitive. No private APIs.
BigLake + Parquet is the normal enterprise ingest path. Security review is familiar territory.
The .rvf bundle is the same bundle DuckDB, edge, and WASM runtime read. True portability.
RaBitQ warm state (rotation matrix, Level-A hotset) stays in Cloud Run memory; per-call overhead drops after first warm call.

Cons / What we give up

Round-trip latency. Every remote function call is HTTPS ~30–80 ms p50 just for the Cloud Run hop. Not sub-ms. See 05-performance-budget.md.
Payload caps. BQ remote function request and response bodies are capped at 10 MB each (verify before relying; this has changed). For k > ~1000 we must fan out.
Cost. Cloud Run compute + egress. At heavy query load, cheaper than BQ slot time; at light load, more expensive than doing nothing.
Parquet vector column is semi-opaque. BQ will read the vec column as ARRAY<FLOAT64> (or ARRAY<FLOAT32> depending on Parquet logical type support in the current BQ release — verify). Good for display, not useful for native filter pushdown because RaBitQ bits live in the sidecar .rvf, not in the Parquet column.

Engineer-weeks

Parquet bridge: 2.0
Cloud Run service (the UDF): 2.5
BigLake table + SQL TVF glue: 1.0
Total: 5.5 E-wks

Candidate B — Iceberg Catalog + Trino/BigQuery-via-BigLake

GCS bucket
 ├── iceberg/embeddings_v1/metadata/...  (Iceberg v2 tables)
 └── iceberg/embeddings_v1/data/...      (Parquet + .rvf sidecars)

BigQuery (BigLake)
 ├── Iceberg external table via BigLake Iceberg integration
 └── No native kernel integration (just column access)

Separate query engine (Trino / Spark / DuckDB)
 └── Runs the RaBitQ kernel via its own UDF path

Pros

Single source of truth. One Iceberg table is readable by BQ, Trino, Snowflake (via Polaris), Databricks, DuckDB — all at once.
Governance lives in the Iceberg catalog (Polaris / Nessie / Glue / Dataplex). Lineage is clean.
Matches the open-lake trend.

Cons / What we give up

No BQ-native vector search. BQ can read the Iceberg table but cannot call our kernel. Customers wanting "do it in BQ" get nothing.
Requires a second query engine for the acceleration path (Trino/Spark/DuckDB). Doubles the ops surface.
BigLake Iceberg support maturity — has been improving since 2024 but still has edge cases around snapshot evolution and schema changes. Verify before relying each quarter.

Engineer-weeks

Iceberg v2 manifest writer: 1.5
BigLake wiring: 0.5
Separate Trino/DuckDB acceleration path: 3.0
Total: 5.0 E-wks — but only half of a BQ story

Candidate C — Native BQ Kernel via JS UDF + Binary Blob

BigQuery
 ├── Native BQ table with BYTES column = RaBitQ 1-bit codes + rotation metadata
 ├── JavaScript UDF:  ruLake.ANGULAR_DISTANCE(query_code BYTES, stored_code BYTES)
 │     → XNOR-popcount-and-angular-estimator in JS
 └── Standard SQL rewrites: SELECT id, ANGULAR_DISTANCE(...) AS d ORDER BY d LIMIT k

Pros

Zero external services. No Cloud Run, no API Gateway, no sidecar.
Billing is pure BQ slot time — already a budget line item.
Lineage is automatic — it is a native BQ query.

Cons / What we give up

JavaScript UDFs are slow. V8 inside BQ is ~1000× slower than native RaBitQ. At n=1M, D=128, a JS popcount loop will be ~seconds per row, not microseconds. Fundamentally unfit for scale beyond ~100k vectors.
No HNSW. JS UDFs cannot maintain a warm graph structure between calls. We lose the biggest win of RVF's Layer-A/B/C progressive index.
No rotation caching. Every call re-materialises the D×D rotation matrix. At D=1536 that is 3.6 B ops per call. Unusable.
BQ persistent Python UDFs (2024 preview; verify current GA status) could fix some of this — Python in a container, can call native libs via cffi, can cache state per slot. But: a) same cold start cost as remote function, b) BQ pricing for Python UDFs was not clearly better than remote function cost in our last check (verify), c) less control over memory than Cloud Run.

Engineer-weeks

JS UDF kernel: 1.0
BQ schema + ingest path: 0.5
Total: 1.5 E-wks — but does not clear the acceptance bar above ~100k vectors. Only usable as a fallback for tiny tables.

Pick: Candidate A (Primary), With Candidate B as the Lake-Distribution Fallback

Why A as primary:

It is the only option that hits "real BQ query, real scale, real performance" on day one.
Every piece is a standard BQ primitive, which keeps the security review tractable.
The Cloud Run UDF is reusable as-is for Snowflake external functions and for a generic HTTPS vector-search endpoint.
Warm-state caching (rotation matrix, HNSW hotset) maps cleanly onto Cloud Run's concurrency model.

Why B in parallel, not instead:

Iceberg distribution is cheap (1.5 E-wks) and gives us Tier-2 compatibility for free once the Parquet bridge exists. We write the Iceberg manifest as part of M1.
We do not do the "separate Trino/Spark engine" half of Candidate B in v1 — that is deferred.

Why not C:

JS UDF is a tech-demo, not a production path. We document it as a "cheap fallback for < 100 k vectors, no HNSW" in the docs but do not maintain it.

Picked Architecture in Detail

┌──────────────────────────────────────────────────────────────────────────┐
│                        BigQuery                                          │
│ ┌────────────────────────┐   ┌────────────────────────────────────────┐  │
│ │ BigLake external table │   │ Remote function: ruLake.SEARCH(...)   │  │
│ │  (Parquet, `vec`)      │   │  signature: (vec ARRAY<FLOAT64>,       │  │
│ │  schema + RBAC + audit │   │             k INT64,                    │  │
│ │  masking applied here  │   │             rerank_factor INT64)        │  │
│ └────────────┬───────────┘   │  returns: ARRAY<STRUCT<id INT64,        │  │
│              │               │                       distance FLOAT64>>│  │
│              │               └──────────┬──────────────────────────────┘  │
└──────────────┼──────────────────────────┼──────────────────────────────────┘
               │                          │ HTTPS (JSON batch)
               ▼                          ▼
        ┌──────────────────────────────────────────────┐
        │           Cloud Run: ruLake-udf              │
        │  Rust binary, 1 container, stateful warm set │
        │  - HTTP server (axum) on port 8080           │
        │  - Warm state: RVF manifest, rotation matrix,│
        │    Layer-A hotset (~4 MB), RaBitQ codes      │
        │  - Per-call: parse JSON → RaBitQ+ scan →     │
        │    rerank×N → JSON response                  │
        │  - Cold start: pull 4 KB Level-0 root, then   │
        │    4 MB Layer A; subsequent requests warm.   │
        └──────────────┬───────────────────────────────┘
                       │ GCS range reads (HTTP Range)
                       ▼
        ┌─────────────────────────────────────────────┐
        │           GCS bucket: gs://.../embeddings_v1│
        │  Parquet files       (BQ reads these)        │
        │  .rvf bundle         (UDF reads these)       │
        │  table.rulake.json   (bundle manifest)       │
        │  iceberg/metadata/...(Tier-2 fallback)       │
        └─────────────────────────────────────────────┘

Key design decisions in this picture, each with a named cost

Warm state lives in Cloud Run memory, not in a shared Redis. Simpler ops; costs us one cold-start per new container instance.
Parquet and .rvf are the same data, written twice. Doubles storage cost. We eat it for now to keep BQ-native column access.
No row-filter pushdown. BQ evaluates filters after the UDF returns top-k. For WHERE genre = 'sci-fi', the UDF must either over-fetch (return 10× k) or accept filters as a UDF argument. We start with the former; see §"Filter pushdown" below.
One UDF per region. Cloud Run is regional; BQ region must match UDF region. Multi-region BQ datasets need one UDF per region.

Filter pushdown

The UDF accepts an optional filter: STRING argument encoding an rvf-runtime::filter::FilterExpr. The UDF applies it on RVF's MetadataStore before the RaBitQ scan. This requires shipping the filter metadata inside the .rvf bundle (META_SEG + META_IDX_SEG, which RVF already encodes).

The trade-off: we build filter serialization and evaluation ourselves. BQ column-level security and row-level policies are still applied after the UDF returns, so a filter-aware UDF is a performance optimisation, not a security boundary. Security remains in BQ.

Things To Verify Before M2

Every one of these has moved in the last 12 months. Confirm at the start of M1.

BQ remote function request/response body cap (current doc says 10 MB each; we previously saw 1 MB in an old doc).
BQ remote function concurrency model (per-replica vs global).
BQ remote function cold-start latency in the target region.
BQ BigLake Iceberg maturity for schema evolution.
BQ native VECTOR(FLOAT32, D) column type — is there one? (As of early 2026 BQ had ARRAY<FLOAT64> for embeddings but no fixed-size vector type.)
BQ persistent Python UDF GA status and pricing.
Dataplex lineage API stability (REST, not SDK).

Each unresolved item becomes an M1 task.

Acceptance Test (Exit Criterion for M3)

Dataset: 1,000,000 vectors, D=128, clustered Gaussian (same generator as
BENCHMARK.md).

1. Build Parquet + .rvf bundle via `rvf-import` + `rvf-parquet` crates.
2. Upload to GCS (us-central1).
3. Deploy ruLake-udf to Cloud Run us-central1.
4. Register BigLake external table + remote function in BQ us-central1.
5. Run: SELECT ... FROM VECTOR_SEARCH(...) with 200 queries.
6. Compare top-10 IDs against Flat-exact top-10 computed locally.

Pass criteria:
  - recall@10 == 100%  (rerank×20 is sized for this)
  - p50 latency    <= 150 ms warm
  - p95 latency    <= 300 ms warm
  - p50 cold start <= 2 s (container start + Layer-A read)

Targets, unmeasured until M3 runs.

13 KiB Raw Permalink Blame History Unescape Escape

03 — BigQuery Integration (Tier 1)

What We Need BigQuery To Do

Candidate A — Parquet Bridge + BigLake External Table + Remote Function

Pros

Cons / What we give up

Engineer-weeks

Candidate B — Iceberg Catalog + Trino/BigQuery-via-BigLake

Pros

Cons / What we give up

Engineer-weeks

Candidate C — Native BQ Kernel via JS UDF + Binary Blob

Pros

Cons / What we give up

Engineer-weeks

Pick: Candidate A (Primary), With Candidate B as the Lake-Distribution Fallback

Picked Architecture in Detail

Key design decisions in this picture, each with a named cost

Filter pushdown

Things To Verify Before M2

Acceptance Test (Exit Criterion for M3)

13 KiB

Raw Permalink Blame History