ruvector/docs/sdk/02-strategy.md
ruvnet f6c684aba0 docs(sdk): add deep planning review for ruvector Python SDK
Seven-file design review at docs/sdk/ covering the binding strategy,
API surface, M1-M4 milestones, risks, and a one-page decision record
for shipping a Python SDK.

Recommended path: **PyO3 + maturin, single in-tree
`crates/ruvector-py/` cdylib, abi3-py39 wheel via cibuildwheel,
`pyo3-asyncio` over a singleton tokio runtime.**

Why:
- The existing `*-node` NAPI templates (e.g.
  `crates/ruvector-diskann-node/src/lib.rs`) already prove out the
  opaque-handle + `Arc<RwLock<…>>` shape PyO3 mirrors line-for-line —
  ~70% port, ~30% lifetime gymnastics.
- abi3 collapses the wheel matrix from ~25 (cpython36 × 5 platforms)
  to 5 (one wheel per platform, all py3.9+).
- Singleton tokio runtime avoids the "one runtime per call" overhead
  while remaining compatible with asyncio + uvloop.

Milestone shape (each with explicit scope + acceptance tests):

  M1 — RaBitQ-only Python wheel. Just the published
       `ruvector-rabitq` crate exposed via PyO3. Smallest possible
       useful surface. ~600 LoC, 3 weeks.
  M2 — ruLake. Async via pyo3-asyncio. Witness verify exposed.
       ~900 LoC, 4 weeks.
  M3 — Embeddings + ML helpers. Wrap consumer-facing parts of
       `ruvector-cnn` / `ruvllm`. ~700 LoC, 3 weeks.
  M4 — A2A agent client. Wrap `rvagent-a2a` so Python apps can
       dispatch tasks to A2A peers, including signed AgentCard
       discovery. ~800 LoC, 4 weeks.

Three acceptance gates that gate the whole effort:
  1. A Python user can do RAG over 1 M vectors in <5 lines.
  2. An asyncio user can stream A2A task updates without thread
     fights.
  3. `pip install ruvector` takes <10 s on a stock machine.

Top 3 risks identified:
  R1 — tokio runtime + PyO3 + asyncio/uvloop interop. Mitigation:
       single lazy runtime, `pyo3-asyncio` shim.
  R3 — wheel size. M4 budget is 22 MB; A2A deps (axum + reqwest +
       rustls) could blow it. Mitigation: feature-gate axum/reqwest
       behind `agent` extra; default install is rabitq + rulake only.
  R7 — PyPI name squat on `ruvector`. Mitigation: register placeholder
       before M1 ships.

Nuance discovered: `ruvector-rabitq` has **no** sibling `*-node` or
`*-wasm` crate — unlike most consumer crates. M1 is therefore clean
greenfield: no parity-pressure to match a flaky NAPI signature, and
it confirms rabitq alone is the right starter target rather than the
umbrella `ruvector` crate the npm package wraps.

Planning doc only; no implementation.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-04-25 20:28:54 -04:00

8.7 KiB
Raw Permalink Blame History

02 — Binding Strategy

Decision

PyO3 + maturin, single extension module, abi3-py39, with pyo3-asyncio for async bridging and a hand-written .pyi stub. Built and distributed via cibuildwheel in CI, published to PyPI as ruvector. The crate lives in-tree at crates/ruvector-py/.

The rest of this document defends that choice against the four alternatives considered, and locks in the supporting decisions (asyncio, GIL, wheels, stubs).

The choice space

Option Idea Why we are not picking it
A. PyO3 + maturin (chosen) Native Rust extension exposed as a CPython C-API module via pyo3, built with maturin.
B. CFFI over a Rust cdylib Hand-roll a C ABI in ruvector-py/ (or reuse ruvector-router-ffi) and let Python call it via cffi. Loses the rich type story PyO3 gives for free (NumPy buffers, Vec<T> <-> list, Result<T,E> <-> exception, async fn <-> awaitable). Forces us to maintain a C header. We already maintain NAPI bindings; CFFI is a strictly worse parallel surface.
C. ctypes over cbindgen Same as B, but using the stdlib ctypes module instead of cffi. Same loss; less ergonomic; no installer to declare a build dep on; users hit a ctypes.CDLL import error if they pip-install on a platform without a wheel.
D. wasmtime-py over the existing *-wasm crates Reuse ruvector-rabitq via a new ruvector-rabitq-wasm crate, run the WASM in wasmtime-py. Requires writing the missing *-wasm crate first (rabitq has none; rulake has none). Loses 520× perf vs native (no SIMD escape hatch). Tokio doesn't run inside wasm32-wasi. Adds a 6 MB+ wasmtime runtime to every wheel. The whole point of going native is to match the Rust numbers, not lose half of them at the boundary.
E. gRPC / OpenAPI server with thin Python client Stand up ruvector-server over HTTP/gRPC, ship a Python client that hits localhost. Two-process architecture is the wrong default for a library — the user gets to deal with port allocation, server lifecycle, and serialization cost on every call. This is the right shape for a Python service SDK, but a vector index isn't a service; it's a data structure.

Why PyO3 specifically

  1. Surface area parity with NAPI is automatic. PyO3's #[pyclass] maps onto an opaque handle the same way #[napi] does, and #[pymethods] maps onto #[napi] impl blocks. Anyone who maintains crates/ruvector-diskann-node can read and review the PyO3 module in crates/ruvector-py line-for-line.
  2. NumPy zero-copy. pyo3 + numpy (the rust-numpy crate) lets us accept np.ndarray and read it as &[f32] without a copy when the array is contiguous and dtype=float32. RaBitQ search loops on &[f32] already; this is a thin wrap.
  3. abi3 wheels. PyO3 supports the stable ABI (abi3-py39), which means one wheel covers Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 / 3.14. We do not need to ship a wheel per Python version. This collapses the matrix from ~25 wheels (5 versions × 5 platforms) to 5 wheels.
  4. Mature async. pyo3-asyncio (or its successor pyo3-async-runtimes, which we should track) lets a Rust async fn return a Python awaitable that asyncio.run awaits without spawning a thread per call. This is the only practical way to bridge tokio without double-runtime-fights.
  5. Maturin is the de-facto Rust-Python build tool. Used by polars, pydantic-core, cryptography (in part), tokenizers. We are not pioneering anything; we are taking the well-trodden path.

Async story

Native asyncio via pyo3-asyncio. Every Rust async fn we expose becomes an async def in Python by way of a pyo3_asyncio::tokio::future_into_py wrapper. There is exactly one tokio runtime in the process: a multi-thread runtime owned by the extension module, lazily initialized on first use, sized to min(8, os.cpu_count()) worker threads. We do not create a runtime per call.

We do not use asyncio.to_thread or run_in_executor to wrap a sync API. That works but breaks cancellation propagation and tracing context.

The main async surfaces are:

  • RuLake.search_async (M2)
  • A2aClient.send_task / stream_task (M4)
  • Embedder.embed_batch_async (M3, optional — sync is fine for CPU work)

Sync siblings are kept for every async method (e.g. search and search_async). Synchronous calls release the GIL via Python::allow_threads; async calls return immediately and block the tokio runtime, not the calling Python thread.

Compatibility: tested against CPython's default asyncio + uvloop. We do not pin uvloop. We do not invent our own loop policy.

GIL story

Every CPU-bound entry point that takes more than ~50 µs releases the GIL via py.allow_threads(|| { ... }) around the inner Rust call. The list as of M3:

Surface Releases GIL? Why
RabitqIndex.build yes dominant cost is rotation + popcount, all Rust
RabitqIndex.search yes scan loop, no Python interaction
RabitqIndex.add no one vector per call, overhead < release cost
RuLake.search_* yes scan + cache lookup, all Rust
Embedder.embed yes tensor ops
A2aClient.send_task n/a (async) tokio runs without holding the GIL

This is the same calculus polars and tokenizers use. Documenting it explicitly so the next person who adds a method knows the rule.

Wheel distribution matrix

We ship five wheels for each release, all abi3-py39 (works on Python 3.9+):

Platform Triple Built on Notes
Linux x86_64 manylinux_2_28_x86_64 GitHub Actions ubuntu-latest AVX2 baseline; runtime detect AVX-512
Linux aarch64 manylinux_2_28_aarch64 GHA ARM runners or QEMU via cibuildwheel NEON baseline
macOS x86_64 macosx_10_15_x86_64 GHA macos-13 AVX2 baseline; bottlenecking on M-series users is fine, they have an arm64 wheel
macOS aarch64 macosx_11_0_arm64 GHA macos-14 NEON baseline
Windows x86_64 win_amd64 GHA windows-latest AVX2 baseline; runtime detect AVX-512

We drop musllinux, Windows arm64, and 32-bit anything. cibuildwheel configures via [tool.cibuildwheel] in pyproject.toml. A 32-bit user gets pip install falling back to sdist, which fails to build, which is the correct outcome.

SIMD is runtime-detected, not compiled per-platform. ruvector-rabitq is pure Rust without explicit AVX-512 paths today (the kernel.rs VectorKernel trait is the extension point). We ship one binary per platform; if/when we add an AVX-512 kernel it lives behind a runtime CPU-feature check.

Type stubs

Hand-written .pyi stubs, checked in at crates/ruvector-py/python/ruvector/__init__.pyi. Reasons:

  • pyo3-stub-gen is real and improving but generates noisy stubs that need editing anyway (it overstates Any, doesn't infer Optional[...] from Option<T> cleanly).
  • The stub surface is small enough (≤ 4 modules × ≤ 40 methods) that hand-writing is feasible.
  • We control the user-visible API shape, e.g. we want NumPy types in signatures (np.ndarray[np.float32]), not list[float].

A CI job runs mypy --strict tests/ and pyright tests/ against an import ruvector to catch stub regressions.

Source layout

crates/ruvector-py/
  Cargo.toml          # crate-type = ["cdylib"], pyo3 + numpy + pyo3-asyncio
  pyproject.toml      # maturin backend; cibuildwheel config; project metadata
  README.md           # short — links to docs/sdk
  src/
    lib.rs            # PyModule init, re-exports each submodule
    rabitq.rs         # M1
    rulake.rs         # M2
    embed.rs          # M3
    a2a.rs            # M4
    error.rs          # exception hierarchy
    runtime.rs        # the singleton tokio runtime
  python/ruvector/
    __init__.py       # re-exports from the compiled module + small pure-Py helpers
    __init__.pyi      # hand-written stubs
    py.typed          # marker so mypy/pyright recognize stubs
  tests/              # pytest, runs against the installed wheel
  benches/            # asv (airspeed-velocity) over identical workloads to Rust criterion

The python/ruvector/__init__.py re-export pattern lets us add pure-Python helpers (e.g. dataclasses for config) without forcing them through the extension boundary.

What this strategy explicitly does NOT do

  • Does not wrap every workspace crate. We pick four crates over four milestones; everything else stays Rust-only.
  • Does not try to be a Pythonic vector DB framework (chromadb, weaviate, qdrant). We are a thin, fast, typed binding to a specific Rust stack.
  • Does not vendor models. The embedder downloads weights from HuggingFace at first use, the same way ruvector-cnn does in Rust.
  • Does not provide an asyncio-only API. Sync siblings always exist for non-network calls.