Seven-file design review at docs/sdk/ covering the binding strategy,
API surface, M1-M4 milestones, risks, and a one-page decision record
for shipping a Python SDK.
Recommended path: **PyO3 + maturin, single in-tree
`crates/ruvector-py/` cdylib, abi3-py39 wheel via cibuildwheel,
`pyo3-asyncio` over a singleton tokio runtime.**
Why:
- The existing `*-node` NAPI templates (e.g.
`crates/ruvector-diskann-node/src/lib.rs`) already prove out the
opaque-handle + `Arc<RwLock<…>>` shape PyO3 mirrors line-for-line —
~70% port, ~30% lifetime gymnastics.
- abi3 collapses the wheel matrix from ~25 (cpython36 × 5 platforms)
to 5 (one wheel per platform, all py3.9+).
- Singleton tokio runtime avoids the "one runtime per call" overhead
while remaining compatible with asyncio + uvloop.
Milestone shape (each with explicit scope + acceptance tests):
M1 — RaBitQ-only Python wheel. Just the published
`ruvector-rabitq` crate exposed via PyO3. Smallest possible
useful surface. ~600 LoC, 3 weeks.
M2 — ruLake. Async via pyo3-asyncio. Witness verify exposed.
~900 LoC, 4 weeks.
M3 — Embeddings + ML helpers. Wrap consumer-facing parts of
`ruvector-cnn` / `ruvllm`. ~700 LoC, 3 weeks.
M4 — A2A agent client. Wrap `rvagent-a2a` so Python apps can
dispatch tasks to A2A peers, including signed AgentCard
discovery. ~800 LoC, 4 weeks.
Three acceptance gates that gate the whole effort:
1. A Python user can do RAG over 1 M vectors in <5 lines.
2. An asyncio user can stream A2A task updates without thread
fights.
3. `pip install ruvector` takes <10 s on a stock machine.
Top 3 risks identified:
R1 — tokio runtime + PyO3 + asyncio/uvloop interop. Mitigation:
single lazy runtime, `pyo3-asyncio` shim.
R3 — wheel size. M4 budget is 22 MB; A2A deps (axum + reqwest +
rustls) could blow it. Mitigation: feature-gate axum/reqwest
behind `agent` extra; default install is rabitq + rulake only.
R7 — PyPI name squat on `ruvector`. Mitigation: register placeholder
before M1 ships.
Nuance discovered: `ruvector-rabitq` has **no** sibling `*-node` or
`*-wasm` crate — unlike most consumer crates. M1 is therefore clean
greenfield: no parity-pressure to match a flaky NAPI signature, and
it confirms rabitq alone is the right starter target rather than the
umbrella `ruvector` crate the npm package wraps.
Planning doc only; no implementation.
Co-Authored-By: claude-flow <ruv@ruv.net>
8.7 KiB
02 — Binding Strategy
Decision
PyO3 + maturin, single extension module, abi3-py39, with pyo3-asyncio
for async bridging and a hand-written .pyi stub. Built and distributed
via cibuildwheel in CI, published to PyPI as ruvector. The crate lives
in-tree at crates/ruvector-py/.
The rest of this document defends that choice against the four alternatives considered, and locks in the supporting decisions (asyncio, GIL, wheels, stubs).
The choice space
| Option | Idea | Why we are not picking it |
|---|---|---|
| A. PyO3 + maturin (chosen) | Native Rust extension exposed as a CPython C-API module via pyo3, built with maturin. |
— |
B. CFFI over a Rust cdylib |
Hand-roll a C ABI in ruvector-py/ (or reuse ruvector-router-ffi) and let Python call it via cffi. |
Loses the rich type story PyO3 gives for free (NumPy buffers, Vec<T> <-> list, Result<T,E> <-> exception, async fn <-> awaitable). Forces us to maintain a C header. We already maintain NAPI bindings; CFFI is a strictly worse parallel surface. |
| C. ctypes over cbindgen | Same as B, but using the stdlib ctypes module instead of cffi. |
Same loss; less ergonomic; no installer to declare a build dep on; users hit a ctypes.CDLL import error if they pip-install on a platform without a wheel. |
D. wasmtime-py over the existing *-wasm crates |
Reuse ruvector-rabitq via a new ruvector-rabitq-wasm crate, run the WASM in wasmtime-py. |
Requires writing the missing *-wasm crate first (rabitq has none; rulake has none). Loses 5–20× perf vs native (no SIMD escape hatch). Tokio doesn't run inside wasm32-wasi. Adds a 6 MB+ wasmtime runtime to every wheel. The whole point of going native is to match the Rust numbers, not lose half of them at the boundary. |
| E. gRPC / OpenAPI server with thin Python client | Stand up ruvector-server over HTTP/gRPC, ship a Python client that hits localhost. |
Two-process architecture is the wrong default for a library — the user gets to deal with port allocation, server lifecycle, and serialization cost on every call. This is the right shape for a Python service SDK, but a vector index isn't a service; it's a data structure. |
Why PyO3 specifically
- Surface area parity with NAPI is automatic. PyO3's
#[pyclass]maps onto an opaque handle the same way#[napi]does, and#[pymethods]maps onto#[napi]impl blocks. Anyone who maintainscrates/ruvector-diskann-nodecan read and review the PyO3 module incrates/ruvector-pyline-for-line. - NumPy zero-copy.
pyo3+numpy(therust-numpycrate) lets us acceptnp.ndarrayand read it as&[f32]without a copy when the array is contiguous anddtype=float32. RaBitQ search loops on&[f32]already; this is a thin wrap. - abi3 wheels. PyO3 supports the stable ABI (
abi3-py39), which means one wheel covers Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 / 3.14. We do not need to ship a wheel per Python version. This collapses the matrix from ~25 wheels (5 versions × 5 platforms) to 5 wheels. - Mature async.
pyo3-asyncio(or its successorpyo3-async-runtimes, which we should track) lets a Rustasync fnreturn a Pythonawaitablethatasyncio.runawaits without spawning a thread per call. This is the only practical way to bridge tokio without double-runtime-fights. - Maturin is the de-facto Rust-Python build tool. Used by polars, pydantic-core, cryptography (in part), tokenizers. We are not pioneering anything; we are taking the well-trodden path.
Async story
Native asyncio via pyo3-asyncio. Every Rust async fn we expose
becomes an async def in Python by way of a pyo3_asyncio::tokio::future_into_py
wrapper. There is exactly one tokio runtime in the process: a multi-thread
runtime owned by the extension module, lazily initialized on first use,
sized to min(8, os.cpu_count()) worker threads. We do not create a
runtime per call.
We do not use asyncio.to_thread or run_in_executor to wrap a sync
API. That works but breaks cancellation propagation and tracing context.
The main async surfaces are:
RuLake.search_async(M2)A2aClient.send_task/stream_task(M4)Embedder.embed_batch_async(M3, optional — sync is fine for CPU work)
Sync siblings are kept for every async method (e.g. search and
search_async). Synchronous calls release the GIL via
Python::allow_threads; async calls return immediately and block the
tokio runtime, not the calling Python thread.
Compatibility: tested against CPython's default asyncio + uvloop. We do not pin uvloop. We do not invent our own loop policy.
GIL story
Every CPU-bound entry point that takes more than ~50 µs releases the
GIL via py.allow_threads(|| { ... }) around the inner Rust call. The
list as of M3:
| Surface | Releases GIL? | Why |
|---|---|---|
RabitqIndex.build |
yes | dominant cost is rotation + popcount, all Rust |
RabitqIndex.search |
yes | scan loop, no Python interaction |
RabitqIndex.add |
no | one vector per call, overhead < release cost |
RuLake.search_* |
yes | scan + cache lookup, all Rust |
Embedder.embed |
yes | tensor ops |
A2aClient.send_task |
n/a (async) | tokio runs without holding the GIL |
This is the same calculus polars and tokenizers use. Documenting it explicitly so the next person who adds a method knows the rule.
Wheel distribution matrix
We ship five wheels for each release, all abi3-py39 (works on Python
3.9+):
| Platform | Triple | Built on | Notes |
|---|---|---|---|
| Linux x86_64 | manylinux_2_28_x86_64 |
GitHub Actions ubuntu-latest | AVX2 baseline; runtime detect AVX-512 |
| Linux aarch64 | manylinux_2_28_aarch64 |
GHA ARM runners or QEMU via cibuildwheel | NEON baseline |
| macOS x86_64 | macosx_10_15_x86_64 |
GHA macos-13 | AVX2 baseline; bottlenecking on M-series users is fine, they have an arm64 wheel |
| macOS aarch64 | macosx_11_0_arm64 |
GHA macos-14 | NEON baseline |
| Windows x86_64 | win_amd64 |
GHA windows-latest | AVX2 baseline; runtime detect AVX-512 |
We drop musllinux, Windows arm64, and 32-bit anything. cibuildwheel
configures via [tool.cibuildwheel] in pyproject.toml. A 32-bit user
gets pip install falling back to sdist, which fails to build, which
is the correct outcome.
SIMD is runtime-detected, not compiled per-platform. ruvector-rabitq
is pure Rust without explicit AVX-512 paths today (the kernel.rs
VectorKernel trait is the extension point). We ship one binary per
platform; if/when we add an AVX-512 kernel it lives behind a runtime
CPU-feature check.
Type stubs
Hand-written .pyi stubs, checked in at
crates/ruvector-py/python/ruvector/__init__.pyi. Reasons:
pyo3-stub-genis real and improving but generates noisy stubs that need editing anyway (it overstatesAny, doesn't inferOptional[...]fromOption<T>cleanly).- The stub surface is small enough (≤ 4 modules × ≤ 40 methods) that hand-writing is feasible.
- We control the user-visible API shape, e.g. we want NumPy types in
signatures (
np.ndarray[np.float32]), notlist[float].
A CI job runs mypy --strict tests/ and pyright tests/ against an
import ruvector to catch stub regressions.
Source layout
crates/ruvector-py/
Cargo.toml # crate-type = ["cdylib"], pyo3 + numpy + pyo3-asyncio
pyproject.toml # maturin backend; cibuildwheel config; project metadata
README.md # short — links to docs/sdk
src/
lib.rs # PyModule init, re-exports each submodule
rabitq.rs # M1
rulake.rs # M2
embed.rs # M3
a2a.rs # M4
error.rs # exception hierarchy
runtime.rs # the singleton tokio runtime
python/ruvector/
__init__.py # re-exports from the compiled module + small pure-Py helpers
__init__.pyi # hand-written stubs
py.typed # marker so mypy/pyright recognize stubs
tests/ # pytest, runs against the installed wheel
benches/ # asv (airspeed-velocity) over identical workloads to Rust criterion
The python/ruvector/__init__.py re-export pattern lets us add pure-Python
helpers (e.g. dataclasses for config) without forcing them through the
extension boundary.
What this strategy explicitly does NOT do
- Does not wrap every workspace crate. We pick four crates over four milestones; everything else stays Rust-only.
- Does not try to be a Pythonic vector DB framework (chromadb, weaviate, qdrant). We are a thin, fast, typed binding to a specific Rust stack.
- Does not vendor models. The embedder downloads weights from
HuggingFace at first use, the same way
ruvector-cnndoes in Rust. - Does not provide an asyncio-only API. Sync siblings always exist for non-network calls.