mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-05-22 19:56:25 +00:00
* ADR-180/181 iter 1: branch off + plan + ServingEngine API audit
New /loop pursues two stacked optimizations on top of the ADR-179
SOTA (20.5 tok/s aggregate):
- Phase A (ADR-180): ServingEngine continuous batching wiring,
target ≥40 tok/s aggregate
- Phase B (ADR-181): in-tree pi_quant Q4 + BitNet b1.58,
target ≥80 tok/s aggregate
Iter 1 lands the plan doc + audits the LlmBackend trait surface
ServingEngine needs. Confirms the `submit_async` async oneshot
flow + the per-request encode/decode path. Wiring shape sketched
for iter 2.
Co-Authored-By: claude-flow <ruv@ruv.net>
* ADR-180 iter 2: wire ServingEngine into ruvllm-pi-worker (build green, scheduler stalls)
Replace Mutex<CandleBackend> with Arc<dyn LlmBackend> + Arc<ServingEngine>.
PiEngine::load constructs the engine with max_inflight from env, spawns
the run_async scheduler in a tokio task. PiEngine::generate is now
async — tokenizes via LlmBackend::tokenizer() (encode/decode live on
Tokenizer trait, not LlmBackend itself), submit_async, decode result.
Host build green ✓. Worker starts cleanly: model loaded.
But: single submit_async request hangs 60+s with no result. Hypothesis:
ServingEngine::run_async expects a lower-level executor surface that
CandleBackend doesn't implement (the LlmBackend::generate path is the
high-level escape hatch for non-batched calls; the scheduler likely
needs forward_iteration or similar). Iter 3 audits run_iteration to
find what backend methods it actually calls.
Co-Authored-By: claude-flow <ruv@ruv.net>
* ADR-180 iter 3: pivot to N-backend pool (ServingEngine isn't real batching)
Iter-2 audit of ServingEngine::generate_next_token: it dispatches
per-token via self.model.generate(text, max_tokens=1), serializing
on Mutex<CandleBackend> with extra text<->token overhead. ruvllm
2.2.0's serving stack is scaffolding for continuous batching,
not a working implementation.
Pivot: pool of N independent CandleBackend instances, each in its
own tokio::sync::Mutex, gated by a Semaphore. True request-level
parallelism — N requests run concurrently on different threads
with their own model weights + KV state.
Cost: N × ~640 MB Q4_K_M weights. With N=4 that's 2.5 GB on each
Pi 5; 8 GB total leaves ~5 GB for system + embed worker + KV.
Host build green. Smoke running async (b4j4csypc).
Co-Authored-By: claude-flow <ruv@ruv.net>
* ADR-180 iter 4: KV-cache statefulness blocks in-process parallelism
ADR-179 iter-16 bug reproduced under iter-3's N-backend pool wiring:
1st request → success, 2nd+ → broadcast shape mismatch from leaked
KV cache. Affects every backend slot in the pool independently —
in-process parallelism cannot work without an upstream ruvllm fix
that resets candle's LlamaModel cache between generate() calls.
Iter 5 pivots to deployment-level parallelism: N independent
ruvllm-pi-worker processes per Pi on adjacent ports, each handling
1 request at a time. Process boundaries enforce request isolation.
Projected aggregate: 4 Pis × 4 workers × 9 tok/s = 144 tok/s.
Co-Authored-By: claude-flow <ruv@ruv.net>
* ADR-180 iter 4: root cause = clear_kv_cache is a no-op for Llama
LlmBackend::generate calls self.clear_kv_cache() at start, but for
LoadedModelInner::Llama the impl only resets current_pos=0 and skips
the actual candle Cache (which holds ks/vs Tensor vecs that accumulate
across calls). The comment in candle_backend.rs:933 — "cache state
will be reset when we start from position 0" — is wrong: candle's
Cache doesn't auto-clear on position reset.
This is THE bug torpedoing every multi-request strategy:
- single Mutex<Backend>: 2nd request errors
- N-backend pool: each slot's 2nd request errors
- ServingEngine: same underlying generate() → same bug
Upstream fix path (ruvllm 2.2.1): store llama_config + dtype on
LoadedModel; clear_kv_cache builds a fresh Cache::new() for Llama
arm and replaces the held one. Worker pins 2.2.1, rebuilds, redeploys.
Iter 5 implements the patch.
Co-Authored-By: claude-flow <ruv@ruv.net>
* ruvllm 2.2.1: clear_kv_cache actually resets the Llama Cache
LoadedModelInner::Llama gained two carry fields (Config, DType) so
clear_kv_cache() can rebuild a fresh candle Cache for each new
generate() call. The previous impl only set current_pos=0 and
left the held Cache's ks/vs Tensor vecs untouched — they
accumulated across calls and broke every request after the first
("cannot broadcast [N,N] to [1,H,N,X]" with X = stale seq len).
This unblocks every multi-request strategy (single-Mutex backend,
N-backend pool, ServingEngine wiring) — request isolation now
works as the trait contract implies.
Workspace version: 2.2.0 → 2.2.1. Host builds green.
Co-Authored-By: claude-flow <ruv@ruv.net>
* ADR-180 iter 6: deploy ruvllm 2.2.1 cluster-wide; throughput plateau
ruvllm 2.2.1 + ruvllm-cli 2.2.1 published to crates.io (cache-reset fix).
aarch64 worker deployed to all 4 Pis with RUVLLM_MAX_INFLIGHT=4.
Cluster bench (Q4_K_M, 4 Pi × 16 in-flight):
16/16 success, 0 errors (cache-reset works)
aggregate ~16-21 tok/s depending on per-Pi inflight
Multi-inflight per Pi REGRESSES on Cortex-A76:
1 inflight × 16 tok: 21.6 tok/s — best
4 inflight × 4 tok: 16.5 tok/s — CPU contention
candle's matmul saturates Pi 5's 4 cores at 1 generate — extra parallel
calls fight for the same cores via context switching. Per-Pi single-
stream rate IS the ceiling on this hardware.
Win from 2.2.1: operational stability (no KV-leak errors across calls)
+ ability to sustain steady-state without worker restarts. Throughput
unchanged from ADR-179 SOTA.
Strike 1 on convergence (aggregate not exceeded). Iter 7 reverts pool
to N=1 + pivots to ADR-181 (in-tree pi_quant 3-bit weights for the
next jump).
Co-Authored-By: claude-flow <ruv@ruv.net>
* ADR-180 iter 7: CONVERGENCE — ruvllm 2.2.1 ships, throughput plateau confirmed
Final bench (4 Pi × 1 in-flight × 16 tok, ruvllm 2.2.1):
wall 2.88s, 64 actual tokens, 22.2 tok/s aggregate
vs iter-26 SOTA 20.5 → +8% (noise)
Strike 2 → converged. The real win is the upstream ruvllm 2.2.1
patch fixing the ADR-179 iter-16 KV-leak bug. Stability +
operational simplicity, throughput unchanged.
Per-Pi ceiling on Cortex-A76 + candle Q4_K_M is ~9 tok/s — hardware
bound (LPDDR4X memory bandwidth + 4-core CPU saturation). Multi-
inflight per Pi REGRESSES due to context switching. Next jumps need
ADR-181 (pi_quant 2-3 bit) or ADR-182 (Hailo-10 onboard DDR).
CronDelete done. Branch push + PR + email follow.
Co-Authored-By: claude-flow <ruv@ruv.net>
* ADR-180 iter 8: fix CI lint — clippy unused_variable + workspace rustfmt drift
Two CI failures on PR #424 blocking merge, both pre-existing drift surfaced
by my iter-3 changes (not new bugs):
1. clippy --all-targets -D warnings (cluster, default features):
unused variable: started — ruvllm-pi-worker.rs:270
`started` is only used inside the #[cfg(feature = "ruvllm-engine")]
timing block. Default cluster build (no feature) treated it as dead.
Fix: gate the let inside the cfg-true arm.
2. rustfmt --check across workspace:
- ruvllm-pi-worker.rs banner format!() + max_tokens chain (mine)
- candle_backend.rs:1244 load_from_hub return cfg arm (mine, ADR-179)
- mmwave-bridge.rs / ruview-csi-bridge.rs / ruvllm-bridge.rs (drift)
- tests/ruview_csi_bridge_cli.rs (drift)
- tests/ruvllm_bridge_cli.rs (drift)
Fix: cargo fmt -p ruvector-hailo-cluster -p ruvllm.
Local verification:
cargo fmt --check -p ruvector-hailo-cluster -p ruvllm → clean
cargo clippy -p ruvector-hailo-cluster --all-targets
-- -D warnings → clean
No behavioral change. Merge unblocker only.
Co-Authored-By: claude-flow <ruv@ruv.net>
---------
Co-authored-by: ruvnet <ruvnet@gmail.com>
|
||
|---|---|---|
| .. | ||
| src | ||
| Cargo.toml | ||
| README.md | ||
RuvLLM CLI
Command-line interface for RuvLLM inference, optimized for Apple Silicon.
Installation
# From crates.io
cargo install ruvllm-cli
# From source (with Metal acceleration)
cargo install --path . --features metal
Commands
Download Models
Download models from HuggingFace Hub:
# Download Qwen with Q4K quantization (default)
ruvllm download qwen
# Download with specific quantization
ruvllm download qwen --quantization q8
ruvllm download mistral --quantization f16
# Force re-download
ruvllm download phi --force
# Download specific revision
ruvllm download llama --revision main
Model Aliases
| Alias | Model ID |
|---|---|
qwen |
Qwen/Qwen2.5-7B-Instruct |
mistral |
mistralai/Mistral-7B-Instruct-v0.3 |
phi |
microsoft/Phi-3-medium-4k-instruct |
llama |
meta-llama/Meta-Llama-3.1-8B-Instruct |
Quantization Options
| Option | Description | Memory Savings |
|---|---|---|
q4k |
4-bit quantization (default) | ~75% |
q8 |
8-bit quantization | ~50% |
f16 |
Half precision | ~50% |
none |
Full precision | 0% |
List Models
# List all available models
ruvllm list
# List only downloaded models
ruvllm list --downloaded
# Detailed listing with sizes
ruvllm list --long
Model Information
# Show model details
ruvllm info qwen
# Output includes:
# - Model architecture
# - Parameter count
# - Download status
# - Disk usage
# - Supported features
Interactive Chat
# Start chat with default settings
ruvllm chat qwen
# With custom system prompt
ruvllm chat qwen --system "You are a helpful coding assistant."
# Adjust generation parameters
ruvllm chat qwen --temperature 0.5 --max-tokens 1024
# Use specific quantization
ruvllm chat qwen --quantization q8
Chat Commands
During chat, use these commands:
| Command | Description |
|---|---|
/help |
Show available commands |
/clear |
Clear conversation history |
/system <prompt> |
Change system prompt |
/temp <value> |
Change temperature |
/quit or /exit |
Exit chat |
Start Server
OpenAI-compatible inference server:
# Start with defaults
ruvllm serve qwen
# Custom host and port
ruvllm serve qwen --host 0.0.0.0 --port 8080
# Configure concurrency
ruvllm serve qwen --max-concurrent 8 --max-context 8192
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions |
/v1/completions |
POST | Text completions |
/v1/models |
GET | List models |
/health |
GET | Health check |
Example Request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 256
}'
Run Benchmarks
# Basic benchmark
ruvllm benchmark qwen
# Configure benchmark
ruvllm benchmark qwen \
--warmup 5 \
--iterations 20 \
--prompt-length 256 \
--gen-length 128
# Output formats
ruvllm benchmark qwen --format json
ruvllm benchmark qwen --format csv
Benchmark Metrics
- Prefill Latency: Time to process input prompt
- Decode Throughput: Tokens per second during generation
- Time to First Token (TTFT): Latency before first output token
- Memory Usage: Peak GPU/RAM consumption
Global Options
# Enable verbose logging
ruvllm --verbose <command>
# Disable colored output
ruvllm --no-color <command>
# Custom cache directory
ruvllm --cache-dir /path/to/cache <command>
# Or via environment variable
export RUVLLM_CACHE_DIR=/path/to/cache
Configuration
Cache Directory
Models are cached in:
- macOS:
~/Library/Caches/ruvllm - Linux:
~/.cache/ruvllm - Windows:
%LOCALAPPDATA%\ruvllm
Override with --cache-dir or RUVLLM_CACHE_DIR.
Logging
Set log level with RUST_LOG:
RUST_LOG=debug ruvllm chat qwen
RUST_LOG=ruvllm=trace ruvllm serve qwen
Examples
Basic Workflow
# 1. Download a model
ruvllm download qwen
# 2. Verify it's downloaded
ruvllm list --downloaded
# 3. Start chatting
ruvllm chat qwen
Server Deployment
# Download model first
ruvllm download qwen --quantization q4k
# Start server with production settings
ruvllm serve qwen \
--host 0.0.0.0 \
--port 8080 \
--max-concurrent 16 \
--max-context 4096 \
--quantization q4k
Performance Testing
# Run comprehensive benchmarks
ruvllm benchmark qwen \
--warmup 10 \
--iterations 50 \
--prompt-length 512 \
--gen-length 256 \
--format json > benchmark_results.json
Troubleshooting
Out of Memory
# Use smaller quantization
ruvllm chat qwen --quantization q4k
# Or reduce context length
ruvllm serve qwen --max-context 2048
Slow Download
# Resume interrupted download
ruvllm download qwen
# Force fresh download
ruvllm download qwen --force
Metal Issues (macOS)
Ensure Metal is available:
# Check Metal device
system_profiler SPDisplaysDataType | grep Metal
# Try with CPU fallback
RUVLLM_NO_METAL=1 ruvllm chat qwen
Feature Flags
Build with specific features:
# Metal acceleration (macOS)
cargo install ruvllm-cli --features metal
# CUDA acceleration (NVIDIA)
cargo install ruvllm-cli --features cuda
# Both (if available)
cargo install ruvllm-cli --features "metal,cuda"
License
Apache-2.0 / MIT dual license.