mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-22 19:56:25 +00:00

History

rUv 55eae8887a ADR-180: ruvllm 2.2.1 cache-reset patch + N-backend pool exploration (#424 ) * ADR-180/181 iter 1: branch off + plan + ServingEngine API audit New /loop pursues two stacked optimizations on top of the ADR-179 SOTA (20.5 tok/s aggregate): - Phase A (ADR-180): ServingEngine continuous batching wiring, target ≥40 tok/s aggregate - Phase B (ADR-181): in-tree pi_quant Q4 + BitNet b1.58, target ≥80 tok/s aggregate Iter 1 lands the plan doc + audits the LlmBackend trait surface ServingEngine needs. Confirms the `submit_async` async oneshot flow + the per-request encode/decode path. Wiring shape sketched for iter 2. Co-Authored-By: claude-flow <ruv@ruv.net> * ADR-180 iter 2: wire ServingEngine into ruvllm-pi-worker (build green, scheduler stalls) Replace Mutex<CandleBackend> with Arc<dyn LlmBackend> + Arc<ServingEngine>. PiEngine::load constructs the engine with max_inflight from env, spawns the run_async scheduler in a tokio task. PiEngine::generate is now async — tokenizes via LlmBackend::tokenizer() (encode/decode live on Tokenizer trait, not LlmBackend itself), submit_async, decode result. Host build green ✓. Worker starts cleanly: model loaded. But: single submit_async request hangs 60+s with no result. Hypothesis: ServingEngine::run_async expects a lower-level executor surface that CandleBackend doesn't implement (the LlmBackend::generate path is the high-level escape hatch for non-batched calls; the scheduler likely needs forward_iteration or similar). Iter 3 audits run_iteration to find what backend methods it actually calls. Co-Authored-By: claude-flow <ruv@ruv.net> * ADR-180 iter 3: pivot to N-backend pool (ServingEngine isn't real batching) Iter-2 audit of ServingEngine::generate_next_token: it dispatches per-token via self.model.generate(text, max_tokens=1), serializing on Mutex<CandleBackend> with extra text<->token overhead. ruvllm 2.2.0's serving stack is scaffolding for continuous batching, not a working implementation. Pivot: pool of N independent CandleBackend instances, each in its own tokio::sync::Mutex, gated by a Semaphore. True request-level parallelism — N requests run concurrently on different threads with their own model weights + KV state. Cost: N × ~640 MB Q4_K_M weights. With N=4 that's 2.5 GB on each Pi 5; 8 GB total leaves ~5 GB for system + embed worker + KV. Host build green. Smoke running async (b4j4csypc). Co-Authored-By: claude-flow <ruv@ruv.net> * ADR-180 iter 4: KV-cache statefulness blocks in-process parallelism ADR-179 iter-16 bug reproduced under iter-3's N-backend pool wiring: 1st request → success, 2nd+ → broadcast shape mismatch from leaked KV cache. Affects every backend slot in the pool independently — in-process parallelism cannot work without an upstream ruvllm fix that resets candle's LlamaModel cache between generate() calls. Iter 5 pivots to deployment-level parallelism: N independent ruvllm-pi-worker processes per Pi on adjacent ports, each handling 1 request at a time. Process boundaries enforce request isolation. Projected aggregate: 4 Pis × 4 workers × 9 tok/s = 144 tok/s. Co-Authored-By: claude-flow <ruv@ruv.net> * ADR-180 iter 4: root cause = clear_kv_cache is a no-op for Llama LlmBackend::generate calls self.clear_kv_cache() at start, but for LoadedModelInner::Llama the impl only resets current_pos=0 and skips the actual candle Cache (which holds ks/vs Tensor vecs that accumulate across calls). The comment in candle_backend.rs:933 — "cache state will be reset when we start from position 0" — is wrong: candle's Cache doesn't auto-clear on position reset. This is THE bug torpedoing every multi-request strategy: - single Mutex<Backend>: 2nd request errors - N-backend pool: each slot's 2nd request errors - ServingEngine: same underlying generate() → same bug Upstream fix path (ruvllm 2.2.1): store llama_config + dtype on LoadedModel; clear_kv_cache builds a fresh Cache::new() for Llama arm and replaces the held one. Worker pins 2.2.1, rebuilds, redeploys. Iter 5 implements the patch. Co-Authored-By: claude-flow <ruv@ruv.net> * ruvllm 2.2.1: clear_kv_cache actually resets the Llama Cache LoadedModelInner::Llama gained two carry fields (Config, DType) so clear_kv_cache() can rebuild a fresh candle Cache for each new generate() call. The previous impl only set current_pos=0 and left the held Cache's ks/vs Tensor vecs untouched — they accumulated across calls and broke every request after the first ("cannot broadcast [N,N] to [1,H,N,X]" with X = stale seq len). This unblocks every multi-request strategy (single-Mutex backend, N-backend pool, ServingEngine wiring) — request isolation now works as the trait contract implies. Workspace version: 2.2.0 → 2.2.1. Host builds green. Co-Authored-By: claude-flow <ruv@ruv.net> * ADR-180 iter 6: deploy ruvllm 2.2.1 cluster-wide; throughput plateau ruvllm 2.2.1 + ruvllm-cli 2.2.1 published to crates.io (cache-reset fix). aarch64 worker deployed to all 4 Pis with RUVLLM_MAX_INFLIGHT=4. Cluster bench (Q4_K_M, 4 Pi × 16 in-flight): 16/16 success, 0 errors (cache-reset works) aggregate ~16-21 tok/s depending on per-Pi inflight Multi-inflight per Pi REGRESSES on Cortex-A76: 1 inflight × 16 tok: 21.6 tok/s — best 4 inflight × 4 tok: 16.5 tok/s — CPU contention candle's matmul saturates Pi 5's 4 cores at 1 generate — extra parallel calls fight for the same cores via context switching. Per-Pi single- stream rate IS the ceiling on this hardware. Win from 2.2.1: operational stability (no KV-leak errors across calls) + ability to sustain steady-state without worker restarts. Throughput unchanged from ADR-179 SOTA. Strike 1 on convergence (aggregate not exceeded). Iter 7 reverts pool to N=1 + pivots to ADR-181 (in-tree pi_quant 3-bit weights for the next jump). Co-Authored-By: claude-flow <ruv@ruv.net> * ADR-180 iter 7: CONVERGENCE — ruvllm 2.2.1 ships, throughput plateau confirmed Final bench (4 Pi × 1 in-flight × 16 tok, ruvllm 2.2.1): wall 2.88s, 64 actual tokens, 22.2 tok/s aggregate vs iter-26 SOTA 20.5 → +8% (noise) Strike 2 → converged. The real win is the upstream ruvllm 2.2.1 patch fixing the ADR-179 iter-16 KV-leak bug. Stability + operational simplicity, throughput unchanged. Per-Pi ceiling on Cortex-A76 + candle Q4_K_M is ~9 tok/s — hardware bound (LPDDR4X memory bandwidth + 4-core CPU saturation). Multi- inflight per Pi REGRESSES due to context switching. Next jumps need ADR-181 (pi_quant 2-3 bit) or ADR-182 (Hailo-10 onboard DDR). CronDelete done. Branch push + PR + email follow. Co-Authored-By: claude-flow <ruv@ruv.net> * ADR-180 iter 8: fix CI lint — clippy unused_variable + workspace rustfmt drift Two CI failures on PR #424 blocking merge, both pre-existing drift surfaced by my iter-3 changes (not new bugs): 1. clippy --all-targets -D warnings (cluster, default features): unused variable: started — ruvllm-pi-worker.rs:270 `started` is only used inside the #[cfg(feature = "ruvllm-engine")] timing block. Default cluster build (no feature) treated it as dead. Fix: gate the let inside the cfg-true arm. 2. rustfmt --check across workspace: - ruvllm-pi-worker.rs banner format!() + max_tokens chain (mine) - candle_backend.rs:1244 load_from_hub return cfg arm (mine, ADR-179) - mmwave-bridge.rs / ruview-csi-bridge.rs / ruvllm-bridge.rs (drift) - tests/ruview_csi_bridge_cli.rs (drift) - tests/ruvllm_bridge_cli.rs (drift) Fix: cargo fmt -p ruvector-hailo-cluster -p ruvllm. Local verification: cargo fmt --check -p ruvector-hailo-cluster -p ruvllm → clean cargo clippy -p ruvector-hailo-cluster --all-targets -- -D warnings → clean No behavioral change. Merge unblocker only. Co-Authored-By: claude-flow <ruv@ruv.net> --------- Co-authored-by: ruvnet <ruvnet@gmail.com>		2026-05-05 09:47:05 -04:00
..
src	fix: resolve compilation errors across workspace	2026-03-16 23:15:25 -04:00
Cargo.toml	ADR-180: ruvllm 2.2.1 cache-reset patch + N-backend pool exploration (#424 )	2026-05-05 09:47:05 -04:00
README.md	feat(training): RuvLTRA v2.4 Ecosystem Edition - 100% routing accuracy (#123 )	2026-01-20 20:08:30 -05:00

README.md

RuvLLM CLI

Command-line interface for RuvLLM inference, optimized for Apple Silicon.

Installation

# From crates.io
cargo install ruvllm-cli

# From source (with Metal acceleration)
cargo install --path . --features metal

Commands

Download Models

Download models from HuggingFace Hub:

# Download Qwen with Q4K quantization (default)
ruvllm download qwen

# Download with specific quantization
ruvllm download qwen --quantization q8
ruvllm download mistral --quantization f16

# Force re-download
ruvllm download phi --force

# Download specific revision
ruvllm download llama --revision main

Model Aliases

Alias	Model ID
`qwen`	`Qwen/Qwen2.5-7B-Instruct`
`mistral`	`mistralai/Mistral-7B-Instruct-v0.3`
`phi`	`microsoft/Phi-3-medium-4k-instruct`
`llama`	`meta-llama/Meta-Llama-3.1-8B-Instruct`

Quantization Options

Option	Description	Memory Savings
`q4k`	4-bit quantization (default)	~75%
`q8`	8-bit quantization	~50%
`f16`	Half precision	~50%
`none`	Full precision	0%

List Models

# List all available models
ruvllm list

# List only downloaded models
ruvllm list --downloaded

# Detailed listing with sizes
ruvllm list --long

Model Information

# Show model details
ruvllm info qwen

# Output includes:
# - Model architecture
# - Parameter count
# - Download status
# - Disk usage
# - Supported features

Interactive Chat

# Start chat with default settings
ruvllm chat qwen

# With custom system prompt
ruvllm chat qwen --system "You are a helpful coding assistant."

# Adjust generation parameters
ruvllm chat qwen --temperature 0.5 --max-tokens 1024

# Use specific quantization
ruvllm chat qwen --quantization q8

Chat Commands

During chat, use these commands:

Command	Description
`/help`	Show available commands
`/clear`	Clear conversation history
`/system <prompt>`	Change system prompt
`/temp <value>`	Change temperature
`/quit` or `/exit`	Exit chat

Start Server

OpenAI-compatible inference server:

# Start with defaults
ruvllm serve qwen

# Custom host and port
ruvllm serve qwen --host 0.0.0.0 --port 8080

# Configure concurrency
ruvllm serve qwen --max-concurrent 8 --max-context 8192

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions
`/v1/completions`	POST	Text completions
`/v1/models`	GET	List models
`/health`	GET	Health check

Example Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 256
  }'

Run Benchmarks

# Basic benchmark
ruvllm benchmark qwen

# Configure benchmark
ruvllm benchmark qwen \
  --warmup 5 \
  --iterations 20 \
  --prompt-length 256 \
  --gen-length 128

# Output formats
ruvllm benchmark qwen --format json
ruvllm benchmark qwen --format csv

Benchmark Metrics

Prefill Latency: Time to process input prompt
Decode Throughput: Tokens per second during generation
Time to First Token (TTFT): Latency before first output token
Memory Usage: Peak GPU/RAM consumption

Global Options

# Enable verbose logging
ruvllm --verbose <command>

# Disable colored output
ruvllm --no-color <command>

# Custom cache directory
ruvllm --cache-dir /path/to/cache <command>

# Or via environment variable
export RUVLLM_CACHE_DIR=/path/to/cache

Configuration

Cache Directory

Models are cached in:

macOS: ~/Library/Caches/ruvllm
Linux: ~/.cache/ruvllm
Windows: %LOCALAPPDATA%\ruvllm

Override with --cache-dir or RUVLLM_CACHE_DIR.

Logging

Set log level with RUST_LOG:

RUST_LOG=debug ruvllm chat qwen
RUST_LOG=ruvllm=trace ruvllm serve qwen

Examples

Basic Workflow

# 1. Download a model
ruvllm download qwen

# 2. Verify it's downloaded
ruvllm list --downloaded

# 3. Start chatting
ruvllm chat qwen

Server Deployment

# Download model first
ruvllm download qwen --quantization q4k

# Start server with production settings
ruvllm serve qwen \
  --host 0.0.0.0 \
  --port 8080 \
  --max-concurrent 16 \
  --max-context 4096 \
  --quantization q4k

Performance Testing

# Run comprehensive benchmarks
ruvllm benchmark qwen \
  --warmup 10 \
  --iterations 50 \
  --prompt-length 512 \
  --gen-length 256 \
  --format json > benchmark_results.json

Troubleshooting

Out of Memory

# Use smaller quantization
ruvllm chat qwen --quantization q4k

# Or reduce context length
ruvllm serve qwen --max-context 2048

Slow Download

# Resume interrupted download
ruvllm download qwen

# Force fresh download
ruvllm download qwen --force

Metal Issues (macOS)

Ensure Metal is available:

# Check Metal device
system_profiler SPDisplaysDataType | grep Metal

# Try with CPU fallback
RUVLLM_NO_METAL=1 ruvllm chat qwen

Feature Flags

Build with specific features:

# Metal acceleration (macOS)
cargo install ruvllm-cli --features metal

# CUDA acceleration (NVIDIA)
cargo install ruvllm-cli --features cuda

# Both (if available)
cargo install ruvllm-cli --features "metal,cuda"

License

Apache-2.0 / MIT dual license.