ruvector/crates/ruvllm-cli
ruvnet f5c39e5bbe chore(ci): green security audit + split test job into 6 matrix shards
Unblocks the 7 stacked PRs (#381-#387) and turns `main`'s CI green
for the first time in days. Two issues fixed:

## Failure 1 — Security audit (was: 8 vulnerabilities)

`cargo audit` is now exit 0. 4 of the 5 critical advisories were
fixed by version bumps; only the unfixable one is ignored.

**Dep-bumped:**
- `rustls-webpki 0.101.7` + `0.103.10` → `0.103.13` via
  `cargo update -p rustls-webpki@0.103.10`. Patches:
    RUSTSEC-2026-0098 (URI name constraints)
    RUSTSEC-2026-0099 (wildcard name constraints)
    RUSTSEC-2026-0104 (CRL parsing panic)
- `idna 0.5.0` → `1.1.0` via `validator 0.18 → 0.20` in
  `examples/scipix`. Patches RUSTSEC-2024-0421 (Punycode acceptance).
- Bonus: `reqwest 0.11 → 0.12` (in `ruvector-core` + `examples/benchmarks`)
  and `hf-hub 0.3 → 0.4` (in `ruvector-core` + `ruvllm` +
  `ruvllm-cli`). Removes the entire legacy `rustls 0.21` /
  `rustls-webpki 0.101.7` subtree from the lockfile.

**Ignored** (single advisory, with rationale):
- `RUSTSEC-2023-0071` (rsa Marvin timing sidechannel) — no upstream
  fix available; we don't expose RSA decryption services. Documented
  in `.cargo/audit.toml`.

**Unmaintained warnings** (16 total — proc-macro-error, derivative,
instant, paste, bincode 1, pqcrypto-{kyber,dilithium}, rustls-pemfile 1,
rusttype, wee_alloc, number_prefix, rand_os, core2, lru, pprof, rand) —
each given a one-line justification in `.cargo/audit.toml` so CI stays
green on them while the team decides whether to chase upstream
replacements.

## Failure 2 — Tests timeout (was: 30-min job timeout cancellation)

`.github/workflows/ci.yml` `test` job is now a `matrix` with
`fail-fast: false` and `timeout-minutes: 45`. Six parallel shards
under `cargo nextest run` (installed via `taiki-e/install-action@v2`)
plus a separate `cargo test --doc` step (nextest doesn't run
doctests):

  | Shard            | Crates                                      |
  |------------------|---------------------------------------------|
  | vector-index     | rabitq, rulake, diskann, graph, gnn, cnn    |
  | rvagent          | 10 rvagent-* crates                         |
  | ruvix            | 16 ruvix-* crates                           |
  | ruqu-quantum     | 5 ruqu* crates                              |
  | ml-research      | attention, mincut, scipix, fpga-transformer,|
  |                  | sparse-inference, sparsifier, solver,       |
  |                  | graph-transformer, domain-expansion,        |
  |                  | robotics                                    |
  | core-and-rest    | --workspace minus the above                 |

`Swatinem/rust-cache@v2` is keyed per shard. Audit job switched to
`taiki-e/install-action` for `cargo-audit` (faster than
`cargo install --locked`).

## Verification

  cargo audit                                                   → exit 0
  cargo build --workspace --exclude ruvector-postgres           → clean
  cargo clippy --workspace --exclude ruvector-postgres --no-deps -- -D warnings → exit 0
  cargo fmt --all --check                                       → exit 0

## Cargo.lock churn

166-line diff, net ~120 lines removed (more deletions than
additions). Removed: `idna 0.5.0`, `rustls-webpki 0.101.7`,
`validator 0.18`, `validator_derive 0.18`, `proc-macro-error 1.0.4`.
Added: `rustls-webpki 0.103.13`, `validator 0.20`,
`proc-macro-error2`, `hf-hub 0.4.3`, `reqwest 0.12.28`. No
suspicious crates.

## Recommended merge order

1. **This PR first** — unblocks every other PR's CI.
2. After this lands and main is green, rebase the 7 open PRs
   (#381-#387) one at a time. The DiskANN stack (#383→#384→#385→#386)
   must merge in numeric order. #381 (Python SDK), #382 (research),
   #387 (graph property index) are independent and can merge in
   any order after their CI goes green on the rebase.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-04-26 00:17:25 -04:00
..
src fix: resolve compilation errors across workspace 2026-03-16 23:15:25 -04:00
Cargo.toml chore(ci): green security audit + split test job into 6 matrix shards 2026-04-26 00:17:25 -04:00
README.md feat(training): RuvLTRA v2.4 Ecosystem Edition - 100% routing accuracy (#123) 2026-01-20 20:08:30 -05:00

RuvLLM CLI

Command-line interface for RuvLLM inference, optimized for Apple Silicon.

Installation

# From crates.io
cargo install ruvllm-cli

# From source (with Metal acceleration)
cargo install --path . --features metal

Commands

Download Models

Download models from HuggingFace Hub:

# Download Qwen with Q4K quantization (default)
ruvllm download qwen

# Download with specific quantization
ruvllm download qwen --quantization q8
ruvllm download mistral --quantization f16

# Force re-download
ruvllm download phi --force

# Download specific revision
ruvllm download llama --revision main

Model Aliases

Alias Model ID
qwen Qwen/Qwen2.5-7B-Instruct
mistral mistralai/Mistral-7B-Instruct-v0.3
phi microsoft/Phi-3-medium-4k-instruct
llama meta-llama/Meta-Llama-3.1-8B-Instruct

Quantization Options

Option Description Memory Savings
q4k 4-bit quantization (default) ~75%
q8 8-bit quantization ~50%
f16 Half precision ~50%
none Full precision 0%

List Models

# List all available models
ruvllm list

# List only downloaded models
ruvllm list --downloaded

# Detailed listing with sizes
ruvllm list --long

Model Information

# Show model details
ruvllm info qwen

# Output includes:
# - Model architecture
# - Parameter count
# - Download status
# - Disk usage
# - Supported features

Interactive Chat

# Start chat with default settings
ruvllm chat qwen

# With custom system prompt
ruvllm chat qwen --system "You are a helpful coding assistant."

# Adjust generation parameters
ruvllm chat qwen --temperature 0.5 --max-tokens 1024

# Use specific quantization
ruvllm chat qwen --quantization q8

Chat Commands

During chat, use these commands:

Command Description
/help Show available commands
/clear Clear conversation history
/system <prompt> Change system prompt
/temp <value> Change temperature
/quit or /exit Exit chat

Start Server

OpenAI-compatible inference server:

# Start with defaults
ruvllm serve qwen

# Custom host and port
ruvllm serve qwen --host 0.0.0.0 --port 8080

# Configure concurrency
ruvllm serve qwen --max-concurrent 8 --max-context 8192

API Endpoints

Endpoint Method Description
/v1/chat/completions POST Chat completions
/v1/completions POST Text completions
/v1/models GET List models
/health GET Health check

Example Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 256
  }'

Run Benchmarks

# Basic benchmark
ruvllm benchmark qwen

# Configure benchmark
ruvllm benchmark qwen \
  --warmup 5 \
  --iterations 20 \
  --prompt-length 256 \
  --gen-length 128

# Output formats
ruvllm benchmark qwen --format json
ruvllm benchmark qwen --format csv

Benchmark Metrics

  • Prefill Latency: Time to process input prompt
  • Decode Throughput: Tokens per second during generation
  • Time to First Token (TTFT): Latency before first output token
  • Memory Usage: Peak GPU/RAM consumption

Global Options

# Enable verbose logging
ruvllm --verbose <command>

# Disable colored output
ruvllm --no-color <command>

# Custom cache directory
ruvllm --cache-dir /path/to/cache <command>

# Or via environment variable
export RUVLLM_CACHE_DIR=/path/to/cache

Configuration

Cache Directory

Models are cached in:

  • macOS: ~/Library/Caches/ruvllm
  • Linux: ~/.cache/ruvllm
  • Windows: %LOCALAPPDATA%\ruvllm

Override with --cache-dir or RUVLLM_CACHE_DIR.

Logging

Set log level with RUST_LOG:

RUST_LOG=debug ruvllm chat qwen
RUST_LOG=ruvllm=trace ruvllm serve qwen

Examples

Basic Workflow

# 1. Download a model
ruvllm download qwen

# 2. Verify it's downloaded
ruvllm list --downloaded

# 3. Start chatting
ruvllm chat qwen

Server Deployment

# Download model first
ruvllm download qwen --quantization q4k

# Start server with production settings
ruvllm serve qwen \
  --host 0.0.0.0 \
  --port 8080 \
  --max-concurrent 16 \
  --max-context 4096 \
  --quantization q4k

Performance Testing

# Run comprehensive benchmarks
ruvllm benchmark qwen \
  --warmup 10 \
  --iterations 50 \
  --prompt-length 512 \
  --gen-length 256 \
  --format json > benchmark_results.json

Troubleshooting

Out of Memory

# Use smaller quantization
ruvllm chat qwen --quantization q4k

# Or reduce context length
ruvllm serve qwen --max-context 2048

Slow Download

# Resume interrupted download
ruvllm download qwen

# Force fresh download
ruvllm download qwen --force

Metal Issues (macOS)

Ensure Metal is available:

# Check Metal device
system_profiler SPDisplaysDataType | grep Metal

# Try with CPU fallback
RUVLLM_NO_METAL=1 ruvllm chat qwen

Feature Flags

Build with specific features:

# Metal acceleration (macOS)
cargo install ruvllm-cli --features metal

# CUDA acceleration (NVIDIA)
cargo install ruvllm-cli --features cuda

# Both (if available)
cargo install ruvllm-cli --features "metal,cuda"

License

Apache-2.0 / MIT dual license.