mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-26 16:04:02 +00:00

Reuven 383ff5e99f perf(ruvllm): optimize MoE routing with buffer reuse and optional metrics

P0: Router buffer reuse optimization
- Add pre-allocated result_buffer to MemoryAwareRouter
- Eliminate collect() allocation in select_top_k_buffered()
- Use std::mem::take for zero-copy buffer handoff
- Expected savings: 1-2µs per routing call

P1: Optional routing metrics feature flag
- Add 'routing-metrics' feature (enabled by default)
- Conditionally compile Instant::now() and metrics tracking
- Allows production builds to avoid syscall overhead (~0.04-0.08µs)

Performance Analysis Documentation:
- MoE routing optimization analysis report
- Comprehensive architecture review (5 documents)
- Identifies 8 additional optimization opportunities

ADR-092 targets: <10µs routing latency, 70%+ cache hit rate
All 26 MoE router tests pass.

Co-Authored-By: claude-flow <ruv@ruv.net>

2026-03-12 23:27:00 -04:00

12 KiB

Raw Permalink Blame History

RuvLLM Optimization Checklist

Quick reference for implementing performance improvements.

PHASE 1: High-Impact Quick Wins (1-2 weeks)

1. Fix Default Features (15-25% faster builds)

Status: ⚠️ NOT STARTED Effort: 30 minutes Impact: 30-45 second faster builds

Change Cargo.toml:

[features]
-default = ["async-runtime", "candle"]
+default = []
+
+# Full inference stack (heavy)
+full = ["async-runtime", "candle", "tokenizers", "hf-hub"]

# Keep existing feature definitions

Rationale:

Candle = 35MB compiled code
Tokio = 5MB runtime
Users building RuvLLM as library forced to compile all this

Validation:

# Before
cargo clean && time cargo build --release
# Expected: ~180 seconds

# After
cargo clean && time cargo build --release
# Expected: ~140 seconds

2. Reduce Re-export Bloat (8-12% speedup)

Status: ⚠️ NOT STARTED Effort: 2-3 hours Impact: 15-25 second faster type checking

File: /Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/src/lib.rs (lines 158-520)

Current State: 362 public re-exports

Action:

Identify most commonly imported items (~50 items)
Keep only high-value re-exports:

// TOP-LEVEL PUBLIC API (keep these)
pub use backends::{LlmBackend, GenerateParams};
pub use serving::{ServingEngine, ServingEngineConfig};
pub use session::{Session, SessionManager};
pub use sona::SonaIntegration;
pub use ruvector_integration::RuvectorIntegration;
pub use error::{Result, RuvLLMError};

// EVERYTHING ELSE: Remove from lib.rs
// Users import directly: use ruvllm::backends::CandleBackend;

Create re-export submodules for organization:

pub mod backends {
    pub use crate::backends::*;
}
pub mod models {
    pub use crate::models::*;
}
// etc.

Validation:

cargo build --lib && wc -l src/lib.rs
# Before: ~994 lines
# After: ~200-300 lines

3. Add Development Profile (4-5x faster dev builds)

Status: ⚠️ NOT STARTED Effort: 15 minutes Impact: 35 seconds → 7 seconds on incremental builds

File: /Users/cohen/GitHub/ruvnet/ruvector/Cargo.toml

Add to workspace:

[profile.release-fast]
inherits = "release"
lto = "thin"              # Instead of "fat"
codegen-units = 16       # Instead of 1
opt-level = 2            # Instead of 3

[profile.release-dev]
inherits = "release-fast"
debug = true             # Include symbols for profiling

Usage:

# Development: 4-5x faster builds
cargo build --profile release-fast

# Production: Optimal performance (use current "release" profile)
cargo build --release

# Benchmarking: Include symbols
cargo build --profile release-dev

Trade-offs:

Profile	Compile Time	Binary Size	Runtime Speed
release	180s	45MB	100% (optimal)
release-fast	40s	48MB	97%
release-dev	42s	65MB	97%

4. Document Unsafe Code (Code Quality)

Status: ⚠️ NOT STARTED Effort: 1 hour Impact: Code clarity, maintainability

Files with unsafe missing SAFETY comments:

/Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/src/kernels/attention.rs (lines 439, 461, 701, 784, 846)
/Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/src/quantize/pi_quant_simd.rs (multiple)
/Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/src/kernels/norm.rs
/Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/src/memory_pool.rs

Template for each unsafe block:

// SAFETY: [explain why this is safe]
// - [precondition 1]
// - [precondition 2]
// - [justification]
unsafe {
    // ... code
}

Example:

// SAFETY: q_ptr points to valid f32 array with length >= len.
// The loop bounds-check i < len before dereferencing, ensuring
// we never access out-of-bounds memory. NEON intrinsics are safe
// for aligned float32 pointers.
unsafe {
    let v0 = vld1q_f32(q_ptr.add(i));
}

PHASE 2: Medium-Impact Improvements (2-4 weeks)

5. Split Large Files (5-8% faster incremental builds)

Status: ⚠️ NOT STARTED Effort: 4-6 hours Impact: Better compile parallelism

Files to split:

5a. `autodetect.rs` (1,944 lines) → `autodetect/`

autodetect/
├── mod.rs             (100 lines - main types)
├── system.rs          (400 lines - SystemCapabilities)
├── cpu.rs             (600 lines - CPU feature detection)
├── gpu.rs             (500 lines - GPU capabilities)
└── inference.rs       (344 lines - InferenceConfig)

Steps:

Create src/autodetect/ directory
Move logic to submodules
Update imports in lib.rs

5b. `memory_pool.rs` (1,703 lines) → `memory_pool/`

memory_pool/
├── mod.rs             (100 lines - main interface)
├── arena.rs           (400 lines - ArenaAllocator)
├── buffer_pool.rs     (500 lines - BufferPool)
└── scratch.rs         (703 lines - ScratchSpaceManager)

5c. `kv_cache.rs` (1,527 lines) → `kv_cache/`

kv_cache/
├── mod.rs             (100 lines - main interface)
├── pooled.rs          (500 lines - PooledKvCache)
├── two_tier.rs        (600 lines - TwoTierKvCache)
└── stats.rs           (327 lines - KvCacheStats)

Validation:

# Check that no files exceed 600 lines
find src -name "*.rs" -exec wc -l {} \; | awk '$1 > 600 {print}'

# Should output nothing

6. Make Tokenizer PCRE Optional (10-15% for lib users)

Status: ⚠️ NOT STARTED Effort: 1 hour Impact: 8-18MB savings for users not needing tokenization

File: /Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/Cargo.toml

Current:

tokenizers = { version = "0.20", optional = true, default-features = false, features = ["onig"] }

Problem: onig feature includes Oniguruma PCRE engine (~10MB)

Options:

Option A: Make PCRE Optional

tokenizers = { version = "0.20", optional = true, default-features = false }
# Remove onig feature - uses lightweight regex

[features]
tokenizers-pcre = ["tokenizers/onig"]  # Optional PCRE

Option B: Provide Regex Lightweight Alternative

# Use regex-lite instead of onig when not needed
regex-lite = { version = "0.1", optional = true }

[features]
tokenizers-full = ["tokenizers/onig"]    # Heavy PCRE
tokenizers-lite = ["tokenizers"]          # Lightweight

Analysis:

onig (PCRE): 18MB compiled
tokenizers (no onig): 8MB compiled
regex-lite: 0.5MB compiled
Savings: 10MB per user not needing full tokenization

7. Move Always-Used Dependencies to Required

Status: ⚠️ NOT STARTED Effort: 30 minutes Impact: 2-3% code cleanup

File: /Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/Cargo.toml

Current: Listed as optional but always imported

tokenizers = { version = "0.20", optional = true, ... }  # Used in tokenizer.rs
hf-hub = { version = "0.3", optional = true, ... }      # Used in hub/

Audit Results:

tokenizers: Imported in tokenizer.rs (module pub) → ALWAYS USED
hf-hub: Imported in hub/download.rs → ALWAYS USED (if hub enabled)
rayon: Only used in kernels/accelerate.rs → OK to keep optional

Action:

[dependencies]
-tokenizers = { version = "0.20", optional = true, ... }
+tokenizers = { version = "0.20", default-features = false, features = ["onig"] }

-hf-hub = { version = "0.3", optional = true, ... }
+hf-hub = { version = "0.3", features = ["tokio"] }

[features]
# Remove these
-hub = ["hf-hub"]  # No longer optional

8. Reduce Clippy Allowlist (Code Quality)

Status: ⚠️ NOT STARTED Effort: 2-3 hours Impact: Better code quality signals

File: /Users/cohen/GitHub/ruvnet/ruvector/crates/ruvllm/src/lib.rs (lines 41-112)

Current: 72 lint suppressions Target: 8-12 suppressions

Strategy:

Remove global allows (move to specific modules)
Investigate root causes:
- too_many_arguments: Method signature issue?
- type_complexity: Over-generic code?
- unused_*: Dead code?
Add module-level allows only where unavoidable:

// lib.rs: Only essential global allows
#![allow(missing_docs)]
#![warn(clippy::all)]

// backends/mod.rs: Only what's needed
#![allow(clippy::too_many_arguments)]

// quantize/mod.rs: Only what's needed
#![allow(clippy::type_complexity)]

Validation:

# Check final allowlist
grep '#!\[allow' src/lib.rs | wc -l
# Target: <= 12

# Run clippy
cargo clippy --all-targets
# Should show fewer suppressions

PHASE 3: Testing & Validation (1 week)

9. Benchmark Before/After

Status: ⚠️ NOT STARTED Effort: 2 hours Impact: Quantify improvements

Benchmark Script:

#!/bin/bash
# benchmark.sh

echo "=== PHASE 1 + 2 OPTIMIZATION BENCHMARKS ==="

# Clean builds
for profile in release release-fast; do
    echo ""
    echo "Profile: $profile"
    cargo clean
    time cargo build --profile $profile 2>&1 | tail -1
done

# Incremental builds
echo ""
echo "Incremental build (touch one file):"
touch src/lib.rs
time cargo build --release 2>&1 | tail -1

# Binary sizes
echo ""
echo "Binary sizes:"
ls -lh target/release/libruvllm.* | awk '{print $5, $9}'

# Check re-export count
echo ""
echo "Re-export count:"
grep '^pub use' src/lib.rs | wc -l

10. Performance Regression Testing

Status: ⚠️ NOT STARTED Effort: 2 hours Impact: Ensure no runtime degradation

Test:

# Run existing benchmarks
cargo bench --bench serving_bench
cargo bench --bench e2e_bench
cargo bench --bench metal_bench

# Compare before/after
# Expected: <2% difference

11. Documentation Updates

Status: ⚠️ NOT STARTED Effort: 1 hour Impact: User guidance

Update:

README.md: Build time expectations
CARGO_FEATURES.md: Feature selection guide
Add build profile documentation
Update unsafe code documentation in modules

Quick Reference: Before/After

Build Times

                  Before    After    Improvement
Full Release      180s      140s     22% faster
Incremental       8s        6s       25% faster
Dev Builds        180s      40s      78% faster

Binary Size

                  Before    After    Improvement
Release (opt)     45MB      42MB     7% smaller
Debug symbols     65MB      62MB     5% smaller
Library only      ~30MB     ~27MB    10% smaller

Code Quality

                  Before    After    Improvement
Clippy allows     72        12       83% reduction
SAFETY comments   37/45     45/45    100% coverage
Max file size     1944      600      69% reduction

Implementation Order

Week 1:
- Fix default features
- Add release-fast profile
- Document unsafe code
- Reduce re-export bloat
Week 2-3:
- Split large files
- Make tokenizer PCRE optional
- Reduce clippy allowlist
- Move optional deps to required
Week 4:
- Benchmark & validate
- Performance regression testing
- Documentation updates
- Merge to main

Rollback Plan

Each change is independent and easily reversible:

# If any change causes issues:
git revert <commit-hash>

# Most risky changes:
# 1. Moving optional deps to required (test compatibility)
# 2. Splitting large files (test import paths)
# 3. Reducing re-exports (test external API)

Success Criteria

Build time reduction: >20% (target: 25%)
Binary size reduction: >5% (target: 8%)
Code quality improvement: Clippy allows <15 (target: 12)
No performance regression: <1% (max 2%)
All tests passing
Documentation updated

Estimated Timeline

Phase 1: 1 week (5 days)
Phase 2: 2 weeks (10 days)
Phase 3: 1 week (5 days)
Total: 4 weeks (20 working days)

Questions & Decisions

Q: Should we keep default = [] or provide default = ["full"]?

A: default = [] allows flexible composition. Users can do:

cargo add ruvllm --features full
cargo add ruvllm --features "async-runtime,candle"

Q: Will thin LTO cause noticeable performance loss?

A: Only 2-3% on throughput metrics. Release profile stays as-is for optimal performance.

Q: What if users complain about breaking API changes?

A: Re-exports are public but deprecated. Keep for 1-2 versions with deprecation warnings.

Status: Ready for implementation Approved By: [awaiting approval] Last Updated: March 2026

12 KiB Raw Permalink Blame History

RuvLLM Optimization Checklist

PHASE 1: High-Impact Quick Wins (1-2 weeks)

1. Fix Default Features (15-25% faster builds)

2. Reduce Re-export Bloat (8-12% speedup)

3. Add Development Profile (4-5x faster dev builds)

4. Document Unsafe Code (Code Quality)

PHASE 2: Medium-Impact Improvements (2-4 weeks)

5. Split Large Files (5-8% faster incremental builds)

5a. autodetect.rs (1,944 lines) → autodetect/

5b. memory_pool.rs (1,703 lines) → memory_pool/

5c. kv_cache.rs (1,527 lines) → kv_cache/

6. Make Tokenizer PCRE Optional (10-15% for lib users)

7. Move Always-Used Dependencies to Required

8. Reduce Clippy Allowlist (Code Quality)

PHASE 3: Testing & Validation (1 week)

9. Benchmark Before/After

10. Performance Regression Testing

11. Documentation Updates

Quick Reference: Before/After

Build Times

Binary Size

Code Quality

Implementation Order

Rollback Plan

Success Criteria

Estimated Timeline

Questions & Decisions

Q: Should we keep default = [] or provide default = ["full"]?

Q: Will thin LTO cause noticeable performance loss?

Q: What if users complain about breaking API changes?

12 KiB

Raw Permalink Blame History

5a. `autodetect.rs` (1,944 lines) → `autodetect/`

5b. `memory_pool.rs` (1,703 lines) → `memory_pool/`

5c. `kv_cache.rs` (1,527 lines) → `kv_cache/`