mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-06-01 23:00:37 +00:00
Add architecture decision records for the 3 critical P0 features needed for production LLM inference parity with vLLM/SGLang: ADR-009: Structured Output (JSON Mode) - Constrained decoding with state machine token filtering - GBNF grammar support for complex schemas - Incremental JSON validation during generation - Performance: <2ms overhead per token ADR-010: Function Calling (Tool Use) - OpenAI-compatible tool definition format - Stop-sequence based argument extraction - Parallel and sequential function execution - Automatic retry with error context ADR-011: Prefix Caching (Radix Tree) - SGLang-style radix tree for prefix matching - Copy-on-write KV cache page sharing - LRU eviction with configurable cache size - 10x speedup target for chat/RAG workloads Also includes: - GitHub issue markdown for tracking implementation - Comprehensive SOTA analysis comparing RuvLLM vs competitors - Detailed roadmap (Q1-Q4 2026) for feature parity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| ADR-001-ruvector-core-architecture.md | ||
| ADR-002-ruvllm-integration.md | ||
| ADR-003-simd-optimization-strategy.md | ||
| ADR-004-kv-cache-management.md | ||
| ADR-005-wasm-runtime-integration.md | ||
| ADR-006-memory-management.md | ||
| ADR-007-security-review-technical-debt.md | ||
| ADR-008-mistral-rs-integration.md | ||
| ADR-009-structured-output.md | ||
| ADR-010-function-calling.md | ||
| ADR-011-prefix-caching.md | ||