diff --git a/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md b/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md index 02fa1c05..63bc562c 100644 --- a/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md +++ b/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md @@ -384,8 +384,8 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine ### Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0) - **Timeline**: 1-3 weeks (overlaps with Phase 0 kernel development) -- **Cost**: **$0** (runs on Mac Studio, ~2-12 days training wall time) -- **Platform**: Mac Studio (same as Phase 0) +- **Cost**: **$0** (runs on Mac Studio, ~2-12 days training wall time with Metal; ~4-24 days SIMD-only) +- **Platform**: Mac Studio (same as Phase 0) — **supports both Metal GPU and pure SIMD/CPU modes** (see AD-20) - **Goal**: Improve Phase 0 PTQ quality from ~55-65% to ~70-80% by training only the small FP16 components using the existing RLM stack — **no traditional distillation, no cloud GPU** - **Approach**: Freeze ternary weights, train FP16 corrections using RLM components: 1. **MicroLoRA adapters** (rank 1-2) on each expert FFN — adds small FP16 correction: `Y = BitLinear(X) + LoRA_B @ LoRA_A @ X` @@ -396,7 +396,7 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine 6. **Policy persistence** via PolicyStore — stores optimized per-layer configurations - **Trainable parameters**: ~200-400M (1-2% of 30B total) — router (~30M), MicroLoRA adapters (~50-100M), LM head (~150M), scale factors (~0.1M) - **Training data**: 100M-500M tokens (sufficient for <400M trainable params) -- **Throughput**: ~500-1000 tok/s (Metal) × 100M-500M tokens = **2-12 days on Mac Studio** +- **Throughput**: ~500-1000 tok/s (Metal) or ~200-500 tok/s (NEON SIMD only) × 100M-500M tokens = **2-12 days (Metal) or 4-24 days (SIMD-only) on Mac Studio** - **Deliverables**: - RLM-refined GGUF with ternary experts + optimized FP16 components - MicroLoRA adapter weights (exportable, ~20-100 MB) @@ -853,16 +853,18 @@ let expert_results: Vec = experts **Throughput and cost comparison:** -| Platform | Training tok/s | Time (200B tok, Phase 1) | Cost | Phase 0 PTQ? | -|----------|---------------|--------------------------|------|-------------| -| **Mac Studio M4 Max (Metal)** | ~500-1000 | ~6.5 years | N/A | **Yes — 1-4 hrs, $0** | -| **Mac Studio M3 Ultra (Metal)** | ~800-1500 | ~4.2 years | N/A | **Yes — 1-1.5 hrs, $0** | -| CPU AVX2 (Ryzen 9) | ~50-100 | ~65 years | N/A | Yes — 2-6 hrs, $0 | -| 1× A100 80GB (GCP on-demand) | ~15,000 | ~155 days | ~$3,700 | Yes — 30 min, ~$5 | -| 4× A100 80GB (GCP on-demand) | ~50,000 | ~46 days | ~$4,400 | Overkill for PTQ | -| 4× A100 80GB (GCP spot) | ~50,000 | ~46 days | **~$1,300** | Overkill for PTQ | -| 1× H100 (DataCrunch) | ~40,000 | ~58 days | ~$2,900 | Overkill for PTQ | -| 4× H100 (DataCrunch) | ~140,000 | ~16 days | **~$3,200** | Overkill for PTQ | +| Platform | Training tok/s | Time (200B tok, Phase 1) | Cost | Phase 0 PTQ? | Phase 0.5 RLM? | +|----------|---------------|--------------------------|------|-------------|---------------| +| **Mac Studio M4 Max (Metal)** | ~500-1000 | ~6.5 years | N/A | **Yes — 1-4 hrs, $0** | **Yes — 2-12 days, $0** | +| **Mac Studio M4 Max (NEON SIMD only, no Metal)** | ~200-500 | ~13 years | N/A | **Yes — 2-6 hrs, $0** | **Yes — 4-24 days, $0** | +| **Mac Studio M3 Ultra (Metal)** | ~800-1500 | ~4.2 years | N/A | **Yes — 1-1.5 hrs, $0** | **Yes — 1.5-8 days, $0** | +| **Mac Studio M3 Ultra (NEON SIMD only, no Metal)** | ~300-700 | ~9 years | N/A | **Yes — 1.5-3 hrs, $0** | **Yes — 3-16 days, $0** | +| CPU AVX2 (Ryzen 9) — scalar fallback | ~50-150 | ~43-130 years | N/A | Yes — 2-6 hrs, $0 | Yes — 14-58 days, $0 | +| 1× A100 80GB (GCP on-demand) | ~15,000 | ~155 days | ~$3,700 | Yes — 30 min, ~$5 | Overkill | +| 4× A100 80GB (GCP on-demand) | ~50,000 | ~46 days | ~$4,400 | Overkill for PTQ | Overkill | +| 4× A100 80GB (GCP spot) | ~50,000 | ~46 days | **~$1,300** | Overkill for PTQ | Overkill | +| 1× H100 (DataCrunch) | ~40,000 | ~58 days | ~$2,900 | Overkill for PTQ | Overkill | +| 4× H100 (DataCrunch) | ~140,000 | ~16 days | **~$3,200** | Overkill for PTQ | Overkill | **Key insight**: Mac Studio is infeasible for Phase 1+ training (years of wall time) but **ideal for Phase 0 PTQ** (hours, $0). This separation justifies the phased approach. @@ -872,7 +874,8 @@ let expert_results: Vec = experts |-------|----------|----------|----------------|----------| | **Phase 0 (PTQ)** | **Mac Studio (M4 Max/M3 Ultra)** | **1-4 hours** | **$0** | **Mmap FP16 weights → absmean quantize → export GGUF; Metal GPU for calibration pass** | | Phase 0D (BitDistill Lite, 10B tok) | Mac Studio Metal or 1× A100 spot | 2-4 weeks (local) / 1-2 days (cloud) | $0 (local) / ~$300 (cloud) | Optional quality upgrade if Phase 0C too degraded | -| **Phase 0.5 (RLM refinement, 100-500M tok)** | **Mac Studio (Metal)** | **3-14 days** | **$0** | **MicroLoRA + router fix + scale opt using existing RLM stack** | +| **Phase 0.5 (RLM refinement, Metal)** | **Mac Studio (Metal)** | **3-14 days** | **$0** | **MicroLoRA + router fix + scale opt using existing RLM stack** | +| **Phase 0.5 (RLM refinement, SIMD-only)** | **Mac Studio (NEON CPU)** | **5-28 days** | **$0** | **Same pipeline, no Metal required — pure ndarray + NEON SIMD (see AD-20)** | | Phase 1 (expert FFN, 200B tok) | 4× A100 80GB spot (GCP) | ~46 days | $1,300-$2,000 | Per-expert sequential with EWC++; each expert fits 1 GPU | | Phase 1 (router validation) | Mac Studio Metal or 1× A100 | ~2-4 hours | $0 (local) / <$10 (cloud) | Contrastive training on router only (~2B params) | | Phase 2 (full ternary, 500B tok) | 4× H100 (DataCrunch) | ~16-32 days | $2,500-$5,000 | All layers; model-parallel across GPUs | @@ -1155,6 +1158,100 @@ All three can be addressed by training only the FP16 components using the existi **Reused (100%)**: `MicroLoRA`, `TrainingPipeline`, `EwcRegularizer`, `GrpoOptimizer`, `ContrastiveTrainer`, `MemoryDistiller`, `PolicyStore`, `TrainingConfig`, LR schedules, GGUF export. **New (0%)**: No new training code. The only new code is a thin `RlmRefiner` orchestrator (~200-300 lines) that wires the existing components together for the Phase 0.5 pipeline. +### AD-20: Phase 0.5 — SIMD-Only Training Mode (No Metal GPU Required) + +**Decision**: Phase 0.5 RLM refinement supports a pure SIMD/CPU execution mode with no Metal GPU dependency. Metal is an optional acceleration path (~2-3x faster) but not required. + +**Rationale**: Analysis of the RLM training stack reveals that Metal GPU is used by only one component (`RealContrastiveTrainer` via Candle), while all other training components are pure ndarray/CPU. Since Phase 0.5 uses the lightweight `ContrastiveTrainer` (not `RealContrastiveTrainer`) for router repair, and all gradient computation is ndarray-based, the entire pipeline runs on pure CPU with SIMD acceleration for inference forward passes. + +**Component-by-component GPU dependency analysis:** + +| Component | Source | GPU Dependency | SIMD-Only Mode | +|-----------|--------|---------------|----------------| +| `MicroLoRA.forward_simd()` | `lora/micro_lora.rs:279` | **None** — ARM NEON intrinsics with scalar fallback | NEON on aarch64, scalar on x86 | +| `MicroLoRA.apply_gradients()` | `lora/micro_lora.rs:621+` | **None** — pure ndarray | Works everywhere | +| `MicroLoRA.apply_gradients_with_ewc()` | `lora/micro_lora.rs:621+` | **None** — pure ndarray | Works everywhere | +| `TrainingPipeline` | `lora/training.rs` | **None** — pure ndarray CPU | Works everywhere | +| `EwcRegularizer` | `lora/training.rs` | **None** — pure ndarray CPU | Works everywhere | +| `GrpoOptimizer` | `training/grpo.rs` | **None** — pure ndarray CPU | Works everywhere | +| `ContrastiveTrainer` | `training/contrastive.rs:169-175` | **Optional** — `use_metal: true` default, but `Device::new_metal(0).unwrap_or(Device::Cpu)` fallback | Set `use_metal: false` for CPU-only; also has non-Candle pure CPU path (line 475) | +| `MemoryDistiller` | `reasoning_bank/distillation.rs` | **None** — pure Rust | Works everywhere | +| `PolicyStore` | `policy_store.rs` | **None** — pure Rust | Works everywhere | +| **`RealContrastiveTrainer`** | `training/real_trainer.rs:178` | **Yes — Metal/Candle** | **NOT used in Phase 0.5** (used in full distillation only) | + +**Inference forward pass (for loss computation) SIMD support:** + +| Kernel | NEON (aarch64) | x86 | Source | +|--------|---------------|-----|--------| +| GEMM | `gemm_neon` | `gemm_scalar` fallback | `kernels/matmul.rs:520` | +| GEMV | `gemv_neon` | `gemv_scalar` fallback | `kernels/matmul.rs:184` | +| SiLU | `silu_neon_impl` (~3.5x speedup) | scalar fallback | `kernels/activations.rs` | +| GeLU | `gelu_neon_impl` (~3.2x speedup) | scalar fallback | `kernels/activations.rs` | +| ReLU | `relu_neon_impl` (~4.0x speedup) | scalar fallback | `kernels/activations.rs` | +| RMSNorm | `rms_norm_neon` | scalar fallback | `kernels/norm.rs` | +| RoPE | `apply_rope_neon` | scalar fallback | `kernels/rope.rs` | +| Softmax | `softmax_neon` (~2.8x speedup) | scalar fallback | `kernels/activations.rs` | + +**Key observation**: The matmul kernels only dispatch on `target_arch = "aarch64"` vs scalar. There are **no explicit AVX2 or AVX512 SIMD implementations** for x86 in the current kernel codebase. This means: +- **Apple Silicon (aarch64)**: Full NEON SIMD acceleration — primary target for SIMD-only mode +- **x86 (AMD/Intel)**: Falls to scalar fallback — works but ~3-5x slower than NEON +- **Future opportunity**: Adding AVX2/AVX512 kernels to `matmul.rs` would make x86 competitive with NEON + +**Throughput comparison for Phase 0.5 (100M tokens, ~200-400M trainable params, 3B active forward):** + +| Execution Mode | Forward tok/s | Effective Training tok/s | 100M Tokens | 500M Tokens | +|---------------|--------------|------------------------|------------|------------| +| Metal GPU (M4 Max) | ~500-1500 | ~300-700 | ~2-4 days | ~8-19 days | +| **NEON SIMD only (M4 Max CPU)** | **~200-500** | **~100-300** | **~4-12 days** | **~19-58 days** | +| **NEON SIMD only (M3 Ultra CPU)** | **~300-700** | **~150-400** | **~3-8 days** | **~14-39 days** | +| x86 scalar (Ryzen 9, no AVX2 kernels) | ~50-150 | ~30-80 | ~14-39 days | ~72-193 days | + +**Why SIMD-only is ~2-3x slower than Metal (not 10x):** +- Phase 0.5 training is dominated by the forward pass through the frozen 3B active parameters to compute loss against the teacher +- The forward pass uses SIMD-accelerated GEMM/GEMV (`gemm_neon`/`gemv_neon`) which gets ~60-70% of Metal throughput for these matrix sizes +- Gradient computation for the ~200-400M trainable params is pure ndarray — identical speed regardless of Metal availability +- The training bottleneck is I/O (loading teacher activations from mmap) not compute, further narrowing the gap + +**Platform portability (bonus of SIMD-only mode):** + +SIMD-only mode extends Phase 0.5 beyond Mac Studio to any platform with ndarray support: + +| Platform | SIMD Path | Effective tok/s | Feasible? | +|----------|----------|----------------|-----------| +| Mac Studio M4 Max (aarch64) | NEON intrinsics | ~100-300 | **Yes — primary target** | +| Mac Studio M3 Ultra (aarch64) | NEON intrinsics | ~150-400 | **Yes — faster than M4 Max** | +| Linux ARM64 (Ampere/Graviton) | NEON intrinsics | ~80-200 | **Yes — cloud ARM instances** | +| Linux x86 (Ryzen/Xeon) | Scalar fallback | ~30-80 | **Marginal — 100M tokens feasible (~14-39 days), 500M not practical** | +| macOS Intel | Scalar fallback | ~20-50 | **Not recommended** | + +**Configuration for SIMD-only mode:** + +```rust +// Phase 0.5 SIMD-only config (no Metal) +let contrastive_config = ContrastiveConfig { + use_metal: false, // Force CPU path in ContrastiveTrainer + ..Default::default() +}; + +// MicroLoRA — already pure SIMD/ndarray, no config change needed +// TrainingPipeline — already pure ndarray +// GrpoOptimizer — already pure ndarray +// EwcRegularizer — already pure ndarray +``` + +The only config change is `ContrastiveTrainer.use_metal = false`. All other RLM components are GPU-agnostic by design. + +**SIMD-only Phase 0.5 exit criteria (in addition to standard Phase 0.5 criteria):** +- [ ] All training completes without Metal GPU dependency +- [ ] `ContrastiveTrainer` runs with `use_metal: false` and produces equivalent router accuracy +- [ ] MicroLoRA `forward_simd()` executes NEON path on aarch64 (verified via `cfg` compile check) +- [ ] Training throughput measured and documented for SIMD-only vs Metal comparison + +**Recommendation**: Use Metal when available (2-3x faster), fall back to SIMD-only when Metal is unavailable or on non-Mac platforms. The training code requires zero changes — only `ContrastiveTrainer.use_metal` needs to be set to `false`. + +**Reused**: 100% of existing RLM stack — `MicroLoRA` NEON forward, ndarray training, `ContrastiveTrainer` CPU fallback, all existing SIMD kernels. +**New**: 0 lines. SIMD-only mode is already supported by the existing code paths; AD-20 documents this capability explicitly. + --- ## Consequences @@ -1177,6 +1274,8 @@ All three can be addressed by training only the FP16 components using the existi 14. **Existing GGUF ecosystem**: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines 15. **Phase 0.5 RLM refinement at $0**: Existing MicroLoRA + GRPO + EWC++ + ContrastiveTrainer stack provides ~10-15 percentage point quality recovery over raw PTQ with zero new training code, running entirely on Mac Studio 16. **100% RLM reuse for Phase 0.5**: No new training infrastructure needed — all 7 RLM components are production-tested and wire together directly +17. **SIMD-only Phase 0.5**: Entire RLM refinement pipeline runs on pure CPU SIMD (NEON on aarch64) without Metal GPU — only ~2-3x slower than Metal, extends platform support to Linux ARM64 and (with scalar fallback) x86 +18. **Zero-config SIMD mode**: All training components (MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer) are already GPU-agnostic; only `ContrastiveTrainer.use_metal = false` needed for full SIMD-only execution ### Negative @@ -1187,6 +1286,7 @@ All three can be addressed by training only the FP16 components using the existi 5. **Mixed-precision complexity**: Router (FP16) + experts (ternary) + attention (FP16/ternary) adds dispatch complexity 6. **WASM limitation**: Ternary lookup table kernels may not translate efficiently to WASM SIMD 7. **RLM scale gap**: Existing `RealContrastiveTrainer` targets 0.5B models (embedding_dim=896); scaling to 30B requires distributed data loading and increased batch sizes +8. **No x86 SIMD kernels**: Current `kernels/matmul.rs` only implements NEON (aarch64); x86 falls to scalar fallback (~3-5x slower than NEON). Adding AVX2/AVX512 kernels would make x86 SIMD-only mode competitive but is not yet implemented ### Risks @@ -1284,3 +1384,6 @@ All three can be addressed by training only the FP16 components using the existi 21. STBLLM: "Breaking the 1-bit Barrier" (ICLR 2025) — https://proceedings.iclr.cc/paper_files/paper/2025/file/ff997469ac66cf893c4183efeb22212a-Paper-Conference.pdf 22. Apple Mac Studio Technical Specifications (2025) — https://www.apple.com/mac-studio/specs/ 23. RuvLLM Metal GEMV integration: `crates/ruvllm/src/kernels/matmul.rs:1444-1582` +24. RuvLLM MicroLoRA NEON SIMD forward: `crates/ruvllm/src/lora/micro_lora.rs:279-390` (forward_simd, forward_simd_neon_impl) +25. RuvLLM NEON SIMD kernels: `crates/ruvllm/src/kernels/` (matmul: gemm_neon/gemv_neon, activations: silu_neon/gelu_neon/relu_neon, norm: rms_norm_neon, rope: apply_rope_neon) +26. RuvLLM ContrastiveTrainer CPU fallback: `crates/ruvllm/src/training/contrastive.rs:171-175` (Metal → CPU fallback) and `contrastive.rs:475` (non-Candle pure CPU path) diff --git a/docs/research/craftsman-ultra-30b-1bit-ddd.md b/docs/research/craftsman-ultra-30b-1bit-ddd.md index cbf4e7e7..ad94765e 100644 --- a/docs/research/craftsman-ultra-30b-1bit-ddd.md +++ b/docs/research/craftsman-ultra-30b-1bit-ddd.md @@ -1,6 +1,6 @@ # Domain-Driven Design: Craftsman Ultra 30b 1bit -**Version:** 2.2 +**Version:** 2.3 **Date:** 2026-02-03 **Relates to:** ADR-017-craftsman-ultra-30b-1bit-bitnet-integration **Status:** Research / Pre-Implementation @@ -84,6 +84,9 @@ The following terms have precise meaning within the Craftsman Ultra domain. All | **Frozen Ternary** | Expert FFN weights locked to their PTQ {-1,0,+1} values during Phase 0.5 refinement — not differentiable, not modified. | | **LoRA Correction** | Small FP16 additive output from MicroLoRA that compensates for ternary quantization error: `Y = BitLinear(X) + LoRA(X)`. | | **Router Repair** | Contrastive fine-tuning of FP16 router weights to correct misrouting caused by expert output distribution changes after PTQ. | +| **SIMD-Only Mode** | Phase 0.5 execution mode where all training runs on pure CPU SIMD (NEON on aarch64) without Metal GPU. All RLM components are GPU-agnostic except ContrastiveTrainer which has an explicit CPU fallback path. ~2-3x slower than Metal but extends platform support beyond macOS. | +| **NEON Intrinsics** | ARM SIMD instruction set used by MicroLoRA's `forward_simd_neon_impl()` for 8x-unrolled forward passes. Available on all Apple Silicon and ARM64 platforms. x86 platforms fall to scalar fallback. | +| **Scalar Fallback** | Platform-agnostic non-SIMD code path used when NEON (aarch64) is unavailable. Provides identical results at ~3-5x lower throughput. Enables Phase 0.5 on x86 Linux/Windows. | --- @@ -548,7 +551,7 @@ The RLM Training Orchestration Context operates in a **lightweight refinement mo | `ContrastiveTrainer` | Router validation | **Router repair** | | `GrpoOptimizer` | Per-expert distillation reward | **Scale factor optimization reward** | | `EwcRegularizer` | Cross-expert stability | **Cross-step stability** | -| Platform | Cloud GPU (4× A100) | **Mac Studio (Metal)** | +| Platform | Cloud GPU (4× A100) | **Mac Studio (Metal or SIMD-only)** | | Cost | $1,300+ | **$0** | | New code | ~30% new | **~0% new** (only thin orchestrator) | @@ -939,7 +942,7 @@ Assuming GLM-4.7-Flash architecture with ~3B active parameters per token: ## 8.5 Training Infrastructure Model -### Why Not Local CPU/SIMD +### Why Not Local CPU/SIMD (for Phase 1+) The existing RuvLLM SIMD kernels (`crates/ruvllm/src/kernels/`) are **inference-only** — no backward pass, no gradient computation, no training support. The training code paths are: @@ -947,7 +950,34 @@ The existing RuvLLM SIMD kernels (`crates/ruvllm/src/kernels/`) are **inference- - `EwcRegularizer` / LoRA training: Pure CPU via `ndarray` (no GPU acceleration) - SIMD kernels: Forward-pass optimizations only (flash attention, matmul, activations) -At ~50-100 training tok/s on CPU, 200B tokens would require ~65 years. Not viable. +At ~50-100 training tok/s on CPU, 200B tokens would require ~65 years. Not viable for Phase 1+. + +### Why SIMD-Only Works (for Phase 0.5) + +Phase 0.5 is fundamentally different from Phase 1+: it trains only ~200-400M FP16 parameters (1-2% of 30B) using existing RLM components that are already pure ndarray/CPU. The SIMD kernels are used for the forward pass through the frozen model to compute training loss, not for gradient computation. + +**GPU dependency analysis of Phase 0.5 components:** + +| Component | GPU Required? | SIMD Benefit | +|-----------|--------------|-------------| +| MicroLoRA forward pass | No — `forward_simd()` uses NEON intrinsics directly | ~3-4x over scalar | +| MicroLoRA gradient computation | No — pure ndarray `apply_gradients()` | None (ndarray handles) | +| TrainingPipeline | No — pure ndarray | None | +| EwcRegularizer | No — pure ndarray | None | +| GrpoOptimizer | No — pure ndarray | None | +| ContrastiveTrainer | Optional — `use_metal: false` forces CPU | Candle CPU tensors | +| Frozen model forward (loss computation) | No — SIMD inference kernels | NEON GEMM/GEMV ~3x | + +**Effective training throughput (SIMD-only, 100M-500M tokens):** + +| Platform | SIMD | tok/s | 100M tokens | Feasible? | +|----------|------|-------|-------------|-----------| +| Mac Studio M4 Max | NEON | ~100-300 | 4-12 days | **Yes** | +| Mac Studio M3 Ultra | NEON | ~150-400 | 3-8 days | **Yes** | +| Linux ARM64 (Graviton3) | NEON | ~80-200 | 6-14 days | **Yes** | +| Linux x86 (Ryzen 9) | Scalar | ~30-80 | 14-39 days | **Marginal** | + +**Platform gap**: No AVX2/AVX512 SIMD kernels exist in `kernels/matmul.rs` — only `target_arch = "aarch64"` (NEON) vs scalar dispatch. x86 therefore falls to scalar, making it ~3-5x slower than NEON. Adding AVX2 kernels is an identified future improvement (see ADR-017 AD-20). ### Cloud GPU Distillation Strategy @@ -987,10 +1017,12 @@ Expert FFN (~1B params): | Task | Location | Device | Duration | |------|----------|--------|----------| +| **Phase 0.5 RLM refinement (Metal)** | **Mac Studio** | **Metal GPU + CPU ndarray** | **3-14 days** | +| **Phase 0.5 RLM refinement (SIMD-only)** | **Mac Studio or Linux ARM64** | **NEON SIMD + CPU ndarray** | **4-24 days** | | Expert distillation (Phase 1) | GCP 4×A100 spot | CUDA | ~46 days | -| Router contrastive validation | GCP 1×A100 or local Mac | CUDA/Metal | Hours | +| Router contrastive validation | GCP 1×A100 or local Mac | CUDA/Metal/CPU | Hours | | Inference benchmark (TL1/TL2) | Local workstation | CPU SIMD (AVX2/NEON) | Minutes | -| MicroLoRA adaptation | Local / edge | CPU (ndarray) | <1ms/update | +| MicroLoRA adaptation | Local / edge | CPU (ndarray + NEON SIMD) | <1ms/update | | GGUF export | Local | CPU | Minutes | | Kernel correctness tests | Local | CPU SIMD | Seconds | @@ -1089,6 +1121,9 @@ All changes are additive. No existing backend, model, or API is modified. The `B | 15 | Optimal MicroLoRA rank for Phase 0.5? | Quality vs speed | Open | Rank-1 is faster, rank-2 is 5% faster due to SIMD but has 2× params. Empirical testing needed. | | 16 | LoRA adapter persistence in GGUF? | Export format | Open | Store LoRA A/B matrices as separate tensors in GGUF, or merge into ternary+FP16 hybrid format? | | 17 | Phase 0.5 LoRA → Phase 1 distillation init? | Continuity | Open | Can Phase 0.5 LoRA corrections inform Phase 1 shadow weight initialization for faster convergence? | +| 18 | Add AVX2/AVX512 SIMD kernels to `matmul.rs`? | x86 SIMD-only performance | Open | Current kernels only have NEON (aarch64) + scalar fallback. Adding AVX2 would make x86 SIMD-only Phase 0.5 ~3-5x faster. Is it worth the effort vs just using ARM? | +| 19 | SIMD-only vs Metal quality equivalence? | Phase 0.5 validation | Open | Does ContrastiveTrainer produce identical router accuracy on CPU vs Metal? Need empirical comparison to confirm no numerical divergence. | +| 20 | Cloud ARM64 instances for SIMD-only Phase 0.5? | Platform portability | Open | AWS Graviton3/4 or Ampere Altra instances with 128+ GB RAM could run SIMD-only Phase 0.5 without Mac Studio. Cost-competitive? | --- @@ -1111,3 +1146,6 @@ All changes are additive. No existing backend, model, or API is modified. The `B - BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025) - bartowski, GLM-4.7-Flash-GGUF quantizations: https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF - llama.cpp IQ1_S blind testing: https://github.com/ggml-org/llama.cpp/discussions/5962 +- RuvLLM MicroLoRA NEON SIMD: `crates/ruvllm/src/lora/micro_lora.rs:279-390` +- RuvLLM NEON SIMD kernels: `crates/ruvllm/src/kernels/` (gemm_neon, gemv_neon, silu_neon, gelu_neon, relu_neon, rms_norm_neon, apply_rope_neon) +- RuvLLM ContrastiveTrainer CPU fallback: `crates/ruvllm/src/training/contrastive.rs:171-175`