diff --git a/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md b/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md
index 02fa1c05..63bc562c 100644
--- a/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md
+++ b/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md
@@ -384,8 +384,8 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine
 
 ### Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0)
 - **Timeline**: 1-3 weeks (overlaps with Phase 0 kernel development)
-- **Cost**: **$0** (runs on Mac Studio, ~2-12 days training wall time)
-- **Platform**: Mac Studio (same as Phase 0)
+- **Cost**: **$0** (runs on Mac Studio, ~2-12 days training wall time with Metal; ~4-24 days SIMD-only)
+- **Platform**: Mac Studio (same as Phase 0) — **supports both Metal GPU and pure SIMD/CPU modes** (see AD-20)
 - **Goal**: Improve Phase 0 PTQ quality from ~55-65% to ~70-80% by training only the small FP16 components using the existing RLM stack — **no traditional distillation, no cloud GPU**
 - **Approach**: Freeze ternary weights, train FP16 corrections using RLM components:
   1. **MicroLoRA adapters** (rank 1-2) on each expert FFN — adds small FP16 correction: `Y = BitLinear(X) + LoRA_B @ LoRA_A @ X`
@@ -396,7 +396,7 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine
   6. **Policy persistence** via PolicyStore — stores optimized per-layer configurations
 - **Trainable parameters**: ~200-400M (1-2% of 30B total) — router (~30M), MicroLoRA adapters (~50-100M), LM head (~150M), scale factors (~0.1M)
 - **Training data**: 100M-500M tokens (sufficient for <400M trainable params)
-- **Throughput**: ~500-1000 tok/s (Metal) × 100M-500M tokens = **2-12 days on Mac Studio**
+- **Throughput**: ~500-1000 tok/s (Metal) or ~200-500 tok/s (NEON SIMD only) × 100M-500M tokens = **2-12 days (Metal) or 4-24 days (SIMD-only) on Mac Studio**
 - **Deliverables**:
   - RLM-refined GGUF with ternary experts + optimized FP16 components
   - MicroLoRA adapter weights (exportable, ~20-100 MB)
@@ -853,16 +853,18 @@ let expert_results: Vec<DistillResult> = experts
 
 **Throughput and cost comparison:**
 
-| Platform | Training tok/s | Time (200B tok, Phase 1) | Cost | Phase 0 PTQ? |
-|----------|---------------|--------------------------|------|-------------|
-| **Mac Studio M4 Max (Metal)** | ~500-1000 | ~6.5 years | N/A | **Yes — 1-4 hrs, $0** |
-| **Mac Studio M3 Ultra (Metal)** | ~800-1500 | ~4.2 years | N/A | **Yes — 1-1.5 hrs, $0** |
-| CPU AVX2 (Ryzen 9) | ~50-100 | ~65 years | N/A | Yes — 2-6 hrs, $0 |
-| 1× A100 80GB (GCP on-demand) | ~15,000 | ~155 days | ~$3,700 | Yes — 30 min, ~$5 |
-| 4× A100 80GB (GCP on-demand) | ~50,000 | ~46 days | ~$4,400 | Overkill for PTQ |
-| 4× A100 80GB (GCP spot) | ~50,000 | ~46 days | **~$1,300** | Overkill for PTQ |
-| 1× H100 (DataCrunch) | ~40,000 | ~58 days | ~$2,900 | Overkill for PTQ |
-| 4× H100 (DataCrunch) | ~140,000 | ~16 days | **~$3,200** | Overkill for PTQ |
+| Platform | Training tok/s | Time (200B tok, Phase 1) | Cost | Phase 0 PTQ? | Phase 0.5 RLM? |
+|----------|---------------|--------------------------|------|-------------|---------------|
+| **Mac Studio M4 Max (Metal)** | ~500-1000 | ~6.5 years | N/A | **Yes — 1-4 hrs, $0** | **Yes — 2-12 days, $0** |
+| **Mac Studio M4 Max (NEON SIMD only, no Metal)** | ~200-500 | ~13 years | N/A | **Yes — 2-6 hrs, $0** | **Yes — 4-24 days, $0** |
+| **Mac Studio M3 Ultra (Metal)** | ~800-1500 | ~4.2 years | N/A | **Yes — 1-1.5 hrs, $0** | **Yes — 1.5-8 days, $0** |
+| **Mac Studio M3 Ultra (NEON SIMD only, no Metal)** | ~300-700 | ~9 years | N/A | **Yes — 1.5-3 hrs, $0** | **Yes — 3-16 days, $0** |
+| CPU AVX2 (Ryzen 9) — scalar fallback | ~50-150 | ~43-130 years | N/A | Yes — 2-6 hrs, $0 | Yes — 14-58 days, $0 |
+| 1× A100 80GB (GCP on-demand) | ~15,000 | ~155 days | ~$3,700 | Yes — 30 min, ~$5 | Overkill |
+| 4× A100 80GB (GCP on-demand) | ~50,000 | ~46 days | ~$4,400 | Overkill for PTQ | Overkill |
+| 4× A100 80GB (GCP spot) | ~50,000 | ~46 days | **~$1,300** | Overkill for PTQ | Overkill |
+| 1× H100 (DataCrunch) | ~40,000 | ~58 days | ~$2,900 | Overkill for PTQ | Overkill |
+| 4× H100 (DataCrunch) | ~140,000 | ~16 days | **~$3,200** | Overkill for PTQ | Overkill |
 
 **Key insight**: Mac Studio is infeasible for Phase 1+ training (years of wall time) but **ideal for Phase 0 PTQ** (hours, $0). This separation justifies the phased approach.
 
@@ -872,7 +874,8 @@ let expert_results: Vec<DistillResult> = experts
 |-------|----------|----------|----------------|----------|
 | **Phase 0 (PTQ)** | **Mac Studio (M4 Max/M3 Ultra)** | **1-4 hours** | **$0** | **Mmap FP16 weights → absmean quantize → export GGUF; Metal GPU for calibration pass** |
 | Phase 0D (BitDistill Lite, 10B tok) | Mac Studio Metal or 1× A100 spot | 2-4 weeks (local) / 1-2 days (cloud) | $0 (local) / ~$300 (cloud) | Optional quality upgrade if Phase 0C too degraded |
-| **Phase 0.5 (RLM refinement, 100-500M tok)** | **Mac Studio (Metal)** | **3-14 days** | **$0** | **MicroLoRA + router fix + scale opt using existing RLM stack** |
+| **Phase 0.5 (RLM refinement, Metal)** | **Mac Studio (Metal)** | **3-14 days** | **$0** | **MicroLoRA + router fix + scale opt using existing RLM stack** |
+| **Phase 0.5 (RLM refinement, SIMD-only)** | **Mac Studio (NEON CPU)** | **5-28 days** | **$0** | **Same pipeline, no Metal required — pure ndarray + NEON SIMD (see AD-20)** |
 | Phase 1 (expert FFN, 200B tok) | 4× A100 80GB spot (GCP) | ~46 days | $1,300-$2,000 | Per-expert sequential with EWC++; each expert fits 1 GPU |
 | Phase 1 (router validation) | Mac Studio Metal or 1× A100 | ~2-4 hours | $0 (local) / <$10 (cloud) | Contrastive training on router only (~2B params) |
 | Phase 2 (full ternary, 500B tok) | 4× H100 (DataCrunch) | ~16-32 days | $2,500-$5,000 | All layers; model-parallel across GPUs |
@@ -1155,6 +1158,100 @@ All three can be addressed by training only the FP16 components using the existi
 **Reused (100%)**: `MicroLoRA`, `TrainingPipeline`, `EwcRegularizer`, `GrpoOptimizer`, `ContrastiveTrainer`, `MemoryDistiller`, `PolicyStore`, `TrainingConfig`, LR schedules, GGUF export.
 **New (0%)**: No new training code. The only new code is a thin `RlmRefiner` orchestrator (~200-300 lines) that wires the existing components together for the Phase 0.5 pipeline.
 
+### AD-20: Phase 0.5 — SIMD-Only Training Mode (No Metal GPU Required)
+
+**Decision**: Phase 0.5 RLM refinement supports a pure SIMD/CPU execution mode with no Metal GPU dependency. Metal is an optional acceleration path (~2-3x faster) but not required.
+
+**Rationale**: Analysis of the RLM training stack reveals that Metal GPU is used by only one component (`RealContrastiveTrainer` via Candle), while all other training components are pure ndarray/CPU. Since Phase 0.5 uses the lightweight `ContrastiveTrainer` (not `RealContrastiveTrainer`) for router repair, and all gradient computation is ndarray-based, the entire pipeline runs on pure CPU with SIMD acceleration for inference forward passes.
+
+**Component-by-component GPU dependency analysis:**
+
+| Component | Source | GPU Dependency | SIMD-Only Mode |
+|-----------|--------|---------------|----------------|
+| `MicroLoRA.forward_simd()` | `lora/micro_lora.rs:279` | **None** — ARM NEON intrinsics with scalar fallback | NEON on aarch64, scalar on x86 |
+| `MicroLoRA.apply_gradients()` | `lora/micro_lora.rs:621+` | **None** — pure ndarray | Works everywhere |
+| `MicroLoRA.apply_gradients_with_ewc()` | `lora/micro_lora.rs:621+` | **None** — pure ndarray | Works everywhere |
+| `TrainingPipeline` | `lora/training.rs` | **None** — pure ndarray CPU | Works everywhere |
+| `EwcRegularizer` | `lora/training.rs` | **None** — pure ndarray CPU | Works everywhere |
+| `GrpoOptimizer` | `training/grpo.rs` | **None** — pure ndarray CPU | Works everywhere |
+| `ContrastiveTrainer` | `training/contrastive.rs:169-175` | **Optional** — `use_metal: true` default, but `Device::new_metal(0).unwrap_or(Device::Cpu)` fallback | Set `use_metal: false` for CPU-only; also has non-Candle pure CPU path (line 475) |
+| `MemoryDistiller` | `reasoning_bank/distillation.rs` | **None** — pure Rust | Works everywhere |
+| `PolicyStore` | `policy_store.rs` | **None** — pure Rust | Works everywhere |
+| **`RealContrastiveTrainer`** | `training/real_trainer.rs:178` | **Yes — Metal/Candle** | **NOT used in Phase 0.5** (used in full distillation only) |
+
+**Inference forward pass (for loss computation) SIMD support:**
+
+| Kernel | NEON (aarch64) | x86 | Source |
+|--------|---------------|-----|--------|
+| GEMM | `gemm_neon` | `gemm_scalar` fallback | `kernels/matmul.rs:520` |
+| GEMV | `gemv_neon` | `gemv_scalar` fallback | `kernels/matmul.rs:184` |
+| SiLU | `silu_neon_impl` (~3.5x speedup) | scalar fallback | `kernels/activations.rs` |
+| GeLU | `gelu_neon_impl` (~3.2x speedup) | scalar fallback | `kernels/activations.rs` |
+| ReLU | `relu_neon_impl` (~4.0x speedup) | scalar fallback | `kernels/activations.rs` |
+| RMSNorm | `rms_norm_neon` | scalar fallback | `kernels/norm.rs` |
+| RoPE | `apply_rope_neon` | scalar fallback | `kernels/rope.rs` |
+| Softmax | `softmax_neon` (~2.8x speedup) | scalar fallback | `kernels/activations.rs` |
+
+**Key observation**: The matmul kernels only dispatch on `target_arch = "aarch64"` vs scalar. There are **no explicit AVX2 or AVX512 SIMD implementations** for x86 in the current kernel codebase. This means:
+- **Apple Silicon (aarch64)**: Full NEON SIMD acceleration — primary target for SIMD-only mode
+- **x86 (AMD/Intel)**: Falls to scalar fallback — works but ~3-5x slower than NEON
+- **Future opportunity**: Adding AVX2/AVX512 kernels to `matmul.rs` would make x86 competitive with NEON
+
+**Throughput comparison for Phase 0.5 (100M tokens, ~200-400M trainable params, 3B active forward):**
+
+| Execution Mode | Forward tok/s | Effective Training tok/s | 100M Tokens | 500M Tokens |
+|---------------|--------------|------------------------|------------|------------|
+| Metal GPU (M4 Max) | ~500-1500 | ~300-700 | ~2-4 days | ~8-19 days |
+| **NEON SIMD only (M4 Max CPU)** | **~200-500** | **~100-300** | **~4-12 days** | **~19-58 days** |
+| **NEON SIMD only (M3 Ultra CPU)** | **~300-700** | **~150-400** | **~3-8 days** | **~14-39 days** |
+| x86 scalar (Ryzen 9, no AVX2 kernels) | ~50-150 | ~30-80 | ~14-39 days | ~72-193 days |
+
+**Why SIMD-only is ~2-3x slower than Metal (not 10x):**
+- Phase 0.5 training is dominated by the forward pass through the frozen 3B active parameters to compute loss against the teacher
+- The forward pass uses SIMD-accelerated GEMM/GEMV (`gemm_neon`/`gemv_neon`) which gets ~60-70% of Metal throughput for these matrix sizes
+- Gradient computation for the ~200-400M trainable params is pure ndarray — identical speed regardless of Metal availability
+- The training bottleneck is I/O (loading teacher activations from mmap) not compute, further narrowing the gap
+
+**Platform portability (bonus of SIMD-only mode):**
+
+SIMD-only mode extends Phase 0.5 beyond Mac Studio to any platform with ndarray support:
+
+| Platform | SIMD Path | Effective tok/s | Feasible? |
+|----------|----------|----------------|-----------|
+| Mac Studio M4 Max (aarch64) | NEON intrinsics | ~100-300 | **Yes — primary target** |
+| Mac Studio M3 Ultra (aarch64) | NEON intrinsics | ~150-400 | **Yes — faster than M4 Max** |
+| Linux ARM64 (Ampere/Graviton) | NEON intrinsics | ~80-200 | **Yes — cloud ARM instances** |
+| Linux x86 (Ryzen/Xeon) | Scalar fallback | ~30-80 | **Marginal — 100M tokens feasible (~14-39 days), 500M not practical** |
+| macOS Intel | Scalar fallback | ~20-50 | **Not recommended** |
+
+**Configuration for SIMD-only mode:**
+
+```rust
+// Phase 0.5 SIMD-only config (no Metal)
+let contrastive_config = ContrastiveConfig {
+    use_metal: false,    // Force CPU path in ContrastiveTrainer
+    ..Default::default()
+};
+
+// MicroLoRA — already pure SIMD/ndarray, no config change needed
+// TrainingPipeline — already pure ndarray
+// GrpoOptimizer — already pure ndarray
+// EwcRegularizer — already pure ndarray
+```
+
+The only config change is `ContrastiveTrainer.use_metal = false`. All other RLM components are GPU-agnostic by design.
+
+**SIMD-only Phase 0.5 exit criteria (in addition to standard Phase 0.5 criteria):**
+- [ ] All training completes without Metal GPU dependency
+- [ ] `ContrastiveTrainer` runs with `use_metal: false` and produces equivalent router accuracy
+- [ ] MicroLoRA `forward_simd()` executes NEON path on aarch64 (verified via `cfg` compile check)
+- [ ] Training throughput measured and documented for SIMD-only vs Metal comparison
+
+**Recommendation**: Use Metal when available (2-3x faster), fall back to SIMD-only when Metal is unavailable or on non-Mac platforms. The training code requires zero changes — only `ContrastiveTrainer.use_metal` needs to be set to `false`.
+
+**Reused**: 100% of existing RLM stack — `MicroLoRA` NEON forward, ndarray training, `ContrastiveTrainer` CPU fallback, all existing SIMD kernels.
+**New**: 0 lines. SIMD-only mode is already supported by the existing code paths; AD-20 documents this capability explicitly.
+
 ---
 
 ## Consequences
@@ -1177,6 +1274,8 @@ All three can be addressed by training only the FP16 components using the existi
 14. **Existing GGUF ecosystem**: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines
 15. **Phase 0.5 RLM refinement at $0**: Existing MicroLoRA + GRPO + EWC++ + ContrastiveTrainer stack provides ~10-15 percentage point quality recovery over raw PTQ with zero new training code, running entirely on Mac Studio
 16. **100% RLM reuse for Phase 0.5**: No new training infrastructure needed — all 7 RLM components are production-tested and wire together directly
+17. **SIMD-only Phase 0.5**: Entire RLM refinement pipeline runs on pure CPU SIMD (NEON on aarch64) without Metal GPU — only ~2-3x slower than Metal, extends platform support to Linux ARM64 and (with scalar fallback) x86
+18. **Zero-config SIMD mode**: All training components (MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer) are already GPU-agnostic; only `ContrastiveTrainer.use_metal = false` needed for full SIMD-only execution
 
 ### Negative
 
@@ -1187,6 +1286,7 @@ All three can be addressed by training only the FP16 components using the existi
 5. **Mixed-precision complexity**: Router (FP16) + experts (ternary) + attention (FP16/ternary) adds dispatch complexity
 6. **WASM limitation**: Ternary lookup table kernels may not translate efficiently to WASM SIMD
 7. **RLM scale gap**: Existing `RealContrastiveTrainer` targets 0.5B models (embedding_dim=896); scaling to 30B requires distributed data loading and increased batch sizes
+8. **No x86 SIMD kernels**: Current `kernels/matmul.rs` only implements NEON (aarch64); x86 falls to scalar fallback (~3-5x slower than NEON). Adding AVX2/AVX512 kernels would make x86 SIMD-only mode competitive but is not yet implemented
 
 ### Risks
 
@@ -1284,3 +1384,6 @@ All three can be addressed by training only the FP16 components using the existi
 21. STBLLM: "Breaking the 1-bit Barrier" (ICLR 2025) — https://proceedings.iclr.cc/paper_files/paper/2025/file/ff997469ac66cf893c4183efeb22212a-Paper-Conference.pdf
 22. Apple Mac Studio Technical Specifications (2025) — https://www.apple.com/mac-studio/specs/
 23. RuvLLM Metal GEMV integration: `crates/ruvllm/src/kernels/matmul.rs:1444-1582`
+24. RuvLLM MicroLoRA NEON SIMD forward: `crates/ruvllm/src/lora/micro_lora.rs:279-390` (forward_simd, forward_simd_neon_impl)
+25. RuvLLM NEON SIMD kernels: `crates/ruvllm/src/kernels/` (matmul: gemm_neon/gemv_neon, activations: silu_neon/gelu_neon/relu_neon, norm: rms_norm_neon, rope: apply_rope_neon)
+26. RuvLLM ContrastiveTrainer CPU fallback: `crates/ruvllm/src/training/contrastive.rs:171-175` (Metal → CPU fallback) and `contrastive.rs:475` (non-Candle pure CPU path)
diff --git a/docs/research/craftsman-ultra-30b-1bit-ddd.md b/docs/research/craftsman-ultra-30b-1bit-ddd.md
index cbf4e7e7..ad94765e 100644
--- a/docs/research/craftsman-ultra-30b-1bit-ddd.md
+++ b/docs/research/craftsman-ultra-30b-1bit-ddd.md
@@ -1,6 +1,6 @@
 # Domain-Driven Design: Craftsman Ultra 30b 1bit
 
-**Version:** 2.2
+**Version:** 2.3
 **Date:** 2026-02-03
 **Relates to:** ADR-017-craftsman-ultra-30b-1bit-bitnet-integration
 **Status:** Research / Pre-Implementation
@@ -84,6 +84,9 @@ The following terms have precise meaning within the Craftsman Ultra domain. All
 | **Frozen Ternary** | Expert FFN weights locked to their PTQ {-1,0,+1} values during Phase 0.5 refinement — not differentiable, not modified. |
 | **LoRA Correction** | Small FP16 additive output from MicroLoRA that compensates for ternary quantization error: `Y = BitLinear(X) + LoRA(X)`. |
 | **Router Repair** | Contrastive fine-tuning of FP16 router weights to correct misrouting caused by expert output distribution changes after PTQ. |
+| **SIMD-Only Mode** | Phase 0.5 execution mode where all training runs on pure CPU SIMD (NEON on aarch64) without Metal GPU. All RLM components are GPU-agnostic except ContrastiveTrainer which has an explicit CPU fallback path. ~2-3x slower than Metal but extends platform support beyond macOS. |
+| **NEON Intrinsics** | ARM SIMD instruction set used by MicroLoRA's `forward_simd_neon_impl()` for 8x-unrolled forward passes. Available on all Apple Silicon and ARM64 platforms. x86 platforms fall to scalar fallback. |
+| **Scalar Fallback** | Platform-agnostic non-SIMD code path used when NEON (aarch64) is unavailable. Provides identical results at ~3-5x lower throughput. Enables Phase 0.5 on x86 Linux/Windows. |
 
 ---
 
@@ -548,7 +551,7 @@ The RLM Training Orchestration Context operates in a **lightweight refinement mo
 | `ContrastiveTrainer` | Router validation | **Router repair** |
 | `GrpoOptimizer` | Per-expert distillation reward | **Scale factor optimization reward** |
 | `EwcRegularizer` | Cross-expert stability | **Cross-step stability** |
-| Platform | Cloud GPU (4× A100) | **Mac Studio (Metal)** |
+| Platform | Cloud GPU (4× A100) | **Mac Studio (Metal or SIMD-only)** |
 | Cost | $1,300+ | **$0** |
 | New code | ~30% new | **~0% new** (only thin orchestrator) |
 
@@ -939,7 +942,7 @@ Assuming GLM-4.7-Flash architecture with ~3B active parameters per token:
 
 ## 8.5 Training Infrastructure Model
 
-### Why Not Local CPU/SIMD
+### Why Not Local CPU/SIMD (for Phase 1+)
 
 The existing RuvLLM SIMD kernels (`crates/ruvllm/src/kernels/`) are **inference-only** — no backward pass, no gradient computation, no training support. The training code paths are:
 
@@ -947,7 +950,34 @@ The existing RuvLLM SIMD kernels (`crates/ruvllm/src/kernels/`) are **inference-
 - `EwcRegularizer` / LoRA training: Pure CPU via `ndarray` (no GPU acceleration)
 - SIMD kernels: Forward-pass optimizations only (flash attention, matmul, activations)
 
-At ~50-100 training tok/s on CPU, 200B tokens would require ~65 years. Not viable.
+At ~50-100 training tok/s on CPU, 200B tokens would require ~65 years. Not viable for Phase 1+.
+
+### Why SIMD-Only Works (for Phase 0.5)
+
+Phase 0.5 is fundamentally different from Phase 1+: it trains only ~200-400M FP16 parameters (1-2% of 30B) using existing RLM components that are already pure ndarray/CPU. The SIMD kernels are used for the forward pass through the frozen model to compute training loss, not for gradient computation.
+
+**GPU dependency analysis of Phase 0.5 components:**
+
+| Component | GPU Required? | SIMD Benefit |
+|-----------|--------------|-------------|
+| MicroLoRA forward pass | No — `forward_simd()` uses NEON intrinsics directly | ~3-4x over scalar |
+| MicroLoRA gradient computation | No — pure ndarray `apply_gradients()` | None (ndarray handles) |
+| TrainingPipeline | No — pure ndarray | None |
+| EwcRegularizer | No — pure ndarray | None |
+| GrpoOptimizer | No — pure ndarray | None |
+| ContrastiveTrainer | Optional — `use_metal: false` forces CPU | Candle CPU tensors |
+| Frozen model forward (loss computation) | No — SIMD inference kernels | NEON GEMM/GEMV ~3x |
+
+**Effective training throughput (SIMD-only, 100M-500M tokens):**
+
+| Platform | SIMD | tok/s | 100M tokens | Feasible? |
+|----------|------|-------|-------------|-----------|
+| Mac Studio M4 Max | NEON | ~100-300 | 4-12 days | **Yes** |
+| Mac Studio M3 Ultra | NEON | ~150-400 | 3-8 days | **Yes** |
+| Linux ARM64 (Graviton3) | NEON | ~80-200 | 6-14 days | **Yes** |
+| Linux x86 (Ryzen 9) | Scalar | ~30-80 | 14-39 days | **Marginal** |
+
+**Platform gap**: No AVX2/AVX512 SIMD kernels exist in `kernels/matmul.rs` — only `target_arch = "aarch64"` (NEON) vs scalar dispatch. x86 therefore falls to scalar, making it ~3-5x slower than NEON. Adding AVX2 kernels is an identified future improvement (see ADR-017 AD-20).
 
 ### Cloud GPU Distillation Strategy
 
@@ -987,10 +1017,12 @@ Expert FFN (~1B params):
 
 | Task | Location | Device | Duration |
 |------|----------|--------|----------|
+| **Phase 0.5 RLM refinement (Metal)** | **Mac Studio** | **Metal GPU + CPU ndarray** | **3-14 days** |
+| **Phase 0.5 RLM refinement (SIMD-only)** | **Mac Studio or Linux ARM64** | **NEON SIMD + CPU ndarray** | **4-24 days** |
 | Expert distillation (Phase 1) | GCP 4×A100 spot | CUDA | ~46 days |
-| Router contrastive validation | GCP 1×A100 or local Mac | CUDA/Metal | Hours |
+| Router contrastive validation | GCP 1×A100 or local Mac | CUDA/Metal/CPU | Hours |
 | Inference benchmark (TL1/TL2) | Local workstation | CPU SIMD (AVX2/NEON) | Minutes |
-| MicroLoRA adaptation | Local / edge | CPU (ndarray) | <1ms/update |
+| MicroLoRA adaptation | Local / edge | CPU (ndarray + NEON SIMD) | <1ms/update |
 | GGUF export | Local | CPU | Minutes |
 | Kernel correctness tests | Local | CPU SIMD | Seconds |
 
@@ -1089,6 +1121,9 @@ All changes are additive. No existing backend, model, or API is modified. The `B
 | 15 | Optimal MicroLoRA rank for Phase 0.5? | Quality vs speed | Open | Rank-1 is faster, rank-2 is 5% faster due to SIMD but has 2× params. Empirical testing needed. |
 | 16 | LoRA adapter persistence in GGUF? | Export format | Open | Store LoRA A/B matrices as separate tensors in GGUF, or merge into ternary+FP16 hybrid format? |
 | 17 | Phase 0.5 LoRA → Phase 1 distillation init? | Continuity | Open | Can Phase 0.5 LoRA corrections inform Phase 1 shadow weight initialization for faster convergence? |
+| 18 | Add AVX2/AVX512 SIMD kernels to `matmul.rs`? | x86 SIMD-only performance | Open | Current kernels only have NEON (aarch64) + scalar fallback. Adding AVX2 would make x86 SIMD-only Phase 0.5 ~3-5x faster. Is it worth the effort vs just using ARM? |
+| 19 | SIMD-only vs Metal quality equivalence? | Phase 0.5 validation | Open | Does ContrastiveTrainer produce identical router accuracy on CPU vs Metal? Need empirical comparison to confirm no numerical divergence. |
+| 20 | Cloud ARM64 instances for SIMD-only Phase 0.5? | Platform portability | Open | AWS Graviton3/4 or Ampere Altra instances with 128+ GB RAM could run SIMD-only Phase 0.5 without Mac Studio. Cost-competitive? |
 
 ---
 
@@ -1111,3 +1146,6 @@ All changes are additive. No existing backend, model, or API is modified. The `B
 - BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025)
 - bartowski, GLM-4.7-Flash-GGUF quantizations: https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
 - llama.cpp IQ1_S blind testing: https://github.com/ggml-org/llama.cpp/discussions/5962
+- RuvLLM MicroLoRA NEON SIMD: `crates/ruvllm/src/lora/micro_lora.rs:279-390`
+- RuvLLM NEON SIMD kernels: `crates/ruvllm/src/kernels/` (gemm_neon, gemv_neon, silu_neon, gelu_neon, relu_neon, rms_norm_neon, apply_rope_neon)
+- RuvLLM ContrastiveTrainer CPU fallback: `crates/ruvllm/src/training/contrastive.rs:171-175`