diff --git a/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md b/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md index 6ac2f5b10..03f26f29e 100644 --- a/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md +++ b/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md @@ -721,6 +721,82 @@ let expert_results: Vec = experts .collect(); ``` +### AD-17: Training Infrastructure — Cloud GPU over Local SIMD + +**Decision**: Use Google Cloud A100/H100 GPU instances for distillation training. Reserve local CPU/SIMD for inference validation, MicroLoRA adaptation, and GGUF export only. + +**Rationale**: Local CPU/SIMD training is mathematically infeasible at the 200B+ token scale required for expert distillation. The existing RuvLLM SIMD kernels (`kernels/`) are inference-only — no backpropagation or gradient computation. The training code (`real_trainer.rs:178-184`) supports Metal (macOS) or CPU but not CUDA, and CPU throughput at ~50-100 tok/s training would require ~65 years for 200B tokens. + +**Memory analysis (per-expert distillation):** + +| Component | Size | Notes | +|-----------|------|-------| +| Single expert FFN shadow weights (FP16) | ~2 GB | ~1B params per expert (28B ÷ N experts) | +| Gradients (FP32) | ~4 GB | Full precision for STE backprop | +| AdamW optimizer state (2× FP32) | ~8 GB | First + second moment | +| Teacher activations cache | ~1 GB | Per-batch FP16 | +| EWC++ Fisher diagonal | ~0.5 GB | Per-expert accumulated | +| **Per-expert total** | **~15.5 GB** | Fits in A100 40GB with headroom | + +**Full model simultaneous (Phase 2+):** + +| Component | Size | Notes | +|-----------|------|-------| +| 30B shadow weights (FP16) | ~60 GB | Requires A100 80GB or H100 | +| Gradients + optimizer | ~360 GB | Requires multi-GPU parallelism | +| **Total** | **~430 GB** | 4× A100 80GB or 4× H100 80GB | + +**Throughput and cost comparison:** + +| Platform | Training tok/s | Time (200B tok, Phase 1) | Cost | +|----------|---------------|--------------------------|------| +| CPU AVX2 (Ryzen 9) | ~50-100 | ~65 years | N/A | +| Apple M4 Max (Metal) | ~500-1000 | ~6.5 years | N/A | +| 1× A100 80GB (GCP on-demand) | ~15,000 | ~155 days | ~$3,700 | +| 4× A100 80GB (GCP on-demand) | ~50,000 | ~46 days | ~$4,400 | +| 4× A100 80GB (GCP spot) | ~50,000 | ~46 days | **~$1,300** | +| 1× H100 (DataCrunch) | ~40,000 | ~58 days | ~$2,900 | +| 4× H100 (DataCrunch) | ~140,000 | ~16 days | **~$3,200** | + +**Recommended infrastructure per phase:** + +| Phase | Instance | Duration | Estimated Cost | Strategy | +|-------|----------|----------|----------------|----------| +| Phase 1 (expert FFN, 200B tok) | 4× A100 80GB spot (GCP) | ~46 days | $1,300-$2,000 | Per-expert sequential with EWC++; each expert fits 1 GPU | +| Phase 1 (router validation) | 1× A100 or local Metal | ~2-4 hours | <$10 | Contrastive training on router only (~2B params) | +| Phase 2 (full ternary, 500B tok) | 4× H100 (DataCrunch) | ~16-32 days | $2,500-$5,000 | All layers; model-parallel across GPUs | +| Phase 3 (native training, 4T tok) | 8× H100 cluster | ~90-180 days | $15,000-$30,000 | Full from-scratch; depends on funding | +| Inference validation | Local CPU (AVX2/NEON) | Continuous | $0 | SIMD kernels validate TL1/TL2/I2_S correctness | +| MicroLoRA adaptation | Local CPU | <1ms/update | $0 | Existing ndarray-based EWC++ pipeline | + +**Required code change**: Add CUDA device dispatch to `RealContrastiveTrainer`: +```rust +// Current (real_trainer.rs:178-184): +let device = if config.use_metal { + Device::new_metal(0).unwrap_or(Device::Cpu) +} else { + Device::Cpu +}; + +// Required for cloud GPU training: +let device = if config.use_cuda { + Device::new_cuda(config.cuda_device_id).unwrap_or(Device::Cpu) +} else if config.use_metal { + Device::new_metal(0).unwrap_or(Device::Cpu) +} else { + Device::Cpu +}; +``` + +This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_device_id: usize`) and a 3-line change to device selection. The rest of the Candle training pipeline (tensors, optimizer, loss computation) works identically across CPU/Metal/CUDA. + +**Cost optimization strategies:** +1. **Spot instances**: GCP A100 spot at ~$1/GPU-hr (70% off on-demand) — requires checkpointing every 30 min +2. **DataCrunch / Lambda Labs**: H100 at $1.99-$2.10/hr (40-50% below GCP on-demand) +3. **Expert-sequential on fewer GPUs**: Distill 1 expert at a time on 1× A100 80GB (~$1.50/hr), increasing wall time but reducing per-hour cost +4. **Mixed precision training**: FP16 shadow weights + BF16 activations reduces memory, enabling smaller instances +5. **Gradient checkpointing**: Trade compute for memory to fit on fewer GPUs + --- ## Consequences diff --git a/docs/research/craftsman-ultra-30b-1bit-ddd.md b/docs/research/craftsman-ultra-30b-1bit-ddd.md index 6ab7d1232..5d9920bf9 100644 --- a/docs/research/craftsman-ultra-30b-1bit-ddd.md +++ b/docs/research/craftsman-ultra-30b-1bit-ddd.md @@ -880,6 +880,72 @@ Assuming GLM-4.7-Flash architecture with ~3B active parameters per token: --- +## 8.5 Training Infrastructure Model + +### Why Not Local CPU/SIMD + +The existing RuvLLM SIMD kernels (`crates/ruvllm/src/kernels/`) are **inference-only** — no backward pass, no gradient computation, no training support. The training code paths are: + +- `RealContrastiveTrainer`: Candle tensors on `Device::Metal` or `Device::Cpu` (no CUDA) +- `EwcRegularizer` / LoRA training: Pure CPU via `ndarray` (no GPU acceleration) +- SIMD kernels: Forward-pass optimizations only (flash attention, matmul, activations) + +At ~50-100 training tok/s on CPU, 200B tokens would require ~65 years. Not viable. + +### Cloud GPU Distillation Strategy + +**Per-expert distillation fits in a single A100 80GB:** + +``` +Expert FFN (~1B params): + Shadow weights (FP16): 2 GB + Gradients (FP32): 4 GB + AdamW state (2×FP32): 8 GB + Teacher activations: 1 GB + EWC++ Fisher: 0.5 GB + ──────────────────────────────── + Total per expert: ~15.5 GB ✓ Fits A100 40GB +``` + +**Expert-parallel: 4 experts distill concurrently on 4× A100/H100:** + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ +│ Expert 0 │ │ Expert 1 │ │ Expert 2 │ │ Expert 3 │ +│ BitLinear │ │ BitLinear │ │ BitLinear │ │ BitLinear │ +│ + EWC │ │ + EWC │ │ + EWC │ │ + EWC │ +│ + GRPO │ │ + GRPO │ │ + GRPO │ │ + GRPO │ +└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ │ + └─────────────────┴─────────────────┴─────────────────┘ + │ + ┌────────────▼───────────┐ + │ Fisher Accumulation │ + │ (cross-expert EWC) │ + └────────────────────────┘ +``` + +### What Runs Where + +| Task | Location | Device | Duration | +|------|----------|--------|----------| +| Expert distillation (Phase 1) | GCP 4×A100 spot | CUDA | ~46 days | +| Router contrastive validation | GCP 1×A100 or local Mac | CUDA/Metal | Hours | +| Inference benchmark (TL1/TL2) | Local workstation | CPU SIMD (AVX2/NEON) | Minutes | +| MicroLoRA adaptation | Local / edge | CPU (ndarray) | <1ms/update | +| GGUF export | Local | CPU | Minutes | +| Kernel correctness tests | Local | CPU SIMD | Seconds | + +### Required Code Change + +Add CUDA device dispatch to `RealContrastiveTrainer` (`training/real_trainer.rs:178-184`): +- New config field: `use_cuda: bool`, `cuda_device_id: usize` +- Device selection: CUDA → Metal → CPU fallback chain +- Existing `candle` + `cuda` Cargo features already available in `Cargo.toml` + +--- + ## 9. Testing Strategy ### Unit Tests (Per Context)