docs: Add AD-17 training infrastructure analysis (cloud GPU vs local SIMD)

ADR-017: Add AD-17 with detailed memory budget analysis showing per-expert
distillation fits in A100 40GB (~15.5GB), full model requires 4×A100 80GB
(~430GB). CPU SIMD training infeasible at 200B+ tokens (~65 years on AVX2).
Recommend GCP 4×A100 spot instances (~$1,300 for Phase 1) or DataCrunch
H100 ($1.99/hr). Includes cost comparison across 6 platforms, per-phase
infrastructure mapping, and required CUDA device dispatch code change for
RealContrastiveTrainer.

DDD: Add section 8.5 Training Infrastructure Model with expert-parallel
GPU topology diagram, what-runs-where matrix, and required code change
summary.

https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
This commit is contained in:
Claude 2026-02-03 04:44:29 +00:00
parent 261b510d8e
commit d5caa4dacf
2 changed files with 142 additions and 0 deletions

View file

@ -721,6 +721,82 @@ let expert_results: Vec<DistillResult> = experts
.collect();
```
### AD-17: Training Infrastructure — Cloud GPU over Local SIMD
**Decision**: Use Google Cloud A100/H100 GPU instances for distillation training. Reserve local CPU/SIMD for inference validation, MicroLoRA adaptation, and GGUF export only.
**Rationale**: Local CPU/SIMD training is mathematically infeasible at the 200B+ token scale required for expert distillation. The existing RuvLLM SIMD kernels (`kernels/`) are inference-only — no backpropagation or gradient computation. The training code (`real_trainer.rs:178-184`) supports Metal (macOS) or CPU but not CUDA, and CPU throughput at ~50-100 tok/s training would require ~65 years for 200B tokens.
**Memory analysis (per-expert distillation):**
| Component | Size | Notes |
|-----------|------|-------|
| Single expert FFN shadow weights (FP16) | ~2 GB | ~1B params per expert (28B ÷ N experts) |
| Gradients (FP32) | ~4 GB | Full precision for STE backprop |
| AdamW optimizer state (2× FP32) | ~8 GB | First + second moment |
| Teacher activations cache | ~1 GB | Per-batch FP16 |
| EWC++ Fisher diagonal | ~0.5 GB | Per-expert accumulated |
| **Per-expert total** | **~15.5 GB** | Fits in A100 40GB with headroom |
**Full model simultaneous (Phase 2+):**
| Component | Size | Notes |
|-----------|------|-------|
| 30B shadow weights (FP16) | ~60 GB | Requires A100 80GB or H100 |
| Gradients + optimizer | ~360 GB | Requires multi-GPU parallelism |
| **Total** | **~430 GB** | 4× A100 80GB or 4× H100 80GB |
**Throughput and cost comparison:**
| Platform | Training tok/s | Time (200B tok, Phase 1) | Cost |
|----------|---------------|--------------------------|------|
| CPU AVX2 (Ryzen 9) | ~50-100 | ~65 years | N/A |
| Apple M4 Max (Metal) | ~500-1000 | ~6.5 years | N/A |
| 1× A100 80GB (GCP on-demand) | ~15,000 | ~155 days | ~$3,700 |
| 4× A100 80GB (GCP on-demand) | ~50,000 | ~46 days | ~$4,400 |
| 4× A100 80GB (GCP spot) | ~50,000 | ~46 days | **~$1,300** |
| 1× H100 (DataCrunch) | ~40,000 | ~58 days | ~$2,900 |
| 4× H100 (DataCrunch) | ~140,000 | ~16 days | **~$3,200** |
**Recommended infrastructure per phase:**
| Phase | Instance | Duration | Estimated Cost | Strategy |
|-------|----------|----------|----------------|----------|
| Phase 1 (expert FFN, 200B tok) | 4× A100 80GB spot (GCP) | ~46 days | $1,300-$2,000 | Per-expert sequential with EWC++; each expert fits 1 GPU |
| Phase 1 (router validation) | 1× A100 or local Metal | ~2-4 hours | <$10 | Contrastive training on router only (~2B params) |
| Phase 2 (full ternary, 500B tok) | 4× H100 (DataCrunch) | ~16-32 days | $2,500-$5,000 | All layers; model-parallel across GPUs |
| Phase 3 (native training, 4T tok) | 8× H100 cluster | ~90-180 days | $15,000-$30,000 | Full from-scratch; depends on funding |
| Inference validation | Local CPU (AVX2/NEON) | Continuous | $0 | SIMD kernels validate TL1/TL2/I2_S correctness |
| MicroLoRA adaptation | Local CPU | <1ms/update | $0 | Existing ndarray-based EWC++ pipeline |
**Required code change**: Add CUDA device dispatch to `RealContrastiveTrainer`:
```rust
// Current (real_trainer.rs:178-184):
let device = if config.use_metal {
Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
Device::Cpu
};
// Required for cloud GPU training:
let device = if config.use_cuda {
Device::new_cuda(config.cuda_device_id).unwrap_or(Device::Cpu)
} else if config.use_metal {
Device::new_metal(0).unwrap_or(Device::Cpu)
} else {
Device::Cpu
};
```
This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_device_id: usize`) and a 3-line change to device selection. The rest of the Candle training pipeline (tensors, optimizer, loss computation) works identically across CPU/Metal/CUDA.
**Cost optimization strategies:**
1. **Spot instances**: GCP A100 spot at ~$1/GPU-hr (70% off on-demand) — requires checkpointing every 30 min
2. **DataCrunch / Lambda Labs**: H100 at $1.99-$2.10/hr (40-50% below GCP on-demand)
3. **Expert-sequential on fewer GPUs**: Distill 1 expert at a time on 1× A100 80GB (~$1.50/hr), increasing wall time but reducing per-hour cost
4. **Mixed precision training**: FP16 shadow weights + BF16 activations reduces memory, enabling smaller instances
5. **Gradient checkpointing**: Trade compute for memory to fit on fewer GPUs
---
## Consequences

View file

@ -880,6 +880,72 @@ Assuming GLM-4.7-Flash architecture with ~3B active parameters per token:
---
## 8.5 Training Infrastructure Model
### Why Not Local CPU/SIMD
The existing RuvLLM SIMD kernels (`crates/ruvllm/src/kernels/`) are **inference-only** — no backward pass, no gradient computation, no training support. The training code paths are:
- `RealContrastiveTrainer`: Candle tensors on `Device::Metal` or `Device::Cpu` (no CUDA)
- `EwcRegularizer` / LoRA training: Pure CPU via `ndarray` (no GPU acceleration)
- SIMD kernels: Forward-pass optimizations only (flash attention, matmul, activations)
At ~50-100 training tok/s on CPU, 200B tokens would require ~65 years. Not viable.
### Cloud GPU Distillation Strategy
**Per-expert distillation fits in a single A100 80GB:**
```
Expert FFN (~1B params):
Shadow weights (FP16): 2 GB
Gradients (FP32): 4 GB
AdamW state (2×FP32): 8 GB
Teacher activations: 1 GB
EWC++ Fisher: 0.5 GB
────────────────────────────────
Total per expert: ~15.5 GB ✓ Fits A100 40GB
```
**Expert-parallel: 4 experts distill concurrently on 4× A100/H100:**
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │
│ Expert 0 │ │ Expert 1 │ │ Expert 2 │ │ Expert 3 │
│ BitLinear │ │ BitLinear │ │ BitLinear │ │ BitLinear │
│ + EWC │ │ + EWC │ │ + EWC │ │ + EWC │
│ + GRPO │ │ + GRPO │ │ + GRPO │ │ + GRPO │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │ │
└─────────────────┴─────────────────┴─────────────────┘
┌────────────▼───────────┐
│ Fisher Accumulation │
│ (cross-expert EWC) │
└────────────────────────┘
```
### What Runs Where
| Task | Location | Device | Duration |
|------|----------|--------|----------|
| Expert distillation (Phase 1) | GCP 4×A100 spot | CUDA | ~46 days |
| Router contrastive validation | GCP 1×A100 or local Mac | CUDA/Metal | Hours |
| Inference benchmark (TL1/TL2) | Local workstation | CPU SIMD (AVX2/NEON) | Minutes |
| MicroLoRA adaptation | Local / edge | CPU (ndarray) | <1ms/update |
| GGUF export | Local | CPU | Minutes |
| Kernel correctness tests | Local | CPU SIMD | Seconds |
### Required Code Change
Add CUDA device dispatch to `RealContrastiveTrainer` (`training/real_trainer.rs:178-184`):
- New config field: `use_cuda: bool`, `cuda_device_id: usize`
- Device selection: CUDA → Metal → CPU fallback chain
- Existing `candle` + `cuda` Cargo features already available in `Cargo.toml`
---
## 9. Testing Strategy
### Unit Tests (Per Context)