mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-05-28 01:44:41 +00:00
docs: Add Phase 0 PTQ rapid prototype to Craftsman Ultra ADR/DDD
Research post-training quantization feasibility for GLM-4.7-Flash as a low-cost ($100, 2-4 hrs) validation step before full distillation ($1,300+). ADR-017 changes: - Restructured Option A from "Rejected" to tiered PTQ analysis (0A-0D) - Added AD-18: PT-BitNet post-training quantization strategy - Updated phased decision to A(0C) → D → C → B - Added Phase 0 exit criteria and validation benchmarks - Documented existing community GGUFs (bartowski, unsloth, ngxson) - Identified RuvLLM IQ1_S dequant gap (type 19 parsed, not implemented) - Added PT-BitNet, BitDistill, and STBLLM references DDD v2.1 changes: - Added 6 Phase 0 ubiquitous language terms (PT-BitNet, BITNET_T158, etc.) - Updated Section 3.4 with dual-mode quantization pipeline (PTQ + distillation) - Updated compatibility matrix with Phase 0 vs Phase 1+ columns - Added 3 new open questions (calibration corpus, GGUF type, weight migration) Key finding: IQ1_S ≠ BitNet b1.58. Generic codebook PTQ produces garbled output; PT-BitNet absmean ternary quantization is viable for kernel validation. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
This commit is contained in:
parent
5ccf99d23b
commit
3686dfc52f
2 changed files with 280 additions and 68 deletions
|
|
@ -179,28 +179,87 @@ RuvLLM contains a mature reinforcement-learning-from-model-feedback (RLM) traini
|
|||
|
||||
## Considered Options
|
||||
|
||||
### Option A: Post-Training Quantization of GLM-4.7-Flash to 1-bit
|
||||
### Option A: Post-Training Quantization of GLM-4.7-Flash (PTQ Tiers)
|
||||
|
||||
Take the existing BF16 GLM-4.7-Flash weights and quantize to IQ1_S format.
|
||||
Take the existing BF16 GLM-4.7-Flash weights and quantize to low-bit formats without full distillation training.
|
||||
|
||||
**Approach:**
|
||||
1. Download GLM-4.7-Flash BF16 weights from HuggingFace
|
||||
2. Apply GPTQ/AWQ-style calibration with IQ1_S target
|
||||
3. Serve via existing GGUF pipeline
|
||||
**Critical distinction — IQ1_S ≠ BitNet b1.58:**
|
||||
|
||||
**Pros:**
|
||||
- No training infrastructure needed
|
||||
- Immediate availability
|
||||
- Leverages existing GGUF IQ1_S support
|
||||
| Property | GGUF IQ1_S | BitNet b1.58 |
|
||||
|----------|-----------|--------------|
|
||||
| Encoding | Codebook-based importance quantization | Ternary {-1, 0, +1} via absmean |
|
||||
| Bits/weight | 1.56 bpw | 1.58 bpw |
|
||||
| Inference | **Dequantize → FP multiply** | **Integer addition only (no multiply)** |
|
||||
| Speed benefit | Memory bandwidth only | Bandwidth + compute (multiplication-free) |
|
||||
| How obtained | Post-training quantization | Trained from scratch or distilled |
|
||||
| Quality at 7B | Near-random / broken outputs | Matches FP16 |
|
||||
|
||||
**Existing GLM-4.7-Flash GGUF quantizations available** (community-published):
|
||||
|
||||
| Repository | Lowest Quant | Size | Notes |
|
||||
|-----------|-------------|------|-------|
|
||||
| [bartowski/zai-org_GLM-4.7-Flash-GGUF](https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF) | IQ2_XXS (2.06 bpw) | 7.62 GB | No IQ1_S published |
|
||||
| [unsloth/GLM-4.7-Flash-GGUF](https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF) | UD-Q2_K_XL (2.7 bpw dynamic) | ~11 GB | Dynamic quant, recommended |
|
||||
| [ngxson/GLM-4.7-Flash-GGUF](https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF) | Q4_K_M (4.5 bpw) | 18.1 GB | 55 variants available |
|
||||
|
||||
**No IQ1_S quantization** has been published for GLM-4.7-Flash by any community quantizer — this itself is a signal (too aggressive for practical use).
|
||||
|
||||
**Sub-options ranked by increasing effort:**
|
||||
|
||||
**Sub-option 0A: Download existing IQ2_XXS GGUF**
|
||||
- Download bartowski's IQ2_XXS at 7.62 GB
|
||||
- Cost: $0, time: 5 minutes (just download)
|
||||
- Quality: ~75-80% of FP16 (2.06 bpw is usable per community reports)
|
||||
- NOT 1-bit, NOT BitNet — just aggressive 2-bit compression
|
||||
- RuvLLM gap: IQ2_XXS dequantization not implemented (falls to error catch-all in `quantization.rs:358`)
|
||||
- RuvLLM Q2_K dequantization IS implemented and works
|
||||
|
||||
**Sub-option 0B: Quantize to IQ1_S via llama.cpp**
|
||||
- Run `llama-quantize GLM-4.7-Flash-F16.gguf IQ1_S` with importance matrix
|
||||
- Cost: $0, time: ~30 minutes on CPU
|
||||
- Quality: **SEVERE degradation** — blind testing shows IQ1_S is "broken rather than just bad" on 7B; outputs contain garbled text despite acceptable perplexity scores. 30B MoE may survive better due to parameter redundancy, but expert routing is highly sensitive to weight perturbation
|
||||
- RuvLLM gap: IQ1_S dequantization not implemented (`quantization.rs:358` catch-all)
|
||||
- Does NOT achieve BitNet multiplication-free inference
|
||||
|
||||
**Sub-option 0C: PT-BitNet ternary PTQ** (per [PT-BitNet paper](https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X))
|
||||
- Apply absmean ternary quantization (BitNet's native method) to pre-trained weights with calibration data
|
||||
- Cost: ~$50-200 (small GPU calibration run, 1-4 hours on 1× A100)
|
||||
- Quality: ~55-65% downstream accuracy (PT-BitNet reports 61% on 70B; GLM-4.7-Flash's 30B-A3B may differ)
|
||||
- THIS IS proper BitNet ternary format → **enables multiplication-free inference with AD-4 kernels**
|
||||
- Requires implementing absmean ternary quantizer (~200-300 lines of new code)
|
||||
- Requires calibration dataset (WikiText-2 or similar, ~1M tokens)
|
||||
|
||||
**Sub-option 0D: BitDistill Lite (10B tokens)** (per [BitDistill paper](https://arxiv.org/html/2510.13998v1))
|
||||
- 3-stage: SubLN insertion → 10B-token continued pre-training → KL + attention distillation
|
||||
- Cost: ~$200-500 (8× GPU hours on Mi300X/A100 class)
|
||||
- Quality: **~90-95% of FP16** (BitDistill reports 88.17% vs 88.01% FP16 on MNLI at 0.6B)
|
||||
- Near-full quality recovery with only 10B tokens (vs 200B+ for Phase 1 full distillation)
|
||||
- Requires SubLN module insertion + distillation fine-tuning loop
|
||||
- Bridges gap between pure PTQ and full expert distillation (Phase 1)
|
||||
|
||||
**Summary comparison:**
|
||||
|
||||
| Sub-option | Cost | Time | Quality (est.) | BitNet Speedup | RuvLLM Ready |
|
||||
|-----------|------|------|---------------|----------------|-------------|
|
||||
| 0A: IQ2_XXS download | $0 | 5 min | ~75-80% | No | No (missing dequant) |
|
||||
| 0B: IQ1_S quantize | $0 | 30 min | ~40-50% | No | No (missing dequant) |
|
||||
| 0C: PT-BitNet PTQ | ~$100 | 2-4 hrs | ~55-65% | **Yes** | Needs quantizer impl |
|
||||
| 0D: BitDistill Lite | ~$300 | 1-2 days | ~90-95% | **Yes** | Needs SubLN + KD loop |
|
||||
|
||||
**Pros (of PTQ approach generally):**
|
||||
- Immediate or near-immediate results ($0-$300, minutes to days)
|
||||
- No large-scale training infrastructure
|
||||
- Validates inference pipeline and kernels before investing in full distillation
|
||||
- Sub-option 0C produces genuine BitNet ternary format for kernel development
|
||||
|
||||
**Cons:**
|
||||
- **Severe quality degradation** — post-training 1-bit quantization loses 30-50% quality
|
||||
- BitNet research explicitly states native training is required for quality parity
|
||||
- MoE routing scores collapse under extreme quantization
|
||||
- Does not achieve BitNet's multiplication-free inference (still uses dequant-then-multiply)
|
||||
- No ternary lookup table optimization possible
|
||||
- Sub-options 0A/0B: Quality too degraded for production coding tasks
|
||||
- Sub-options 0A/0B: No BitNet multiplication-free inference (still dequant-then-multiply)
|
||||
- Sub-option 0C: Significant quality loss (~35-45%) vs teacher — adequate for kernel validation, not production
|
||||
- Sub-option 0D: Requires non-trivial training code (SubLN, KD loss) but much less than full Phase 1
|
||||
- IQ1_S blind test results: statistically indistinguishable from random on smaller models
|
||||
|
||||
**Verdict: Rejected** — Quality loss makes this unsuitable for production coding tasks.
|
||||
**Verdict: Recommended as Phase 0 rapid prototype** — Sub-option 0C (PT-BitNet PTQ) is the optimal entry point: $100, 2-4 hours, produces genuine BitNet ternary format for kernel development and inference validation. Sub-option 0D (BitDistill Lite) bridges to Phase 1 if higher quality is needed before committing to full expert distillation. Sub-options 0A/0B are useful only as baselines for comparison.
|
||||
|
||||
### Option B: Native BitNet Training of GLM-4.7-Flash Architecture (Full)
|
||||
|
||||
|
|
@ -303,22 +362,40 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine
|
|||
|
||||
## Decision
|
||||
|
||||
**Phased approach: D → C → B**
|
||||
**Phased approach: A(0C) → D → C → B**
|
||||
|
||||
### Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)
|
||||
- **Timeline**: 1-2 weeks
|
||||
- **Cost**: ~$100 (calibration on 1× A100 spot, 2-4 hours)
|
||||
- **Goal**: Produce a genuine BitNet ternary GGUF of GLM-4.7-Flash for kernel development, inference pipeline validation, and baseline quality measurement
|
||||
- **Deliverables**:
|
||||
- PT-BitNet ternary quantized GLM-4.7-Flash GGUF file (~6-7 GB)
|
||||
- Absmean ternary quantizer implementation (~200-300 lines)
|
||||
- IQ1_S / BITNET_T158 dequantization kernel in RuvLLM
|
||||
- Baseline quality benchmarks (HumanEval, MMLU) to compare against Phase 1+
|
||||
- Functional TL1 kernel validated against ternary model
|
||||
- **Expected quality**: ~55-65% of GLM-4.7-Flash (adequate for kernel validation, not production)
|
||||
- **Key value**: De-risks Phase 1 by validating the entire inference pipeline (GGUF loading → ternary dequant → TL1 kernel → MoE routing → token generation) at near-zero cost before committing to $1,300+ distillation training
|
||||
- **Optional upgrade (0D)**: If 0C quality is too low for meaningful testing, apply BitDistill Lite (10B tokens, ~$300, 1-2 days) to reach ~90-95% quality
|
||||
|
||||
### Phase 1: BitNet Expert Replacement (Option D)
|
||||
- **Timeline**: 3-4 months
|
||||
- **Goal**: Validate MoE + BitNet integration, build inference kernels
|
||||
- **Cost**: ~$1,300-$2,000 (4× A100 spot, ~46 days)
|
||||
- **Goal**: Full-quality ternary experts via distillation, validated against Phase 0 baseline
|
||||
- **Deliverables**: Working Craftsman Ultra 30b 1bit (mixed: ternary experts, FP16 attention)
|
||||
- **Expected quality**: ~90-95% of GLM-4.7-Flash on coding benchmarks
|
||||
- **Prerequisites**: Phase 0 validates inference pipeline works end-to-end
|
||||
|
||||
### Phase 2: Full BitNet Distillation (Option C)
|
||||
- **Timeline**: 4-6 months after Phase 1
|
||||
- **Cost**: ~$2,500-$5,000 (4× H100, 16-32 days)
|
||||
- **Goal**: Full ternary model with complete BitNet inference optimization
|
||||
- **Deliverables**: Craftsman Ultra 30b 1bit v2 (full ternary except router/embed/head)
|
||||
- **Expected quality**: ~95-98% of GLM-4.7-Flash
|
||||
|
||||
### Phase 3: Native BitNet Training (Option B)
|
||||
- **Timeline**: 6-12 months after Phase 2, contingent on funding/compute
|
||||
- **Cost**: ~$15,000-$30,000 (8× H100 cluster, 90-180 days)
|
||||
- **Goal**: Surpass GLM-4.7-Flash quality with native ternary training
|
||||
- **Deliverables**: Craftsman Ultra 30b 1bit v3 (trained from scratch)
|
||||
- **Expected quality**: 100%+ of GLM-4.7-Flash (BitNet at scale exceeds FP16)
|
||||
|
|
@ -797,6 +874,91 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_
|
|||
4. **Mixed precision training**: FP16 shadow weights + BF16 activations reduces memory, enabling smaller instances
|
||||
5. **Gradient checkpointing**: Trade compute for memory to fit on fewer GPUs
|
||||
|
||||
### AD-18: Phase 0 — PT-BitNet Post-Training Quantization Strategy
|
||||
|
||||
**Decision**: Implement a PT-BitNet ternary post-training quantizer as Phase 0, producing a rapid prototype GGUF for inference pipeline validation before investing in full distillation.
|
||||
|
||||
**Rationale**: The original Option A ("Rejected") assumed only generic IQ1_S quantization, which produces garbled outputs at 1.56 bpw. However, PT-BitNet (2025) demonstrates that applying BitNet's native absmean ternary quantization to pre-trained weights with calibration data achieves significantly better results (61% downstream at 70B) than generic codebook PTQ. This produces genuine BitNet ternary format that enables multiplication-free inference with TL1/TL2 kernels — unlike IQ1_S which still requires dequant-then-multiply.
|
||||
|
||||
**Implementation approach**:
|
||||
|
||||
```
|
||||
Phase 0 Pipeline:
|
||||
1. Load GLM-4.7-Flash FP16/BF16 weights
|
||||
2. For each linear layer in expert FFNs:
|
||||
a. Compute gamma = mean(|W|) (absmean scale)
|
||||
b. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
|
||||
c. Store: 2-bit packed ternary weights + FP16 scale per block
|
||||
3. Calibration pass (optional, improves quality):
|
||||
a. Run ~1000 calibration samples through teacher model
|
||||
b. Record activation statistics per layer
|
||||
c. Optimize scale factors to minimize MSE between teacher and ternary outputs
|
||||
4. Export to GGUF with BITNET_T158 tensor type + metadata
|
||||
5. Validate: load in BitNetBackend → TL1 kernel → generate tokens
|
||||
```
|
||||
|
||||
**Absmean ternary quantizer (core algorithm)**:
|
||||
```
|
||||
Input: W ∈ R^{m×n} (FP16 weight matrix)
|
||||
Output: W_t ∈ {-1,0,+1}^{m×n}, scale ∈ R (per-block FP16)
|
||||
|
||||
For each block of 256 elements:
|
||||
1. gamma = mean(|block|) + 1e-8
|
||||
2. normalized = block / gamma
|
||||
3. ternary = round(clamp(normalized, -1, 1)) → {-1, 0, +1}
|
||||
4. Pack: 2 bits per weight (00=-1, 01=0, 10=+1)
|
||||
5. Store scale = gamma as FP16
|
||||
```
|
||||
|
||||
**What stays FP16** (same as AD-2):
|
||||
- MoE router gating weights
|
||||
- Token embeddings + LM head
|
||||
- RoPE frequencies
|
||||
- LayerNorm/RMSNorm parameters
|
||||
|
||||
**RuvLLM implementation gaps to fill**:
|
||||
|
||||
| Gap | Effort | Details |
|
||||
|-----|--------|---------|
|
||||
| Absmean ternary quantizer | ~200-300 lines | New function in `gguf/quantization.rs` or new module |
|
||||
| IQ1_S / BITNET_T158 dequantization | ~80-120 lines | Add to `dequantize_tensor` match arm (currently falls to error at line 358) |
|
||||
| GGUF export with ternary metadata | ~100-150 lines | Extend `GgufExportResult` with BitNet metadata keys from AD-5 |
|
||||
| TL1 kernel smoke test | ~200 lines | Validate ternary GEMM produces correct output on PTQ model |
|
||||
|
||||
**Total new code**: ~600-800 lines (vs ~15,000+ for Phase 1 full distillation pipeline)
|
||||
|
||||
**Quality expectations (conservative estimates for GLM-4.7-Flash 30B-A3B)**:
|
||||
|
||||
| Benchmark | FP16 Baseline | Phase 0 PTQ (est.) | Phase 1 Distill (est.) |
|
||||
|-----------|--------------|-------------------|----------------------|
|
||||
| HumanEval pass@1 | ~65% | ~35-45% | ~55-60% |
|
||||
| MMLU | ~75% | ~45-55% | ~65-70% |
|
||||
| SWE-bench Verified | 59.2% | ~25-35% | ~50-55% |
|
||||
| LiveCodeBench v6 | 64.0% | ~30-40% | ~55-60% |
|
||||
|
||||
**Why Phase 0 quality is still useful**:
|
||||
1. **Kernel validation**: Ternary GEMM correctness doesn't depend on model quality
|
||||
2. **Memory profiling**: Real-world memory usage measurement with actual MoE activation patterns
|
||||
3. **Throughput benchmarking**: Measure real tok/s with TL1/TL2/I2_S kernels on target hardware
|
||||
4. **Pipeline testing**: End-to-end GGUF load → inference → token output
|
||||
5. **Baseline measurement**: Quantitative quality floor establishes improvement target for Phase 1
|
||||
6. **Cost**: ~$100 vs ~$1,300 for Phase 1 — validates infrastructure before 10x investment
|
||||
|
||||
**Key configuration**:
|
||||
```rust
|
||||
pub struct PtBitnetConfig {
|
||||
pub calibration_samples: usize, // 1000 default (WikiText-2 or code corpus)
|
||||
pub block_size: usize, // 256 (matches AD-1)
|
||||
pub optimize_scales: bool, // true: MSE-optimized scales; false: raw absmean
|
||||
pub layers_to_quantize: LayerMask, // ExpertsOnly (Phase 0) or All (future)
|
||||
pub export_format: TernaryFormat, // BitnetT158 (native) or IQ1S (llama.cpp compat)
|
||||
pub router_precision: Precision, // FP16 (always, per AD-2)
|
||||
}
|
||||
```
|
||||
|
||||
**Reused**: GGUF parser, tensor metadata, `GgufQuantType` enum, export pipeline.
|
||||
**New**: `PtBitnetQuantizer`, `absmean_ternary()`, `BITNET_T158` dequantization kernel.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
|
@ -809,12 +971,14 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_
|
|||
4. **Multiplication-free expert GEMM**: Integer addition only in expert forward passes
|
||||
5. **SONA compatibility**: MicroLoRA adaptation preserves per-session learning
|
||||
6. **GGUF ecosystem**: Compatible with existing model distribution infrastructure
|
||||
7. **Incremental path**: Phase 1 delivers value quickly; Phases 2-3 improve quality
|
||||
7. **Incremental path**: Phase 0 validates at ~$100; Phase 1 delivers quality; Phases 2-3 optimize
|
||||
8. **~70% RLM code reuse**: GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore are production-tested — only BitLinear layer and orchestrator are net-new
|
||||
9. **Adaptive distillation**: GRPO reward scaling dynamically focuses compute on hard-to-distill experts
|
||||
10. **Cross-expert stability**: EWC++ Fisher diagonal prevents catastrophic forgetting during sequential expert distillation
|
||||
11. **Learned quantization policies**: PolicyStore persists per-layer ternary scale distributions for reproducible future distillation runs
|
||||
12. **Expert-parallel distillation**: Independent expert FFNs enable rayon-parallel distillation across CPU cores
|
||||
13. **Phase 0 de-risks Phase 1**: ~$100 PTQ prototype validates entire inference pipeline (GGUF → dequant → kernel → MoE → generation) before committing $1,300+ to distillation
|
||||
14. **Existing GGUF ecosystem**: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines
|
||||
|
||||
### Negative
|
||||
|
||||
|
|
@ -830,19 +994,32 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_
|
|||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| MoE routing degrades with ternary experts | Medium | High | Phase 1 validates routing; router stays FP16; AD-12 contrastive validation |
|
||||
| bitnet.cpp kernel translation to Rust introduces bugs | Medium | Medium | Extensive kernel unit tests; validate against reference impl |
|
||||
| Phase 0 PTQ quality too low for meaningful testing | Medium | Low | Phase 0 is for kernel/pipeline validation, not quality; upgrade to 0D (BitDistill Lite) if needed |
|
||||
| MoE routing degrades with ternary experts | Medium | High | Phase 0 detects routing issues early; Phase 1 validates routing; router stays FP16; AD-12 contrastive validation |
|
||||
| bitnet.cpp kernel translation to Rust introduces bugs | Medium | Medium | Phase 0 PTQ model provides cheap test fixture; extensive kernel unit tests; validate against reference impl |
|
||||
| Distillation fails to converge for MoE | Low | High | GRPO reward scaling + per-expert distillation fallback; EWC++ stability (AD-13) |
|
||||
| GLM-4.7-Flash architecture changes break compatibility | Low | Medium | Pin to specific HF revision; architecture abstraction layer |
|
||||
| IQ1_S GGUF format insufficient for absmean metadata | Medium | Low | Register custom GGUF type; backward-compatible extension |
|
||||
| IQ1_S GGUF format insufficient for absmean metadata | Medium | Low | Register custom GGUF type (BITNET_T158); backward-compatible extension |
|
||||
| EWC++ Fisher accumulation OOM at 30B scale | Medium | Medium | Sparse Fisher (top-k diagonal entries); per-expert rather than global Fisher |
|
||||
| GRPO reward signal too noisy for distillation | Low | Low | Fall back to static KD loss; GRPO reward as optional multiplier |
|
||||
| `RealContrastiveTrainer` doesn't scale to 30B | Medium | Medium | Extract training loop; replace Candle Linear with BitLinear; keep optimizer/scheduler |
|
||||
| Calibration data bias in Phase 0 PTQ | Low | Low | Use diverse calibration corpus (WikiText + code); measure variance across calibration sets |
|
||||
|
||||
---
|
||||
|
||||
## Validation Criteria
|
||||
|
||||
### Phase 0 Exit Criteria
|
||||
- [ ] Absmean ternary quantizer produces valid {-1, 0, +1} weights from GLM-4.7-Flash FP16
|
||||
- [ ] GGUF export with BITNET_T158 tensor type loads without error in BitNetBackend
|
||||
- [ ] TL1 kernel produces non-zero, bounded output on PTQ ternary weights
|
||||
- [ ] MoE routing selects experts (not all-zero or all-same-expert degenerate routing)
|
||||
- [ ] End-to-end token generation produces coherent (if degraded) text
|
||||
- [ ] Memory usage measured and documented for real MoE activation patterns
|
||||
- [ ] Throughput measured: tok/s on target CPU (AVX2 and/or NEON)
|
||||
- [ ] Baseline quality benchmarks recorded (HumanEval, MMLU) as Phase 1 improvement target
|
||||
- [ ] Total Phase 0 cost < $200
|
||||
|
||||
### Phase 1 Exit Criteria
|
||||
- [ ] BitNet backend loads GGUF with ternary expert weights
|
||||
- [ ] TL1 kernel produces bit-exact output vs reference float implementation
|
||||
|
|
@ -887,3 +1064,9 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_
|
|||
13. RuvLLM Memory Distillation: `crates/ruvllm/src/reasoning_bank/distillation.rs`
|
||||
14. RuvLLM Policy Store: `crates/ruvllm/src/policy_store.rs`
|
||||
15. RuvLLM Contrastive Training: `crates/ruvllm/src/training/contrastive.rs`
|
||||
16. PT-BitNet: "Scaling up the 1-Bit large language model with post-training quantization" (2025) — https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X
|
||||
17. BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025) — https://arxiv.org/html/2510.13998v1
|
||||
18. bartowski, GLM-4.7-Flash-GGUF quantizations — https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
|
||||
19. unsloth, GLM-4.7-Flash-GGUF dynamic quantizations — https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
|
||||
20. llama.cpp IQ1_S blind testing (Discussion #5962) — https://github.com/ggml-org/llama.cpp/discussions/5962
|
||||
21. STBLLM: "Breaking the 1-bit Barrier" (ICLR 2025) — https://proceedings.iclr.cc/paper_files/paper/2025/file/ff997469ac66cf893c4183efeb22212a-Paper-Conference.pdf
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
# Domain-Driven Design: Craftsman Ultra 30b 1bit
|
||||
|
||||
**Version:** 2.0
|
||||
**Version:** 2.1
|
||||
**Date:** 2026-02-03
|
||||
**Relates to:** ADR-017-craftsman-ultra-30b-1bit-bitnet-integration
|
||||
**Status:** Research / Pre-Implementation
|
||||
|
|
@ -75,6 +75,11 @@ The following terms have precise meaning within the Craftsman Ultra domain. All
|
|||
| **Contrastive Router Validation** | Post-ternary-conversion check that MoE routing still selects correct experts, using triplet loss on expert embeddings. |
|
||||
| **Knowledge Distillation Loss** | `alpha * KL(teacher/T, student/T) + (1-alpha) * CE(labels, student)`. Core training objective for ternary student. |
|
||||
| **Distillation Trajectory** | Sequence of training steps for one expert, recorded as ReasoningBank `Trajectory` for quality analysis. |
|
||||
| **PT-BitNet** | Post-Training BitNet quantization: applying absmean ternary conversion to pre-trained FP16 weights with optional calibration. No training loop — just quantize and export. |
|
||||
| **Calibration Pass** | Forward pass of ~1000 samples through the teacher model to record activation statistics used to optimize ternary scale factors. |
|
||||
| **IQ1_S** | llama.cpp's 1.56 bpw importance quantization format. Codebook-based, dequant-then-multiply — NOT multiplication-free like BitNet. |
|
||||
| **BITNET_T158** | Proposed GGUF tensor type for native BitNet b1.58 ternary weights (2-bit packed + FP16 per-block absmean scale). Distinct from IQ1_S. |
|
||||
| **Phase 0 Prototype** | PT-BitNet quantized model used for inference pipeline validation and kernel testing, not production quality. |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -241,16 +246,22 @@ The following terms have precise meaning within the Craftsman Ultra domain. All
|
|||
|
||||
### 3.4 Quantization Pipeline Context (Supporting)
|
||||
|
||||
**Responsibility**: Convert full-precision weights to ternary format during training/distillation. **Delegates training orchestration to the RLM Training Orchestration Context** (3.8), which provides GRPO rewards, EWC++ stability, and quality tracking.
|
||||
**Responsibility**: Convert full-precision weights to ternary format. Supports two modes:
|
||||
1. **Phase 0 (PTQ)**: Direct absmean ternary quantization with optional calibration — no training loop
|
||||
2. **Phase 1+ (Distillation)**: Full training pipeline with STE, shadow weights, and RLM orchestration
|
||||
|
||||
**Delegates training orchestration to the RLM Training Orchestration Context** (3.8) for Phase 1+ distillation, which provides GRPO rewards, EWC++ stability, and quality tracking.
|
||||
|
||||
**Owns:**
|
||||
- Absmean quantization implementation
|
||||
- Straight-through estimator for backpropagation
|
||||
- Shadow weight management (FP16 ↔ ternary)
|
||||
- GGUF export with ternary tensor metadata
|
||||
- Absmean quantization implementation (shared by Phase 0 and Phase 1+)
|
||||
- PT-BitNet quantizer for Phase 0 rapid prototype (no training loop)
|
||||
- Straight-through estimator for backpropagation (Phase 1+ only)
|
||||
- Shadow weight management (FP16 ↔ ternary, Phase 1+ only)
|
||||
- Calibration pass for scale factor optimization (Phase 0)
|
||||
- GGUF export with ternary tensor metadata (BITNET_T158 type)
|
||||
- Calibration dataset management
|
||||
|
||||
**Delegates to RLM Training (3.8):**
|
||||
**Delegates to RLM Training (3.8) — Phase 1+ only:**
|
||||
- Distillation loss computation with GRPO reward scaling
|
||||
- Cross-expert stability via EWC++ regularization
|
||||
- Router validation via contrastive training
|
||||
|
|
@ -258,43 +269,53 @@ The following terms have precise meaning within the Craftsman Ultra domain. All
|
|||
- Per-layer policy persistence via PolicyStore
|
||||
|
||||
**Key Entities:**
|
||||
- `BitLinearTrainer` — BitLinear layer with shadow weights and STE (NEW)
|
||||
- `AbsmeanQuantizer` — Converts FP16 block → ternary + scale (NEW)
|
||||
- `PtBitnetQuantizer` — Phase 0: direct FP16 → ternary conversion with calibration (NEW, ~200-300 lines)
|
||||
- `AbsmeanQuantizer` — Converts FP16 block → ternary + scale (NEW, shared by Phase 0 and 1+)
|
||||
- `CalibrationRunner` — Phase 0: runs calibration samples to optimize scale factors (NEW, ~100 lines)
|
||||
- `BitLinearTrainer` — Phase 1+: BitLinear layer with shadow weights and STE (NEW)
|
||||
- `TeacherModel` — FP16 GLM-4.7-Flash reference model (NEW)
|
||||
- `CalibrationDataset` — Token sequences for quantization calibration (NEW)
|
||||
- `GrpoOptimizer` — Per-expert reward scaling (REUSED from `training/grpo.rs`)
|
||||
- `EwcRegularizer` — Cross-expert forgetting prevention (REUSED from `lora/training.rs`)
|
||||
- `GrpoOptimizer` — Per-expert reward scaling, Phase 1+ only (REUSED from `training/grpo.rs`)
|
||||
- `EwcRegularizer` — Cross-expert forgetting prevention, Phase 1+ only (REUSED from `lora/training.rs`)
|
||||
|
||||
**Invariants:**
|
||||
- Shadow weights are FP16 throughout training (never accumulated in ternary)
|
||||
- Quantization is deterministic: same FP16 input → same ternary output
|
||||
- Teacher model is frozen during distillation (no gradient updates)
|
||||
- Distillation loss = KD_base * GRPO_scale + EWC_penalty (see ADR-017 AD-11, AD-13)
|
||||
- Phase 0: No shadow weights — direct one-shot quantization
|
||||
- Phase 1+: Shadow weights are FP16 throughout training (never accumulated in ternary)
|
||||
- Phase 1+: Teacher model is frozen during distillation (no gradient updates)
|
||||
- Phase 1+: Distillation loss = KD_base * GRPO_scale + EWC_penalty (see ADR-017 AD-11, AD-13)
|
||||
|
||||
**Interfaces:**
|
||||
- **Inbound**: Teacher model weights + training dataset
|
||||
- **Outbound**: Trained ternary weights exported as GGUF
|
||||
- **Inbound**: Teacher model weights (FP16/BF16) + calibration or training dataset
|
||||
- **Outbound**: Ternary weights exported as GGUF with BITNET_T158 tensor type
|
||||
- **Downstream**: Feeds Model Lifecycle Context with final artifacts
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Quantization Pipeline Context │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────────┐ │
|
||||
│ │TeacherModel │───▶│DistillPipeline │ │
|
||||
│ │(GLM-4.7-Flash│ │(KD loss + STE) │ │
|
||||
│ └──────────────┘ └────────┬─────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────┐ ┌───────▼──────────┐ │
|
||||
│ │AbsmeanQuant │◀───│BitLinearTrainer │ │
|
||||
│ │(FP16→ternary)│ │(shadow weights) │ │
|
||||
│ └──────┬───────┘ └──────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────▼───────────────────────────────┐ │
|
||||
│ │ GGUFExporter │ │
|
||||
│ │ (ternary tensors + metadata) │ │
|
||||
│ └──────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────┘
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ Quantization Pipeline Context │
|
||||
│ │
|
||||
│ Phase 0 (PTQ): │
|
||||
│ ┌──────────────┐ ┌──────────────────┐ │
|
||||
│ │ FP16 Weights │───▶│PtBitnetQuantizer │ │
|
||||
│ │(GLM-4.7-Flash│ │(absmean + calib) │ │
|
||||
│ └──────────────┘ └────────┬─────────┘ │
|
||||
│ │ │
|
||||
│ Phase 1+ (Distillation): │ │
|
||||
│ ┌──────────────┐ ┌───────┼──────────┐ │
|
||||
│ │TeacherModel │───▶│DistillPipeline │ │
|
||||
│ │(GLM-4.7-Flash│ │(KD loss + STE) │ │
|
||||
│ └──────────────┘ └────────┬─────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────┐ ┌───────▼──────────┐ │
|
||||
│ │AbsmeanQuant │◀───│BitLinearTrainer │ │
|
||||
│ │(FP16→ternary)│ │(shadow weights) │ │
|
||||
│ └──────┬───────┘ └──────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────▼───────────────────────────────┐ Both paths: │
|
||||
│ │ GGUFExporter │◀──────────┘ │
|
||||
│ │ (BITNET_T158 tensors + metadata) │ │
|
||||
│ └──────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
|
@ -991,18 +1012,19 @@ Add CUDA device dispatch to `RealContrastiveTrainer` (`training/real_trainer.rs:
|
|||
|
||||
### Compatibility Matrix
|
||||
|
||||
| Existing Feature | Impact | Action |
|
||||
|-----------------|--------|--------|
|
||||
| GGUF parser | Low | Add BITNET_T158 type to `GgufQuantType` enum |
|
||||
| `InferenceBackend` trait | None | New `BitNetBackend` implements existing trait |
|
||||
| KV cache (`kv_cache.rs`) | None | Reused as-is (FP16/Q8 cache unchanged) |
|
||||
| Autodetect (`autodetect.rs`) | Low | Add ternary kernel capability flags |
|
||||
| SIMD kernels (`kernels/`) | Medium | New ternary kernels alongside existing |
|
||||
| MicroLoRA (`lora/`) | Low | Adapter applied to BitLinear output |
|
||||
| SONA (`sona/`) | None | Instant loop drives adapter feedback |
|
||||
| Claude Flow (`claude_flow/`) | Low | Add `BitNetModel` to model router |
|
||||
| NAPI bindings | Low | Expose `BitNetBackend` via existing pattern |
|
||||
| tokenizer | None | Reused (GLM-4 tokenizer, 151K vocab) |
|
||||
| Existing Feature | Impact | Phase 0 | Phase 1+ |
|
||||
|-----------------|--------|---------|----------|
|
||||
| GGUF parser | Low | Add BITNET_T158 type to `GgufQuantType` enum | Same |
|
||||
| `dequantize_tensor` | **Medium** | **Implement IQ1_S/BITNET_T158 dequant** (currently returns error at line 358) | Same |
|
||||
| `InferenceBackend` trait | None | New `BitNetBackend` implements existing trait | Same |
|
||||
| KV cache (`kv_cache.rs`) | None | Reused as-is | Reused as-is |
|
||||
| Autodetect (`autodetect.rs`) | Low | Add ternary kernel capability flags | Same |
|
||||
| SIMD kernels (`kernels/`) | **Medium** | TL1 kernel minimum viable for validation | Full TL1/TL2/I2_S suite |
|
||||
| MicroLoRA (`lora/`) | None (Phase 0) | Not needed for PTQ | Adapter applied to BitLinear output |
|
||||
| SONA (`sona/`) | None | Not needed for PTQ | Instant loop drives adapter feedback |
|
||||
| Claude Flow (`claude_flow/`) | Low | Add `BitNetModel` to model router | Same |
|
||||
| NAPI bindings | Low | Expose `BitNetBackend` via existing pattern | Same |
|
||||
| tokenizer | None | Reused (GLM-4 tokenizer, 151K vocab) | Same |
|
||||
|
||||
### Non-Breaking Changes
|
||||
|
||||
|
|
@ -1025,6 +1047,9 @@ All changes are additive. No existing backend, model, or API is modified. The `B
|
|||
| 9 | EWC++ Fisher OOM at 30B scale? | RLM Training | Open | May need sparse Fisher (top-k diagonal entries per expert) |
|
||||
| 10 | GRPO group_size = num_experts or per-layer? | RLM Training | Open | Per-layer groups provide finer reward signal but more compute |
|
||||
| 11 | Expert-parallel distillation rayon thread count? | RLM Training | Open | Balance CPU cores between rayon parallelism and ternary GEMM |
|
||||
| 12 | Phase 0 PTQ calibration corpus choice? | Phase 0 quality | Open | WikiText-2 vs code-specific corpus (e.g., The Stack) — code corpus may preserve coding ability better |
|
||||
| 13 | IQ1_S vs BITNET_T158 GGUF type for Phase 0? | GGUF compatibility | Open | IQ1_S (type 19) exists but block format may differ from absmean; custom BITNET_T158 avoids confusion but breaks llama.cpp compat |
|
||||
| 14 | Phase 0 → Phase 1 weight migration path? | Efficiency | Open | Can Phase 0 PTQ weights serve as initialization for Phase 1 distillation shadow weights? |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -1043,3 +1068,7 @@ All changes are additive. No existing backend, model, or API is modified. The `B
|
|||
- RuvLLM Memory Distillation: `crates/ruvllm/src/reasoning_bank/distillation.rs`
|
||||
- RuvLLM Policy Store: `crates/ruvllm/src/policy_store.rs`
|
||||
- RuvLLM Contrastive Training: `crates/ruvllm/src/training/contrastive.rs`
|
||||
- PT-BitNet: "Scaling up the 1-Bit large language model with post-training quantization" (2025)
|
||||
- BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025)
|
||||
- bartowski, GLM-4.7-Flash-GGUF quantizations: https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
|
||||
- llama.cpp IQ1_S blind testing: https://github.com/ggml-org/llama.cpp/discussions/5962
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue