docs: Add Phase 0 PTQ rapid prototype to Craftsman Ultra ADR/DDD

Research post-training quantization feasibility for GLM-4.7-Flash as a low-cost ($100, 2-4 hrs) validation step before full distillation ($1,300+). ADR-017 changes: - Restructured Option A from "Rejected" to tiered PTQ analysis (0A-0D) - Added AD-18: PT-BitNet post-training quantization strategy - Updated phased decision to A(0C) → D → C → B - Added Phase 0 exit criteria and validation benchmarks - Documented existing community GGUFs (bartowski, unsloth, ngxson) - Identified RuvLLM IQ1_S dequant gap (type 19 parsed, not implemented) - Added PT-BitNet, BitDistill, and STBLLM references DDD v2.1 changes: - Added 6 Phase 0 ubiquitous language terms (PT-BitNet, BITNET_T158, etc.) - Updated Section 3.4 with dual-mode quantization pipeline (PTQ + distillation) - Updated compatibility matrix with Phase 0 vs Phase 1+ columns - Added 3 new open questions (calibration corpus, GGUF type, weight migration) Key finding: IQ1_S ≠ BitNet b1.58. Generic codebook PTQ produces garbled output; PT-BitNet absmean ternary quantization is viable for kernel validation. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
2026-05-28 01:44:41 +00:00 · 2026-02-03 04:56:28 +00:00 · 2026-02-03 04:56:28 +00:00 · 3686dfc52f
commit 3686dfc52f
parent 5ccf99d23b
2 changed files with 280 additions and 68 deletions
--- a/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md
+++ b/docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md
@ -179,28 +179,87 @@ RuvLLM contains a mature reinforcement-learning-from-model-feedback (RLM) traini

 ## Considered Options

-### Option A: Post-Training Quantization of GLM-4.7-Flash to 1-bit
+### Option A: Post-Training Quantization of GLM-4.7-Flash (PTQ Tiers)

-Take the existing BF16 GLM-4.7-Flash weights and quantize to IQ1_S format.
+Take the existing BF16 GLM-4.7-Flash weights and quantize to low-bit formats without full distillation training.

-**Approach:**
-1. Download GLM-4.7-Flash BF16 weights from HuggingFace
-2. Apply GPTQ/AWQ-style calibration with IQ1_S target
-3. Serve via existing GGUF pipeline
+**Critical distinction — IQ1_S ≠ BitNet b1.58:**

-**Pros:**
- No training infrastructure needed
- Immediate availability
- Leverages existing GGUF IQ1_S support
+| Property | GGUF IQ1_S | BitNet b1.58 |
+|----------|-----------|--------------|
+| Encoding | Codebook-based importance quantization | Ternary {-1, 0, +1} via absmean |
+| Bits/weight | 1.56 bpw | 1.58 bpw |
+| Inference | **Dequantize → FP multiply** | **Integer addition only (no multiply)** |
+| Speed benefit | Memory bandwidth only | Bandwidth + compute (multiplication-free) |
+| How obtained | Post-training quantization | Trained from scratch or distilled |
+| Quality at 7B | Near-random / broken outputs | Matches FP16 |
+
+**Existing GLM-4.7-Flash GGUF quantizations available** (community-published):
+
+| Repository | Lowest Quant | Size | Notes |
+|-----------|-------------|------|-------|
+| [bartowski/zai-org_GLM-4.7-Flash-GGUF](https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF) | IQ2_XXS (2.06 bpw) | 7.62 GB | No IQ1_S published |
+| [unsloth/GLM-4.7-Flash-GGUF](https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF) | UD-Q2_K_XL (2.7 bpw dynamic) | ~11 GB | Dynamic quant, recommended |
+| [ngxson/GLM-4.7-Flash-GGUF](https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF) | Q4_K_M (4.5 bpw) | 18.1 GB | 55 variants available |
+
+**No IQ1_S quantization** has been published for GLM-4.7-Flash by any community quantizer — this itself is a signal (too aggressive for practical use).
+
+**Sub-options ranked by increasing effort:**
+
+**Sub-option 0A: Download existing IQ2_XXS GGUF**
+- Download bartowski's IQ2_XXS at 7.62 GB
+- Cost: $0, time: 5 minutes (just download)
+- Quality: ~75-80% of FP16 (2.06 bpw is usable per community reports)
+- NOT 1-bit, NOT BitNet — just aggressive 2-bit compression
+- RuvLLM gap: IQ2_XXS dequantization not implemented (falls to error catch-all in `quantization.rs:358`)
+- RuvLLM Q2_K dequantization IS implemented and works
+
+**Sub-option 0B: Quantize to IQ1_S via llama.cpp**
+- Run `llama-quantize GLM-4.7-Flash-F16.gguf IQ1_S` with importance matrix
+- Cost: $0, time: ~30 minutes on CPU
+- Quality: **SEVERE degradation** — blind testing shows IQ1_S is "broken rather than just bad" on 7B; outputs contain garbled text despite acceptable perplexity scores. 30B MoE may survive better due to parameter redundancy, but expert routing is highly sensitive to weight perturbation
+- RuvLLM gap: IQ1_S dequantization not implemented (`quantization.rs:358` catch-all)
+- Does NOT achieve BitNet multiplication-free inference
+
+**Sub-option 0C: PT-BitNet ternary PTQ** (per [PT-BitNet paper](https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X))
+- Apply absmean ternary quantization (BitNet's native method) to pre-trained weights with calibration data
+- Cost: ~$50-200 (small GPU calibration run, 1-4 hours on 1× A100)
+- Quality: ~55-65% downstream accuracy (PT-BitNet reports 61% on 70B; GLM-4.7-Flash's 30B-A3B may differ)
+- THIS IS proper BitNet ternary format → **enables multiplication-free inference with AD-4 kernels**
+- Requires implementing absmean ternary quantizer (~200-300 lines of new code)
+- Requires calibration dataset (WikiText-2 or similar, ~1M tokens)
+
+**Sub-option 0D: BitDistill Lite (10B tokens)** (per [BitDistill paper](https://arxiv.org/html/2510.13998v1))
+- 3-stage: SubLN insertion → 10B-token continued pre-training → KL + attention distillation
+- Cost: ~$200-500 (8× GPU hours on Mi300X/A100 class)
+- Quality: **~90-95% of FP16** (BitDistill reports 88.17% vs 88.01% FP16 on MNLI at 0.6B)
+- Near-full quality recovery with only 10B tokens (vs 200B+ for Phase 1 full distillation)
+- Requires SubLN module insertion + distillation fine-tuning loop
+- Bridges gap between pure PTQ and full expert distillation (Phase 1)
+
+**Summary comparison:**
+
+| Sub-option | Cost | Time | Quality (est.) | BitNet Speedup | RuvLLM Ready |
+|-----------|------|------|---------------|----------------|-------------|
+| 0A: IQ2_XXS download | $0 | 5 min | ~75-80% | No | No (missing dequant) |
+| 0B: IQ1_S quantize | $0 | 30 min | ~40-50% | No | No (missing dequant) |
+| 0C: PT-BitNet PTQ | ~$100 | 2-4 hrs | ~55-65% | **Yes** | Needs quantizer impl |
+| 0D: BitDistill Lite | ~$300 | 1-2 days | ~90-95% | **Yes** | Needs SubLN + KD loop |
+
+**Pros (of PTQ approach generally):**
+- Immediate or near-immediate results ($0-$300, minutes to days)
+- No large-scale training infrastructure
+- Validates inference pipeline and kernels before investing in full distillation
+- Sub-option 0C produces genuine BitNet ternary format for kernel development

 **Cons:**
- **Severe quality degradation** — post-training 1-bit quantization loses 30-50% quality
- BitNet research explicitly states native training is required for quality parity
- MoE routing scores collapse under extreme quantization
- Does not achieve BitNet's multiplication-free inference (still uses dequant-then-multiply)
- No ternary lookup table optimization possible
+- Sub-options 0A/0B: Quality too degraded for production coding tasks
+- Sub-options 0A/0B: No BitNet multiplication-free inference (still dequant-then-multiply)
+- Sub-option 0C: Significant quality loss (~35-45%) vs teacher — adequate for kernel validation, not production
+- Sub-option 0D: Requires non-trivial training code (SubLN, KD loss) but much less than full Phase 1
+- IQ1_S blind test results: statistically indistinguishable from random on smaller models

-**Verdict: Rejected** — Quality loss makes this unsuitable for production coding tasks.
+**Verdict: Recommended as Phase 0 rapid prototype** — Sub-option 0C (PT-BitNet PTQ) is the optimal entry point: $100, 2-4 hours, produces genuine BitNet ternary format for kernel development and inference validation. Sub-option 0D (BitDistill Lite) bridges to Phase 1 if higher quality is needed before committing to full expert distillation. Sub-options 0A/0B are useful only as baselines for comparison.

 ### Option B: Native BitNet Training of GLM-4.7-Flash Architecture (Full)

@ -303,22 +362,40 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine

 ## Decision

-**Phased approach: D → C → B**
+**Phased approach: A(0C) → D → C → B**
+
+### Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)
+- **Timeline**: 1-2 weeks
+- **Cost**: ~$100 (calibration on 1× A100 spot, 2-4 hours)
+- **Goal**: Produce a genuine BitNet ternary GGUF of GLM-4.7-Flash for kernel development, inference pipeline validation, and baseline quality measurement
+- **Deliverables**:
+  - PT-BitNet ternary quantized GLM-4.7-Flash GGUF file (~6-7 GB)
+  - Absmean ternary quantizer implementation (~200-300 lines)
+  - IQ1_S / BITNET_T158 dequantization kernel in RuvLLM
+  - Baseline quality benchmarks (HumanEval, MMLU) to compare against Phase 1+
+  - Functional TL1 kernel validated against ternary model
+- **Expected quality**: ~55-65% of GLM-4.7-Flash (adequate for kernel validation, not production)
+- **Key value**: De-risks Phase 1 by validating the entire inference pipeline (GGUF loading → ternary dequant → TL1 kernel → MoE routing → token generation) at near-zero cost before committing to $1,300+ distillation training
+- **Optional upgrade (0D)**: If 0C quality is too low for meaningful testing, apply BitDistill Lite (10B tokens, ~$300, 1-2 days) to reach ~90-95% quality

 ### Phase 1: BitNet Expert Replacement (Option D)
 - **Timeline**: 3-4 months
- **Goal**: Validate MoE + BitNet integration, build inference kernels
+- **Cost**: ~$1,300-$2,000 (4× A100 spot, ~46 days)
+- **Goal**: Full-quality ternary experts via distillation, validated against Phase 0 baseline
 - **Deliverables**: Working Craftsman Ultra 30b 1bit (mixed: ternary experts, FP16 attention)
 - **Expected quality**: ~90-95% of GLM-4.7-Flash on coding benchmarks
+- **Prerequisites**: Phase 0 validates inference pipeline works end-to-end

 ### Phase 2: Full BitNet Distillation (Option C)
 - **Timeline**: 4-6 months after Phase 1
+- **Cost**: ~$2,500-$5,000 (4× H100, 16-32 days)
 - **Goal**: Full ternary model with complete BitNet inference optimization
 - **Deliverables**: Craftsman Ultra 30b 1bit v2 (full ternary except router/embed/head)
 - **Expected quality**: ~95-98% of GLM-4.7-Flash

 ### Phase 3: Native BitNet Training (Option B)
 - **Timeline**: 6-12 months after Phase 2, contingent on funding/compute
+- **Cost**: ~$15,000-$30,000 (8× H100 cluster, 90-180 days)
 - **Goal**: Surpass GLM-4.7-Flash quality with native ternary training
 - **Deliverables**: Craftsman Ultra 30b 1bit v3 (trained from scratch)
 - **Expected quality**: 100%+ of GLM-4.7-Flash (BitNet at scale exceeds FP16)
@ -797,6 +874,91 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_
 4. **Mixed precision training**: FP16 shadow weights + BF16 activations reduces memory, enabling smaller instances
 5. **Gradient checkpointing**: Trade compute for memory to fit on fewer GPUs

+### AD-18: Phase 0 — PT-BitNet Post-Training Quantization Strategy
+
+**Decision**: Implement a PT-BitNet ternary post-training quantizer as Phase 0, producing a rapid prototype GGUF for inference pipeline validation before investing in full distillation.
+
+**Rationale**: The original Option A ("Rejected") assumed only generic IQ1_S quantization, which produces garbled outputs at 1.56 bpw. However, PT-BitNet (2025) demonstrates that applying BitNet's native absmean ternary quantization to pre-trained weights with calibration data achieves significantly better results (61% downstream at 70B) than generic codebook PTQ. This produces genuine BitNet ternary format that enables multiplication-free inference with TL1/TL2 kernels — unlike IQ1_S which still requires dequant-then-multiply.
+
+**Implementation approach**:
+
+```
+Phase 0 Pipeline:
+  1. Load GLM-4.7-Flash FP16/BF16 weights
+  2. For each linear layer in expert FFNs:
+     a. Compute gamma = mean(|W|)  (absmean scale)
+     b. W_ternary = RoundClip(W / (gamma + epsilon), -1, 1)
+     c. Store: 2-bit packed ternary weights + FP16 scale per block
+  3. Calibration pass (optional, improves quality):
+     a. Run ~1000 calibration samples through teacher model
+     b. Record activation statistics per layer
+     c. Optimize scale factors to minimize MSE between teacher and ternary outputs
+  4. Export to GGUF with BITNET_T158 tensor type + metadata
+  5. Validate: load in BitNetBackend → TL1 kernel → generate tokens
+```
+
+**Absmean ternary quantizer (core algorithm)**:
+```
+Input:  W ∈ R^{m×n} (FP16 weight matrix)
+Output: W_t ∈ {-1,0,+1}^{m×n}, scale ∈ R (per-block FP16)
+
+For each block of 256 elements:
+  1. gamma = mean(|block|) + 1e-8
+  2. normalized = block / gamma
+  3. ternary = round(clamp(normalized, -1, 1))  → {-1, 0, +1}
+  4. Pack: 2 bits per weight (00=-1, 01=0, 10=+1)
+  5. Store scale = gamma as FP16
+```
+
+**What stays FP16** (same as AD-2):
+- MoE router gating weights
+- Token embeddings + LM head
+- RoPE frequencies
+- LayerNorm/RMSNorm parameters
+
+**RuvLLM implementation gaps to fill**:
+
+| Gap | Effort | Details |
+|-----|--------|---------|
+| Absmean ternary quantizer | ~200-300 lines | New function in `gguf/quantization.rs` or new module |
+| IQ1_S / BITNET_T158 dequantization | ~80-120 lines | Add to `dequantize_tensor` match arm (currently falls to error at line 358) |
+| GGUF export with ternary metadata | ~100-150 lines | Extend `GgufExportResult` with BitNet metadata keys from AD-5 |
+| TL1 kernel smoke test | ~200 lines | Validate ternary GEMM produces correct output on PTQ model |
+
+**Total new code**: ~600-800 lines (vs ~15,000+ for Phase 1 full distillation pipeline)
+
+**Quality expectations (conservative estimates for GLM-4.7-Flash 30B-A3B)**:
+
+| Benchmark | FP16 Baseline | Phase 0 PTQ (est.) | Phase 1 Distill (est.) |
+|-----------|--------------|-------------------|----------------------|
+| HumanEval pass@1 | ~65% | ~35-45% | ~55-60% |
+| MMLU | ~75% | ~45-55% | ~65-70% |
+| SWE-bench Verified | 59.2% | ~25-35% | ~50-55% |
+| LiveCodeBench v6 | 64.0% | ~30-40% | ~55-60% |
+
+**Why Phase 0 quality is still useful**:
+1. **Kernel validation**: Ternary GEMM correctness doesn't depend on model quality
+2. **Memory profiling**: Real-world memory usage measurement with actual MoE activation patterns
+3. **Throughput benchmarking**: Measure real tok/s with TL1/TL2/I2_S kernels on target hardware
+4. **Pipeline testing**: End-to-end GGUF load → inference → token output
+5. **Baseline measurement**: Quantitative quality floor establishes improvement target for Phase 1
+6. **Cost**: ~$100 vs ~$1,300 for Phase 1 — validates infrastructure before 10x investment
+
+**Key configuration**:
+```rust
+pub struct PtBitnetConfig {
+    pub calibration_samples: usize,     // 1000 default (WikiText-2 or code corpus)
+    pub block_size: usize,              // 256 (matches AD-1)
+    pub optimize_scales: bool,          // true: MSE-optimized scales; false: raw absmean
+    pub layers_to_quantize: LayerMask,  // ExpertsOnly (Phase 0) or All (future)
+    pub export_format: TernaryFormat,   // BitnetT158 (native) or IQ1S (llama.cpp compat)
+    pub router_precision: Precision,    // FP16 (always, per AD-2)
+}
+```
+
+**Reused**: GGUF parser, tensor metadata, `GgufQuantType` enum, export pipeline.
+**New**: `PtBitnetQuantizer`, `absmean_ternary()`, `BITNET_T158` dequantization kernel.
+
 ---

 ## Consequences
@ -809,12 +971,14 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_
 4. **Multiplication-free expert GEMM**: Integer addition only in expert forward passes
 5. **SONA compatibility**: MicroLoRA adaptation preserves per-session learning
 6. **GGUF ecosystem**: Compatible with existing model distribution infrastructure
-7. **Incremental path**: Phase 1 delivers value quickly; Phases 2-3 improve quality
+7. **Incremental path**: Phase 0 validates at ~$100; Phase 1 delivers quality; Phases 2-3 optimize
 8. **~70% RLM code reuse**: GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore are production-tested — only BitLinear layer and orchestrator are net-new
 9. **Adaptive distillation**: GRPO reward scaling dynamically focuses compute on hard-to-distill experts
 10. **Cross-expert stability**: EWC++ Fisher diagonal prevents catastrophic forgetting during sequential expert distillation
 11. **Learned quantization policies**: PolicyStore persists per-layer ternary scale distributions for reproducible future distillation runs
 12. **Expert-parallel distillation**: Independent expert FFNs enable rayon-parallel distillation across CPU cores
+13. **Phase 0 de-risks Phase 1**: ~$100 PTQ prototype validates entire inference pipeline (GGUF → dequant → kernel → MoE → generation) before committing $1,300+ to distillation
+14. **Existing GGUF ecosystem**: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines

 ### Negative

@ -830,19 +994,32 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_

 | Risk | Likelihood | Impact | Mitigation |
 |------|-----------|--------|------------|
-| MoE routing degrades with ternary experts | Medium | High | Phase 1 validates routing; router stays FP16; AD-12 contrastive validation |
-| bitnet.cpp kernel translation to Rust introduces bugs | Medium | Medium | Extensive kernel unit tests; validate against reference impl |
+| Phase 0 PTQ quality too low for meaningful testing | Medium | Low | Phase 0 is for kernel/pipeline validation, not quality; upgrade to 0D (BitDistill Lite) if needed |
+| MoE routing degrades with ternary experts | Medium | High | Phase 0 detects routing issues early; Phase 1 validates routing; router stays FP16; AD-12 contrastive validation |
+| bitnet.cpp kernel translation to Rust introduces bugs | Medium | Medium | Phase 0 PTQ model provides cheap test fixture; extensive kernel unit tests; validate against reference impl |
 | Distillation fails to converge for MoE | Low | High | GRPO reward scaling + per-expert distillation fallback; EWC++ stability (AD-13) |
 | GLM-4.7-Flash architecture changes break compatibility | Low | Medium | Pin to specific HF revision; architecture abstraction layer |
-| IQ1_S GGUF format insufficient for absmean metadata | Medium | Low | Register custom GGUF type; backward-compatible extension |
+| IQ1_S GGUF format insufficient for absmean metadata | Medium | Low | Register custom GGUF type (BITNET_T158); backward-compatible extension |
 | EWC++ Fisher accumulation OOM at 30B scale | Medium | Medium | Sparse Fisher (top-k diagonal entries); per-expert rather than global Fisher |
 | GRPO reward signal too noisy for distillation | Low | Low | Fall back to static KD loss; GRPO reward as optional multiplier |
 | `RealContrastiveTrainer` doesn't scale to 30B | Medium | Medium | Extract training loop; replace Candle Linear with BitLinear; keep optimizer/scheduler |
+| Calibration data bias in Phase 0 PTQ | Low | Low | Use diverse calibration corpus (WikiText + code); measure variance across calibration sets |

 ---

 ## Validation Criteria

+### Phase 0 Exit Criteria
+- [ ] Absmean ternary quantizer produces valid {-1, 0, +1} weights from GLM-4.7-Flash FP16
+- [ ] GGUF export with BITNET_T158 tensor type loads without error in BitNetBackend
+- [ ] TL1 kernel produces non-zero, bounded output on PTQ ternary weights
+- [ ] MoE routing selects experts (not all-zero or all-same-expert degenerate routing)
+- [ ] End-to-end token generation produces coherent (if degraded) text
+- [ ] Memory usage measured and documented for real MoE activation patterns
+- [ ] Throughput measured: tok/s on target CPU (AVX2 and/or NEON)
+- [ ] Baseline quality benchmarks recorded (HumanEval, MMLU) as Phase 1 improvement target
+- [ ] Total Phase 0 cost < $200
+
 ### Phase 1 Exit Criteria
 - [ ] BitNet backend loads GGUF with ternary expert weights
 - [ ] TL1 kernel produces bit-exact output vs reference float implementation
@ -887,3 +1064,9 @@ This is a single-line addition to `RealTrainingConfig` (`use_cuda: bool`, `cuda_
 13. RuvLLM Memory Distillation: `crates/ruvllm/src/reasoning_bank/distillation.rs`
 14. RuvLLM Policy Store: `crates/ruvllm/src/policy_store.rs`
 15. RuvLLM Contrastive Training: `crates/ruvllm/src/training/contrastive.rs`
+16. PT-BitNet: "Scaling up the 1-Bit large language model with post-training quantization" (2025) — https://www.sciencedirect.com/science/article/abs/pii/S089360802500735X
+17. BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025) — https://arxiv.org/html/2510.13998v1
+18. bartowski, GLM-4.7-Flash-GGUF quantizations — https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
+19. unsloth, GLM-4.7-Flash-GGUF dynamic quantizations — https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
+20. llama.cpp IQ1_S blind testing (Discussion #5962) — https://github.com/ggml-org/llama.cpp/discussions/5962
+21. STBLLM: "Breaking the 1-bit Barrier" (ICLR 2025) — https://proceedings.iclr.cc/paper_files/paper/2025/file/ff997469ac66cf893c4183efeb22212a-Paper-Conference.pdf
--- a/docs/research/craftsman-ultra-30b-1bit-ddd.md
+++ b/docs/research/craftsman-ultra-30b-1bit-ddd.md
@ -1,6 +1,6 @@
 # Domain-Driven Design: Craftsman Ultra 30b 1bit

-**Version:** 2.0
+**Version:** 2.1
 **Date:** 2026-02-03
 **Relates to:** ADR-017-craftsman-ultra-30b-1bit-bitnet-integration
 **Status:** Research / Pre-Implementation
@ -75,6 +75,11 @@ The following terms have precise meaning within the Craftsman Ultra domain. All
 | **Contrastive Router Validation** | Post-ternary-conversion check that MoE routing still selects correct experts, using triplet loss on expert embeddings. |
 | **Knowledge Distillation Loss** | `alpha * KL(teacher/T, student/T) + (1-alpha) * CE(labels, student)`. Core training objective for ternary student. |
 | **Distillation Trajectory** | Sequence of training steps for one expert, recorded as ReasoningBank `Trajectory` for quality analysis. |
+| **PT-BitNet** | Post-Training BitNet quantization: applying absmean ternary conversion to pre-trained FP16 weights with optional calibration. No training loop — just quantize and export. |
+| **Calibration Pass** | Forward pass of ~1000 samples through the teacher model to record activation statistics used to optimize ternary scale factors. |
+| **IQ1_S** | llama.cpp's 1.56 bpw importance quantization format. Codebook-based, dequant-then-multiply — NOT multiplication-free like BitNet. |
+| **BITNET_T158** | Proposed GGUF tensor type for native BitNet b1.58 ternary weights (2-bit packed + FP16 per-block absmean scale). Distinct from IQ1_S. |
+| **Phase 0 Prototype** | PT-BitNet quantized model used for inference pipeline validation and kernel testing, not production quality. |

 ---

@ -241,16 +246,22 @@ The following terms have precise meaning within the Craftsman Ultra domain. All

 ### 3.4 Quantization Pipeline Context (Supporting)

-**Responsibility**: Convert full-precision weights to ternary format during training/distillation. **Delegates training orchestration to the RLM Training Orchestration Context** (3.8), which provides GRPO rewards, EWC++ stability, and quality tracking.
+**Responsibility**: Convert full-precision weights to ternary format. Supports two modes:
+1. **Phase 0 (PTQ)**: Direct absmean ternary quantization with optional calibration — no training loop
+2. **Phase 1+ (Distillation)**: Full training pipeline with STE, shadow weights, and RLM orchestration
+
+**Delegates training orchestration to the RLM Training Orchestration Context** (3.8) for Phase 1+ distillation, which provides GRPO rewards, EWC++ stability, and quality tracking.

 **Owns:**
- Absmean quantization implementation
- Straight-through estimator for backpropagation
- Shadow weight management (FP16 ↔ ternary)
- GGUF export with ternary tensor metadata
+- Absmean quantization implementation (shared by Phase 0 and Phase 1+)
+- PT-BitNet quantizer for Phase 0 rapid prototype (no training loop)
+- Straight-through estimator for backpropagation (Phase 1+ only)
+- Shadow weight management (FP16 ↔ ternary, Phase 1+ only)
+- Calibration pass for scale factor optimization (Phase 0)
+- GGUF export with ternary tensor metadata (BITNET_T158 type)
 - Calibration dataset management

-**Delegates to RLM Training (3.8):**
+**Delegates to RLM Training (3.8) — Phase 1+ only:**
 - Distillation loss computation with GRPO reward scaling
 - Cross-expert stability via EWC++ regularization
 - Router validation via contrastive training
@ -258,43 +269,53 @@ The following terms have precise meaning within the Craftsman Ultra domain. All
 - Per-layer policy persistence via PolicyStore

 **Key Entities:**
- `BitLinearTrainer` — BitLinear layer with shadow weights and STE (NEW)
- `AbsmeanQuantizer` — Converts FP16 block → ternary + scale (NEW)
+- `PtBitnetQuantizer` — Phase 0: direct FP16 → ternary conversion with calibration (NEW, ~200-300 lines)
+- `AbsmeanQuantizer` — Converts FP16 block → ternary + scale (NEW, shared by Phase 0 and 1+)
+- `CalibrationRunner` — Phase 0: runs calibration samples to optimize scale factors (NEW, ~100 lines)
+- `BitLinearTrainer` — Phase 1+: BitLinear layer with shadow weights and STE (NEW)
 - `TeacherModel` — FP16 GLM-4.7-Flash reference model (NEW)
 - `CalibrationDataset` — Token sequences for quantization calibration (NEW)
- `GrpoOptimizer` — Per-expert reward scaling (REUSED from `training/grpo.rs`)
- `EwcRegularizer` — Cross-expert forgetting prevention (REUSED from `lora/training.rs`)
+- `GrpoOptimizer` — Per-expert reward scaling, Phase 1+ only (REUSED from `training/grpo.rs`)
+- `EwcRegularizer` — Cross-expert forgetting prevention, Phase 1+ only (REUSED from `lora/training.rs`)

 **Invariants:**
- Shadow weights are FP16 throughout training (never accumulated in ternary)
 - Quantization is deterministic: same FP16 input → same ternary output
- Teacher model is frozen during distillation (no gradient updates)
- Distillation loss = KD_base * GRPO_scale + EWC_penalty (see ADR-017 AD-11, AD-13)
+- Phase 0: No shadow weights — direct one-shot quantization
+- Phase 1+: Shadow weights are FP16 throughout training (never accumulated in ternary)
+- Phase 1+: Teacher model is frozen during distillation (no gradient updates)
+- Phase 1+: Distillation loss = KD_base * GRPO_scale + EWC_penalty (see ADR-017 AD-11, AD-13)

 **Interfaces:**
- **Inbound**: Teacher model weights + training dataset
- **Outbound**: Trained ternary weights exported as GGUF
+- **Inbound**: Teacher model weights (FP16/BF16) + calibration or training dataset
+- **Outbound**: Ternary weights exported as GGUF with BITNET_T158 tensor type
 - **Downstream**: Feeds Model Lifecycle Context with final artifacts

 ```
-┌─────────────────────────────────────────────┐
-│      Quantization Pipeline Context          │
-│                                             │
-│  ┌──────────────┐    ┌──────────────────┐   │
-│  │TeacherModel  │───▶│DistillPipeline   │   │
-│  │(GLM-4.7-Flash│    │(KD loss + STE)   │   │
-│  └──────────────┘    └────────┬─────────┘   │
-│                               │              │
-│  ┌──────────────┐    ┌───────▼──────────┐   │
-│  │AbsmeanQuant  │◀───│BitLinearTrainer  │   │
-│  │(FP16→ternary)│    │(shadow weights)  │   │
-│  └──────┬───────┘    └──────────────────┘   │
-│         │                                    │
-│  ┌──────▼───────────────────────────────┐   │
-│  │         GGUFExporter                 │   │
-│  │  (ternary tensors + metadata)        │   │
-│  └──────────────────────────────────────┘   │
-└─────────────────────────────────────────────┘
+┌──────────────────────────────────────────────────────────┐
+│           Quantization Pipeline Context                  │
+│                                                          │
+│  Phase 0 (PTQ):                                          │
+│  ┌──────────────┐    ┌──────────────────┐                │
+│  │ FP16 Weights │───▶│PtBitnetQuantizer │                │
+│  │(GLM-4.7-Flash│    │(absmean + calib) │                │
+│  └──────────────┘    └────────┬─────────┘                │
+│                               │                          │
+│  Phase 1+ (Distillation):    │                          │
+│  ┌──────────────┐    ┌───────┼──────────┐                │
+│  │TeacherModel  │───▶│DistillPipeline   │                │
+│  │(GLM-4.7-Flash│    │(KD loss + STE)   │                │
+│  └──────────────┘    └────────┬─────────┘                │
+│                               │                          │
+│  ┌──────────────┐    ┌───────▼──────────┐                │
+│  │AbsmeanQuant  │◀───│BitLinearTrainer  │                │
+│  │(FP16→ternary)│    │(shadow weights)  │                │
+│  └──────┬───────┘    └──────────────────┘                │
+│         │                                                │
+│  ┌──────▼───────────────────────────────┐   Both paths:  │
+│  │         GGUFExporter                 │◀──────────┘    │
+│  │  (BITNET_T158 tensors + metadata)    │                │
+│  └──────────────────────────────────────┘                │
+└──────────────────────────────────────────────────────────┘
 ```

 ---
@ -991,18 +1012,19 @@ Add CUDA device dispatch to `RealContrastiveTrainer` (`training/real_trainer.rs:

 ### Compatibility Matrix

-| Existing Feature | Impact | Action |
-|-----------------|--------|--------|
-| GGUF parser | Low | Add BITNET_T158 type to `GgufQuantType` enum |
-| `InferenceBackend` trait | None | New `BitNetBackend` implements existing trait |
-| KV cache (`kv_cache.rs`) | None | Reused as-is (FP16/Q8 cache unchanged) |
-| Autodetect (`autodetect.rs`) | Low | Add ternary kernel capability flags |
-| SIMD kernels (`kernels/`) | Medium | New ternary kernels alongside existing |
-| MicroLoRA (`lora/`) | Low | Adapter applied to BitLinear output |
-| SONA (`sona/`) | None | Instant loop drives adapter feedback |
-| Claude Flow (`claude_flow/`) | Low | Add `BitNetModel` to model router |
-| NAPI bindings | Low | Expose `BitNetBackend` via existing pattern |
-| tokenizer | None | Reused (GLM-4 tokenizer, 151K vocab) |
+| Existing Feature | Impact | Phase 0 | Phase 1+ |
+|-----------------|--------|---------|----------|
+| GGUF parser | Low | Add BITNET_T158 type to `GgufQuantType` enum | Same |
+| `dequantize_tensor` | **Medium** | **Implement IQ1_S/BITNET_T158 dequant** (currently returns error at line 358) | Same |
+| `InferenceBackend` trait | None | New `BitNetBackend` implements existing trait | Same |
+| KV cache (`kv_cache.rs`) | None | Reused as-is | Reused as-is |
+| Autodetect (`autodetect.rs`) | Low | Add ternary kernel capability flags | Same |
+| SIMD kernels (`kernels/`) | **Medium** | TL1 kernel minimum viable for validation | Full TL1/TL2/I2_S suite |
+| MicroLoRA (`lora/`) | None (Phase 0) | Not needed for PTQ | Adapter applied to BitLinear output |
+| SONA (`sona/`) | None | Not needed for PTQ | Instant loop drives adapter feedback |
+| Claude Flow (`claude_flow/`) | Low | Add `BitNetModel` to model router | Same |
+| NAPI bindings | Low | Expose `BitNetBackend` via existing pattern | Same |
+| tokenizer | None | Reused (GLM-4 tokenizer, 151K vocab) | Same |

 ### Non-Breaking Changes

@ -1025,6 +1047,9 @@ All changes are additive. No existing backend, model, or API is modified. The `B
 | 9 | EWC++ Fisher OOM at 30B scale? | RLM Training | Open | May need sparse Fisher (top-k diagonal entries per expert) |
 | 10 | GRPO group_size = num_experts or per-layer? | RLM Training | Open | Per-layer groups provide finer reward signal but more compute |
 | 11 | Expert-parallel distillation rayon thread count? | RLM Training | Open | Balance CPU cores between rayon parallelism and ternary GEMM |
+| 12 | Phase 0 PTQ calibration corpus choice? | Phase 0 quality | Open | WikiText-2 vs code-specific corpus (e.g., The Stack) — code corpus may preserve coding ability better |
+| 13 | IQ1_S vs BITNET_T158 GGUF type for Phase 0? | GGUF compatibility | Open | IQ1_S (type 19) exists but block format may differ from absmean; custom BITNET_T158 avoids confusion but breaks llama.cpp compat |
+| 14 | Phase 0 → Phase 1 weight migration path? | Efficiency | Open | Can Phase 0 PTQ weights serve as initialization for Phase 1 distillation shadow weights? |

 ---

@ -1043,3 +1068,7 @@ All changes are additive. No existing backend, model, or API is modified. The `B
 - RuvLLM Memory Distillation: `crates/ruvllm/src/reasoning_bank/distillation.rs`
 - RuvLLM Policy Store: `crates/ruvllm/src/policy_store.rs`
 - RuvLLM Contrastive Training: `crates/ruvllm/src/training/contrastive.rs`
+- PT-BitNet: "Scaling up the 1-Bit large language model with post-training quantization" (2025)
+- BitDistill: "BitNet Distillation" (arXiv:2510.13998, Oct 2025)
+- bartowski, GLM-4.7-Flash-GGUF quantizations: https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
+- llama.cpp IQ1_S blind testing: https://github.com/ggml-org/llama.cpp/discussions/5962