WFGY/ProblemMap/predeploy-collapse.md
2025-08-15 23:21:37 +08:00

152 lines
7.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 📒 Problem #16·PreDeploy Collapse Problem Map
*“Everything looked fine in CI… until **nothing** booted in prod.”*
Predeploy collapse happens **before** a single user query is served.
Migrations pass, tests are green, images ship — yet the first container
crashes or the very first LLM call returns a 500. Common root causes:
* Model checkpoint ≠ tokenizer version
* Env var misspells (e.g., `OPEN_API_KEY` vs `OPENAI_API_KEY`)
* Hidden fstrings / Jinja placeholders left unresolved
* GPU drivers mismatch container base (CUDA 11 vs 12)
WFGYs *preflight sanity layer* runs a semantic diff between **declared**
runtime and **effective** runtime, catching mismaps before traffic starts.
---
## 🚨 Typical PreDeploy Collapses
| Pattern | RealWorld Fallout |
| ------------------------------------- | ------------------------------------ |
| Tokenizer / checkpoint version skew | Embeds garbage; queries 0% recall |
| Missing secret in KV store | First API call 401 / segfault |
| CUDA / driver mismatch | GPU container exits code 139 |
| Undigested template vars (`{{ }}`) | Prompt crashes, empty completions |
---
## 🛡️ WFGY PreFlight Guards
| Collapse Pattern | Guard Module | Remedy | Status |
| --------------------- | ----------------------- | ------------------------------------ | ------ |
| Version skew | **SemVers Diff** | Abort deploy if `model.json`  runtime mismatch | ✅ Stable |
| Missing secret | **Boot Checkpoint** | Block start until secret present | ✅ Stable |
| Driver mismatch | **CudaProbe** | Warn & fall back to CPU safe mode | ⚠️ Beta |
| Stray `{{var}}` tokens| **Prompt Lint** | Fail CI; highlight undeclared vars | ✅ Stable |
---
## 📝 How It Works
1. **SemVers Diff**
Parses `modelcard.json`, compares `tokenizer_sha`, `pytorch_sha`,
`cuda`, etc., with container runtime; throws if mismatch unless
`--force`.
2. **Boot Checkpoint** *(shared)*
Kubernetes initcontainer polls secret store; fails pod after
`secret_timeout`.
3. **CudaProbe**
Minimal `nvidiasmi` check; if driver  compiled CUDA, WFGY rewrites
env `CUDA_VISIBLE_DEVICES=""` and logs downgrade.
4. **Prompt Lint**
CI step: scans prompts for `{{ }}` or `${}` tokens lacking a default in
`prompt_vars.yaml`.
---
## ✍️ Demo  Tokenizer Version Skew
```bash
$ wgfy preflight
✔ env vars ............... OK
✖ checkpoint  tokenizer .. MISMATCH
• model: facebook/llama27bchathf tokenizersha = `ad4c1b9`
• runtime: tokenizersha = `9e7f02d`
→ Aborting deploy (use --force to override)
````
---
## 🗺️ Module CheatSheet
| Module | Role |
| ------------------- | ------------------------------- |
| **SemVers Diff** | Catch model / tokenizer skew |
| **Boot Checkpoint** | Ensure secrets & config exist |
| **CudaProbe** | Verify GPU driver compatibility |
| **Prompt Lint** | Fail CI on stray template vars |
---
## 📊 Implementation Status
| Feature | State |
| ------------------------ | -------- |
| SemVers diff | ✅ Stable |
| Boot checkpoint | ✅ Stable |
| Cudaprobe fallback | ⚠️ Beta |
| Prompt lint in CI action | ✅ Stable |
---
## 📝 Tips & Limits
* Add `ignore_versions: ["minor"]` in `wgfy.yaml` to allow 1patch drifts.
* Set `secret_timeout = 90s` for slower vaults.
* GPU fallback adds \~0.4s latency per request — tune `cuda_probe.mode`.
---
### 🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|------|------|--------------|
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1⃣ Download · 2⃣ Upload to your LLM · 3⃣ Ask “Answer using WFGY + \<your question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1⃣ Download · 2⃣ Paste into any LLM chat · 3⃣ Type “hello world” — OS boots instantly |
---
### 🧭 Explore More
| Module | Description | Link |
|-----------------------|----------------------------------------------------------|----------|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
| 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) |
---
> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —
> Engineers, hackers, and open source builders who supported WFGY from day one.
> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
<div align="center">
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
&nbsp;
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
&nbsp;
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
&nbsp;
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
&nbsp;
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
&nbsp;
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
&nbsp;
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
&nbsp;
</div>