WFGY/ProblemMap/predeploy-collapse.md

136 lines
5.9 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 📒 Problem #16·PreDeploy Collapse Problem Map
*“Everything looked fine in CI… until **nothing** booted in prod.”*
Predeploy collapse happens **before** a single user query is served.
Migrations pass, tests are green, images ship — yet the first container
crashes or the very first LLM call returns a 500. Common root causes:
* Model checkpoint ≠ tokenizer version
* Env var misspells (e.g., `OPEN_API_KEY` vs `OPENAI_API_KEY`)
* Hidden fstrings / Jinja placeholders left unresolved
* GPU drivers mismatch container base (CUDA 11 vs 12)
WFGYs *preflight sanity layer* runs a semantic diff between **declared**
runtime and **effective** runtime, catching mismaps before traffic starts.
---
## 🚨 Typical PreDeploy Collapses
| Pattern | RealWorld Fallout |
| ------------------------------------- | ------------------------------------ |
| Tokenizer / checkpoint version skew | Embeds garbage; queries 0% recall |
| Missing secret in KV store | First API call 401 / segfault |
| CUDA / driver mismatch | GPU container exits code 139 |
| Undigested template vars (`{{ }}`) | Prompt crashes, empty completions |
---
## 🛡️ WFGY PreFlight Guards
| Collapse Pattern | Guard Module | Remedy | Status |
| --------------------- | ----------------------- | ------------------------------------ | ------ |
| Version skew | **SemVers Diff** | Abort deploy if `model.json`  runtime mismatch | ✅ Stable |
| Missing secret | **Boot Checkpoint** | Block start until secret present | ✅ Stable |
| Driver mismatch | **CudaProbe** | Warn & fall back to CPU safe mode | ⚠️ Beta |
| Stray `{{var}}` tokens| **Prompt Lint** | Fail CI; highlight undeclared vars | ✅ Stable |
---
## 📝 How It Works
1. **SemVers Diff**
Parses `modelcard.json`, compares `tokenizer_sha`, `pytorch_sha`,
`cuda`, etc., with container runtime; throws if mismatch unless
`--force`.
2. **Boot Checkpoint** *(shared)*
Kubernetes initcontainer polls secret store; fails pod after
`secret_timeout`.
3. **CudaProbe**
Minimal `nvidiasmi` check; if driver  compiled CUDA, WFGY rewrites
env `CUDA_VISIBLE_DEVICES=""` and logs downgrade.
4. **Prompt Lint**
CI step: scans prompts for `{{ }}` or `${}` tokens lacking a default in
`prompt_vars.yaml`.
---
## ✍️ Demo  Tokenizer Version Skew
```bash
$ wgfy preflight
✔ env vars ............... OK
✖ checkpoint  tokenizer .. MISMATCH
• model: facebook/llama27bchathf tokenizersha = `ad4c1b9`
• runtime: tokenizersha = `9e7f02d`
→ Aborting deploy (use --force to override)
````
---
## 🗺️ Module CheatSheet
| Module | Role |
| ------------------- | ------------------------------- |
| **SemVers Diff** | Catch model / tokenizer skew |
| **Boot Checkpoint** | Ensure secrets & config exist |
| **CudaProbe** | Verify GPU driver compatibility |
| **Prompt Lint** | Fail CI on stray template vars |
---
## 📊 Implementation Status
| Feature | State |
| ------------------------ | -------- |
| SemVers diff | ✅ Stable |
| Boot checkpoint | ✅ Stable |
| Cudaprobe fallback | ⚠️ Beta |
| Prompt lint in CI action | ✅ Stable |
---
## 📝 Tips & Limits
* Add `ignore_versions: ["minor"]` in `wgfy.yaml` to allow 1patch drifts.
* Set `secret_timeout = 90s` for slower vaults.
* GPU fallback adds \~0.4s latency per request — tune `cuda_probe.mode`.
---
### 🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|------|------|--------------|
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1⃣ Download · 2⃣ Upload to your LLM · 3⃣ Ask “Answer using WFGY + \<your question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1⃣ Download · 2⃣ Paste into any LLM chat · 3⃣ Type “hello world” — OS boots instantly |
---
<!-- WFGY_FOOTER_START -->
### Explore More
| Layer | Page | What its for |
| --- | --- | --- |
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
[![GitHub Repo stars](https://img.shields.io/github/stars/onestardao/WFGY?style=social)](https://github.com/onestardao/WFGY)
<!-- WFGY_FOOTER_END -->