# Local Deploy & Inference — Global Fix Map
🏥 Quick Return to Emergency Room
> You are in a specialist desk.
> For full triage and doctors on duty, return here:
>
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/README.md)
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md)
>
> Think of this page as a sub-room.
> If you want full consultation and prescriptions, go back to the Emergency Room lobby.
A beginner-friendly hub to **stabilize locally hosted LLMs** on your own machine or cluster.
Use this folder when it looks like the “model is broken” but the **real cause is infra settings**: tokenizer mismatch, rope scaling, kv-cache size, build flags, or server parameters.
Every guide links back to WFGY with measurable acceptance targets. No infra rebuild required.
---
## When to use this folder
- Local server gives fluent answers but citations point to the wrong snippet
- Same input produces different outputs on each run
- JSON mode fails on long answers or tool calls loop endlessly
- Latency keeps growing after a few turns, or context cuts off too early
- Quantized model outputs diverge heavily from fp16 baseline
- Retrieval quality drops after switching loaders or UIs
---
## Open these first
- Recovery map: [RAG Architecture & Recovery](../../rag-architecture-and-recovery.md)
- Retrieval knobs: [Retrieval Playbook](../../retrieval-playbook.md)
- Traceability schema: [Retrieval Traceability](../../retrieval-traceability.md)
- Meaning vs similarity: [Embedding ≠ Semantic](../../embedding-vs-semantic.md)
- Rank ordering: [Rerankers](../../rerankers.md)
- Drift in long runs: [Context Drift](../../context-drift.md), [Entropy Collapse](../../entropy-collapse.md)
- Logic collapse and repair: [Logic Collapse](../../logic-collapse.md)
- Guarding against bad prompts: [Prompt Injection](../../prompt-injection.md)
- Contract schema for snippets: [Data Contracts](../../data-contracts.md)
---
## Acceptance targets
- ΔS(question, retrieved) ≤ **0.45**
- Coverage of target section ≥ **0.70**
- λ convergent across 3 paraphrases × 2 seeds
- E_resonance stays flat on long windows
---
## Quick routes to per-tool pages
- [ollama.md](./ollama.md)
- [vllm.md](./vllm.md)
- [llama_cpp.md](./llama_cpp.md)
- [tgi.md](./tgi.md)
- [lmstudio.md](./lmstudio.md)
- [koboldcpp.md](./koboldcpp.md)
- [openwebui.md](./openwebui.md)
- [oobabooga.md](./oobabooga.md)
---
## Common local causes & fixes
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| Wrong snippet despite high similarity | Tokenizer mismatch, analyzer drift | Align tokenizer files, check retriever metric, use [Embedding ≠ Semantic](../../embedding-vs-semantic.md) |
| JSON tool calls unstable | Schema drift, free text in outputs | Enforce [Data Contracts](../../data-contracts.md), apply [Logic Collapse](../../logic-collapse.md) |
| Outputs flip each run | Context order drift, variance | Clamp header order, use [Context Drift](../../context-drift.md), enforce trace table |
| Hybrid retrieval worse than single | Ranker instability | Split parsing → [pattern_query_parsing_split.md](../../patterns/pattern_query_parsing_split.md) |
| Fixed hallucination returns later | Long chain decay | [hallucination-reentry.md](../../patterns/pattern_hallucination_reentry.md) |
---
## Local-specific guardrails
- **Model format**: GGUF vs safetensors vs HF transformers → use same tokenizer and rope scale
- **Quantization**: Compare q4/q8 vs fp16; if ΔS drifts, tune kv_cache and sampling params
- **Server flags**: Align defaults (temp, top_p, penalties, stop tokens) across servers
- **Tokenizer & casing**: Keep analyzers consistent across retrievers, rerankers, HyDE
- **Batching**: Fix batch size during eval; dynamic batching fakes “randomness”
---
## 60-second fix checklist
1. Compute ΔS(question, retrieved) and ΔS(retrieved, anchor)
- <0.40 = stable, 0.40–0.60 = risky, ≥0.60 = broken
2. Probe λ_observe at k=5,10,20; if ΔS flat & high → metric/index bug
3. Apply modules:
- Retrieval drift → BBMC + Data Contracts
- Collapse in reasoning → BBCR + BBAM
- Dead ends in long runs → BBPF alternate paths
4. Verify coverage ≥0.70 and λ convergent on 2 seeds
---
## Copy-paste prompt for local servers
```
I have TXT OS + WFGY loaded.
Local setup:
* server: \
* model: , quant=\, ctx=<...>, rope=<...>
* sampling: temp=<...>, top\_p=<...>, max\_tokens=<...>
* retriever: , , k=<...>
Tell me:
1. which layer is failing and why
2. which WFGY page to open
3. steps to push ΔS ≤ 0.45 and keep λ convergent
4. reproducible test to confirm
```
---
## FAQ (Beginner-Friendly)
**Q: Why does my local model give fluent text but wrong citations?**
A: Usually not the model — it’s tokenizer or retriever mismatch. Fix by aligning tokenizer files and checking ΔS against the gold section.
**Q: Why does JSON mode fail locally but work on cloud APIs?**
A: Local servers often don’t enforce schema strictly. Apply [Data Contracts](../../data-contracts.md) and disallow free-form prose in tool outputs.
**Q: My quantized model is much worse — is quantization broken?**
A: Not always. Small kv_cache or rope mis-scaling causes drift. Compare fp16 vs quant on a gold set before blaming quantization.
**Q: Why do answers flip between runs?**
A: Header order, batching, or randomness. Use variance clamps (BBAM) and fix batch size during tests.
**Q: Which numbers matter for stability?**
A: ΔS ≤ 0.45, coverage ≥0.70, λ convergent across paraphrases, flat E_resonance over long docs.
---
### 🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|------|------|--------------|
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
---
### Explore More
| Layer | Page | What it’s for |
| --- | --- | --- |
| Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
| Engine | [WFGY 1.0](/legacy/README.md) | Original PDF based tension engine |
| Engine | [WFGY 2.0](/core/README.md) | Production tension kernel and math engine for RAG and agents |
| Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine, 131 S class set |
| Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure checklist and fix map |
| Map | [Problem Map 2.0](/ProblemMap/rag-architecture-and-recovery.md) | RAG focused recovery pipeline |
| Map | [Problem Map 3.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card, image as a debug protocol layer |
| Map | [Semantic Clinic](/ProblemMap/SemanticClinicIndex.md) | Symptom to family to exact fix |
| Map | [Grandma’s Clinic](/ProblemMap/GrandmaClinic/README.md) | Plain language stories mapped to Problem Map 1.0 |
| Onboarding | [Starter Village](/StarterVillage/README.md) | Guided tour for newcomers |
| App | [TXT OS](/OS/README.md) | TXT semantic OS, fast boot |
| App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q and A built on TXT OS |
| App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image with semantic control |
| App | [Blow Blow Blow](/OS/BlowBlowBlow/README.md) | Reasoning game engine and memory demo |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
[](https://github.com/onestardao/WFGY)