21 KiB
Getting Started — Apply the WFGY Problem Map in real projects (no SDK)
This page shows how to turn the Problem Map into a working pipeline. No framework lock-in. No custom SDK. Everything is minimal, testable, and easy to remove if you do not like it.
You will build
- A stable retrieval chain that resists chunk drift and reduces hallucinations.
- A thin guard that forces answers to stay inside evidence.
- A trace log so you can locate failure points and map them to Problem Map items.
- A small evaluation harness so you can prove improvements.
Hardware A laptop with 16 to 32 GB RAM is enough for a 300 to 800 page corpus using CPU embeddings and FAISS or SQLite VSS.
References
- Problem Map index:
ProblemMap/README.md - RAG Architecture and Recovery:
ProblemMap/rag-architecture-and-recovery.md - Math layer and rules (PDF): Download Now
0. Principles you will enforce
These map directly to common failures in the Problem Map.
- Retrieval never returns context that crosses entity or constraint boundaries. This reduces No.1 Hallucination and Chunk Drift.
- Generation must quote only the provided evidence. If not provable, it must say
not in context. - Every answer writes a machine readable trace with query, chunk ids, and scores. This turns debugging from guesswork into data.
- Optional rerank step promotes passages that match constraints, not only keywords.
- When storage or write fails, do not advance offsets or cursors. This prevents No.11 Symbolic Collapse type off by one bugs in pipelines.
1. Corpus preparation and chunking (Python)
Install minimal tools.
pip install pypdf rank-bm25 sentence-transformers faiss-cpu rapidfuzz
Extract text per page. Save page spans to preserve boundaries.
# tools/extract_pdf.py
from pypdf import PdfReader
import json, sys
def extract(pdf_path, out_json):
reader = PdfReader(pdf_path)
docs = []
for i, page in enumerate(reader.pages):
text = page.extract_text() or ""
docs.append({"id": f"p{i+1}", "page": i+1, "text": text})
with open(out_json, "w", encoding="utf8") as f:
json.dump(docs, f, ensure_ascii=False, indent=2)
if __name__ == "__main__":
extract(sys.argv[1], sys.argv[2])
Chunk with a constraint aware strategy. Keep headings or definition lines attached to their paragraphs. Merge small lines, but never cross section boundaries like References or Footnotes.
# tools/chunk.py
import re, json, sys
MAX_TOKENS = 350 # target size per chunk
def split_into_sentences(txt):
# light sentence split. replace with a better splitter if you have one.
return re.split(r'(?<=[.!?])\s+', txt.strip())
def is_boundary(line):
head = line.lower().strip()
return bool(re.match(r'^(abstract|introduction|conclusion|references?)\b', head))
def chunk_page(doc):
lines = [l for l in doc["text"].splitlines() if l.strip()]
chunks, buf, tokens, mark_boundary = [], [], 0, False
for ln in lines:
if is_boundary(ln):
mark_boundary = True
sents = split_into_sentences(ln)
for s in sents:
t = max(1, len(s.split()))
if tokens + t > MAX_TOKENS or mark_boundary:
if buf:
chunks.append(" ".join(buf))
buf, tokens, mark_boundary = [s], t, False
else:
buf.append(s); tokens += t
if buf:
chunks.append(" ".join(buf))
out = []
for j, c in enumerate(chunks):
out.append({
"id": f'{doc["id"]}#{j+1}',
"page": doc["page"],
"text": c
})
return out
if __name__ == "__main__":
src, dst = sys.argv[1], sys.argv[2]
docs = json.load(open(src, encoding="utf8"))
out = []
for d in docs:
out.extend(chunk_page(d))
json.dump(out, open(dst, "w", encoding="utf8"), ensure_ascii=False, indent=2)
2. Build a hybrid index: BM25 + embeddings + optional rerank (Python)
pip install numpy
# index/build.py
import json, numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss, os, sys
def build(corpus_json, out_dir):
os.makedirs(out_dir, exist_ok=True)
chunks = json.load(open(corpus_json, encoding="utf8"))
texts = [c["text"] for c in chunks]
ids = [c["id"] for c in chunks]
# lexical
tokenized = [t.lower().split() for t in texts]
bm25 = BM25Okapi(tokenized)
# semantic
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embs = model.encode(texts, show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs.astype(np.float32))
np.save(os.path.join(out_dir, "embs.npy"), embs)
json.dump(chunks, open(os.path.join(out_dir, "chunks.json"), "w", encoding="utf8"), ensure_ascii=False)
json.dump(ids, open(os.path.join(out_dir, "ids.json"), "w", encoding="utf8"))
bm25_dump = {
"ids": ids,
"tokenized": tokenized
}
json.dump(bm25_dump, open(os.path.join(out_dir, "bm25.json"), "w", encoding="utf8"))
faiss.write_index(index, os.path.join(out_dir, "faiss.index"))
if __name__ == "__main__":
build(sys.argv[1], sys.argv[2])
Retriever with hybrid scoring and simple rerank. You can replace the rerank with your favorite cross encoder later.
# index/retrieve.py
import json, numpy as np, faiss
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
class HybridRetriever:
def __init__(self, idx_dir):
self.chunks = json.load(open(f"{idx_dir}/chunks.json", encoding="utf8"))
ids = json.load(open(f"{idx_dir}/ids.json"))
bm = json.load(open(f"{idx_dir}/bm25.json", encoding="utf8"))
self.ids = ids
self.bm25 = BM25Okapi(bm["tokenized"])
self.model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
self.index = faiss.read_index(f"{idx_dir}/faiss.index")
self.embs = np.load(f"{idx_dir}/embs.npy")
def retrieve(self, q, topk=12):
toks = q.lower().split()
bm_scores = self.bm25.get_scores(toks)
bm_top = np.argsort(bm_scores)[::-1][:topk*4]
qv = self.model.encode([q], normalize_embeddings=True)
sims, idxs = self.index.search(qv.astype(np.float32), topk*4)
sem_top = idxs[0]
# intersect then union
cand = list(set(bm_top).intersection(set(sem_top)))
if len(cand) < topk:
cand = list(set(list(bm_top) + list(sem_top)))
# simple rerank by cosine against query vector
cand = np.array(cand)
scores = (self.embs[cand] @ qv[0])
order = np.argsort(scores)[::-1][:topk]
picks = cand[order]
out = []
for i, ix in enumerate(picks):
c = self.chunks[ix]
out.append({
"id": c["id"], "text": c["text"],
"score": float(scores[order][i])
})
return out
3. Guard the answer and write a trace (Python)
You can use any LLM. The guard is model agnostic.
pip install openai # or any client you prefer
# pipeline/answer_py.py
import json, time, os
from index.retrieve import HybridRetriever
from typing import List, Dict
from openai import OpenAI
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini") # example. replace as needed
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
def call_llm(prompt: str) -> str:
client = OpenAI(api_key=OPENAI_API_KEY)
out = client.chat.completions.create(
model=OPENAI_MODEL,
messages=[{"role":"user","content":prompt}],
temperature=0
)
return out.choices[0].message.content.strip()
def build_prompt(question: str, chunks: List[Dict]) -> str:
ctx = "\n\n".join(f"[{c['id']}] {c['text']}" for c in chunks)
return (
"Use only the evidence. If not provable, reply exactly: not in context.\n"
"Answer format:\n"
"- claim\n- citations: [id,...]\n\n"
f"Question: {question}\n\nEvidence:\n{ctx}\n"
)
def answer(question: str, retriever: HybridRetriever, topk=8) -> Dict:
chunks = retriever.retrieve(question, topk=topk)
prompt = build_prompt(question, chunks)
txt = call_llm(prompt)
ok = "not in context" in txt.lower() or "citations:" in txt.lower()
record = {
"ts": int(time.time()),
"q": question,
"chunks": [{"id": c["id"], "score": c["score"]} for c in chunks],
"answer": txt,
"ok": ok
}
os.makedirs("runs", exist_ok=True)
with open("runs/trace.jsonl", "a", encoding="utf8") as f:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
return record
if __name__ == "__main__":
import sys
retriever = HybridRetriever(sys.argv[1])
print(answer(sys.argv[2], retriever))
Run end to end.
python tools/extract_pdf.py your.pdf data/pages.json
python tools/chunk.py data/pages.json data/chunks.json
python index/build.py data/chunks.json index_out
OPENAI_API_KEY=sk-... python pipeline/answer_py.py index_out "What is the author’s definition of X and on which page?"
4. Node.js variant
Install minimal tools.
npm install node-fetch @xenova/transformers faiss-node bm25
Simple hybrid retriever.
@xenova/transformers runs CPU embeddings in process. For large corpora, precompute and store embeddings offline.
// index/retrieve.js
import { pipeline } from "@xenova/transformers";
import faiss from "faiss-node";
import fs from "node:fs";
export class HybridRetriever {
constructor(dir) {
this.chunks = JSON.parse(fs.readFileSync(`${dir}/chunks.json`, "utf8"));
this.ids = JSON.parse(fs.readFileSync(`${dir}/ids.json`, "utf8"));
this.embs = Float32Array.from(
Object.values(JSON.parse(fs.readFileSync(`${dir}/embs.json`, "utf8")))
);
this.d = this.embs.length / this.ids.length;
this.index = new faiss.IndexFlatIP(this.d);
this.index.add(this.embs);
}
async init() {
this.embed = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2");
}
async retrieve(q, topk = 12) {
const out = await this.embed(q, { pooling: "mean", normalize: true });
const qv = Float32Array.from(out.data);
const { distances, labels } = this.index.search(qv, topk * 2);
const picks = labels.slice(0, topk);
return picks.map((ix, i) => ({
id: this.chunks[ix].id,
text: this.chunks[ix].text,
score: distances[i]
}));
}
}
Guarded answer with your LLM client.
// pipeline/answer_node.js
import fs from "node:fs";
import fetch from "node-fetch";
import { HybridRetriever } from "../index/retrieve.js";
async function callLLM(prompt) {
const r = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`
},
body: JSON.stringify({
model: process.env.OPENAI_MODEL || "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
temperature: 0
})
});
const j = await r.json();
return j.choices[0].message.content.trim();
}
function buildPrompt(question, chunks) {
const ctx = chunks.map(c => `[${c.id}] ${c.text}`).join("\n\n");
return `Use only the evidence. If not provable, reply exactly: not in context.
Answer format:
- claim
- citations: [id,...]
Question: ${question}
Evidence:
${ctx}
`;
}
export async function answer(indexDir, question) {
const retr = new HybridRetriever(indexDir);
await retr.init();
const chunks = await retr.retrieve(question, 8);
const prompt = buildPrompt(question, chunks);
const txt = await callLLM(prompt);
const ok = txt.toLowerCase().includes("not in context") || txt.includes("citations");
const rec = {
ts: Date.now(),
q: question,
chunks: chunks.map(c => ({ id: c.id, score: c.score })),
answer: txt,
ok
};
fs.mkdirSync("runs", { recursive: true });
fs.appendFileSync("runs/trace.jsonl", JSON.stringify(rec) + "\n");
return rec;
}
// run: node pipeline/answer_node.js index_out "your question"
if (import.meta.url === `file://${process.argv[1]}`) {
answer(process.argv[2], process.argv.slice(3).join(" ")).then(x => console.log(x));
}
5. Evaluation harness
Create ten questions with ground truth spans. Keep the spans inside your chunks to avoid ambiguity.
// eval/qaset.json
[
{
"qid": "q1",
"q": "State the definition of X and cite the page.",
"gold_ids": ["p12#2", "p12#3"],
"gold_answer_contains": ["X is ..."]
}
]
Scorer.
# eval/score.py
import json, re, sys
def norm(s):
return re.sub(r'\W+', ' ', s.lower()).strip()
def score(run_file, qaset):
preds = [json.loads(l) for l in open(run_file, encoding="utf8")]
gold = {q["qid"]: q for q in json.load(open(qaset, encoding="utf8"))}
ok, refuse, cite_hit, n = 0, 0, 0, 0
for p in preds:
n += 1
if "not in context" in p["answer"].lower():
refuse += 1
# citation overlap
qid = p.get("qid", f"q{n}")
g = gold.get(qid)
if g:
cites = {c["id"] for c in p["chunks"]}
if set(g["gold_ids"]) & cites:
cite_hit += 1
if all(norm(x) in norm(p["answer"]) for x in g["gold_answer_contains"]):
ok += 1
return {"n": n, "exact_like": ok, "refusal": refuse, "cite_hit": cite_hit}
if __name__ == "__main__":
print(score(sys.argv[1], sys.argv[2]))
Run a baseline then the guarded pipeline. Compare exact_like, refusal, and cite_hit.
Expect the guarded version to increase refusal when information is missing. That is good. It means fewer fabricated answers.
6. Recipes for common failures and how to fix them
These map directly to Problem Map items.
No.1 Hallucination and Chunk Drift Symptoms
- Answers include facts that are near the topic but not in your chunks. Fix
- Reduce chunk size to keep entity and constraint within the same chunk.
- Use intersection of BM25 and embeddings then rerank by cosine.
- Keep a strict answer template and allow refusal.
No.2 Query Parsing and Intent Split Symptoms
- Multi part questions pull only one sub topic. Fix
- Split questions on explicit markers. Run retrieval per subquestion. Merge results.
- Rerank with a feature that rewards coverage across sub parts.
No.3 Index Schema Drift Symptoms
- Chunks produced with a different preprocessor than the one used at query time. Fix
- Store chunker version and tokenization rule with the index. Refuse to answer if version mismatch.
No.4 Over Retrieval and Tail Noise Symptoms
- Too many low score chunks enter the prompt and drown the relevant ones. Fix
- Hard cut after the score knee. Never include tail beyond top 20 before rerank and top 8 after rerank.
No.11 Symbolic Collapse Symptoms
- Pipeline marks success while storage or catalog updates failed. Users later see missing answers or duplicates. Fix
- Make commit of offsets or cursors contingent on durable writes.
- Example in Python
def flush_and_commit(batch):
try:
write(batch)
update_catalog()
return next_offset() # expose N+1 only after both steps succeed
except Exception:
return None # block commit so batch replays
No.14 Bootstrap Ordering Symptoms
- Components start in the wrong order and initial calls see empty resources. Fix
- Warm the retriever and embedding model before serving the first query.
- Cache the model and index in memory with a readiness flag.
class Service:
def __init__(self):
self.ready = False
def warm(self):
self.retr = HybridRetriever("index_out")
_ = self.retr.retrieve("warmup", 1)
self.ready = True
7. Operational guidance
-
Store
runs/trace.jsonlin your logs. Add query id and user id if appropriate. -
Sample one in ten queries for manual inspection.
-
Track these metrics
- refusal rate
- citation overlap rate
- first token latency
- rerank time
- chunk size distribution
-
Capacity
- FAISS flat inner product on CPU handles tens of thousands of chunks easily.
- For millions of chunks, move to IVF or HNSW and precompute embeddings offline.
8. Minimal quick start summary
# one time
pip install pypdf rank-bm25 sentence-transformers faiss-cpu rapidfuzz openai
python tools/extract_pdf.py your.pdf data/pages.json
python tools/chunk.py data/pages.json data/chunks.json
python index/build.py data/chunks.json index_out
# answering
OPENAI_API_KEY=sk-... python pipeline/answer_py.py index_out "Your question here"
# traces at runs/trace.jsonl
9. What to do if results are still unstable
Open an issue with three things. This makes triage fast and maps to the Problem Map.
- The query and your expected answer.
- The top 10 retrieved chunk ids and scores.
- The generated answer.
We will tell you which Problem Map item it matches and give exact steps that apply to your repo.
10. Roadmap for practice oriented docs
We will add
- A small corpus with prebuilt index for quick experiments.
- Node scripts that mirror the Python chunker and builder.
- Optional cross encoder rerankers that still run on CPU.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.