vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

2025-08-15 23:17:52 +08:00

21 KiB

Raw Blame History

Getting Started — Apply the WFGY Problem Map in real projects (no SDK)

This page shows how to turn the Problem Map into a working pipeline. No framework lock-in. No custom SDK. Everything is minimal, testable, and easy to remove if you do not like it.

You will build

A stable retrieval chain that resists chunk drift and reduces hallucinations.
A thin guard that forces answers to stay inside evidence.
A trace log so you can locate failure points and map them to Problem Map items.
A small evaluation harness so you can prove improvements.

Hardware A laptop with 16 to 32 GB RAM is enough for a 300 to 800 page corpus using CPU embeddings and FAISS or SQLite VSS.

References

Problem Map index: ProblemMap/README.md
RAG Architecture and Recovery: ProblemMap/rag-architecture-and-recovery.md
Math layer and rules (PDF): Download Now

0. Principles you will enforce

These map directly to common failures in the Problem Map.

Retrieval never returns context that crosses entity or constraint boundaries. This reduces No.1 Hallucination and Chunk Drift.
Generation must quote only the provided evidence. If not provable, it must say not in context.
Every answer writes a machine readable trace with query, chunk ids, and scores. This turns debugging from guesswork into data.
Optional rerank step promotes passages that match constraints, not only keywords.
When storage or write fails, do not advance offsets or cursors. This prevents No.11 Symbolic Collapse type off by one bugs in pipelines.

1. Corpus preparation and chunking (Python)

Install minimal tools.

pip install pypdf rank-bm25 sentence-transformers faiss-cpu rapidfuzz

Extract text per page. Save page spans to preserve boundaries.

# tools/extract_pdf.py
from pypdf import PdfReader
import json, sys

def extract(pdf_path, out_json):
    reader = PdfReader(pdf_path)
    docs = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        docs.append({"id": f"p{i+1}", "page": i+1, "text": text})
    with open(out_json, "w", encoding="utf8") as f:
        json.dump(docs, f, ensure_ascii=False, indent=2)

if __name__ == "__main__":
    extract(sys.argv[1], sys.argv[2])

Chunk with a constraint aware strategy. Keep headings or definition lines attached to their paragraphs. Merge small lines, but never cross section boundaries like References or Footnotes.

# tools/chunk.py
import re, json, sys
MAX_TOKENS = 350  # target size per chunk

def split_into_sentences(txt):
    # light sentence split. replace with a better splitter if you have one.
    return re.split(r'(?<=[.!?])\s+', txt.strip())

def is_boundary(line):
    head = line.lower().strip()
    return bool(re.match(r'^(abstract|introduction|conclusion|references?)\b', head))

def chunk_page(doc):
    lines = [l for l in doc["text"].splitlines() if l.strip()]
    chunks, buf, tokens, mark_boundary = [], [], 0, False
    for ln in lines:
        if is_boundary(ln):
            mark_boundary = True
        sents = split_into_sentences(ln)
        for s in sents:
            t = max(1, len(s.split()))
            if tokens + t > MAX_TOKENS or mark_boundary:
                if buf:
                    chunks.append(" ".join(buf))
                buf, tokens, mark_boundary = [s], t, False
            else:
                buf.append(s); tokens += t
    if buf:
        chunks.append(" ".join(buf))
    out = []
    for j, c in enumerate(chunks):
        out.append({
            "id": f'{doc["id"]}#{j+1}',
            "page": doc["page"],
            "text": c
        })
    return out

if __name__ == "__main__":
    src, dst = sys.argv[1], sys.argv[2]
    docs = json.load(open(src, encoding="utf8"))
    out = []
    for d in docs:
        out.extend(chunk_page(d))
    json.dump(out, open(dst, "w", encoding="utf8"), ensure_ascii=False, indent=2)

2. Build a hybrid index: BM25 + embeddings + optional rerank (Python)

pip install numpy

# index/build.py
import json, numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss, os, sys

def build(corpus_json, out_dir):
    os.makedirs(out_dir, exist_ok=True)
    chunks = json.load(open(corpus_json, encoding="utf8"))
    texts = [c["text"] for c in chunks]
    ids = [c["id"] for c in chunks]

    # lexical
    tokenized = [t.lower().split() for t in texts]
    bm25 = BM25Okapi(tokenized)

    # semantic
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    embs = model.encode(texts, show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
    index = faiss.IndexFlatIP(embs.shape[1])
    index.add(embs.astype(np.float32))

    np.save(os.path.join(out_dir, "embs.npy"), embs)
    json.dump(chunks, open(os.path.join(out_dir, "chunks.json"), "w", encoding="utf8"), ensure_ascii=False)
    json.dump(ids, open(os.path.join(out_dir, "ids.json"), "w", encoding="utf8"))
    bm25_dump = {
        "ids": ids,
        "tokenized": tokenized
    }
    json.dump(bm25_dump, open(os.path.join(out_dir, "bm25.json"), "w", encoding="utf8"))
    faiss.write_index(index, os.path.join(out_dir, "faiss.index"))

if __name__ == "__main__":
    build(sys.argv[1], sys.argv[2])

Retriever with hybrid scoring and simple rerank. You can replace the rerank with your favorite cross encoder later.

# index/retrieve.py
import json, numpy as np, faiss
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer

class HybridRetriever:
    def __init__(self, idx_dir):
        self.chunks = json.load(open(f"{idx_dir}/chunks.json", encoding="utf8"))
        ids = json.load(open(f"{idx_dir}/ids.json"))
        bm = json.load(open(f"{idx_dir}/bm25.json", encoding="utf8"))
        self.ids = ids
        self.bm25 = BM25Okapi(bm["tokenized"])
        self.model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
        self.index = faiss.read_index(f"{idx_dir}/faiss.index")
        self.embs = np.load(f"{idx_dir}/embs.npy")

    def retrieve(self, q, topk=12):
        toks = q.lower().split()
        bm_scores = self.bm25.get_scores(toks)
        bm_top = np.argsort(bm_scores)[::-1][:topk*4]

        qv = self.model.encode([q], normalize_embeddings=True)
        sims, idxs = self.index.search(qv.astype(np.float32), topk*4)
        sem_top = idxs[0]

        # intersect then union
        cand = list(set(bm_top).intersection(set(sem_top)))
        if len(cand) < topk:
            cand = list(set(list(bm_top) + list(sem_top)))
        # simple rerank by cosine against query vector
        cand = np.array(cand)
        scores = (self.embs[cand] @ qv[0])
        order = np.argsort(scores)[::-1][:topk]
        picks = cand[order]

        out = []
        for i, ix in enumerate(picks):
            c = self.chunks[ix]
            out.append({
                "id": c["id"], "text": c["text"],
                "score": float(scores[order][i])
            })
        return out

3. Guard the answer and write a trace (Python)

You can use any LLM. The guard is model agnostic.

pip install openai  # or any client you prefer

# pipeline/answer_py.py
import json, time, os
from index.retrieve import HybridRetriever
from typing import List, Dict
from openai import OpenAI

OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")  # example. replace as needed
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

def call_llm(prompt: str) -> str:
    client = OpenAI(api_key=OPENAI_API_KEY)
    out = client.chat.completions.create(
        model=OPENAI_MODEL,
        messages=[{"role":"user","content":prompt}],
        temperature=0
    )
    return out.choices[0].message.content.strip()

def build_prompt(question: str, chunks: List[Dict]) -> str:
    ctx = "\n\n".join(f"[{c['id']}] {c['text']}" for c in chunks)
    return (
        "Use only the evidence. If not provable, reply exactly: not in context.\n"
        "Answer format:\n"
        "- claim\n- citations: [id,...]\n\n"
        f"Question: {question}\n\nEvidence:\n{ctx}\n"
    )

def answer(question: str, retriever: HybridRetriever, topk=8) -> Dict:
    chunks = retriever.retrieve(question, topk=topk)
    prompt = build_prompt(question, chunks)
    txt = call_llm(prompt)

    ok = "not in context" in txt.lower() or "citations:" in txt.lower()
    record = {
        "ts": int(time.time()),
        "q": question,
        "chunks": [{"id": c["id"], "score": c["score"]} for c in chunks],
        "answer": txt,
        "ok": ok
    }
    os.makedirs("runs", exist_ok=True)
    with open("runs/trace.jsonl", "a", encoding="utf8") as f:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")
    return record

if __name__ == "__main__":
    import sys
    retriever = HybridRetriever(sys.argv[1])
    print(answer(sys.argv[2], retriever))

Run end to end.

python tools/extract_pdf.py your.pdf data/pages.json
python tools/chunk.py data/pages.json data/chunks.json
python index/build.py data/chunks.json index_out
OPENAI_API_KEY=sk-... python pipeline/answer_py.py index_out "What is the author’s definition of X and on which page?"

4. Node.js variant

Install minimal tools.

npm install node-fetch @xenova/transformers faiss-node bm25

Simple hybrid retriever. @xenova/transformers runs CPU embeddings in process. For large corpora, precompute and store embeddings offline.

// index/retrieve.js
import { pipeline } from "@xenova/transformers";
import faiss from "faiss-node";
import fs from "node:fs";

export class HybridRetriever {
  constructor(dir) {
    this.chunks = JSON.parse(fs.readFileSync(`${dir}/chunks.json`, "utf8"));
    this.ids = JSON.parse(fs.readFileSync(`${dir}/ids.json`, "utf8"));
    this.embs = Float32Array.from(
      Object.values(JSON.parse(fs.readFileSync(`${dir}/embs.json`, "utf8")))
    );
    this.d = this.embs.length / this.ids.length;
    this.index = new faiss.IndexFlatIP(this.d);
    this.index.add(this.embs);
  }
  async init() {
    this.embed = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2");
  }
  async retrieve(q, topk = 12) {
    const out = await this.embed(q, { pooling: "mean", normalize: true });
    const qv = Float32Array.from(out.data);
    const { distances, labels } = this.index.search(qv, topk * 2);
    const picks = labels.slice(0, topk);
    return picks.map((ix, i) => ({
      id: this.chunks[ix].id,
      text: this.chunks[ix].text,
      score: distances[i]
    }));
  }
}

Guarded answer with your LLM client.

// pipeline/answer_node.js
import fs from "node:fs";
import fetch from "node-fetch";
import { HybridRetriever } from "../index/retrieve.js";

async function callLLM(prompt) {
  const r = await fetch("https://api.openai.com/v1/chat/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`
    },
    body: JSON.stringify({
      model: process.env.OPENAI_MODEL || "gpt-4o-mini",
      messages: [{ role: "user", content: prompt }],
      temperature: 0
    })
  });
  const j = await r.json();
  return j.choices[0].message.content.trim();
}

function buildPrompt(question, chunks) {
  const ctx = chunks.map(c => `[${c.id}] ${c.text}`).join("\n\n");
  return `Use only the evidence. If not provable, reply exactly: not in context.
Answer format:
- claim
- citations: [id,...]

Question: ${question}

Evidence:
${ctx}
`;
}

export async function answer(indexDir, question) {
  const retr = new HybridRetriever(indexDir);
  await retr.init();
  const chunks = await retr.retrieve(question, 8);
  const prompt = buildPrompt(question, chunks);
  const txt = await callLLM(prompt);
  const ok = txt.toLowerCase().includes("not in context") || txt.includes("citations");

  const rec = {
    ts: Date.now(),
    q: question,
    chunks: chunks.map(c => ({ id: c.id, score: c.score })),
    answer: txt,
    ok
  };
  fs.mkdirSync("runs", { recursive: true });
  fs.appendFileSync("runs/trace.jsonl", JSON.stringify(rec) + "\n");
  return rec;
}

// run: node pipeline/answer_node.js index_out "your question"
if (import.meta.url === `file://${process.argv[1]}`) {
  answer(process.argv[2], process.argv.slice(3).join(" ")).then(x => console.log(x));
}

5. Evaluation harness

Create ten questions with ground truth spans. Keep the spans inside your chunks to avoid ambiguity.

// eval/qaset.json
[
  {
    "qid": "q1",
    "q": "State the definition of X and cite the page.",
    "gold_ids": ["p12#2", "p12#3"],
    "gold_answer_contains": ["X is ..."]
  }
]

Scorer.

# eval/score.py
import json, re, sys

def norm(s):
    return re.sub(r'\W+', ' ', s.lower()).strip()

def score(run_file, qaset):
    preds = [json.loads(l) for l in open(run_file, encoding="utf8")]
    gold = {q["qid"]: q for q in json.load(open(qaset, encoding="utf8"))}
    ok, refuse, cite_hit, n = 0, 0, 0, 0
    for p in preds:
        n += 1
        if "not in context" in p["answer"].lower():
            refuse += 1
        # citation overlap
        qid = p.get("qid", f"q{n}")
        g = gold.get(qid)
        if g:
            cites = {c["id"] for c in p["chunks"]}
            if set(g["gold_ids"]) & cites:
                cite_hit += 1
            if all(norm(x) in norm(p["answer"]) for x in g["gold_answer_contains"]):
                ok += 1
    return {"n": n, "exact_like": ok, "refusal": refuse, "cite_hit": cite_hit}

if __name__ == "__main__":
    print(score(sys.argv[1], sys.argv[2]))

Run a baseline then the guarded pipeline. Compare exact_like, refusal, and cite_hit. Expect the guarded version to increase refusal when information is missing. That is good. It means fewer fabricated answers.

6. Recipes for common failures and how to fix them

These map directly to Problem Map items.

No.1 Hallucination and Chunk Drift Symptoms

Answers include facts that are near the topic but not in your chunks. Fix
Reduce chunk size to keep entity and constraint within the same chunk.
Use intersection of BM25 and embeddings then rerank by cosine.
Keep a strict answer template and allow refusal.

No.2 Query Parsing and Intent Split Symptoms

Multi part questions pull only one sub topic. Fix
Split questions on explicit markers. Run retrieval per subquestion. Merge results.
Rerank with a feature that rewards coverage across sub parts.

No.3 Index Schema Drift Symptoms

Chunks produced with a different preprocessor than the one used at query time. Fix
Store chunker version and tokenization rule with the index. Refuse to answer if version mismatch.

No.4 Over Retrieval and Tail Noise Symptoms

Too many low score chunks enter the prompt and drown the relevant ones. Fix
Hard cut after the score knee. Never include tail beyond top 20 before rerank and top 8 after rerank.

No.11 Symbolic Collapse Symptoms

Pipeline marks success while storage or catalog updates failed. Users later see missing answers or duplicates. Fix
Make commit of offsets or cursors contingent on durable writes.
Example in Python

def flush_and_commit(batch):
    try:
        write(batch)
        update_catalog()
        return next_offset()      # expose N+1 only after both steps succeed
    except Exception:
        return None               # block commit so batch replays

No.14 Bootstrap Ordering Symptoms

Components start in the wrong order and initial calls see empty resources. Fix
Warm the retriever and embedding model before serving the first query.
Cache the model and index in memory with a readiness flag.

class Service:
    def __init__(self):
        self.ready = False
    def warm(self):
        self.retr = HybridRetriever("index_out")
        _ = self.retr.retrieve("warmup", 1)
        self.ready = True

7. Operational guidance

Store runs/trace.jsonl in your logs. Add query id and user id if appropriate.
Sample one in ten queries for manual inspection.
Track these metrics
- refusal rate
- citation overlap rate
- first token latency
- rerank time
- chunk size distribution
Capacity
- FAISS flat inner product on CPU handles tens of thousands of chunks easily.
- For millions of chunks, move to IVF or HNSW and precompute embeddings offline.

8. Minimal quick start summary

# one time
pip install pypdf rank-bm25 sentence-transformers faiss-cpu rapidfuzz openai
python tools/extract_pdf.py your.pdf data/pages.json
python tools/chunk.py data/pages.json data/chunks.json
python index/build.py data/chunks.json index_out

# answering
OPENAI_API_KEY=sk-... python pipeline/answer_py.py index_out "Your question here"
# traces at runs/trace.jsonl

9. What to do if results are still unstable

Open an issue with three things. This makes triage fast and maps to the Problem Map.

The query and your expected answer.
The top 10 retrieved chunk ids and scores.
The generated answer.

We will tell you which Problem Map item it matches and give exact steps that apply to your repo.

10. Roadmap for practice oriented docs

We will add

A small corpus with prebuilt index for quick experiments.
Node scripts that mirror the Python chunker and builder.
Optional cross encoder rerankers that still run on CPU.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

21 KiB Raw Blame History Unescape Escape