vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig 1acbfff219

Update retrieval-traceability.md

2025-09-05 11:47:27 +08:00

16 KiB

Raw Blame History

Retrieval Traceability

🧭 Quick Return to Map

You are in a sub-page of Retrieval.
To reorient, go back here:

Retrieval — information access and knowledge lookup

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A citation and audit schema that makes every answer explainable. Install this when answers look correct but citations are missing, misaligned, or not reproducible. The schema works across any store or framework.

Use this page with

Contracts for payload shape → Data Contracts
ΔS probes for semantic fit → deltaS_probes.md
Chunk window parity → chunk_alignment.md
Ordering control → rerankers.md
Store fences → store_agnostic_guardrails.md

Acceptance targets

ΔS(question, retrieved) ≤ 0.45
Coverage ≥ 0.70 to the intended section
λ convergent across 3 paraphrases and 2 seeds
Citation match rate ≥ 0.95 on the gold set

Citation payload schema

Every retrieved unit must carry a portable citation payload. The payload is used in prompts, logs, and evals.

Required fields

Field	Type	Meaning
`doc_id`	string	Stable document id across versions
`section_id`	string	Human legible section key or path
`snippet_id`	string	Unique id for this chunk or span
`source_url`	string	Canonical source link or URI
`offsets`	object	Character or token start and end
`tokens`	integer	Token count of the snippet
`index_hash`	string	Write time hash of the index
`embed_model`	string	Embedding model id and pooling
`analyzer`	string	Analyzer and casing policy
`rev`	string	Document revision or commit hash

Optional but recommended

Field	Type	Meaning
`window`	object	Pre and post context sizes
`page`	integer	Page number when applicable
`score_raw`	number	Raw similarity from the store
`score_norm`	number	Store independent score in 0 to 1
`rerank_score`	number	Cross encoder score if used
`k_pos`	integer	Rank position before rerank
`k_final`	integer	Rank position after rerank

JSON shape

{
  "doc_id": "handbook_2024",
  "section_id": "policies/security/keys",
  "snippet_id": "hb24-sec-keys-0007",
  "source_url": "https://example.org/handbook#security-keys",
  "offsets": {"start": 2310, "end": 2688, "unit": "char"},
  "tokens": 188,
  "index_hash": "faiss:2f7d...9a",
  "embed_model": "bge-large-en-v1.5:mean",
  "analyzer": "lowercase+ascii_fold",
  "rev": "3b91f2a",
  "window": {"pre": 120, "post": 120},
  "page": 14,
  "score_raw": 0.8421,
  "score_norm": 0.78,
  "rerank_score": 0.67,
  "k_pos": 18,
  "k_final": 3
}

Validation rules

Cite first then explain The model must emit citations before or together with the answer. Never after the answer only.
Offsets are monotonic start < end. Units are consistent within a run. Character and token units cannot be mixed in the same field.
No cross section reuse by default One answer segment can only use citations from the same section_id, unless a whitelist is provided.
Analyzer parity analyzer must match the read path. If the analyzer differs at read time, the run fails fast.
Deterministic tiebreak When scores tie, final order uses (score_norm desc, section_id asc, snippet_id asc).
Index identity index_hash in citations must match the live index hash logged by the pipeline.
Revision lock rev must match the document revision that produced the chunk.
Trace completeness For each citation the log must include score_raw or score_norm, k_pos, and when present k_final.

Prompt contract

Embed the citation schema directly in the prompt. Require validation behavior.

Contract:
- You must cite before you explain.
- Each citation item = {snippet_id, section_id, source_url, offsets, tokens}.
- All citations in one segment must share the same section_id unless I set allow_cross_section=true.
- If any field is missing or inconsistent, stop and return {error: "<field>", fix: "<page to open>"}.

Return JSON:
{
  "citations": [ ... ],
  "answer": "...",
  "ΔS": 0.xx,
  "λ": "<state>"
}

Pair this with the post step validator below.

Minimal validators

Python pseudo

def validate_citations(items, allow_cross=False):
    if not items:
        return "empty_citations"
    sect = items[0]["section_id"]
    for x in items:
        for f in ["snippet_id","section_id","source_url","offsets","tokens","index_hash","embed_model","analyzer","rev"]:
            if f not in x:
                return f"missing_{f}"
        if x["offsets"]["start"] >= x["offsets"]["end"]:
            return "bad_offsets"
        if not allow_cross and x["section_id"] != sect:
            return "cross_section_reuse"
        if "score_raw" not in x and "score_norm" not in x:
            return "missing_score"
    return "ok"

Return codes

Code	Meaning	Next step
`empty_citations`	Model skipped citation block	Enforce cite first in prompt contract
`missing_score`	Store did not carry score metadata	Add score projection in the adapter
`bad_offsets`	Span math is broken	Rebuild chunk offsets and verify unit
`cross_section_reuse`	Mixed sources in one segment	Forbid reuse or add an explicit whitelist
`mismatch_index_hash`	Read and write do not match	Rebuild index or fix boot ordering
`analyzer_mismatch`	Analyzer differs at read time	Align analyzer and normalization

Log format

Log one line per question and per segment. Keep it grep friendly.

ts=2025-08-28 qid=Q001 seg=1
deltaS=0.37 lambda=convergent
k=10 store=faiss index_hash=2f7d...9a embed=bge-large-en-v1.5
citations=[hb24-sec-keys-0007,hb24-sec-keys-0008]
scores=[0.8421,0.8034] kpos=[18,7] kfinal=[3,1]
section_id=policies/security/keys rev=3b91f2a

You can mirror this to a table. Example DDL follows.

SQL table

CREATE TABLE rag_traces (
  ts TIMESTAMP NOT NULL,
  qid TEXT NOT NULL,
  seg INTEGER NOT NULL,
  deltaS REAL NOT NULL,
  lambda_state TEXT NOT NULL,
  store TEXT,
  index_hash TEXT,
  embed_model TEXT,
  analyzer TEXT,
  section_id TEXT,
  rev TEXT,
  citations JSONB,
  scores JSONB,
  kpos JSONB,
  kfinal JSONB
);

Auditable answer format

Force the model to output a machine checkable block.

{
  "citations": [
    {
      "snippet_id": "hb24-sec-keys-0007",
      "section_id": "policies/security/keys",
      "source_url": "https://example.org/handbook#security-keys",
      "offsets": {"start": 2310, "end": 2688, "unit": "char"},
      "tokens": 188
    }
  ],
  "answer": "Hardware keys are required for all admin roles. The policy defines rotation and recovery.",
  "ΔS": 0.37,
  "λ": "convergent"
}

Typical failure modes and exact fixes

Citations look right but the text is from another section Install the cross section fence and recheck analyzer parity. Open: Data Contracts · chunk_alignment.md
Citation block is present but fields are missing Add the adapter projection. Enforce required fields at the chain boundary. Open: store_agnostic_guardrails.md
Hybrid retrieval changes the final order every run Add reranking with a fixed model and seed. Log k_pos and k_final. Open: rerankers.md
High similarity but wrong meaning Verify analyzer, pooling, and metric. Rebuild if mismatched. Open: Embedding ≠ Semantic

Drop in adapters

LangChain LCEL

def lc_project(doc):
    return {
        "doc_id": doc.metadata["doc_id"],
        "section_id": doc.metadata["section_id"],
        "snippet_id": doc.metadata["snippet_id"],
        "source_url": doc.metadata.get("source_url",""),
        "offsets": {"start": doc.metadata["start"], "end": doc.metadata["end"], "unit": "char"},
        "tokens": doc.metadata["tokens"],
        "index_hash": doc.metadata["index_hash"],
        "embed_model": doc.metadata["embed_model"],
        "analyzer": doc.metadata["analyzer"],
        "rev": doc.metadata["rev"],
        "score_raw": doc.metadata.get("score", None),
        "k_pos": doc.metadata.get("k", None)
    }

LlamaIndex

proj = {
  "doc_id": node.ref_doc_id,
  "section_id": node.metadata.get("section_id",""),
  "snippet_id": node.node_id,
  "source_url": node.metadata.get("source_url",""),
  "offsets": {"start": node.start_char_idx, "end": node.end_char_idx, "unit": "char"},
  "tokens": node.metadata.get("tokens", 0),
  "index_hash": idx_hash,
  "embed_model": embed_id,
  "analyzer": analyzer_id,
  "rev": doc_rev,
  "score_raw": score
}

Semantic Kernel

record Citation(string doc_id, string section_id, string snippet_id,
                string source_url, Offsets offsets, int tokens,
                string index_hash, string embed_model, string analyzer, string rev,
                double? score_raw = null, double? score_norm = null, int? k_pos = null, int? k_final = null);

Evaluation recipe

Prepare a gold set with 3 to 5 questions per section.
Run 3 paraphrases and 2 seeds per question.
Check citation match rate and coverage.
If match rate falls below 0.95, inspect validator return codes first.
Only then tune retrieval and rerank knobs.

More detail → retrieval_eval_recipes.md

Copy paste test

Use this to smoke test your chain.

You have TXTOS and the WFGY Problem Map loaded.

Task: validate the citations using the schema defined in my contract.
Input: {the model JSON}

If the payload is valid, reply "OK" and give ΔS and λ.
If invalid, return {error: "<code>", field: "<name>", tip: "<page to open>"}.
Use pages: retrieval-traceability, data-contracts, chunk-alignment, rerankers.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

16 KiB Raw Blame History Unescape Escape