vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig × MiniPS 9cd5d4d06a

Update wfgy-rag-16-problem-map-global-debug-card.md

2026-03-02 08:34:16 +08:00

26 KiB

Raw Blame History

🧭 Not sure where to start ? Open the WFGY Engine Compass

WFGY System Map

(One place to see everything; links open the relevant section.)

Layer	Page	What it’s for
⭐ Proof	WFGY Recognition Map	External citations, integrations, and ecosystem proof
⚙️ Engine	WFGY 1.0	Original PDF-based tension engine blue
⚙️ Engine	WFGY 2.0	Production tension kernel and math engine for RAG and agents.
⚙️ Engine	WFGY 3.0	TXT-based Singularity tension engine (131 S-class set)
🗺️ Map	Problem Map 1.0	Flagship 16-problem RAG failure checklist and fix map
🗺️ Map	Problem Map 2.0	RAG-focused recovery pipeline
🗺️ Map	Problem Map 3.0	Global Debug Card - image as a debug protocol layer - 🔴 YOU ARE HERE 🔴
🗺️ Map	Semantic Clinic	Symptom → family → exact fix
🧓 Map	Grandma’s Clinic	Plain-language stories, mapped to PM 1.0
🏡 Onboarding	Starter Village	Guided tour for newcomers
🧰 App	TXT OS	.txt semantic OS - 60-second boot
🧰 App	Blah Blah Blah	Abstract/paradox Q&A (built on TXT OS)
🧰 App	Blur Blur Blur	Text-to-image with semantic control
🧰 App	Blow Blow Blow	Reasoning game engine & memory demo
🧪 Research	Semantic Blueprint	Modular layer structures (future)
🧪 Research	Benchmarks	Comparisons & how to reproduce
🧪 Research	Value Manifest	Why this engine creates $-scale value

🚀 WFGY 3.0 · RAG 16 Problem Map · Global Debug Card

Image as a Structured Debug Prompt for RAG / Agent Pipelines

🌐 Recognition & ecosystem integration

As of 2026-03, the WFGY RAG 16 Problem Map line has been adopted or referenced by
20+ frameworks, academic labs, and curated lists in the RAG and agent ecosystem.
Most external references use the WFGY ProblemMap as a diagnostic layer for RAG / agent pipelines,
not the full WFGY product stack.
A smaller but growing set also uses WFGY 3.0 · Singularity Demo as a long-horizon TXT stress test.

Some representative integrations:

Project	Segment	How it uses WFGY ProblemMap	Proof (PR / doc)
RAGFlow	Mainstream RAG engine	Adds a RAG failure modes checklist guide in its official docs, adapted from the WFGY 16-problem failure map for step-by-step RAG pipeline diagnostics.	PR #13204
LlamaIndex	Mainstream RAG infra	Integrates the WFGY 16-problem RAG failure checklist into its official RAG troubleshooting docs as a structured failure-mode reference.	PR #20760
ToolUniverse (Harvard MIMS Lab)	Academic lab / tools	Provides a `WFGY_triage_llm_rag_failure` tool that wraps the 16-mode map for incident triage.	PR #75
Rankify (Univ. of Innsbruck)	Academic lab / system	Uses the 16 failure patterns in RAG and re-ranking troubleshooting docs.	PR #76
Multimodal RAG Survey (QCRI LLM Lab)	Academic lab / survey	Cites WFGY as a practical diagnostic resource for multimodal RAG.	PR #4

For the complete 20+ project list (frameworks, benchmarks, curated lists), see the 👉 WFGY Recognition Map

If your project uses the WFGY ProblemMap and you would like to be listed,
feel free to open an issue or pull request in this repository.

A production-first failure map for RAG / agent pipelines, compressed into one portable image.
The poster defines objects, metrics, ΔS zones, failure types, and mode patterns in a single unified view.

Feed a failing run (Q, E, P, A) plus this card to any LLM.
It returns type → mode(s) → fixes → verification tests.

No external framework required.

How to use (60 seconds)

Download the full-resolution Debug Card from this repository.
Upload the image to any strong LLM.
Paste your failing run context:
- Q = user question
- E = top-k retrieved evidence
- P = final prompt sent to the model
- A = model output
- plus logs/metrics if available
Ask the model to diagnose using the 16-Problem Map and propose fixes with one test per fix.

Expected output:

ΔS zones (or best-effort estimate)
failure type: R / P / S / I
1–3 matching modes (1–16)
structural fixes + verification tests

📥 Download the original full-resolution image directly from this repository.
If this helps you, feel free to leave a ⭐ on GitHub.

Colab MVP (Work in Progress)

This section is a placeholder for the first runnable notebook release.

Planned direction:

run one failing case from (Q, E, P, A)
emit a structured debug packet
apply one constrained patch
re-check verification signals

Expected first-wave scope:

No.1 retrieval wrong or off-topic
No.5 semantic vs embedding mismatch
No.8 missing evidence visibility
No.2 interpretation collapse

Status:

placeholder only
notebook structure and examples coming soon

Machine-Readable JSON MVP (Work in Progress)

This section is a placeholder for the machine-readable protocol layer.

Planned spec set:

wfgy_problem_catalog_v1.json
wfgy_debug_packet_v1.json

Intended use:

make the 16-mode map consumable by tools
let agent runners ingest one debug result as a stable packet
support reproducible before/after repair loops

Status:

placeholder only
schema, examples, and integration notes coming soon

Environment Coverage

MLflow · Experiment tracking & RAG debugging · Medium Article
Dask · Distributed execution & task orchestration · Medium Article
Kedro · Pipeline structuring & ML workflow management · Medium Article

FAQ

Do I need all four objects `(Q, E, P, A)` to use the Global Debug Card?

The best results come from having all four objects:

Q = user question
E = retrieved evidence
P = final prompt sent to the model
A = model output

This gives the card the full diagnostic chain from intent to evidence to prompt to answer.

If one object is missing, the card can still be used for partial triage:

missing P usually means rough classification is still possible
missing E often points to observability or infra-side issues
missing both E and P reduces accuracy, but the card can still help identify likely failure families

In short: all four objects are ideal, but partial inputs can still be useful for first-pass diagnosis.

Can the Global Debug Card partially automate RAG debugging?

Yes, partially.

The card is well suited for:

structured triage
first-pass failure classification
matching likely problem modes
proposing repair directions
defining one verification test per fix

This means it can help automate the early part of debugging very well.

What it does not guarantee is a one-click, fully automatic repair for every RAG failure.
Some cases still require human review, domain context, or deeper system changes.

The current design goal is not "magic auto-fix."
The goal is to turn messy debugging into a repeatable protocol.

Can tools like OpenClaw use this card to automatically resolve RAG issues?

Yes, for some classes of issues.

A tool like OpenClaw can use the Global Debug Card as a diagnostic layer:

collect (Q, E, P, A)
run classification
emit a structured debug packet
apply one constrained repair action
re-run a verification check

This is most realistic for problems where the repair loop is narrow and testable.

Best early targets include:

No.1 retrieval wrong or off-topic
No.5 semantic vs embedding mismatch
No.8 missing evidence visibility
parts of No.2 interpretation collapse

These are good candidates because they often respond to:

retrieval parameter changes
query rewrite
re-ranking
logging and observability upgrades
constrained prompt repair

Harder cases, such as long-horizon reasoning errors or deep architectural flaws, usually should not be treated as fully automatic fixes in the first wave.

What can the first Colab MVP actually automate?

The first Colab MVP is designed as a first-pass repair loop, not a full autonomous debugger.

Its planned job is:

accept one failing case from (Q, E, P, A)
compute or estimate the diagnostic structure
emit a machine-readable debug packet
apply one constrained patch
re-check one or more verification signals

This makes the first public notebook useful for:

reproducible triage
small repair experiments
before/after comparisons
documenting how one failure was improved

The first Colab MVP is intended to prove the protocol works in practice.
It is not meant to solve every failure mode automatically on day one.

Which problem modes are best suited for automated debugging first?

The best first-wave targets are the modes that are both:

easy to observe
narrow enough to repair with a constrained loop

The strongest early candidates are:

No.1 retrieval wrong or off-topic
Good for top-k sweeps, query rewrite, filter tightening, and re-ranking.
No.5 semantic vs embedding mismatch
Good for embedding checks, chunking review, normalization review, and retrieval alignment tests.
No.8 missing evidence visibility
Good for observability upgrades, evidence logging, and packet completeness checks.
No.2 interpretation collapse
Good for constrained prompt repair, evidence-first answer patterns, and citation-before-claim checks.

These modes are ideal for Wave 0 because they can often be improved without rebuilding the entire stack.

What will the machine-readable JSON MVP be used for?

The JSON MVP is the protocol layer for tools.

It is meant to make the card usable by:

agent runners
CI workflows
internal debugging tools
replay and comparison pipelines

The planned JSON layer has two main roles:

Problem catalog JSON
A machine-readable version of the 16-mode map, including mode IDs, signals, and default repair directions.
Debug packet JSON
A stable result format for one diagnosis run, including the detected type, likely modes, fixes, and verification steps.

The JSON layer does not replace the image workflow.
It makes the same logic consumable by software.

Can I plug this into my own agent runner, CI, or internal tooling?

Yes, that is one of the intended directions.

If your system can provide a minimal case record such as:

then it can likely be adapted to this protocol.

Additional metadata can improve results, for example:

retrieval scores
model name
chunk IDs
timing logs
prompt template version
index configuration

The final goal is simple:

your system emits one case, the protocol emits one debug packet.

That makes the card easier to integrate into real workflows than a free-form debugging conversation.

Is this meant to replace human debugging?

No.

The purpose of the Global Debug Card is to improve the first stage of debugging:

shared vocabulary
structured triage
clearer diagnosis
repeatable verification steps

It is not meant to remove human judgment.

Human review is still important for:

high-risk changes
domain-specific correctness
architecture-level redesign
policy or safety decisions
ambiguous multi-cause failures

A good way to think about it is:

this card reduces debugging chaos, but humans still decide the final repair strategy.

Do I need embeddings to use this page?

No, but embeddings make the protocol stronger.

The full diagnostic model uses structured similarity and boundary reasoning.
That works best when a fixed embedding model is available.

However, the card is still useful without embeddings:

a strong LLM can estimate likely drift patterns from the visible objects
the image workflow can still guide manual or semi-structured triage
rough type and mode guesses can still be generated from the failure pattern

So there are two valid usage modes:

image + reasoning workflow for fast manual triage
embedding + packet workflow for stronger, machine-runnable diagnosis

What is the difference between the image workflow and the JSON workflow?

They use the same logic, but they serve different users.

Image workflow

optimized for humans
easy to use immediately
works by uploading the card and pasting (Q, E, P, A)
good for fast triage with any strong LLM

JSON workflow

optimized for tools
intended for automation and integration
produces stable, machine-readable outputs
good for replay, comparison, and agent orchestration

In short:

the image workflow is the human entry point
the JSON workflow is the machine entry point

Both are two views of the same protocol.

Why not just paste logs into any strong LLM and ask it to debug?

You can do that, and sometimes it helps.
But without a shared protocol, the output is often inconsistent.

Raw free-form debugging usually has these problems:

no shared object model
no fixed failure vocabulary
no stable mapping from symptoms to modes
no required verification step
hard to compare two runs consistently

The Global Debug Card adds structure:

the same four objects
the same failure families
the same mode vocabulary
the same output expectation
the same repair-and-test pattern

That makes debugging more reproducible and much easier to turn into automation.

How far can the Global Debug Card be pushed?

The short answer is: much further than a static poster.

At minimum, it is already useful as:

an image-based triage layer
a shared RAG debugging vocabulary
a structured prompt for diagnosis

With the next layers added, it can grow into:

a Colab-based repair loop
a machine-readable incident format
a replayable before/after comparison tool
an agent-facing debug protocol
a lightweight regression-check layer for RAG systems

So the long-term direction is not "just a card."

The long-term direction is:

from image prompt
to structured diagnosis
to partial automation
to reproducible repair loops
to a reusable protocol for RAG debugging

Can this help reduce hallucinations without changing the whole stack?

Often, yes.

One of the main strengths of the card is that it works as a diagnostic layer, not a full replacement stack.

That means teams can often improve reliability by first identifying:

whether the problem is retrieval
whether the prompt is the main source of drift
whether evidence visibility is missing
whether the issue is state or infra-related

This allows smaller, more targeted fixes:

retrieval tuning
prompt repair
re-ranking
observability upgrades
packet completeness checks

In many real cases, that is much cheaper than rebuilding the whole system.

Can this support replay, A/B comparison, and before/after repair loops?

Yes, and that is one of the strongest next-step directions.

Once a failure case is captured in a structured form, the same case can be replayed under different settings:

different top-k values
different retrievers
different chunking strategies
different prompt templates
different repair actions

This makes it possible to compare:

before vs after
setup A vs setup B
manual fix vs automated fix

That is one reason the Colab MVP and JSON MVP matter.
They move the card from "diagnostic poster" toward "replayable debugging protocol."

What should I submit if I want my failing case to become a reproducible example?

The best submission is a minimal but complete failure packet.

Recommended input:

Q = the original user question
E = retrieved evidence or top-k chunks
P = the final prompt sent to the model
A = the model output
any relevant logs or metrics
optional notes about the expected correct behavior

You should remove private or sensitive data before sharing.

A good reproducible example does not need to be large.
It just needs to preserve the failure pattern clearly enough for the protocol to analyze.

That kind of example is ideal for future Colab demos, test cases, and repair-loop comparisons.

26 KiB Raw Blame History Unescape Escape

WFGY System Map

🚀 WFGY 3.0 · RAG 16 Problem Map · Global Debug Card

Image as a Structured Debug Prompt for RAG / Agent Pipelines

Environment Coverage

FAQ

26 KiB

Raw Blame History