vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-01 21:11:11 +00:00

PSBigBig + MiniPS 191f33758b

Update README.md

2026-03-15 20:08:38 +08:00

16 KiB

Raw Blame History

Demo 1 · F1 Grounding Anchor Recovery

Problem Map 3.0 Troubleshooting Atlas

Official flagship demo for grounding-first repair

Quick links:

Open in Colab if you want live reproduction.
The README, replay artifacts, and replay mode are already enough to understand the core teaching pattern.

Replay + live MVP
Replay is readable without any API key

This is the first flagship demo in the official runnable demo pack.

If the AI eval snapshots suggest that better routing may reduce hidden debugging waste, this page goes one step further.

This demo is meant to make the mechanism-level claim visible:

the atlas does not only name failures
it changes the first repair move

It was chosen first for a simple reason:

many systems look wrong in a way people casually call hallucination
but the first real failure is often grounding

This demo is designed to make that difference visible.

It shows that once a case is routed as F1 Grounding & Evidence Integrity, the first repair move changes from vague prompt tweaking to explicit anchor recovery.

Quick start

If you want the shortest path through this demo, use this order:

read Section 1 · What this demo proves
read Section 3 · Why not F5 first
read Section 5 · First repair move
open replay_outputs.json
inspect the notebook only if you want replay execution or optional live reproduction

If you only want the takeaway, this demo teaches one clean lesson:

if the answer lost the right evidence anchor, fix the anchor first

1. What this demo proves

This demo proves four things.

A. A fluent wrong answer is not enough as a diagnosis

A system can sound smooth and still be wrong for a very specific reason:

it attached itself to the wrong evidence
it followed the wrong chunk
it lost the right target-reference link
it answered without stable evidence-anchor integrity

This is different from saying only:

the model hallucinated
the reasoning was bad
the system is low quality

B. Correct routing changes the first repair move

If the case is routed correctly into F1, the first repair move becomes:

re-grounding
chunk-to-target trace
evidence verification
anchor re-check

This is very different from trying to fix style, verbosity, or high-level reasoning first.

C. Replay mode is enough to teach the core pattern

A user should be able to understand this demo without running anything.

The replay artifacts are part of the official design, not an afterthought.

D. Live mode is useful, but the lesson does not depend on it

If the user wants to reproduce the same pattern with a real model call, the live mode is available.

But the core lesson of the demo must remain understandable even without execution.

2. Family route

Primary family

F1 · Grounding & Evidence Integrity

Secondary family

F5 · Observability & Diagnosability Integrity

Why F1 is primary

The main failure is that the answer no longer remains tied to the correct evidence anchor.

The case may also contain diagnosability pressure, especially if the retrieval path is not obvious.
But the first broken invariant is still grounding-first.

Short routing statement

the answer looks plausible
the evidence attachment is wrong
grounding fails before observability becomes the main issue

Best current fit

F1_N01 Retrieval Anchor Drift

In some variants this may also approach:

F1_N02 Semantic Grounding Mismatch

But the flagship teaching version stays centered on retrieval-anchor drift.

3. Why not F5 first

The main tempting neighboring cut is F5.

That temptation is understandable because many real systems make it hard to inspect why the answer went wrong.

But this demo is not mainly about a hidden failure path.

It is mainly about this:

the answer attached itself to the wrong evidence source

That means the first failure is not:

I cannot yet inspect the system clearly enough

It is:

the output lost the correct anchor

Wrong cut

“This is mainly a black-box debugging problem.”

Better cut

“This is a grounding-first failure with possible observability pressure at the edge.”

That distinction matters because the first repair move changes immediately.

4. Baseline failure

The baseline case is intentionally simple and easy to inspect.

Core pattern

A user asks a question that should be answerable from a small evidence set.

The system receives:

a question
several chunks or candidate sources
one truly relevant anchor
one or more misleading but superficially similar chunks

The baseline answer then:

sounds fluent
looks reasonable on the surface
cites or follows the wrong chunk
produces a wrong answer because the anchor path is off

What the baseline should make visible

The user should be able to see:

the question
the available chunks
which chunk was actually relevant
which chunk the baseline answer attached to
how that led to the wrong answer

Important design note

Do not make the baseline too chaotic.

The goal is not to create a bizarre edge case.
The goal is to create a clean teaching case for grounding drift.

5. First repair move

Once the case is routed to F1, the first repair move should be simple and explicit.

Recommended first repair stack

chunk-to-target trace
Check which chunk is being treated as the evidence anchor.
evidence verification
Compare the answer against the actually relevant evidence.
anchor re-check
Force the response pipeline to reconnect to the correct source.
re-grounding pass
Rebuild the answer after the anchor is corrected.

What should not happen first

Do not start with:

style rewriting
confidence rewriting
chain-of-thought expansion
“be more careful” style prompt band-aids

Those may change the surface, but they do not repair the anchor.

First repair principle

if the anchor is wrong, repair the anchor first

That is the teaching core of this demo.

6. Optional WFGY 3.0 escalation

This demo can remain useful without deeper escalation.

But if the user wants to explore a stronger structural repair path, this is where WFGY 3.0 becomes useful.

When escalation makes sense

Escalate beyond the simple first repair move if:

the answer keeps drifting even after obvious re-grounding
the chunk surface is noisy or highly confusable
the retrieval target and semantic target diverge
the system needs deeper structural diagnosis rather than one-shot repair

What WFGY 3.0 can add

A WFGY 3.0 handoff can help explore:

deeper target / proxy separation
stronger target-anchoring language
more explicit retrieval surface design
semantic target stabilization
experimental variants for anchor stress

Correct relationship

The right sequence is:

atlas route
first repair move
deeper WFGY exploration if needed

The atlas remains the router.
WFGY 3.0 is the deeper experimental engine.

7. Replay mode

Replay mode is the default public reading mode.

It requires no API key and no notebook execution.

Replay mode should show

the baseline case
the baseline answer
the family route
the why-primary-not-secondary statement
the broken invariant
the first repair move
the replayed improved answer
a short explanation of what changed

What replay mode proves

Replay mode proves that:

the routing logic is understandable
the repair logic is understandable
the before / after difference is visible
the demo can teach without requiring execution

Why replay mode matters

Most readers will not run the notebook first.

If the replay is weak, the demo is weak.

Replay mode is not a fallback.
It is part of the official design.

Core replay artifacts:

8. Live mode

Live mode is optional.

It exists for users who want to reproduce the same pattern with a real model call.

Live mode should do

load the case
show the baseline prompt or baseline conditions
run the baseline path
apply the route-first repair move
run the repaired path
compare outputs

Live mode should not pretend to do

giant benchmark coverage
full production evaluation
final proof of universal robustness

It is a reproduction layer, not a universal benchmark.

Live mode design rule

If live execution introduces too much noise, the notebook should favor clarity over realism.

This is a flagship demo, not a stress benchmark.

Current MVP interpretation

The current MVP version already includes a live runnable notebook.

That does not mean the baseline must always fail in the exact same way.
It means the live run is designed to reproduce the core pattern:

baseline moves in the wrong direction
repaired moves toward the correct answer
anchor correction changes the result path

That is enough for MVP success.

9. API key note

The live variant may require an API key.

If so, the rule remains simple:

the key is entered only at run time
the key is never stored in the repository
replay mode remains readable without any secret

Important note for users

This demo is designed for understanding and reproduction.

You do not need to run the notebook in order to understand the point of the case.

A strong demo should still teach through:

README
JSON fixtures
replay outputs

The live rerun is optional.

10. Files in this folder

Core files

Optional future additions

helper utilities from the shared folder
notebook variants
patch notes for stronger baseline contrast

11. How to run the notebook

Replay-only reading path

Open the notebook in Colab
Keep MODE = "replay"
Click Run all
Read the baseline / repaired contrast

Live rerun path

Open the notebook in Colab
Change MODE = "live"
Click Run all
Enter your API key when prompted
Compare the two final output cells:
- baseline model output
- repaired model output

What to look for

The notebook does not require a magical result.

You only need to verify that:

the baseline is more vulnerable to the wrong-anchor path
the repaired version is more likely to move toward the correct answer
the shift comes from anchor correction, not generic prompt expansion

12. Expected outcome

If the demo works, the user should walk away with the following understanding:

the baseline answer looked reasonable but followed the wrong evidence
the atlas routed the case to F1, not F5
the first repair move was re-grounding, not generic prompt tweaking
after the anchor was corrected, the answer moved in the expected direction

The outcome does not need to look magical.

It only needs to make the repair logic visibly different and more correct.

That is enough.

13. Limits of this demo

This demo has real limits, and those limits should be stated clearly.

It does not prove

that all hallucination-like cases are the same
that all F1 cases are solved by one simple repair
that observability never matters in grounding cases
that this single example covers all retrieval systems

It does prove

that some fluent wrong answers are grounding-first
that route-first diagnosis changes the repair move
that grounding repair can be made visible and teachable

These are already strong claims.
There is no need to overclaim.

14. Community extension ideas

This demo is also a seed template for future community work.

Possible extensions include:

swapping in a different retrieval dataset
using different chunking strategies
testing multiple models on the same fixture
comparing naive prompt fixes versus grounding-first fixes
adding stronger evidence-trace visualization
linking the same case to deeper WFGY 3.0 experimental variants

Community boundary rule

Contributors should preserve the official logic:

route first
explain why primary not secondary
show the first repair move
keep replay outputs understandable
avoid turning the demo into an unreadable benchmark swamp

For contribution structure, see:

Short summary

This demo teaches one clean lesson: