vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig + MiniPS 726043d2c6

Create flagship-fix-demos-v1.md

2026-03-12 12:07:34 +08:00

16 KiB

Raw Blame History

Flagship Fix Demos v1 🧪

Problem Map 3.0 Troubleshooting Atlas

Official first runnable demo pack design

0. Document Status 🚦

This document defines the first official flagship demo pack for the atlas fix layer.

Its purpose is simple:

show that the atlas does not only classify failures
it also leads to better first repair moves that can be demonstrated in runnable form

This pack is frozen as Flagship Fix Demos v1.

It is frozen not because all future demos are complete, but because the first official demo strategy is now clear enough to:

guide implementation
guide Colab creation
guide JSON fixture design
guide expected-output design
support public demos
support community expansion later

1. Why flagship demos matter 💥

Without demos, the atlas can still look like “only a clever taxonomy.”

With strong demos, people can see:

a failure enters
the atlas routes it
the first repair move changes
the output or workflow improves
deeper WFGY escalation becomes optional instead of mysterious

That is the point of this pack.

The demos are meant to prove:

route-first diagnosis changes the repair path
the first repair move is not arbitrary
the atlas is usable in real workflows
the bridge into WFGY 3.0 is practical, not just theoretical

2. Should flagship demos use Colab? ✅

Yes. At least the flagship demos should have a Colab-compatible implementation path.

Why:

Colab is easy to share
people can run it without setting up a local environment first
screenshots and outputs are easy to compare
contributors can fork and extend demos quickly
it lowers trust friction

But not every fix note needs a notebook.

The right split is:

Official flagship demos

Should include:

a markdown demo card
a Colab notebook or notebook-ready script
input JSON fixture
expected output JSON
short success criteria

Long-tail or domain-specific demos

Can be contributed later by the community.

Short version:

official flagship demos should be runnable
not every future fix recipe needs to be official or notebook-first

3. Demo design standard 🔧

Each official demo should contain five parts.

Part A · Case framing

Define:

failure name
primary family
secondary family
broken invariant
best current fit
why this case is a good demo

Part B · Baseline failure

Show the broken version first.

This is important. A demo is much stronger when people can see the wrong behavior before the fix.

Part C · First repair move

Apply the family-level first repair move from the official fix surface.

Part D · Optional WFGY escalation

If useful, show how the same case can be pushed one level deeper with WFGY 3.0.

Part E · Evaluation result

Show concrete before / after comparison using simple metrics or checks.

4. Asset standard 📦

Each flagship demo should eventually have the following files.

4.1 Demo card

A markdown file that explains:

the case
the routing
the fix logic
the result
the lesson

4.2 Colab notebook

A runnable notebook that reproduces:

baseline failure
first repair move
after-fix behavior

4.3 Input JSON fixture

A small structured input pack for the demo.

4.4 Expected output JSON

A compact expected result file for comparison.

4.5 Optional screenshot or GIF

Useful for quick public-facing presentation.

5. Recommended folder pattern 🗂️

The first official demos should follow a predictable structure.

Suggested future companion assets:

Suggested file naming style:

f1-retrieval-anchor-drift-demo-v1.md
f1-retrieval-anchor-drift-demo-v1.ipynb
f1-retrieval-anchor-drift-input-v1.json
f1-retrieval-anchor-drift-expected-v1.json

And similarly for F4, F5, F7.

6. Why these first demos were chosen 🎯

The first demo pack should not try to cover every family equally at once.

The correct first wave should optimize for:

high clarity
high reproducibility
visible before / after effect
easy sharing
strong teaching value

That is why the first official pack should focus on four demos:

F1 grounding
F4 execution closure
F5 observability
F7 representation fidelity

Why these four first:

they are easier to make runnable
they show clearly different types of repair
they are easier to teach publicly
they are less likely to dissolve into endless philosophical interpretation

F2, F3, and F6 can follow in later waves or in deeper expansions.

Part I · Official Demo 1

F1 Retrieval Anchor Drift Demo 🌍

1.1 Goal

Show that a retrieval-like failure that looks like “the model answered badly” is actually a grounding problem first.

1.2 Family routing

Primary Family: F1 Grounding & Evidence Integrity
Secondary Family: F5 Observability & Diagnosability Integrity
Broken Invariant: evidence-anchor integrity broken
Best Current Fit: F1_N01 Retrieval Anchor Drift

1.3 Baseline failure design

Build a tiny corpus with:

one target passage
several semantically similar distractor passages
one query that requires the correct anchor

The broken baseline should use:

naive retrieval
weak source checking
no source-to-answer trace

This should produce a plausible but unsupported answer often enough to be visibly wrong.

1.4 First repair move

Apply:

anchor tracing
source-to-answer verification
explicit support checking
optional simple reranking by support quality

1.5 Suggested evaluation fields

Measure:

evidence_hit_rate
answer_support_rate
wrong_anchor_rate
citation_presence

1.6 Suggested JSON fixture

Input JSON should contain:

corpus
query
expected_anchor_id
distractor_ids
baseline_config
fix_config

Expected output JSON should contain:

retrieved_ids_before
retrieved_ids_after
predicted_answer_before
predicted_answer_after
support_verdict_before
support_verdict_after

1.7 Why this demo matters

This is one of the best first demos because it proves a key atlas claim:

many “hallucination-like” failures are grounding failures first

Part II · Official Demo 2

F4 Readiness and Closure Demo ⚙️

2.1 Goal

Show that a workflow can fail not because the model reasons badly, but because the execution skeleton never properly closes.

2.2 Family routing

Primary Family: F4 Execution & Contract Integrity
Secondary Family: F3 State & Continuity Integrity
Broken Invariant: execution skeleton closure broken
Best Current Fit: F4_N03 Pre-Readiness Execution Failure or F4_N02 Deployment Deadlock, depending on implementation

2.3 Baseline failure design

Build a simple staged workflow:

ingest
index
query

The broken baseline should deliberately let query run before index readiness is confirmed.

Optional variants:

missing artifact
stale index
hidden dependency
bridge value not yet written

2.4 First repair move

Apply:

readiness check
ordering validation
bridge existence check
fallback path or fail-fast branch

2.5 Suggested evaluation fields

Measure:

successful_run_rate
precondition_failure_count
closure_success_before
closure_success_after
fallback_triggered

2.6 Suggested JSON fixture

Input JSON should contain:

pipeline_steps
required_artifacts
baseline_order
corrected_order
readiness_gate_config

Expected output JSON should contain:

step_status_before
step_status_after
failure_stage_before
failure_stage_after
closure_verdict_before
closure_verdict_after

2.7 Why this demo matters

This demo is strong because it proves another key atlas claim:

some failures should be repaired at the execution skeleton layer before anyone blames reasoning

Part III · Official Demo 3

F5 Failure Path Visibility Demo 🔎

3.1 Goal

Show that some systems are not fixable first by “being smarter,” but by becoming diagnosable.

3.2 Family routing

Primary Family: F5 Observability & Diagnosability Integrity
Secondary Family: F4 Execution & Contract Integrity
Broken Invariant: failure-path visibility broken
Best Current Fit: F5_N01 Failure Path Opacity

3.3 Baseline failure design

Use a simple multi-step pipeline with weak or absent tracing.

The broken baseline should have:

no step-level trace
no structured failure log
no coherence or stage probe
failure visible only at the final output

3.4 First repair move

Apply:

trace IDs
step-level logging
structured error summary
minimal coherence or stage probe

3.5 Suggested evaluation fields

Measure:

trace_completeness
stage_localization_success
diagnosable_run_rate
mean_failure_visibility_score

3.6 Suggested JSON fixture

Input JSON should contain:

pipeline_steps
hidden_failure_stage
baseline_observability_config
upgraded_observability_config

Expected output JSON should contain:

visible_failure_stage_before
visible_failure_stage_after
trace_fields_before
trace_fields_after
diagnosable_verdict_before
diagnosable_verdict_after

3.7 Why this demo matters

This demo proves that:

some first repair moves should expose the failure before attempting deep intervention

It is also highly useful for public teaching.

Part IV · Official Demo 4

F7 Structure Fidelity Demo 🧱

4.1 Goal

Show that some failures come from a broken carrier, not from “not enough reasoning.”

4.2 Family routing

Primary Family: F7 Representation & Localization Integrity
Secondary Family: F2 Reasoning & Progression Integrity
Broken Invariant: representation container fidelity broken
Best Current Fit: F7_N01 Symbolic Representation Fidelity Failure or F7_N01_A Logic Descriptor Fidelity Failure

4.3 Baseline failure design

Use a structured object such as:

logic rules
constrained table
schema-bound task
hierarchical instruction object

The broken baseline should flatten or distort the carrier, for example by:

lossy serialization
schema field drop
hierarchy collapse
descriptor simplification

4.4 First repair move

Apply:

descriptor fidelity audit
schema validation
hierarchy preservation check
local-anchor or field-preservation verification

4.5 Suggested evaluation fields

Measure:

schema_pass_rate
field_loss_count
constraint_preservation_score
descriptor_fidelity_before
descriptor_fidelity_after

4.6 Suggested JSON fixture

Input JSON should contain:

source_structure
expected_constraints
baseline_representation
repaired_representation

Expected output JSON should contain:

lost_fields_before
lost_fields_after
schema_pass_before
schema_pass_after
constraint_check_before
constraint_check_after

4.7 Why this demo matters

This demo proves a key atlas insight:

richer reasoning does not help much when the structure-carrying container is already broken

7. What should be official in v1, and what should wait ⏳

The first official demo pack should stop at four demos.

That is enough to prove:

the atlas can route
the fix surface changes first moves
some cases are visibly repairable
Colab-backed reproduction is possible

The first official wave should not try to cover every family immediately.

The better strategy is:

Official v1 demo pack

Later official wave or community wave

F2 deeper reasoning collapse demos
F3 continuity and ownership demos
F6 collective boundary or safe-corridor demos

This keeps the first wave strong and realistic.

8. Colab implementation guidance 💻

The official demo design should assume that each flagship demo eventually has a Colab companion.

A good notebook should include:

Section A

Case setup

Section B

Broken baseline run

Section C

Routed family and broken invariant

Section D

First repair move

Section E

After-fix rerun

Section F

Before / after comparison

Section G

Optional WFGY escalation prompt

The notebook should avoid:

too much hidden magic
giant dependency chains
unnecessary infra complexity
unclear evaluation logic

The best official demos should feel:

light
reproducible
inspectable
easy to fork

9. Optional WFGY escalation in demos 🌊

Each flagship demo may include a small optional section:

“Go deeper with WFGY 3.0”

This section should not replace the official first repair move.

Its purpose is only to show how the same demo can be explored more deeply through:

stronger structural diagnosis
alternative recovery hypotheses
problem-specific experiment cuts
more advanced recovery exploration

This is the cleanest way to demonstrate that:

the atlas gives the first move
WFGY 3.0 gives deeper exploration

10. Community extension path 🤝

Once the first four official demos exist, the community layer can grow much faster.

Community contributors can extend:

new Colab variants
JSON fixtures for related cases
prompt-based reproductions
workflow-specific versions
benchmark reruns
narrower domain-specific repair demos

The official demo pack should therefore be treated as:

the clean seed set
not the full future library

11. Public demo usage pattern 📣

A strong public-facing demo flow should look like this:

show the broken baseline
show atlas routing
show the first repair move
show the improved result
optionally show the WFGY 3.0 deeper path
point to community extensions

This is much stronger than only saying:

“we have a framework” or
“we have a taxonomy”

Because people can actually see:

what changed
why it changed
which repair layer did the work

12. Patch protocol 🔄

Flagship Fix Demos v1 is frozen, but not closed.

Small patch

Use for:

wording refinement
better metrics
clearer demo instructions
improved JSON examples

Medium patch

Use for:

adding one more official demo
improving Colab flow
improving expected-output structure
adding stronger evaluation notes

Large patch

Only use if:

the first official demo strategy proves structurally misleading
the current demo set fails to demonstrate atlas value clearly
the route-first to fix-first logic breaks down in practice

Current status

No large-patch pressure is currently justified.

13. Official status

The correct formal statement is:

Flagship Fix Demos v1 is the first frozen official runnable demo design pack for the Atlas fix layer.
It defines the first set of high-value demonstrations that connect atlas routing, first repair moves, reproducible execution, and optional deeper WFGY exploration.

14. One-line version

Flagship Fix Demos v1 defines the first official runnable demos that prove the atlas can guide better first repair moves.

15. Closing note ✨

A strong system should not only tell people how to think.

It should also give them a few clean things they can actually run.

That is what this file is for.

The first official demos should be small, sharp, reproducible, and easy to extend.

16 KiB Raw Blame History

Flagship Fix Demos v1 🧪

Problem Map 3.0 Troubleshooting Atlas

Official first runnable demo pack design

0. Document Status 🚦

1. Why flagship demos matter 💥

2. Should flagship demos use Colab? ✅

Official flagship demos

Long-tail or domain-specific demos

3. Demo design standard 🔧

Part A · Case framing

Part B · Baseline failure

Part C · First repair move

Part D · Optional WFGY escalation

Part E · Evaluation result

4. Asset standard 📦

4.1 Demo card

4.2 Colab notebook

4.3 Input JSON fixture

4.4 Expected output JSON

4.5 Optional screenshot or GIF

5. Recommended folder pattern 🗂️

6. Why these first demos were chosen 🎯

Part I · Official Demo 1

F1 Retrieval Anchor Drift Demo 🌍

1.1 Goal

1.2 Family routing

1.3 Baseline failure design

1.4 First repair move

1.5 Suggested evaluation fields

1.6 Suggested JSON fixture

1.7 Why this demo matters

Part II · Official Demo 2

F4 Readiness and Closure Demo ⚙️

2.1 Goal

2.2 Family routing

2.3 Baseline failure design

2.4 First repair move

2.5 Suggested evaluation fields

2.6 Suggested JSON fixture

2.7 Why this demo matters

Part III · Official Demo 3

F5 Failure Path Visibility Demo 🔎

3.1 Goal

3.2 Family routing

3.3 Baseline failure design

3.4 First repair move

3.5 Suggested evaluation fields

3.6 Suggested JSON fixture

3.7 Why this demo matters

Part IV · Official Demo 4

F7 Structure Fidelity Demo 🧱

4.1 Goal

4.2 Family routing

4.3 Baseline failure design

4.4 First repair move

4.5 Suggested evaluation fields

4.6 Suggested JSON fixture

4.7 Why this demo matters

7. What should be official in v1, and what should wait ⏳

Official v1 demo pack

Later official wave or community wave

8. Colab implementation guidance 💻

Section A

Section B

Section C

Section D

Section E

Section F

Section G

9. Optional WFGY escalation in demos 🌊

“Go deeper with WFGY 3.0”

10. Community extension path 🤝

11. Public demo usage pattern 📣

12. Patch protocol 🔄

Small patch

Medium patch

Large patch

Current status

13. Official status

16 KiB

Raw Blame History