21 KiB
Flagship Runnable Demo Pack
Problem Map 3.0 Troubleshooting Atlas
Official MVP demo entry for route-first repair
Quick links:
This folder contains the first flagship runnable demos for the Atlas fix system.
These demos are not meant to prove every possible case.
They are meant to prove something more important:
the atlas does not only name failures
it changes the first repair move
That is the entire reason this demo pack exists.
If the atlas is only a naming system, people may find it interesting.
If the atlas can show that different routing decisions lead to different repair decisions, people begin to feel that it is actually useful.
This demo pack is the smallest runnable surface that makes that claim visible.
Each demo is built around the same high-signal pattern:
- baseline failure
- atlas routing
- first repair move
- visible outcome shift
- optional deeper WFGY 3.0 exploration
The goal is not scale.
The goal is clarity.
This folder should be read as the official public demo surface of the first Atlas fix release.
It is not a giant benchmark zoo.
It is a compact, high-signal proof layer.
What this demo pack proves
This demo pack is designed to make four claims visible.
1. Different failures should not be repaired the same way
Many systems fail with surface similarities.
A fluent wrong answer, a broken workflow, a symbolic collapse, and a black-box debugging problem can all feel like “the system is bad.”
The atlas says that is too coarse.
This demo pack shows that once a case is routed into the right family, the first repair move changes.
2. The atlas is not just a checklist
A checklist can name symptoms.
A troubleshooting atlas should help you decide:
- what kind of failure this is
- why this family is primary
- why a neighboring family is secondary
- what should be repaired first
- what should not be repaired first
These demos are built to make that difference visible.
3. Route-first repair is practical
The purpose of these notebooks is not to simulate a giant production stack.
The purpose is to show a minimal but convincing pattern:
- baseline failure
- atlas routing
- first repair move
- result shift
- optional deeper WFGY 3.0 exploration
That is enough to make the system feel real.
4. Community growth becomes much easier once the flagship set exists
These four demos are also templates.
They are not only proofs.
They are seed assets for future contributed demos in:
- Colab
- JSON fixtures
- prompt packs
- workflow reproductions
- benchmark reruns
This is why the first official set matters so much.
Demo overview
| Demo | Family | Core proof | Recommended entry |
|---|---|---|---|
| Demo 1 | F1 Grounding & Evidence Integrity | re-grounding changes the first repair path | demo_01_f1_grounding_anchor_recovery_live.ipynb |
| Demo 2 | F5 Observability & Diagnosability Integrity | visibility uplift comes before answer repair | demo_02_f5_observability_first_replay_v2.ipynb |
| Demo 3 | F4 Execution & Contract Integrity | execution closure repair comes before reasoning-level repair | demo_03_f4_execution_closure_replay_v2.ipynb |
| Demo 4 | F7 Representation & Localization Integrity | container repair changes what can stabilize next | demo_04_f7_container_fidelity_replay_v2.ipynb |
Current MVP status
The first flagship demo pack is now in a usable MVP state.
At the current stage:
- Demo 1 includes a live notebook and replay support
- Demo 2 is replay-first, with v2 now serving as the recommended replay notebook
- Demo 3 is replay-first, with v2 now serving as the recommended replay notebook
- Demo 4 is replay-first, with v2 now serving as the recommended replay notebook
This is intentional.
The current pack is designed to prove the strongest teaching pattern in the clearest possible way.
That means:
- use live mode where live comparison adds real proof value
- use replay mode where replay is clearer, safer, and more honest
This is not a shortcut.
It is a deliberate MVP teaching decision.
At the current release stage, this pack should be treated as the official recommended first demo set for route-first repair.
Current official notebook choices
At the current MVP stage, the recommended notebook entry points are:
-
Demo 1
- official live notebook:
demo_01_f1_grounding_anchor_recovery_live.ipynb
- official live notebook:
-
Demo 2
- official recommended replay notebook:
demo_02_f5_observability_first_replay_v2.ipynb
- original replay notebook retained as:
demo_02_f5_observability_first_replay.ipynb
- official recommended replay notebook:
-
Demo 3
- official recommended replay notebook:
demo_03_f4_execution_closure_replay_v2.ipynb
- original replay notebook retained as:
demo_03_f4_execution_closure_replay.ipynb
- official recommended replay notebook:
-
Demo 4
- official recommended replay notebook:
demo_04_f7_container_fidelity_replay_v2.ipynb
- original replay notebook retained as:
demo_04_f7_container_fidelity_replay.ipynb
- official recommended replay notebook:
The rule is simple:
original notebooks are preserved as first-pass MVP assets
v2 notebooks are the cleaner recommended replay versions for Demo 2, Demo 3, and Demo 4
If multiple notebooks exist for the same demo, the README and this page should always make the recommended entry point explicit.
Current shared support layer
The demo pack also includes a small official shared support layer under:
At the current MVP stage, that folder already includes:
README.mddemo_utils.pydisplay_helpers.pyrouting_schema.md
These files exist to keep the official demos more aligned, more readable, and easier to audit.
They are not meant to turn the demo pack into a hidden mini-framework.
In short:
the shared layer already exists
but it remains intentionally small
Why these four demos were chosen
The first flagship set uses four families:
- F1 Grounding & Evidence Integrity
- F5 Observability & Diagnosability Integrity
- F4 Execution & Contract Integrity
- F7 Representation & Localization Integrity
This combination was chosen on purpose.
F1 is the best entry point
It is easy to understand and immediately useful.
People instantly understand what it means when an answer looks fluent but is attached to the wrong evidence.
F5 makes engineers pay attention
This demo proves the atlas is not limited to answer quality.
Sometimes the first repair move is not “fix the answer.”
Sometimes the first repair move is “make the failure visible.”
That is a mature debugging idea.
F4 proves the atlas can touch workflow skeletons
This demo shows that the atlas is not only about content generation.
It can also classify and repair problems involving:
- readiness
- ordering
- bridge integrity
- liveness
- closure
That gives the system real architectural weight.
F7 gives the atlas its sharpest identity
This is one of the most distinctive cuts in the whole map.
It shows that some failures are not reasoning-first or grounding-first.
Sometimes the container that carries structure fails first.
That is a powerful and memorable cut.
The four official demos
This folder is organized around four flagship demos.
Demo 1 · F1 Grounding Anchor Recovery
Theme
A fluent answer fails because it is attached to the wrong evidence anchor.
What this demo proves
- the failure is grounding-first
- the problem is not mainly “the model is dumb”
- the first repair move should be re-grounding, not style rewriting
- evidence verification changes the repair path
Who this demo will hit hardest
- RAG builders
- retrieval engineers
- enterprise QA builders
- doc QA users
- people tired of shallow hallucination discourse
Main lesson
Not all wrong answers are “hallucination” in the same way.
Some are evidence-anchor failures first.
Folder
Demo 1 · F1 Grounding Anchor Recovery
Official notebook
demo_01_f1_grounding_anchor_recovery_live.ipynb
Demo 2 · F5 Observability First
Theme
A failing workflow cannot be repaired correctly because its failure path is still hidden.
What this demo proves
- the first failure is diagnosability
- the correct first repair move is observability insertion
- fixing the answer too early is the wrong move
- visibility changes the repair landscape
Who this demo will hit hardest
- agent builders
- workflow orchestrators
- evaluation engineers
- anyone who has said “I know it is broken, but I cannot see why”
Main lesson
Sometimes the first repair is not “repair the system.”
Sometimes the first repair is “make the system visible.”
Folder
Demo 2 · F5 Observability First
Official notebook
demo_02_f5_observability_first_replay_v2.ipynb
Original notebook retained
demo_02_f5_observability_first_replay.ipynb
Demo 3 · F4 Execution Closure
Theme
A system fails because execution skeleton closure breaks before reasoning quality even matters.
What this demo proves
- the problem is not primarily memory or reasoning
- the problem is readiness, ordering, bridge, or liveness
- the correct first repair move is execution closure repair
- system structure can fail before model reasoning becomes the limiting factor
Who this demo will hit hardest
- AI workflow engineers
- multi-step system builders
- pipeline designers
- tool-calling framework users
- anyone who has seen “it failed because the sequence itself was wrong”
Main lesson
Some failures are caused by the workflow skeleton, not by intelligence quality.
Folder
Official notebook
demo_03_f4_execution_closure_replay_v2.ipynb
Original notebook retained
demo_03_f4_execution_closure_replay.ipynb
Demo 4 · F7 Container Fidelity
Theme
A task looks like reasoning failure, but the structure carrier fails first.
What this demo proves
- the problem is not purely progression-first
- symbolic or formal containers can fail before reasoning becomes the main issue
- the first repair move should target descriptor fidelity or formal adequacy
- container-first repair changes what the system can stably do next
Who this demo will hit hardest
- structured output builders
- JSON and schema users
- code and symbolic reasoning users
- OCR or layout-sensitive pipeline users
- anyone interested in the atlas’s most distinctive knife-cut
Main lesson
Sometimes the system does not fail because it “cannot think.”
Sometimes it fails because the box carrying the thinking is already broken.
Folder
Demo 4 · F7 Container Fidelity
Official notebook
demo_04_f7_container_fidelity_replay_v2.ipynb
Original notebook retained
demo_04_f7_container_fidelity_replay.ipynb
Demo modes
The flagship pack currently uses two practical modes plus one growth mode.
Mode A · Replay mode
This is the default and most important public mode.
It works without any API key.
The user can inspect:
- the case
- the baseline
- the atlas route
- the first repair move
- the replayed before / after outputs
- the explanation of what changed
A person should be able to understand the demo even without running anything.
Mode B · Live reproduction mode
This is optional and only used when live execution adds real value.
If it exists, it should be clearly treated as:
- optional
- for reproduction
- not required to understand the demo
- not required to evaluate the atlas concept
Mode C · Community extension mode
This is the growth mode.
Once the official demo exists, contributors should be able to:
- swap the input case
- swap the model
- swap the prompt
- swap the fixture
- extend the repair path
- compare their result to the official version
This is how the long tail grows.
Why only Demo 1 has live mode in the first MVP
This point matters and should be explicit.
In the first MVP release:
- Demo 1 includes a live notebook
- Demo 2, Demo 3, and Demo 4 are intentionally replay-first, and their current recommended notebooks are the v2 replay notebooks
This is not because the other demos are weaker.
It is because the first thing they need to prove is different.
Why Demo 1 gets live mode first
Demo 1 is the cleanest place to show a real before / after answer shift.
Its teaching value becomes stronger when a reader can see:
- baseline answer
- repaired answer
- anchor correction
- result movement
That makes live reproduction especially worthwhile.
Why Demo 2 stays replay-first in the first MVP
Demo 2 is about failure-path visibility.
Its first teaching job is not to show a model being more impressive.
Its first teaching job is to show that:
the system was too opaque to diagnose safely
and the first repair move was visibility uplift
Replay mode is already enough to teach that clearly.
Why Demo 3 stays replay-first in the first MVP
Demo 3 is about execution skeleton closure.
Its teaching center is:
- readiness
- ordering
- bridge integrity
- closure
These are structural logic shifts, not model-performance showpieces.
Replay mode is the cleanest and most honest way to teach that in the first release.
Why Demo 4 stays replay-first in the first MVP
Demo 4 is about container fidelity.
Its first teaching job is to make one thing visible:
the form was already failing before deeper reasoning could stabilize
This is mostly a structure-comparison demo, not a live-performance demo.
Replay mode is enough for the first public proof.
The honest design rule
The first MVP should choose the mode that best teaches the pattern.
That means:
- use live mode where live comparison adds real proof value
- use replay mode where replay is clearer, safer, and more honest
This is not a shortcut.
It is a deliberate teaching design.
API key policy
Some live notebooks may require an API key.
If so, the policy is simple:
- no hard-coded keys
- no saved secrets in the repository
- key entry should happen only at run time
- replay mode should still remain readable without a key
Recommended pattern for notebooks:
- ask for the key at execution time
- keep replay mode readable without key access
- clearly state that the notebook is for reproduction, not mandatory usage
This matters because the demos are designed to be understandable even when not executed.
They are proofs of use, not mandatory benchmark rituals.
Minimal asset structure for each demo
Each flagship demo folder should contain the following.
Required
README.mdinput_case.jsonreplay_outputs.jsonexpected_output.json
Recommended
- notebook file
- optional prompt file
- optional lightweight helper reference
- optional screenshot or output snapshot if replay is easier to inspect that way
Shared support
The folder shared contains the small official support layer for:
- formatting
- simple output display
- schema handling
- compact route presentation
- optional run-time utilities
This keeps each notebook smaller and easier to audit.
If multiple notebooks exist in one demo folder, the README should clearly identify which one is the recommended official entry point.
What each demo README should explain
Each demo folder README should follow a stable structure.
Required sections
- what this demo proves
- family route
- why not neighbor
- baseline failure
- first repair move
- optional WFGY 3.0 escalation
- replay mode
- files in this folder
- expected outcome
- limits of this demo
- community extension ideas
This is important because many readers will understand the system from the README alone, without opening the notebook.
Official vs community scope
This folder is the official flagship pack.
That means it should stay:
- small
- sharp
- readable
- high-signal
- reviewable
The official goal is not to cover everything.
The official goal is to provide the strongest first proofs.
Long-tail expansion belongs to the community structure under:
That is intentional.
Official demos prove the core.
Community demos scale the edge.
Relationship to WFGY 3.0
These demos sit in the middle of a larger repair flow.
Atlas layer
The atlas routes the failure.
Fix surface layer
The official fix surface suggests the first repair move.
WFGY 3.0 layer
WFGY 3.0 supports deeper structural and experimental exploration.
That means these demos should not pretend to be the final end of repair logic.
Instead, they should clearly show:
- what the first move is
- what changes after that move
- when deeper WFGY exploration becomes appropriate
This is why each demo may include an optional section called:
Optional WFGY 3.0 escalation
That section should remain compact and honest.
Why these demos matter
These demos matter because they turn the atlas from:
- a strong classification system
into:
- a visible troubleshooting system
They help a reader feel, not just believe, that:
- different routes lead to different repairs
- different repairs produce different outcomes
- the atlas changes what happens next
That is the real threshold.
Once that becomes visible, the project stops feeling like a theory-only system.
It starts feeling like a real operating layer.
What this pack does not claim
This pack does not claim that:
- four demos are enough to cover the whole atlas
- every family already has a runnable asset
- every demo must be live-run to be meaningful
- replay mode is inferior
- deeper repair is already fully solved
- community growth is no longer needed
This pack claims only that:
a first official set of flagship demos now exists to prove that route-first repair can be made visible, teachable, and reproducible
Recommended reading order
If you are new, use this order:
- Problem Map 3.0 Troubleshooting Atlas
- AI Eval Evidence
- Atlas Hub
- Atlas Final Freeze v1
- Family Fix Surface v1
- this demo pack
- individual demo folders
- optional deeper bridge through Atlas to WFGY Bridge v1
What to explore next
After reading this demo pack, most readers continue with:
If this demo pack helped you understand the Atlas, consider:
- starring the WFGY repo
- opening an issue
- testing the demo folders
- contributing a clean community extension later
One-line version
This demo pack is the first official proof that Atlas routing changes the first repair move in visible, teachable, and reproducible ways.
Closing note
These four demos are small on purpose.
They are not trying to be a giant benchmark.
They are trying to be the strongest first signal.
If they work, people will immediately understand three things:
- the atlas can classify failures more cleanly
- the classification changes what should be repaired first
- the system can grow far beyond these four examples
That is enough for a flagship MVP.