Update reproduce-in-60-seconds.md

This commit is contained in:
PSBigBig + MiniPS 2026-03-30 20:22:47 +08:00 committed by GitHub
parent 8dcfa8f2f6
commit 617e217c43
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -2,153 +2,242 @@
AI_NOTE_START
Document role:
This page is the shortest public rerun path for the current WFGY 4.0 Twin Atlas proof surface.
This page is the public screenshot-first AI evaluation surface for WFGY 4.0 Twin Atlas Engine.
What this page is for:
1. Give readers the fastest possible way to test the current public surface themselves.
2. Point directly to the two TXT files required for the public rerun.
3. Clarify what the reader should compare between BEFORE and AFTER.
4. Route readers into deeper evidence pages only after the short path is clear.
1. Give first-time readers the fastest visible entry into the current WFGY 4.0 public proof surface.
2. Show how the same public governance stress surface changes visible model behavior across different model families.
3. Connect screenshots to raw runs, results summary, and the shortest rerun path.
4. Help readers verify that WFGY 4.0 is not being presented as theory alone.
What this page is not:
1. It is not the full methodology archive.
2. It is not the cleanest reviewer-facing protocol.
3. It is not a universal benchmark procedure.
4. It is not a replacement for results summary or raw runs.
1. It is not a universal benchmark certification page.
2. It is not the full experiment archive.
3. It is not a replacement for raw runs.
4. It is not a replacement for results summary.
5. It is not a claim that every model behaves identically under WFGY 4.0.
Reading strategy:
1. Read the one-minute summary first.
2. Download or open the two TXT files.
3. Run them in order.
4. Compare BEFORE and AFTER using the guidance on this page.
5. Go to deeper evidence pages only if needed.
1. Read the opening summary first.
2. Use the gallery as the first-contact layer.
3. Open raw runs if you want original output wording.
4. Open Results Summary if you want the aggregate interpretation.
5. Open Reproduce in 60 Seconds if you want to test the same public surface yourself.
Important boundary:
This page is the shortest public rerun path for the current WFGY 4.0 proof surface.
It is designed for speed and clarity, not for exhaustive benchmark rigor.
This page is a public proof surface for a custom governance stress demo.
It is designed to be visible, inspectable, and reproducible.
It is not a universal benchmark claim page.
AI_NOTE_END
-->
# ⚡ Reproduce in 60 Seconds
# 📊 AI Eval
> The shortest public rerun path for the current WFGY 4.0 Twin Atlas proof surface.
> A screenshot-first public proof surface for WFGY 4.0 Twin Atlas Engine.
This page exists for readers who do not want a long explanation first.
This page exists for one simple reason:
If you want to know whether the governance shift is visible, this is the fastest public path.
**some readers should not have to read the whole engine first just to see whether the governance shift is real.**
You need exactly two files:
The current AI Eval surface is built around a narrower but very important question:
- [Twin Atlas Runtime TXT](./prompts/wfgy-4_0-twin-atlas-runtime.txt)
- [Governance Stress Suite TXT](./prompts/wfgy-4_0-governance-stress-suite.txt)
**what changes when a model is pushed to conclude too early, too strongly, or beyond the evidence boundary?**
---
This is not a universal benchmark page.
## 🕒 One-minute summary
It is a public comparison surface for the current WFGY 4.0 governance stress demo.
Do this in order:
## What you should look for
1. open your target AI system
2. paste the Twin Atlas runtime TXT
3. paste the governance stress suite TXT
4. let the model complete both passes
5. compare the BEFORE pass and the AFTER pass
A good rerun is not just one where the AFTER answer looks more careful.
That is it.
---
## 📂 The two files
### 1. Twin Atlas runtime
Use this first:
- [wfgy-4_0-twin-atlas-runtime.txt](./prompts/wfgy-4_0-twin-atlas-runtime.txt)
This loads the public WFGY 4.0 Twin Atlas runtime.
### 2. Governance stress suite
Use this second:
- [wfgy-4_0-governance-stress-suite.txt](./prompts/wfgy-4_0-governance-stress-suite.txt)
This runs the current public governance stress surface.
---
## 🧪 What you are actually checking
The question is not:
“did the model become more polite?”
“did the model become more cautious?”
“did the answer get softer?”
The real question is:
**did the model stop turning plausibility into public conclusion too early?**
That is the core public test.
A strong AFTER pass should make at least one of these shifts visible:
A good rerun should make at least one of these shifts visible:
- less illegal commitment
- less evidence-boundary crossing
- less single-cause compression
- less contradiction suppression
- more lawful downgrade
- stronger preservation of still-live ambiguity
- stronger preservation of still-live competing explanations
The right reading lens is:
**not softer vs louder**
**but more lawful vs more premature**
## Why this page matters
The current WFGY 4.0 public surface already includes:
- a public Twin Atlas runtime TXT
- a public governance stress suite TXT
- screenshot comparisons across the current public model set
- model-specific raw runs
- a results-summary layer
- deeper flagship evidence pages
That means this page should be read as a visible proof surface, not as a one-off screenshot wall.
If you want the aggregate read, go to [Results Summary](../evidence/results-summary.md).
If you want the original outputs, go to [Raw Runs](../evidence/raw-runs/).
If you want to run the same public surface yourself, go to [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md).
## Current Public Gallery
### ChatGPT
**Why it matters**
A strong public example of lawful downgrade without collapsing into a blanket stop system.
**Best first use**
Open this if you want to see how a strong default assistant shifts from premature closure toward authorization-aware restraint.
- [Raw run](../evidence/raw-runs/chatgpt.txt)
- [Screenshot](./screenshots/chatgpt_before-after.png)
---
## 📌 What not to overclaim
### Claude
This page is useful because it is fast.
**Why it matters**
One of the clearest public examples of ambiguity preservation and conflict-sensitive restraint under pressure.
That does **not** mean it is the cleanest reviewer-facing protocol or a universal benchmark.
**Best first use**
Open this if you want to see how WFGY 4.0 resists collapsing mixed evidence into one over-confident narrative.
A visible before / after shift is meaningful.
A repeated pattern across public runs is meaningful.
A reproducible TXT-based path is meaningful.
But none of those things automatically prove universal superiority in every domain or every future deployment environment.
This page is honest about that.
- [Raw run](../evidence/raw-runs/claude.txt)
- [Screenshot](./screenshots/claude_before-after.png)
---
## 🔁 If you want a cleaner public rerun path
### Gemini
Use:
**Why it matters**
A strong example of downgrade discipline under thin evidence and forced-choice pressure.
- [Advanced Clean Protocol](../evidence/advanced-clean-protocol.md)
**Best first use**
Open this if you want to see how a model stops treating pressure as permission to conclude.
That page is better if you want the reviewer-facing path instead of the shortest path.
- [Raw run](../evidence/raw-runs/gemini.txt)
- [Screenshot](./screenshots/gemini_before-after.png)
---
## 🖼️ If you want to compare with the current public runs
### Grok
Use:
**Why it matters**
A useful public comparison point for attribution pressure, authenticity pressure, and over-commitment control.
**Best first use**
Open this if you want to see how visible output strength changes when route and authorization are no longer allowed to collapse.
- [Raw run](../evidence/raw-runs/grok.txt)
- [Screenshot](./screenshots/grok_before-after.png)
---
### DeepSeek
**Why it matters**
A clear public case for stronger evidence-boundary discipline and attribution restraint.
**Best first use**
Open this if you want to see how the same cases change when the model is no longer allowed to turn circumstantial pressure into hard public naming.
- [Raw run](../evidence/raw-runs/deepseek.txt)
- [Screenshot](./screenshots/deepseek_before-after.png)
---
### Kimi
**Why it matters**
A strong before-after separation in several pressure-heavy business and evidence-chain cases.
**Best first use**
Open this if you want to inspect a screenshot layer where the governance shift is easy to see quickly.
- [Raw run](../evidence/raw-runs/kimi.txt)
- [Screenshot](./screenshots/kimi_before-after.png)
---
### Mistral
**Why it matters**
A useful comparison point for visible output-strength reduction under the same stress surface.
**Best first use**
Open this if you want to compare how governance discipline changes the public answer profile across model families.
- [Raw run](../evidence/raw-runs/mistral.txt)
- [Screenshot](./screenshots/mistral_before-after.png)
---
### Perplexity
**Why it matters**
An important public outlier.
**Best first use**
Open this if you want to see why this page is a proof surface instead of a polished promo wall. This run is useful precisely because it makes over-downgrade and blanket-refusal drift inspectable rather than hidden.
- [Raw run](../evidence/raw-runs/perplexity.txt)
- [Screenshot](./screenshots/perplexity_before-after.png)
## What changed most across the current public screenshot layer
Across the current public runs, the most consistent visible shift is not that the AFTER answer becomes “nicer.”
The more important shift is that the AFTER pass becomes less willing to:
- convert a plausible route into an authorized conclusion
- treat appearance as if it were already proof
- erase live ambiguity just to satisfy pressure
- compress multi-factor situations into one exact cause
- speak above the lawful output ceiling just because the user demanded it
That is the real use of this page.
It is not here to prove that every model became perfect.
It is here to show that the current public WFGY 4.0 surface produces a visible governance shift that readers can inspect for themselves.
## Want to run the same public surface yourself
Use these two files:
- [Twin Atlas Runtime TXT](./prompts/wfgy-4_0-twin-atlas-runtime.txt)
- [Governance Stress Suite TXT](./prompts/wfgy-4_0-governance-stress-suite.txt)
Then follow:
- [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md)
## Important boundary
This page is useful because it is visible, repeatable, and easy to inspect.
It does **not** by itself prove universal superiority in every domain, every workflow, or every deployment environment.
For broader interpretation, use:
- [AI Eval](./ai-eval.md)
- [Screenshots](./screenshots/)
- [Raw Runs](../evidence/raw-runs/)
- [Results Summary](../evidence/results-summary.md)
---
## 🧭 Where to go next
### If you want screenshot-first proof
- [AI Eval](./ai-eval.md)
### If you want aggregate interpretation
- [Results Summary](../evidence/results-summary.md)
### If you want original model outputs
- [Raw Runs](../evidence/raw-runs/)
### If you want the flagship case layer
- [Governance Stress Suite](../evidence/governance-stress-suite.md)
- [Flagship Cases](../evidence/flagship-cases.md)
## Where to go next
### If you want the aggregate interpretation
- [Results Summary](../evidence/results-summary.md)
### If you want the original model wording
- [Raw Runs](../evidence/raw-runs/)
### If you want the shortest rerun path
- [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md)
### If you want the flagship example cases
- [Flagship Cases](../evidence/flagship-cases.md)