Create ai-eval.md

2026-04-28 03:29:51 +00:00 · 2026-03-30 19:51:23 +08:00 · 2026-03-30 19:51:23 +08:00 · 99d4efa940
commit 99d4efa940
parent 81fb2221f7
1 changed files with 240 additions and 0 deletions
--- a/ProblemMap/Twin_Atlas/demos/ai-eval.md
+++ b/ProblemMap/Twin_Atlas/demos/ai-eval.md
@ -0,0 +1,240 @@
+<!--
+AI_NOTE_START
+
+Document role:
+This page is the public AI evaluation surface for WFGY 4.0 Twin Atlas Engine.
+
+What this page is for:
+1. Give first-time readers a screenshot-first entry into the current public evidence layer.
+2. Show how different model families visibly change under the same governance stress surface.
+3. Connect screenshots to raw runs, results summary, and the rerun path.
+4. Make the public proof layer easy to inspect without forcing readers into the full architecture first.
+
+What this page is not:
+1. It is not a universal benchmark claim page.
+2. It is not the full experiment archive.
+3. It is not a replacement for raw runs or results summary.
+4. It is not a claim that every model behaves identically under WFGY 4.0.
+
+Reading strategy:
+1. Scan the opening summary first.
+2. Use the screenshot gallery as the first-contact layer.
+3. Open raw runs if you want the original model wording.
+4. Open Reproduce in 60 Seconds if you want to rerun the same public surface yourself.
+
+Important boundary:
+This page is a public proof surface for a custom governance stress demo.
+It is designed to be inspectable and reproducible.
+It is not a universal benchmark certification page.
+
+AI_NOTE_END
+-->
+
+# 📊 AI Eval
+
+> A screenshot-first public proof surface for WFGY 4.0 Twin Atlas Engine.
+
+This page exists for one simple reason:
+
+**some readers should not have to read the whole engine first just to see whether the governance shift is real.**
+
+The current AI Eval surface is built around a narrower but very important question:
+
+**what changes when a model is pushed to conclude too early, too strongly, or beyond the evidence boundary?**
+
+This is not a universal benchmark page.
+
+It is a visible comparison layer built around the current public WFGY 4.0 governance stress surface.
+
+---
+
+## 🧭 How to read this page
+
+A good rerun is not just one where the AFTER answer looks more careful.
+
+A good rerun should make at least one of these shifts visible:
+
+- less illegal commitment
+- less evidence-boundary crossing
+- less single-cause compression
+- less contradiction suppression
+- more lawful downgrade
+- stronger preservation of still-live competing explanations
+
+The right reading lens is:
+
+**not softer vs louder**  
+**but more lawful vs more premature**
+
+---
+
+## 📌 What this page is showing
+
+The screenshots on this page are drawn from the same public release surface:
+
+- the public Twin Atlas runtime TXT
+- the public governance stress suite TXT
+- model-specific public runs
+- the current screenshot layer
+- the current results-summary layer
+
+That means this page should be read as a visible proof surface, not as a one-off visual anecdote.
+
+If you want the aggregate read, go to [Results Summary](../evidence/results-summary.md).  
+If you want the original outputs, go to [Raw Runs](../evidence/raw-runs/).  
+If you want to test the same public surface yourself, go to [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md).
+
+---
+
+## 🖼️ Current Public Gallery
+
+### ChatGPT
+
+**Why it matters:**  
+A strong public example of visible lawful downgrade without collapsing into a blanket stop system.
+
+**Use this if you want to see:**  
+How a strong default assistant shifts from premature closure toward authorization-aware restraint.
+
+- [View raw run](../evidence/raw-runs/chatgpt.txt)
+- [Open screenshot](./screenshots/chatgpt_before-after.png)
+
+---
+
+### Claude
+
+**Why it matters:**  
+One of the clearest public examples of ambiguity preservation and conflict-preserving output under pressure.
+
+**Use this if you want to see:**  
+How WFGY 4.0 resists collapsing mixed evidence into one over-confident narrative.
+
+- [View raw run](../evidence/raw-runs/claude.txt)
+- [Open screenshot](./screenshots/claude_before-after.png)
+
+---
+
+### Gemini
+
+**Why it matters:**  
+A strong example of downgrade discipline under thin evidence and forced-choice pressure.
+
+**Use this if you want to see:**  
+How a model stops treating pressure as permission to conclude.
+
+- [View raw run](../evidence/raw-runs/gemini.txt)
+- [Open screenshot](./screenshots/gemini_before-after.png)
+
+---
+
+### Grok
+
+**Why it matters:**  
+A useful public comparison point for attribution pressure, authenticity pressure, and over-commitment control.
+
+**Use this if you want to see:**  
+How visible output strength changes when route and authorization are no longer allowed to collapse.
+
+- [View raw run](../evidence/raw-runs/grok.txt)
+- [Open screenshot](./screenshots/grok_before-after.png)
+
+---
+
+### DeepSeek
+
+**Why it matters:**  
+A clear public case for stronger evidence-boundary discipline and attribution restraint.
+
+**Use this if you want to see:**  
+How the same cases change when the model is no longer allowed to turn circumstantial pressure into hard public naming.
+
+- [View raw run](../evidence/raw-runs/deepseek.txt)
+- [Open screenshot](./screenshots/deepseek_before-after.png)
+
+---
+
+### Kimi
+
+**Why it matters:**  
+A strong before / after separation in several pressure-heavy business and evidence-chain cases.
+
+**Use this if you want to see:**  
+How governance framing changes the output shape in a way that is easy to inspect from screenshots first.
+
+- [View raw run](../evidence/raw-runs/kimi.txt)
+- [Open screenshot](./screenshots/kimi_before-after.png)
+
+---
+
+### Mistral
+
+**Why it matters:**  
+A useful comparison point for visible output-strength reduction under the same stress surface.
+
+**Use this if you want to see:**  
+How governance discipline alters the public answer profile even when the model family differs.
+
+- [View raw run](../evidence/raw-runs/mistral.txt)
+- [Open screenshot](./screenshots/mistral_before-after.png)
+
+---
+
+### Perplexity
+
+**Why it matters:**  
+An important public outlier.
+
+**Use this if you want to see:**  
+Why this page is a proof surface instead of a polished promo wall. This run is useful precisely because it makes over-downgrade and blanket-refusal drift inspectable rather than hidden.
+
+- [View raw run](../evidence/raw-runs/perplexity.txt)
+- [Open screenshot](./screenshots/perplexity_before-after.png)
+
+---
+
+## 🔍 What changed most across the public screenshot layer
+
+Across the current public runs, the most consistent visible shift is not “the answer got nicer.”
+
+The more important shift is that the AFTER pass becomes less willing to:
+
+- convert a plausible route into an authorized conclusion
+- treat appearance as if it were proof
+- erase live ambiguity just to satisfy pressure
+- compress multi-factor situations into one exact cause
+- speak above the lawful output ceiling just because the user demanded it
+
+That is the real use of this page.
+
+It is not here to prove that every model became perfect.
+
+It is here to show that the current public WFGY 4.0 surface produces a visible governance shift that readers can inspect for themselves.
+
+---
+
+## 🧪 Want to run the same public surface yourself
+
+Use these two files:
+
+- [Twin Atlas Runtime TXT](./prompts/wfgy-4_0-twin-atlas-runtime.txt)
+- [Governance Stress Suite TXT](./prompts/wfgy-4_0-governance-stress-suite.txt)
+
+Then follow:
+
+- [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md)
+
+---
+
+## 🧭 Where to go next
+
+### If you want the aggregate interpretation
+- [Results Summary](../evidence/results-summary.md)
+
+### If you want the original model wording
+- [Raw Runs](../evidence/raw-runs/)
+
+### If you want the shortest rerun path
+- [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md)
+
+### If you want the flagship example cases
+- [Flagship Cases](../evidence/flagship-cases.md)