# πŸ“Š AI Eval > A screenshot-first public proof surface for WFGY 4.0 Twin Atlas Engine. This page exists for one simple reason: **some readers should not have to read the whole engine first just to see whether the governance shift is real.** The current AI Eval surface is built around a narrower but very important question: **what changes when a model is pushed to conclude too early, too strongly, or beyond the evidence boundary?** This is not a universal benchmark page. It is a visible comparison layer built around the current public WFGY 4.0 governance stress surface. --- ## 🧭 How to read this page A good rerun is not just one where the AFTER answer looks more careful. A good rerun should make at least one of these shifts visible: - less illegal commitment - less evidence-boundary crossing - less single-cause compression - less contradiction suppression - more lawful downgrade - stronger preservation of still-live competing explanations The right reading lens is: **not softer vs louder** **but more lawful vs more premature** --- ## πŸ“Œ What this page is showing The screenshots on this page are drawn from the same public release surface: - the public Twin Atlas runtime TXT - the public governance stress suite TXT - model-specific public runs - the current screenshot layer - the current results-summary layer That means this page should be read as a visible proof surface, not as a one-off visual anecdote. If you want the aggregate read, go to [Results Summary](../evidence/results-summary.md). If you want the original outputs, go to [Raw Runs](../evidence/raw-runs/). If you want to test the same public surface yourself, go to [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md). --- ## πŸ–ΌοΈ Current Public Gallery ### ChatGPT **Why it matters:** A strong public example of visible lawful downgrade without collapsing into a blanket stop system. **Use this if you want to see:** How a strong default assistant shifts from premature closure toward authorization-aware restraint. - [View raw run](../evidence/raw-runs/chatgpt.txt) - [Open screenshot](./screenshots/chatgpt_before-after.png) --- ### Claude **Why it matters:** One of the clearest public examples of ambiguity preservation and conflict-preserving output under pressure. **Use this if you want to see:** How WFGY 4.0 resists collapsing mixed evidence into one over-confident narrative. - [View raw run](../evidence/raw-runs/claude.txt) - [Open screenshot](./screenshots/claude_before-after.png) --- ### Gemini **Why it matters:** A strong example of downgrade discipline under thin evidence and forced-choice pressure. **Use this if you want to see:** How a model stops treating pressure as permission to conclude. - [View raw run](../evidence/raw-runs/gemini.txt) - [Open screenshot](./screenshots/gemini_before-after.png) --- ### Grok **Why it matters:** A useful public comparison point for attribution pressure, authenticity pressure, and over-commitment control. **Use this if you want to see:** How visible output strength changes when route and authorization are no longer allowed to collapse. - [View raw run](../evidence/raw-runs/grok.txt) - [Open screenshot](./screenshots/grok_before-after.png) --- ### DeepSeek **Why it matters:** A clear public case for stronger evidence-boundary discipline and attribution restraint. **Use this if you want to see:** How the same cases change when the model is no longer allowed to turn circumstantial pressure into hard public naming. - [View raw run](../evidence/raw-runs/deepseek.txt) - [Open screenshot](./screenshots/deepseek_before-after.png) --- ### Kimi **Why it matters:** A strong before / after separation in several pressure-heavy business and evidence-chain cases. **Use this if you want to see:** How governance framing changes the output shape in a way that is easy to inspect from screenshots first. - [View raw run](../evidence/raw-runs/kimi.txt) - [Open screenshot](./screenshots/kimi_before-after.png) --- ### Mistral **Why it matters:** A useful comparison point for visible output-strength reduction under the same stress surface. **Use this if you want to see:** How governance discipline alters the public answer profile even when the model family differs. - [View raw run](../evidence/raw-runs/mistral.txt) - [Open screenshot](./screenshots/mistral_before-after.png) --- ### Perplexity **Why it matters:** An important public outlier. **Use this if you want to see:** Why this page is a proof surface instead of a polished promo wall. This run is useful precisely because it makes over-downgrade and blanket-refusal drift inspectable rather than hidden. - [View raw run](../evidence/raw-runs/perplexity.txt) - [Open screenshot](./screenshots/perplexity_before-after.png) --- ## πŸ” What changed most across the public screenshot layer Across the current public runs, the most consistent visible shift is not β€œthe answer got nicer.” The more important shift is that the AFTER pass becomes less willing to: - convert a plausible route into an authorized conclusion - treat appearance as if it were proof - erase live ambiguity just to satisfy pressure - compress multi-factor situations into one exact cause - speak above the lawful output ceiling just because the user demanded it That is the real use of this page. It is not here to prove that every model became perfect. It is here to show that the current public WFGY 4.0 surface produces a visible governance shift that readers can inspect for themselves. --- ## πŸ§ͺ Want to run the same public surface yourself Use these two files: - [Twin Atlas Runtime TXT](./prompts/wfgy-4_0-twin-atlas-runtime.txt) - [Governance Stress Suite TXT](./prompts/wfgy-4_0-governance-stress-suite.txt) Then follow: - [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md) --- ## 🧭 Where to go next ### If you want the aggregate interpretation - [Results Summary](../evidence/results-summary.md) ### If you want the original model wording - [Raw Runs](../evidence/raw-runs/) ### If you want the shortest rerun path - [Reproduce in 60 Seconds](./reproduce-in-60-seconds.md) ### If you want the flagship example cases - [Flagship Cases](../evidence/flagship-cases.md)