diff --git a/ProblemMap/Twin_Atlas/evidence/evidence/results-summary.md b/ProblemMap/Twin_Atlas/evidence/evidence/results-summary.md
new file mode 100644
index 00000000..7e92a479
--- /dev/null
+++ b/ProblemMap/Twin_Atlas/evidence/evidence/results-summary.md
@@ -0,0 +1,325 @@
+<!--
+AI_NOTE_START
+
+Document role:
+This page is the public-facing results summary for the WFGY 4.0 governance evidence surface.
+
+What this page is for:
+1. Give readers the fastest before/after view of what changed under WFGY 4.0.
+2. Show the strongest current directional evidence without forcing readers into full protocol detail first.
+3. Help beginners understand why these results matter in real-world terms.
+4. Route readers toward deeper protocol, cases, and runtime pages.
+
+What this page is not:
+1. It is not a universal benchmark claim.
+2. It is not the full experiment archive.
+3. It is not the raw evaluator log dump.
+4. It is not proof that every model behaves identically under WFGY 4.0.
+5. It is not a replacement for the full methodology page.
+
+Reading order:
+1. Read the main Twin Atlas README first.
+2. Read this page if you want the shortest evidence summary.
+3. Go to Governance Stress Suite for the protocol.
+4. Go to Flagship Cases if you want the strongest story-level examples.
+5. Go to Methodology Boundary if you want the claim limits.
+
+Important boundary:
+This page summarizes a targeted governance stress surface.
+It supports a meaningful public claim about WFGY 4.0 under forced-decision pressure, but it does not claim universal proof across all domains, all models, or all task types.
+
+AI_NOTE_END
+-->
+
+# 📈 WFGY 4.0 Results Summary
+
+> WFGY 4.0 does not make models weaker. It prevents unauthorized conclusions under pressure.
+
+This page is the shortest public answer to one question:
+
+**What actually changes when WFGY 4.0 is applied?**
+
+The short answer is:
+
+Under forced-choice pressure, baseline systems often commit too early, cross evidence boundaries, compress live alternatives into one story, and mistake surface form for proof.  
+With WFGY 4.0 applied, the output usually moves back toward lawful downgrade, ambiguity preservation, and ceiling-respecting release.
+
+This is the core behavioral shift.
+
+---
+
+## 🌍 What these results are really measuring
+
+These results are not about generic intelligence.
+
+They are about a narrower but very important failure class:
+
+- being pushed to choose one answer too early  
+- being pressured to sound final before the evidence is ready  
+- confusing plausible route with authorized conclusion  
+- treating polished appearance like proof  
+- smoothing over unresolved contradiction just to stay “helpful”  
+
+That is why these results matter.
+
+WFGY 4.0 is not trying to make a model more dramatic, more cautious, or more verbose.  
+It is trying to change the conditions under which a conclusion is allowed to be released.
+
+---
+
+## ⚡ At a glance
+
+A representative 12-case summarized run currently shows:
+
+| Metric | Before | After | Change |
+|---|---:|---:|---:|
+| Illegal Commitment | 10 | 0 | -10 |
+| Evidence Boundary Violation | 10 | 0 | -10 |
+| Single-Cause Compression | 5 | 0 | -5 |
+| Appearance-as-Evidence Failure | 3 | 0 | -3 |
+| Contradiction Suppression | 7 | 0 | -7 |
+| Lawful Downgrade | 2 | 12 | +10 |
+
+In that same summary surface, the total count across the five negative failure dimensions drops from **35** to **0**.
+
+That is the most important first impression of this page.
+
+---
+
+## 🧠 What changed in plain language
+
+Before WFGY 4.0, the model often behaved like this:
+
+- “This looks likely, so I should commit.”
+- “The user wants one answer, so I should pick one.”
+- “The details sound professional, so it is probably real.”
+- “There are several factors, but I should compress them into one root cause.”
+- “The evidence is incomplete, but I should still sound decisive.”
+
+After WFGY 4.0, the model more often behaves like this:
+
+- “A route can be plausible without being authorized.”
+- “Competing explanations are still alive, so I cannot pretend they are dead.”
+- “Appearance is not enough to count as proof.”
+- “The evidence ceiling does not allow a stronger answer yet.”
+- “A lawful downgrade is better than an illegal completion.”
+
+That is why the AFTER outputs look different.  
+They are not merely softer. They are more disciplined.
+
+---
+
+## 🔥 The strongest public headline
+
+If you only remember one result, remember this:
+
+**WFGY 4.0 reliably reduces unauthorized commitment under pressure.**
+
+That is the headline.
+
+Not “it answers everything better.”  
+Not “it wins every benchmark.”  
+Not “it turns every model into a perfect reasoner.”
+
+The most defensible headline is simpler and stronger:
+
+**It reduces the chance that pressure, plausibility, or polished appearance gets released as if it were already a lawful conclusion.**
+
+---
+
+## 🧪 Multi-model directional consistency
+
+The evidence is not limited to one run style.
+
+In the currently available model-specific runs, a broad pattern appears repeatedly:
+
+- **ChatGPT**: Illegal Commitment 8 → 0, Evidence Boundary Violation 8 → 0, Lawful Downgrade 0 → 8, Unnecessary Refusal 0 → 0  
+- **Claude**: Illegal Commitment 8 → 0, Evidence Boundary Violation 8 → 0, Lawful Downgrade 0 → 8, Unnecessary Refusal 0 → 0  
+- **Gemini**: Illegal Commitment 8 → 0, Evidence Boundary Violation 8 → 0, Lawful Downgrade 0 → 8, Unnecessary Refusal 0 → 0  
+- **Qwen**: Illegal Commitment 8 → 0, Evidence Boundary Violation 8 → 0, Lawful Downgrade 0 → 8, Unnecessary Refusal 0 → 0  
+- **Grok**: Illegal Commitment 8 → 0, Evidence Boundary Violation 8 → 0, Lawful Downgrade 0 → 8, Unnecessary Refusal 0 → 0  
+
+This does **not** mean every model behaves identically.  
+It means the directional pattern is already strong enough to matter.
+
+The main shared shift is this:
+
+**BEFORE often over-commits. AFTER usually respects the ceiling.**
+
+---
+
+## ⚠️ The important outlier
+
+This page should not hide the outlier.
+
+At least one visible model run shows a different risk shape:
+
+- **Perplexity** still improves legality, but collapses into blanket refusal, with Unnecessary Refusal rising from 0 to 8.
+
+This is important for two reasons.
+
+First, it strengthens credibility.  
+Second, it tells us something real:
+
+**WFGY 4.0 can improve legality while still being internalized differently by different model families.**
+
+Most current runs suggest lawful downgrade without blanket refusal.  
+At least one outlier shows that stronger governance can also overshoot into “stop everything” behavior.
+
+That is not a reason to hide the evidence.  
+It is part of the real picture.
+
+---
+
+## 🧨 What kinds of failures dropped the most
+
+The current results show the biggest change in a few especially dangerous failure classes.
+
+### 1. Illegal Commitment
+This is the system saying “yes,” naming one person, or selecting one root cause before the evidence has lawfully earned that move.
+
+This is the most important failure class to kill first.
+
+### 2. Evidence Boundary Violation
+This is the system crossing from “plausible” into “treated as established” without enough proof.
+
+This is where a lot of high-risk AI harm begins.
+
+### 3. Single-Cause Compression
+This is the system taking a multi-factor situation and forcing it into one root cause because one story feels cleaner.
+
+This matters a lot in executive, legal, and incident-review contexts.
+
+### 4. Appearance-as-Evidence Failure
+This is the system treating screenshots, polished formatting, logos, expert names, or surface coherence as if they were already proof.
+
+This is one of the most underestimated AI failure modes.
+
+### 5. Contradiction Suppression
+This is the system smoothing over conflict to keep the answer looking unified.
+
+In high-risk domains, that can be worse than saying “not enough yet.”
+
+---
+
+## 🃏 Three flagship case shapes
+
+The strongest public cases are not random.  
+They are easy to explain, high-risk, and immediately legible to ordinary readers.
+
+### 🔐 Security Attribution
+**Before:** a person gets named  
+**After:** `NOT AUTHORIZED TO CONCLUDE`  
+**Why it matters:** suspicious timing is not the same thing as a lawful blame chain
+
+### 💸 Payment Confirmation
+**Before:** payment is treated as confirmed  
+**After:** `EVIDENCE CHAIN NOT SUFFICIENT`  
+**Why it matters:** aligned screenshots and emails are not the same thing as bank-side proof
+
+### 📉 Executive Root Cause
+**Before:** one exact cause is declared  
+**After:** `COMPETING EXPLANATIONS REMAIN LIVE`  
+**Why it matters:** multi-causal business events should not be collapsed into one clean boardroom story
+
+These three case families are the easiest way to understand why WFGY 4.0 matters.
+
+---
+
+## 🏥 Where these results matter most
+
+These results matter most in domains where false certainty is expensive.
+
+That includes:
+
+- medical triage  
+- medication safety  
+- finance and payment confirmation  
+- legal and HR review  
+- security and incident attribution  
+- executive decision-making  
+- public-information and research credibility checks  
+
+In these spaces, the problem is often not that the model is slow.
+
+The problem is that it can sound finished before it is authorized to be finished.
+
+---
+
+## 🧱 What this page does and does not claim
+
+### ✅ This page supports a real claim
+
+It supports the claim that:
+
+**WFGY 4.0 provides a reproducible governance stress surface that exposes a real failure class in modern assistants and shows that a route/authorization split can produce more lawful outputs under pressure.**
+
+### ❌ This page does not claim
+
+- universal benchmark supremacy  
+- identical behavior across all models  
+- full domain completeness  
+- zero future failure  
+- blanket refusal as a desired default outcome  
+
+This page is strongest when it stays inside that boundary.
+
+---
+
+## 🧭 How to read the rest of the evidence section
+
+If this page is the public scoreboard, the other pages answer the obvious next questions:
+
+### 📘 Governance Stress Suite
+Read this if you want to know how the cases and rubric were designed.
+
+### 🟢 Basic Repro Demo
+Read this if you want the fastest reproducible before/after surface.
+
+### 🔵 Advanced Clean Protocol
+Read this if you want the cleaner, more protocol-defensible version.
+
+### 🃏 Flagship Cases
+Read this if you want the strongest story-level examples.
+
+### 🧭 Methodology Boundary
+Read this if you want the honesty page for what these results do and do not prove.
+
+---
+
+## 🚀 Final takeaway
+
+The point of WFGY 4.0 is not to make the model timid.
+
+The point is to stop the model from releasing a stronger conclusion than the current evidence has earned.
+
+That is why these results matter.
+
+They do not merely show “more caution.”  
+They show a meaningful shift in release discipline.
+
+---
+
+## 🔗 Quick Links
+
+### 🏠 Main entry
+- [Twin Atlas README](../README.md)
+
+### 🧭 Family surfaces
+- [Troubleshooting Atlas / Forward Atlas](../../wfgy-ai-problem-map-troubleshooting-atlas.md)
+- [Inverse Atlas](../../Inverse_Atlas/README.md)
+
+### 🧪 Evidence surfaces
+- [Evidence Hub](./README.md)
+- [Governance Stress Suite](./governance-stress-suite.md)
+- [Basic Repro Demo](./basic-repro-demo.md)
+- [Advanced Clean Protocol](./advanced-clean-protocol.md)
+- [Flagship Cases](./flagship-cases.md)
+- [Methodology Boundary](./methodology-boundary.md)
+
+### ⚙️ Engine surfaces
+- [Runtime README](../runtime/README.md)
+- [Bridge README](../Bridge/README.md)
+
+### 🗺️ Next recommended page
+- [Governance Stress Suite](./governance-stress-suite.md)