Create README.md

2026-04-28 03:29:51 +00:00 · 2026-03-30 19:52:30 +08:00 · 2026-03-30 19:52:30 +08:00 · d914fe3130
commit d914fe3130
parent eedc21fd7f
1 changed files with 129 additions and 0 deletions
--- a/ProblemMap/Twin_Atlas/evidence/raw-runs/README.md
+++ b/ProblemMap/Twin_Atlas/evidence/raw-runs/README.md
@ -0,0 +1,129 @@
+<!--
+AI_NOTE_START
+
+Document role:
+This page is the public raw-runs index for the current WFGY 4.0 Twin Atlas evidence surface.
+
+What this page is for:
+1. Give readers a clean index into the current model-specific raw runs.
+2. Explain why raw runs matter and how they differ from screenshots and results summary.
+3. Connect each raw run to the surrounding proof surface.
+4. Make the current raw-run layer readable instead of leaving it as a bare file folder.
+
+What this page is not:
+1. It is not the aggregate headline page.
+2. It is not the screenshot-first gallery.
+3. It is not the shortest rerun path.
+4. It is not a universal benchmark scoreboard.
+
+Reading strategy:
+1. Read the short explanation first.
+2. Use the model index to open the raw run you want.
+3. Use AI Eval if you want screenshots first.
+4. Use Results Summary if you want the aggregate read first.
+
+Important boundary:
+Raw runs matter because they preserve original output shape.
+They are part of the public proof surface, but they are not a universal benchmark by themselves.
+
+AI_NOTE_END
+-->
+
+# 🧾 Raw Runs
+
+> The original model-specific outputs behind the current WFGY 4.0 public proof surface.
+
+Screenshots are useful.  
+Aggregate summaries are useful.  
+But raw runs matter for a different reason:
+
+**they preserve the actual output shape.**
+
+That means readers can inspect what each model really said, how it scored the cases, and whether the visible screenshot story matches the original output.
+
+This page exists to make that raw layer readable.
+
+---
+
+## 🧭 Why this page matters
+
+If you only read screenshots, you see the visual contrast.
+
+If you only read results summary, you see the aggregate headline.
+
+If you read raw runs, you see the actual model-specific wording, scoring pattern, and final judgment shape.
+
+That is why raw runs are a critical part of the WFGY 4.0 public evidence surface.
+
+---
+
+## 📂 Current public raw-run index
+
+| Model | Raw run | Best reason to open it |
+|---|---|---|
+| ChatGPT | [chatgpt.txt](./chatgpt.txt) | strong public example of lawful downgrade without full collapse into blanket refusal |
+| Claude | [claude.txt](./claude.txt) | strong example of ambiguity preservation and conflict-sensitive restraint |
+| Gemini | [gemini.txt](./gemini.txt) | useful example of thin-evidence downgrade discipline |
+| Grok | [grok.txt](./grok.txt) | good for attribution and authenticity pressure comparison |
+| DeepSeek | [deepseek.txt](./deepseek.txt) | useful for evidence-boundary tightening and attribution restraint |
+| Kimi | [kimi.txt](./kimi.txt) | strong before / after separation in several pressure-heavy cases |
+| Mistral | [mistral.txt](./mistral.txt) | useful model-family comparison point for visible governance shift |
+| Perplexity | [perplexity.txt](./perplexity.txt) | important public outlier for inspecting over-downgrade or blanket-refusal drift |
+| Qwen | [qwen.txt](./qwen.txt) | currently available as a raw run asset even if not always foregrounded in the main public screenshot layer |
+
+---
+
+## 🔍 How to use this page
+
+### If you want the screenshot layer first
+Use:
+
+- [AI Eval](../../demos/ai-eval.md)
+
+### If you want the aggregate interpretation first
+Use:
+
+- [Results Summary](../results-summary.md)
+
+### If you want the shortest rerun path first
+Use:
+
+- [Reproduce in 60 Seconds](../../demos/reproduce-in-60-seconds.md)
+
+### If you want the original wording and scoring shape
+Stay here and open the raw runs directly.
+
+---
+
+## 📌 What this raw layer is good for
+
+This layer is especially useful if you want to inspect:
+
+- whether the AFTER pass preserved ambiguity instead of just hiding it
+- whether a model downgraded lawfully or merely refused everything
+- whether the screenshot impression matches the original output
+- whether a model-specific run looks representative or idiosyncratic
+- whether the public evidence surface is preserving outliers honestly
+
+That last point matters.
+
+A serious governance release should not only preserve its strongest examples.  
+It should also preserve the runs that expose boundary behavior.
+
+That is part of why the raw-run layer matters.
+
+---
+
+## 🧭 Where to go next
+
+### If you want visible proof first
+- [AI Eval](../../demos/ai-eval.md)
+
+### If you want the aggregate signal
+- [Results Summary](../results-summary.md)
+
+### If you want the shortest public rerun path
+- [Reproduce in 60 Seconds](../../demos/reproduce-in-60-seconds.md)
+
+### If you want the flagship example cases
+- [Flagship Cases](../flagship-cases.md)