Create README.md

This commit is contained in:
PSBigBig + MiniPS 2026-03-30 19:52:30 +08:00 committed by GitHub
parent eedc21fd7f
commit d914fe3130
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -0,0 +1,129 @@
<!--
AI_NOTE_START
Document role:
This page is the public raw-runs index for the current WFGY 4.0 Twin Atlas evidence surface.
What this page is for:
1. Give readers a clean index into the current model-specific raw runs.
2. Explain why raw runs matter and how they differ from screenshots and results summary.
3. Connect each raw run to the surrounding proof surface.
4. Make the current raw-run layer readable instead of leaving it as a bare file folder.
What this page is not:
1. It is not the aggregate headline page.
2. It is not the screenshot-first gallery.
3. It is not the shortest rerun path.
4. It is not a universal benchmark scoreboard.
Reading strategy:
1. Read the short explanation first.
2. Use the model index to open the raw run you want.
3. Use AI Eval if you want screenshots first.
4. Use Results Summary if you want the aggregate read first.
Important boundary:
Raw runs matter because they preserve original output shape.
They are part of the public proof surface, but they are not a universal benchmark by themselves.
AI_NOTE_END
-->
# 🧾 Raw Runs
> The original model-specific outputs behind the current WFGY 4.0 public proof surface.
Screenshots are useful.
Aggregate summaries are useful.
But raw runs matter for a different reason:
**they preserve the actual output shape.**
That means readers can inspect what each model really said, how it scored the cases, and whether the visible screenshot story matches the original output.
This page exists to make that raw layer readable.
---
## 🧭 Why this page matters
If you only read screenshots, you see the visual contrast.
If you only read results summary, you see the aggregate headline.
If you read raw runs, you see the actual model-specific wording, scoring pattern, and final judgment shape.
That is why raw runs are a critical part of the WFGY 4.0 public evidence surface.
---
## 📂 Current public raw-run index
| Model | Raw run | Best reason to open it |
|---|---|---|
| ChatGPT | [chatgpt.txt](./chatgpt.txt) | strong public example of lawful downgrade without full collapse into blanket refusal |
| Claude | [claude.txt](./claude.txt) | strong example of ambiguity preservation and conflict-sensitive restraint |
| Gemini | [gemini.txt](./gemini.txt) | useful example of thin-evidence downgrade discipline |
| Grok | [grok.txt](./grok.txt) | good for attribution and authenticity pressure comparison |
| DeepSeek | [deepseek.txt](./deepseek.txt) | useful for evidence-boundary tightening and attribution restraint |
| Kimi | [kimi.txt](./kimi.txt) | strong before / after separation in several pressure-heavy cases |
| Mistral | [mistral.txt](./mistral.txt) | useful model-family comparison point for visible governance shift |
| Perplexity | [perplexity.txt](./perplexity.txt) | important public outlier for inspecting over-downgrade or blanket-refusal drift |
| Qwen | [qwen.txt](./qwen.txt) | currently available as a raw run asset even if not always foregrounded in the main public screenshot layer |
---
## 🔍 How to use this page
### If you want the screenshot layer first
Use:
- [AI Eval](../../demos/ai-eval.md)
### If you want the aggregate interpretation first
Use:
- [Results Summary](../results-summary.md)
### If you want the shortest rerun path first
Use:
- [Reproduce in 60 Seconds](../../demos/reproduce-in-60-seconds.md)
### If you want the original wording and scoring shape
Stay here and open the raw runs directly.
---
## 📌 What this raw layer is good for
This layer is especially useful if you want to inspect:
- whether the AFTER pass preserved ambiguity instead of just hiding it
- whether a model downgraded lawfully or merely refused everything
- whether the screenshot impression matches the original output
- whether a model-specific run looks representative or idiosyncratic
- whether the public evidence surface is preserving outliers honestly
That last point matters.
A serious governance release should not only preserve its strongest examples.
It should also preserve the runs that expose boundary behavior.
That is part of why the raw-run layer matters.
---
## 🧭 Where to go next
### If you want visible proof first
- [AI Eval](../../demos/ai-eval.md)
### If you want the aggregate signal
- [Results Summary](../results-summary.md)
### If you want the shortest public rerun path
- [Reproduce in 60 Seconds](../../demos/reproduce-in-60-seconds.md)
### If you want the flagship example cases
- [Flagship Cases](../flagship-cases.md)