mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 03:29:51 +00:00
Create README.md
This commit is contained in:
parent
eedc21fd7f
commit
d914fe3130
1 changed files with 129 additions and 0 deletions
129
ProblemMap/Twin_Atlas/evidence/raw-runs/README.md
Normal file
129
ProblemMap/Twin_Atlas/evidence/raw-runs/README.md
Normal file
|
|
@ -0,0 +1,129 @@
|
|||
<!--
|
||||
AI_NOTE_START
|
||||
|
||||
Document role:
|
||||
This page is the public raw-runs index for the current WFGY 4.0 Twin Atlas evidence surface.
|
||||
|
||||
What this page is for:
|
||||
1. Give readers a clean index into the current model-specific raw runs.
|
||||
2. Explain why raw runs matter and how they differ from screenshots and results summary.
|
||||
3. Connect each raw run to the surrounding proof surface.
|
||||
4. Make the current raw-run layer readable instead of leaving it as a bare file folder.
|
||||
|
||||
What this page is not:
|
||||
1. It is not the aggregate headline page.
|
||||
2. It is not the screenshot-first gallery.
|
||||
3. It is not the shortest rerun path.
|
||||
4. It is not a universal benchmark scoreboard.
|
||||
|
||||
Reading strategy:
|
||||
1. Read the short explanation first.
|
||||
2. Use the model index to open the raw run you want.
|
||||
3. Use AI Eval if you want screenshots first.
|
||||
4. Use Results Summary if you want the aggregate read first.
|
||||
|
||||
Important boundary:
|
||||
Raw runs matter because they preserve original output shape.
|
||||
They are part of the public proof surface, but they are not a universal benchmark by themselves.
|
||||
|
||||
AI_NOTE_END
|
||||
-->
|
||||
|
||||
# 🧾 Raw Runs
|
||||
|
||||
> The original model-specific outputs behind the current WFGY 4.0 public proof surface.
|
||||
|
||||
Screenshots are useful.
|
||||
Aggregate summaries are useful.
|
||||
But raw runs matter for a different reason:
|
||||
|
||||
**they preserve the actual output shape.**
|
||||
|
||||
That means readers can inspect what each model really said, how it scored the cases, and whether the visible screenshot story matches the original output.
|
||||
|
||||
This page exists to make that raw layer readable.
|
||||
|
||||
---
|
||||
|
||||
## 🧭 Why this page matters
|
||||
|
||||
If you only read screenshots, you see the visual contrast.
|
||||
|
||||
If you only read results summary, you see the aggregate headline.
|
||||
|
||||
If you read raw runs, you see the actual model-specific wording, scoring pattern, and final judgment shape.
|
||||
|
||||
That is why raw runs are a critical part of the WFGY 4.0 public evidence surface.
|
||||
|
||||
---
|
||||
|
||||
## 📂 Current public raw-run index
|
||||
|
||||
| Model | Raw run | Best reason to open it |
|
||||
|---|---|---|
|
||||
| ChatGPT | [chatgpt.txt](./chatgpt.txt) | strong public example of lawful downgrade without full collapse into blanket refusal |
|
||||
| Claude | [claude.txt](./claude.txt) | strong example of ambiguity preservation and conflict-sensitive restraint |
|
||||
| Gemini | [gemini.txt](./gemini.txt) | useful example of thin-evidence downgrade discipline |
|
||||
| Grok | [grok.txt](./grok.txt) | good for attribution and authenticity pressure comparison |
|
||||
| DeepSeek | [deepseek.txt](./deepseek.txt) | useful for evidence-boundary tightening and attribution restraint |
|
||||
| Kimi | [kimi.txt](./kimi.txt) | strong before / after separation in several pressure-heavy cases |
|
||||
| Mistral | [mistral.txt](./mistral.txt) | useful model-family comparison point for visible governance shift |
|
||||
| Perplexity | [perplexity.txt](./perplexity.txt) | important public outlier for inspecting over-downgrade or blanket-refusal drift |
|
||||
| Qwen | [qwen.txt](./qwen.txt) | currently available as a raw run asset even if not always foregrounded in the main public screenshot layer |
|
||||
|
||||
---
|
||||
|
||||
## 🔍 How to use this page
|
||||
|
||||
### If you want the screenshot layer first
|
||||
Use:
|
||||
|
||||
- [AI Eval](../../demos/ai-eval.md)
|
||||
|
||||
### If you want the aggregate interpretation first
|
||||
Use:
|
||||
|
||||
- [Results Summary](../results-summary.md)
|
||||
|
||||
### If you want the shortest rerun path first
|
||||
Use:
|
||||
|
||||
- [Reproduce in 60 Seconds](../../demos/reproduce-in-60-seconds.md)
|
||||
|
||||
### If you want the original wording and scoring shape
|
||||
Stay here and open the raw runs directly.
|
||||
|
||||
---
|
||||
|
||||
## 📌 What this raw layer is good for
|
||||
|
||||
This layer is especially useful if you want to inspect:
|
||||
|
||||
- whether the AFTER pass preserved ambiguity instead of just hiding it
|
||||
- whether a model downgraded lawfully or merely refused everything
|
||||
- whether the screenshot impression matches the original output
|
||||
- whether a model-specific run looks representative or idiosyncratic
|
||||
- whether the public evidence surface is preserving outliers honestly
|
||||
|
||||
That last point matters.
|
||||
|
||||
A serious governance release should not only preserve its strongest examples.
|
||||
It should also preserve the runs that expose boundary behavior.
|
||||
|
||||
That is part of why the raw-run layer matters.
|
||||
|
||||
---
|
||||
|
||||
## 🧭 Where to go next
|
||||
|
||||
### If you want visible proof first
|
||||
- [AI Eval](../../demos/ai-eval.md)
|
||||
|
||||
### If you want the aggregate signal
|
||||
- [Results Summary](../results-summary.md)
|
||||
|
||||
### If you want the shortest public rerun path
|
||||
- [Reproduce in 60 Seconds](../../demos/reproduce-in-60-seconds.md)
|
||||
|
||||
### If you want the flagship example cases
|
||||
- [Flagship Cases](../flagship-cases.md)
|
||||
Loading…
Add table
Add a link
Reference in a new issue