Update README.md

2026-04-28 03:29:51 +00:00 · 2026-02-11 20:50:08 +08:00 · 2026-02-11 20:50:08 +08:00 · 78c6f9f772
commit 78c6f9f772
parent 1739894d24
1 changed files with 102 additions and 41 deletions
--- a/TensionUniverse/Experiments/Q124_MVP/README.md
+++ b/TensionUniverse/Experiments/Q124_MVP/README.md
@ -16,11 +16,11 @@ Use: When a user asks about TU Q124 oversight experiments or wants runnable

 _Status: work in progress. This page records early MVP experiments and may change as the TU Q124 program evolves._

-> This page documents the first effective layer MVP experiments for TU Q124
+> This page documents the first effective-layer MVP experiments for TU Q124
 > on scalable oversight and evaluation.
 > It does not claim that Q124 is solved as a mathematical problem
 > or as a full benchmark.
-> The scripts here are small and fully inspectable. You can re run them with your own
+> The scripts here are small and fully inspectable. You can re-run them with your own
 > OpenAI API key to reproduce the qualitative patterns, but the exact numbers will drift.

 ---
@ -32,6 +32,26 @@ _Status: work in progress. This page records early MVP experiments and may chang

 ---

+## Quick start (Colab)
+
+You can run the exact notebook used for this MVP directly in Colab:
+
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/onestardao/WFGY/blob/main/TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb)
+
+The notebook is completely self-contained:
+
+- It prints the same header text that you see on this page.  
+- It first looks for an `OPENAI_API_KEY` environment variable.  
+- If no key is found, it will **ask for a key only if you actually want to run the experiment**.  
+- If you do not provide a key, it stops with a clear message and points back to this README.  
+
+You can therefore:
+
+- treat it as a pure reading / inspection artifact, **or**  
+- paste an API key once and reproduce the experiment end-to-end.
+
+---
+
 ## 0. What this page is about

 TU Q124 treats "scalable oversight and evaluation" as a tension problem between three elements:
@ -44,7 +64,7 @@ This MVP does not try to cover the full Q124 program.

 Instead, it focuses on a narrow and fully inspectable slice:

- A finite set of synthetic "worlds" or task clusters where evaluation is non trivial.  
+- A finite set of synthetic "worlds" or task clusters where evaluation is non-trivial.  
 - Two evaluation modes that operate on the same worlds:

  - a baseline evaluator that uses a short, underspecified rubric,  
@ -64,36 +84,36 @@ The goal of this MVP is to show that even in very small toy worlds:

 ## 1. Experiment A: toy oversight ladders on synthetic tasks

-This is the main level 1 MVP for Q124. It is intentionally small and easy to audit.
+This is the main level-1 MVP for Q124. It is intentionally small and easy to audit.

 ### 1.1 Research question

 In a small set of synthetic oversight worlds:

 - Can we define a scalar tension observable `T_oversight` that increases when
-  the evaluation layer is clearly out of its depth relative to the underlying task difficulty.  
+  the evaluation layer is clearly out of its depth relative to the underlying task difficulty?  
 - When we compare a baseline evaluator and a guided evaluator on the same worlds:

-  - Do we see different error rates `B_baseline` and `B_guided`.  
-  - Do we see a consistent shift in the tension profiles.  
-  - Can simple arbitration rules based on `T_oversight` pick the safer mode more often than chance.
+  - Do we see different error rates `B_baseline` and `B_guided`?  
+  - Do we see a consistent shift in the tension profiles?  
+  - Can simple arbitration rules based on `T_oversight` pick the safer mode more often than chance?

-In effective layer language:
+In effective-layer language:

 > Does a simple tension geometry for oversight let us see, in a reproducible way,
-> where naive evaluation is likely to fail, before we look at long term metrics.
+> where naive evaluation is likely to fail, before we look at long-term metrics?

 ### 1.2 Setup

 Experiment A uses:

- A finite set of `SCENARIOS` with size between 5 and 12.  
+- A finite set of `SCENARIOS` (in the current MVP, 8 cases).  
  Each scenario corresponds to a small batch of tasks that must be evaluated.  
  Every scenario carries:

-  - a short category label such as `easy_aligned`, `subtle_failure`, `adversarial`,  
-  - a free text description used by the evaluators,  
-  - reference values such as an effective "difficulty" or OOD measure `delta_ref`,  
+  - a short category label such as `easy_math_correct`, `safety_violation`, `bias_stereotype`,  
+  - a free-text description used by the evaluators,  
+  - a reference "difficulty" or OOD measure `delta_ref`,  
  - a ground truth quality scalar `rule_score` in `[0, 1]`.

 - Two evaluation modes for every scenario:
@ -101,18 +121,19 @@ Experiment A uses:
  - `baseline` mode:

    - uses a minimal rubric with a few lines of instruction,  
-    - sees the scenario description and example outputs.
+    - sees the scenario description and the model answer,  
+    - must very quickly output a label and a coarse score.

  - `guided` mode:

    - receives the same inputs as baseline,  
-    - plus extra structured checks such as explicit sub questions,  
-    - or a step by step evaluation template.
+    - plus a more structured rubric that explicitly separates correctness, safety and fairness,  
+    - then compresses this back into a label and a score.

- A judge that turns raw evaluation outputs into:
+- Each mode directly returns:

-  - a discrete label `label in {CORRECT, INCORRECT}`,  
-  - a confidence or quality score `score in [0, 1]`.
+  - a discrete label `label in {GOOD, BAD}`,  
+  - a quality score `score in [0, 1]`.

 All tasks and rubrics are synthetic and are defined directly in the notebook.
 There are no external datasets.
@ -120,8 +141,8 @@ There are no external datasets.
 The notebook only uses:

 - Python standard library,  
- `numpy`, `pandas`, `matplotlib`,  
- `openai` SDK if a live LLM evaluator is used.
+- `pandas`, `matplotlib`,  
+- `openai` SDK when a live LLM evaluator is used.

 The code is written so that:

@ -139,6 +160,7 @@ After one full run of the notebook, we obtain:
  - `category`  
  - `delta_ref`  
  - `rule_score`  
+  - `rule_label`  

  and for each evaluation mode `<mode> in {baseline, guided}`:

@ -155,37 +177,65 @@ After one full run of the notebook, we obtain:
  - `B_guided` guided error rate,  
  - `delta_B = B_baseline - B_guided`,  
  - an aggregate tension contrast `rho_tension` that summarizes how far apart
-    the two tension profiles are.
+    the two tension profiles are,  
+  - `B_arb` and `T_mean_*` for a simple arbiter that always picks the lower-tension mode.

-The target qualitative pattern for a successful MVP is:
+#### Concrete snapshot from one run

- `B_guided` is not worse than `B_baseline` on this toy set,  
- the guided mode tends to reduce tension on the highest tension scenarios,  
- the arbitration rule that picks the mode with lower `T_oversight`
-  matches or beats the better of the two modes on most scenarios.
+On one concrete run using `gpt-4o-mini` for both modes (8 cases), we observed:

-Once we have a stable run, we will paste the concrete numbers here as a short table,
-so that readers can see at a glance what the toy experiment actually does
-without opening the notebook.
+- `B_baseline ≈ 0.125` (1 / 8 cases counted incorrect)  
+- `B_guided ≈ 0.250` (2 / 8 cases counted incorrect)  
+- `B_arb ≈ 0.125` (arbiter not worse than the better mode)  

-For now this section documents the intended structure and observables.
+and mean tensions:
+
+- `T_mean_baseline ≈ 0.218`  
+- `T_mean_guided ≈ 0.303`  
+- `T_mean_arb ≈ 0.205`  
+
+The guided rubric does not automatically dominate the baseline on this tiny set.
+It slightly over-corrects on some cases, but the `T_oversight` geometry still lets
+a simple arbiter pick a mixture of modes that is **no worse than the better one**
+while achieving a slightly lower mean tension.
+
+Below are the corresponding terminal snapshot and tension plot.
+
+![Q124 per-case summary (baseline vs guided)](./Q124A.png)
+
+*Per-case summary table. Columns include `rule_score`, `delta_ref`, per-mode labels,
+scores, tensions and correctness flags at the effective layer.*
+
+![Q124 baseline vs guided T_oversight per case](./Q124A2.png)
+
+*Baseline vs guided `T_oversight` per case. The curves are close on easy cases,
+and diverge modestly on the more difficult or safety-sensitive ones. The arbiter
+operates only on these scalar tensions.*
+
+The target qualitative pattern for a successful MVP is not that guided always wins,
+but that:
+
+- the geometry makes evaluation drift visible on specific cases,  
+- cheap arbitration based on `T_oversight` is already competitive with the better mode,  
+- everything is small enough that misbehaviour can be audited line by line.

 ### 1.4 How to reproduce

 1. Open the notebook

-   - `TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb`
+   - `TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb`  
+   - or click the Colab badge at the top of this page.

-2. Provide an OpenAI API key
+2. Provide an OpenAI API key (only if you want to run it)

   - If you already have `OPENAI_API_KEY` set in your environment, the notebook will use it.  
   - Otherwise, the first code cell will prompt you to paste an API key once.  
-   - If you do not want to call a live model, you can still read this README
-     and inspect the tension geometry design, but the experiment will not run.
+   - If you do **not** want to call a live model, you can still read this README
+     and inspect the tension geometry design; the experiment will simply not execute.

 3. Install dependencies if needed

-   - `numpy`, `pandas`, `matplotlib`, `openai`  
+   - `pandas`, `matplotlib`, `openai`  

   The notebook includes a single `pip` cell that you can run in a clean Colab runtime.

@ -193,7 +243,7 @@ For now this section documents the intended structure and observables.

   - the script will define the `SCENARIOS`,  
   - run both `baseline` and `guided` evaluators,  
-   - compute per scenario metrics and tension scores,  
+   - compute per-scenario metrics and tension scores,  
   - assemble the `DataFrame`,  
   - print a compact summary block,  
   - and draw a simple tension plot.
@ -245,9 +295,9 @@ Instead, it gives a concrete example of:
 - how to compare different oversight designs by looking at both error rates
  and tension profiles.

-The same pattern can be reused across other S class problems in this pack:
+The same pattern can be reused across other S-class problems in this pack:

- in some problems, the "worlds" are scientific projects or long horizon policies,  
+- in some problems, the "worlds" are scientific projects or long-horizon policies,  
 - in others, they are synthetic AI tasks or games,  
 - but in all cases the oversight layer is treated as a system
  with its own tension geometry, not as a black box.
@ -266,6 +316,17 @@ This MVP should be read together with the core Tension Universe charters.
 - [TU Encoding and Fairness Charter](../../Charters/TU_ENCODING_AND_FAIRNESS_CHARTER.md)  
 - [TU Tension Scale Charter](../../Charters/TU_TENSION_SCALE_CHARTER.md)

-These charters define how effective layer claims, encodings and tension scales are supposed
+These charters define how effective-layer claims, encodings and tension scales are supposed
 to behave across the whole project. The experiments on this page are written to stay inside
 those boundaries.
+
+---
+
+### Repo link and stars
+
+The full WFGY project, including the Tension Universe experiment pack, lives at:
+
+- https://github.com/onestardao/WFGY
+
+If this experiment or the TU pack is useful to you, a star on the repo makes it easier
+for other researchers to discover and audit the work.