Update README.md

This commit is contained in:
PSBigBig × MiniPS 2026-02-11 20:50:08 +08:00 committed by GitHub
parent 1739894d24
commit 78c6f9f772
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -16,11 +16,11 @@ Use: When a user asks about TU Q124 oversight experiments or wants runnable
_Status: work in progress. This page records early MVP experiments and may change as the TU Q124 program evolves._
> This page documents the first effective layer MVP experiments for TU Q124
> This page documents the first effective-layer MVP experiments for TU Q124
> on scalable oversight and evaluation.
> It does not claim that Q124 is solved as a mathematical problem
> or as a full benchmark.
> The scripts here are small and fully inspectable. You can re run them with your own
> The scripts here are small and fully inspectable. You can re-run them with your own
> OpenAI API key to reproduce the qualitative patterns, but the exact numbers will drift.
---
@ -32,6 +32,26 @@ _Status: work in progress. This page records early MVP experiments and may chang
---
## Quick start (Colab)
You can run the exact notebook used for this MVP directly in Colab:
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/onestardao/WFGY/blob/main/TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb)
The notebook is completely self-contained:
- It prints the same header text that you see on this page.
- It first looks for an `OPENAI_API_KEY` environment variable.
- If no key is found, it will **ask for a key only if you actually want to run the experiment**.
- If you do not provide a key, it stops with a clear message and points back to this README.
You can therefore:
- treat it as a pure reading / inspection artifact, **or**
- paste an API key once and reproduce the experiment end-to-end.
---
## 0. What this page is about
TU Q124 treats "scalable oversight and evaluation" as a tension problem between three elements:
@ -44,7 +64,7 @@ This MVP does not try to cover the full Q124 program.
Instead, it focuses on a narrow and fully inspectable slice:
- A finite set of synthetic "worlds" or task clusters where evaluation is non trivial.
- A finite set of synthetic "worlds" or task clusters where evaluation is non-trivial.
- Two evaluation modes that operate on the same worlds:
- a baseline evaluator that uses a short, underspecified rubric,
@ -64,36 +84,36 @@ The goal of this MVP is to show that even in very small toy worlds:
## 1. Experiment A: toy oversight ladders on synthetic tasks
This is the main level 1 MVP for Q124. It is intentionally small and easy to audit.
This is the main level-1 MVP for Q124. It is intentionally small and easy to audit.
### 1.1 Research question
In a small set of synthetic oversight worlds:
- Can we define a scalar tension observable `T_oversight` that increases when
the evaluation layer is clearly out of its depth relative to the underlying task difficulty.
the evaluation layer is clearly out of its depth relative to the underlying task difficulty?
- When we compare a baseline evaluator and a guided evaluator on the same worlds:
- Do we see different error rates `B_baseline` and `B_guided`.
- Do we see a consistent shift in the tension profiles.
- Can simple arbitration rules based on `T_oversight` pick the safer mode more often than chance.
- Do we see different error rates `B_baseline` and `B_guided`?
- Do we see a consistent shift in the tension profiles?
- Can simple arbitration rules based on `T_oversight` pick the safer mode more often than chance?
In effective layer language:
In effective-layer language:
> Does a simple tension geometry for oversight let us see, in a reproducible way,
> where naive evaluation is likely to fail, before we look at long term metrics.
> where naive evaluation is likely to fail, before we look at long-term metrics?
### 1.2 Setup
Experiment A uses:
- A finite set of `SCENARIOS` with size between 5 and 12.
- A finite set of `SCENARIOS` (in the current MVP, 8 cases).
Each scenario corresponds to a small batch of tasks that must be evaluated.
Every scenario carries:
- a short category label such as `easy_aligned`, `subtle_failure`, `adversarial`,
- a free text description used by the evaluators,
- reference values such as an effective "difficulty" or OOD measure `delta_ref`,
- a short category label such as `easy_math_correct`, `safety_violation`, `bias_stereotype`,
- a free-text description used by the evaluators,
- a reference "difficulty" or OOD measure `delta_ref`,
- a ground truth quality scalar `rule_score` in `[0, 1]`.
- Two evaluation modes for every scenario:
@ -101,18 +121,19 @@ Experiment A uses:
- `baseline` mode:
- uses a minimal rubric with a few lines of instruction,
- sees the scenario description and example outputs.
- sees the scenario description and the model answer,
- must very quickly output a label and a coarse score.
- `guided` mode:
- receives the same inputs as baseline,
- plus extra structured checks such as explicit sub questions,
- or a step by step evaluation template.
- plus a more structured rubric that explicitly separates correctness, safety and fairness,
- then compresses this back into a label and a score.
- A judge that turns raw evaluation outputs into:
- Each mode directly returns:
- a discrete label `label in {CORRECT, INCORRECT}`,
- a confidence or quality score `score in [0, 1]`.
- a discrete label `label in {GOOD, BAD}`,
- a quality score `score in [0, 1]`.
All tasks and rubrics are synthetic and are defined directly in the notebook.
There are no external datasets.
@ -120,8 +141,8 @@ There are no external datasets.
The notebook only uses:
- Python standard library,
- `numpy`, `pandas`, `matplotlib`,
- `openai` SDK if a live LLM evaluator is used.
- `pandas`, `matplotlib`,
- `openai` SDK when a live LLM evaluator is used.
The code is written so that:
@ -139,6 +160,7 @@ After one full run of the notebook, we obtain:
- `category`
- `delta_ref`
- `rule_score`
- `rule_label`
and for each evaluation mode `<mode> in {baseline, guided}`:
@ -155,37 +177,65 @@ After one full run of the notebook, we obtain:
- `B_guided` guided error rate,
- `delta_B = B_baseline - B_guided`,
- an aggregate tension contrast `rho_tension` that summarizes how far apart
the two tension profiles are.
the two tension profiles are,
- `B_arb` and `T_mean_*` for a simple arbiter that always picks the lower-tension mode.
The target qualitative pattern for a successful MVP is:
#### Concrete snapshot from one run
- `B_guided` is not worse than `B_baseline` on this toy set,
- the guided mode tends to reduce tension on the highest tension scenarios,
- the arbitration rule that picks the mode with lower `T_oversight`
matches or beats the better of the two modes on most scenarios.
On one concrete run using `gpt-4o-mini` for both modes (8 cases), we observed:
Once we have a stable run, we will paste the concrete numbers here as a short table,
so that readers can see at a glance what the toy experiment actually does
without opening the notebook.
- `B_baseline ≈ 0.125` (1 / 8 cases counted incorrect)
- `B_guided ≈ 0.250` (2 / 8 cases counted incorrect)
- `B_arb ≈ 0.125` (arbiter not worse than the better mode)
For now this section documents the intended structure and observables.
and mean tensions:
- `T_mean_baseline ≈ 0.218`
- `T_mean_guided ≈ 0.303`
- `T_mean_arb ≈ 0.205`
The guided rubric does not automatically dominate the baseline on this tiny set.
It slightly over-corrects on some cases, but the `T_oversight` geometry still lets
a simple arbiter pick a mixture of modes that is **no worse than the better one**
while achieving a slightly lower mean tension.
Below are the corresponding terminal snapshot and tension plot.
![Q124 per-case summary (baseline vs guided)](./Q124A.png)
*Per-case summary table. Columns include `rule_score`, `delta_ref`, per-mode labels,
scores, tensions and correctness flags at the effective layer.*
![Q124 baseline vs guided T_oversight per case](./Q124A2.png)
*Baseline vs guided `T_oversight` per case. The curves are close on easy cases,
and diverge modestly on the more difficult or safety-sensitive ones. The arbiter
operates only on these scalar tensions.*
The target qualitative pattern for a successful MVP is not that guided always wins,
but that:
- the geometry makes evaluation drift visible on specific cases,
- cheap arbitration based on `T_oversight` is already competitive with the better mode,
- everything is small enough that misbehaviour can be audited line by line.
### 1.4 How to reproduce
1. Open the notebook
- `TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb`
- `TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb`
- or click the Colab badge at the top of this page.
2. Provide an OpenAI API key
2. Provide an OpenAI API key (only if you want to run it)
- If you already have `OPENAI_API_KEY` set in your environment, the notebook will use it.
- Otherwise, the first code cell will prompt you to paste an API key once.
- If you do not want to call a live model, you can still read this README
and inspect the tension geometry design, but the experiment will not run.
- If you do **not** want to call a live model, you can still read this README
and inspect the tension geometry design; the experiment will simply not execute.
3. Install dependencies if needed
- `numpy`, `pandas`, `matplotlib`, `openai`
- `pandas`, `matplotlib`, `openai`
The notebook includes a single `pip` cell that you can run in a clean Colab runtime.
@ -193,7 +243,7 @@ For now this section documents the intended structure and observables.
- the script will define the `SCENARIOS`,
- run both `baseline` and `guided` evaluators,
- compute per scenario metrics and tension scores,
- compute per-scenario metrics and tension scores,
- assemble the `DataFrame`,
- print a compact summary block,
- and draw a simple tension plot.
@ -245,9 +295,9 @@ Instead, it gives a concrete example of:
- how to compare different oversight designs by looking at both error rates
and tension profiles.
The same pattern can be reused across other S class problems in this pack:
The same pattern can be reused across other S-class problems in this pack:
- in some problems, the "worlds" are scientific projects or long horizon policies,
- in some problems, the "worlds" are scientific projects or long-horizon policies,
- in others, they are synthetic AI tasks or games,
- but in all cases the oversight layer is treated as a system
with its own tension geometry, not as a black box.
@ -266,6 +316,17 @@ This MVP should be read together with the core Tension Universe charters.
- [TU Encoding and Fairness Charter](../../Charters/TU_ENCODING_AND_FAIRNESS_CHARTER.md)
- [TU Tension Scale Charter](../../Charters/TU_TENSION_SCALE_CHARTER.md)
These charters define how effective layer claims, encodings and tension scales are supposed
These charters define how effective-layer claims, encodings and tension scales are supposed
to behave across the whole project. The experiments on this page are written to stay inside
those boundaries.
---
### Repo link and stars
The full WFGY project, including the Tension Universe experiment pack, lives at:
- https://github.com/onestardao/WFGY
If this experiment or the TU pack is useful to you, a star on the repo makes it easier
for other researchers to discover and audit the work.