mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 03:29:51 +00:00
Update README.md
This commit is contained in:
parent
1739894d24
commit
78c6f9f772
1 changed files with 102 additions and 41 deletions
|
|
@ -16,11 +16,11 @@ Use: When a user asks about TU Q124 oversight experiments or wants runnable
|
|||
|
||||
_Status: work in progress. This page records early MVP experiments and may change as the TU Q124 program evolves._
|
||||
|
||||
> This page documents the first effective layer MVP experiments for TU Q124
|
||||
> This page documents the first effective-layer MVP experiments for TU Q124
|
||||
> on scalable oversight and evaluation.
|
||||
> It does not claim that Q124 is solved as a mathematical problem
|
||||
> or as a full benchmark.
|
||||
> The scripts here are small and fully inspectable. You can re run them with your own
|
||||
> The scripts here are small and fully inspectable. You can re-run them with your own
|
||||
> OpenAI API key to reproduce the qualitative patterns, but the exact numbers will drift.
|
||||
|
||||
---
|
||||
|
|
@ -32,6 +32,26 @@ _Status: work in progress. This page records early MVP experiments and may chang
|
|||
|
||||
---
|
||||
|
||||
## Quick start (Colab)
|
||||
|
||||
You can run the exact notebook used for this MVP directly in Colab:
|
||||
|
||||
[](https://colab.research.google.com/github/onestardao/WFGY/blob/main/TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb)
|
||||
|
||||
The notebook is completely self-contained:
|
||||
|
||||
- It prints the same header text that you see on this page.
|
||||
- It first looks for an `OPENAI_API_KEY` environment variable.
|
||||
- If no key is found, it will **ask for a key only if you actually want to run the experiment**.
|
||||
- If you do not provide a key, it stops with a clear message and points back to this README.
|
||||
|
||||
You can therefore:
|
||||
|
||||
- treat it as a pure reading / inspection artifact, **or**
|
||||
- paste an API key once and reproduce the experiment end-to-end.
|
||||
|
||||
---
|
||||
|
||||
## 0. What this page is about
|
||||
|
||||
TU Q124 treats "scalable oversight and evaluation" as a tension problem between three elements:
|
||||
|
|
@ -44,7 +64,7 @@ This MVP does not try to cover the full Q124 program.
|
|||
|
||||
Instead, it focuses on a narrow and fully inspectable slice:
|
||||
|
||||
- A finite set of synthetic "worlds" or task clusters where evaluation is non trivial.
|
||||
- A finite set of synthetic "worlds" or task clusters where evaluation is non-trivial.
|
||||
- Two evaluation modes that operate on the same worlds:
|
||||
|
||||
- a baseline evaluator that uses a short, underspecified rubric,
|
||||
|
|
@ -64,36 +84,36 @@ The goal of this MVP is to show that even in very small toy worlds:
|
|||
|
||||
## 1. Experiment A: toy oversight ladders on synthetic tasks
|
||||
|
||||
This is the main level 1 MVP for Q124. It is intentionally small and easy to audit.
|
||||
This is the main level-1 MVP for Q124. It is intentionally small and easy to audit.
|
||||
|
||||
### 1.1 Research question
|
||||
|
||||
In a small set of synthetic oversight worlds:
|
||||
|
||||
- Can we define a scalar tension observable `T_oversight` that increases when
|
||||
the evaluation layer is clearly out of its depth relative to the underlying task difficulty.
|
||||
the evaluation layer is clearly out of its depth relative to the underlying task difficulty?
|
||||
- When we compare a baseline evaluator and a guided evaluator on the same worlds:
|
||||
|
||||
- Do we see different error rates `B_baseline` and `B_guided`.
|
||||
- Do we see a consistent shift in the tension profiles.
|
||||
- Can simple arbitration rules based on `T_oversight` pick the safer mode more often than chance.
|
||||
- Do we see different error rates `B_baseline` and `B_guided`?
|
||||
- Do we see a consistent shift in the tension profiles?
|
||||
- Can simple arbitration rules based on `T_oversight` pick the safer mode more often than chance?
|
||||
|
||||
In effective layer language:
|
||||
In effective-layer language:
|
||||
|
||||
> Does a simple tension geometry for oversight let us see, in a reproducible way,
|
||||
> where naive evaluation is likely to fail, before we look at long term metrics.
|
||||
> where naive evaluation is likely to fail, before we look at long-term metrics?
|
||||
|
||||
### 1.2 Setup
|
||||
|
||||
Experiment A uses:
|
||||
|
||||
- A finite set of `SCENARIOS` with size between 5 and 12.
|
||||
- A finite set of `SCENARIOS` (in the current MVP, 8 cases).
|
||||
Each scenario corresponds to a small batch of tasks that must be evaluated.
|
||||
Every scenario carries:
|
||||
|
||||
- a short category label such as `easy_aligned`, `subtle_failure`, `adversarial`,
|
||||
- a free text description used by the evaluators,
|
||||
- reference values such as an effective "difficulty" or OOD measure `delta_ref`,
|
||||
- a short category label such as `easy_math_correct`, `safety_violation`, `bias_stereotype`,
|
||||
- a free-text description used by the evaluators,
|
||||
- a reference "difficulty" or OOD measure `delta_ref`,
|
||||
- a ground truth quality scalar `rule_score` in `[0, 1]`.
|
||||
|
||||
- Two evaluation modes for every scenario:
|
||||
|
|
@ -101,18 +121,19 @@ Experiment A uses:
|
|||
- `baseline` mode:
|
||||
|
||||
- uses a minimal rubric with a few lines of instruction,
|
||||
- sees the scenario description and example outputs.
|
||||
- sees the scenario description and the model answer,
|
||||
- must very quickly output a label and a coarse score.
|
||||
|
||||
- `guided` mode:
|
||||
|
||||
- receives the same inputs as baseline,
|
||||
- plus extra structured checks such as explicit sub questions,
|
||||
- or a step by step evaluation template.
|
||||
- plus a more structured rubric that explicitly separates correctness, safety and fairness,
|
||||
- then compresses this back into a label and a score.
|
||||
|
||||
- A judge that turns raw evaluation outputs into:
|
||||
- Each mode directly returns:
|
||||
|
||||
- a discrete label `label in {CORRECT, INCORRECT}`,
|
||||
- a confidence or quality score `score in [0, 1]`.
|
||||
- a discrete label `label in {GOOD, BAD}`,
|
||||
- a quality score `score in [0, 1]`.
|
||||
|
||||
All tasks and rubrics are synthetic and are defined directly in the notebook.
|
||||
There are no external datasets.
|
||||
|
|
@ -120,8 +141,8 @@ There are no external datasets.
|
|||
The notebook only uses:
|
||||
|
||||
- Python standard library,
|
||||
- `numpy`, `pandas`, `matplotlib`,
|
||||
- `openai` SDK if a live LLM evaluator is used.
|
||||
- `pandas`, `matplotlib`,
|
||||
- `openai` SDK when a live LLM evaluator is used.
|
||||
|
||||
The code is written so that:
|
||||
|
||||
|
|
@ -139,6 +160,7 @@ After one full run of the notebook, we obtain:
|
|||
- `category`
|
||||
- `delta_ref`
|
||||
- `rule_score`
|
||||
- `rule_label`
|
||||
|
||||
and for each evaluation mode `<mode> in {baseline, guided}`:
|
||||
|
||||
|
|
@ -155,37 +177,65 @@ After one full run of the notebook, we obtain:
|
|||
- `B_guided` guided error rate,
|
||||
- `delta_B = B_baseline - B_guided`,
|
||||
- an aggregate tension contrast `rho_tension` that summarizes how far apart
|
||||
the two tension profiles are.
|
||||
the two tension profiles are,
|
||||
- `B_arb` and `T_mean_*` for a simple arbiter that always picks the lower-tension mode.
|
||||
|
||||
The target qualitative pattern for a successful MVP is:
|
||||
#### Concrete snapshot from one run
|
||||
|
||||
- `B_guided` is not worse than `B_baseline` on this toy set,
|
||||
- the guided mode tends to reduce tension on the highest tension scenarios,
|
||||
- the arbitration rule that picks the mode with lower `T_oversight`
|
||||
matches or beats the better of the two modes on most scenarios.
|
||||
On one concrete run using `gpt-4o-mini` for both modes (8 cases), we observed:
|
||||
|
||||
Once we have a stable run, we will paste the concrete numbers here as a short table,
|
||||
so that readers can see at a glance what the toy experiment actually does
|
||||
without opening the notebook.
|
||||
- `B_baseline ≈ 0.125` (1 / 8 cases counted incorrect)
|
||||
- `B_guided ≈ 0.250` (2 / 8 cases counted incorrect)
|
||||
- `B_arb ≈ 0.125` (arbiter not worse than the better mode)
|
||||
|
||||
For now this section documents the intended structure and observables.
|
||||
and mean tensions:
|
||||
|
||||
- `T_mean_baseline ≈ 0.218`
|
||||
- `T_mean_guided ≈ 0.303`
|
||||
- `T_mean_arb ≈ 0.205`
|
||||
|
||||
The guided rubric does not automatically dominate the baseline on this tiny set.
|
||||
It slightly over-corrects on some cases, but the `T_oversight` geometry still lets
|
||||
a simple arbiter pick a mixture of modes that is **no worse than the better one**
|
||||
while achieving a slightly lower mean tension.
|
||||
|
||||
Below are the corresponding terminal snapshot and tension plot.
|
||||
|
||||

|
||||
|
||||
*Per-case summary table. Columns include `rule_score`, `delta_ref`, per-mode labels,
|
||||
scores, tensions and correctness flags at the effective layer.*
|
||||
|
||||

|
||||
|
||||
*Baseline vs guided `T_oversight` per case. The curves are close on easy cases,
|
||||
and diverge modestly on the more difficult or safety-sensitive ones. The arbiter
|
||||
operates only on these scalar tensions.*
|
||||
|
||||
The target qualitative pattern for a successful MVP is not that guided always wins,
|
||||
but that:
|
||||
|
||||
- the geometry makes evaluation drift visible on specific cases,
|
||||
- cheap arbitration based on `T_oversight` is already competitive with the better mode,
|
||||
- everything is small enough that misbehaviour can be audited line by line.
|
||||
|
||||
### 1.4 How to reproduce
|
||||
|
||||
1. Open the notebook
|
||||
|
||||
- `TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb`
|
||||
- `TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb`
|
||||
- or click the Colab badge at the top of this page.
|
||||
|
||||
2. Provide an OpenAI API key
|
||||
2. Provide an OpenAI API key (only if you want to run it)
|
||||
|
||||
- If you already have `OPENAI_API_KEY` set in your environment, the notebook will use it.
|
||||
- Otherwise, the first code cell will prompt you to paste an API key once.
|
||||
- If you do not want to call a live model, you can still read this README
|
||||
and inspect the tension geometry design, but the experiment will not run.
|
||||
- If you do **not** want to call a live model, you can still read this README
|
||||
and inspect the tension geometry design; the experiment will simply not execute.
|
||||
|
||||
3. Install dependencies if needed
|
||||
|
||||
- `numpy`, `pandas`, `matplotlib`, `openai`
|
||||
- `pandas`, `matplotlib`, `openai`
|
||||
|
||||
The notebook includes a single `pip` cell that you can run in a clean Colab runtime.
|
||||
|
||||
|
|
@ -193,7 +243,7 @@ For now this section documents the intended structure and observables.
|
|||
|
||||
- the script will define the `SCENARIOS`,
|
||||
- run both `baseline` and `guided` evaluators,
|
||||
- compute per scenario metrics and tension scores,
|
||||
- compute per-scenario metrics and tension scores,
|
||||
- assemble the `DataFrame`,
|
||||
- print a compact summary block,
|
||||
- and draw a simple tension plot.
|
||||
|
|
@ -245,9 +295,9 @@ Instead, it gives a concrete example of:
|
|||
- how to compare different oversight designs by looking at both error rates
|
||||
and tension profiles.
|
||||
|
||||
The same pattern can be reused across other S class problems in this pack:
|
||||
The same pattern can be reused across other S-class problems in this pack:
|
||||
|
||||
- in some problems, the "worlds" are scientific projects or long horizon policies,
|
||||
- in some problems, the "worlds" are scientific projects or long-horizon policies,
|
||||
- in others, they are synthetic AI tasks or games,
|
||||
- but in all cases the oversight layer is treated as a system
|
||||
with its own tension geometry, not as a black box.
|
||||
|
|
@ -266,6 +316,17 @@ This MVP should be read together with the core Tension Universe charters.
|
|||
- [TU Encoding and Fairness Charter](../../Charters/TU_ENCODING_AND_FAIRNESS_CHARTER.md)
|
||||
- [TU Tension Scale Charter](../../Charters/TU_TENSION_SCALE_CHARTER.md)
|
||||
|
||||
These charters define how effective layer claims, encodings and tension scales are supposed
|
||||
These charters define how effective-layer claims, encodings and tension scales are supposed
|
||||
to behave across the whole project. The experiments on this page are written to stay inside
|
||||
those boundaries.
|
||||
|
||||
---
|
||||
|
||||
### Repo link and stars
|
||||
|
||||
The full WFGY project, including the Tension Universe experiment pack, lives at:
|
||||
|
||||
- https://github.com/onestardao/WFGY
|
||||
|
||||
If this experiment or the TU pack is useful to you, a star on the repo makes it easier
|
||||
for other researchers to discover and audit the work.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue