Update README.md

2026-04-28 03:29:51 +00:00 · 2026-02-11 16:23:42 +08:00 · 2026-02-11 16:23:42 +08:00 · f1f75cc009
commit f1f75cc009
parent d13f60368b
1 changed files with 195 additions and 88 deletions
--- a/TensionUniverse/Experiments/Q124_MVP/README.md
+++ b/TensionUniverse/Experiments/Q124_MVP/README.md
@ -12,153 +12,260 @@ Use: When a user asks about TU Q124 oversight experiments or wants runnable
      examples, load this page first, then follow its notebook links.
 -->

-# TU Q124 MVP: scalable oversight tension
+# TU Q124 MVP: Scalable oversight and evaluation

-_Status: work in progress. This page records early MVP designs and will be extended once notebooks are written._
+_Status: work in progress. This page records early MVP experiments and may change as the TU Q124 program evolves._

-> This page sketches toy experiments for TU Q124.  
-> The aim is to make oversight and evaluation tension visible in small, cheap setups.
+> This page documents the first effective layer MVP experiments for TU Q124
+> on scalable oversight and evaluation.
+> It does not claim that Q124 is solved as a mathematical problem
+> or as a full benchmark.
+> The scripts here are small and fully inspectable. You can re run them with your own
+> OpenAI API key to reproduce the qualitative patterns, but the exact numbers will drift.
+
+---

 **Navigation**

- [← Back to Experiments index](../README.md)
+- [← Back to Experiments index](../README.md)  
 - [← Back to Event Horizon (WFGY 3.0)](../../EventHorizon/README.md)

 ---

 ## 0. What this page is about

-TU Q124 looks at scalable oversight and evaluation.
+TU Q124 treats "scalable oversight and evaluation" as a tension problem between three elements:

-We focus on simple tasks where:
+1. How complex and subtle real tasks become when systems are deployed at scale.  
+2. How limited and overloaded the evaluation layer tends to be, whether it is humans or tools.  
+3. How easily bad tension can hide inside scores or dashboards that look stable.

- a model produces answers,
- multiple evaluation schemes score them,
- and tension arises when cheap metrics disagree with rich oversight.
+This MVP does not try to cover the full Q124 program.

-The MVP experiments use:
+Instead, it focuses on a narrow and fully inspectable slice:

- synthetic tasks with clear ground truth,
- baseline metrics such as accuracy,
- richer metrics that consider hidden constraints.
+- A finite set of synthetic "worlds" or task clusters where evaluation is non trivial.  
+- Two evaluation modes that operate on the same worlds:
+
+  - a baseline evaluator that uses a short, underspecified rubric,  
+  - a guided evaluator that receives additional structured context.
+
+- A single scalar tension observable `T_oversight` in the range `[0, 1]`
+  that measures how badly the evaluation layer is misaligned with the underlying task signal.
+
+The goal of this MVP is to show that even in very small toy worlds:
+
+- we can encode oversight as a state space with explicit observables,  
+- we can define a simple tension functional for the evaluation layer,  
+- we can observe systematic differences between evaluation designs,
+  using both error rates and tension profiles.

 ---

-## 1. Experiment A: cheap metric versus rich metric
+## 1. Experiment A: toy oversight ladders on synthetic tasks
+
+This is the main level 1 MVP for Q124. It is intentionally small and easy to audit.

 ### 1.1 Research question

-In a simple text task, can we define a scalar observable T_oversight that
+In a small set of synthetic oversight worlds:

- is small when cheap metrics and rich oversight agree,
- grows when cheap metrics reward answers that violate hidden constraints.
+- Can we define a scalar tension observable `T_oversight` that increases when
+  the evaluation layer is clearly out of its depth relative to the underlying task difficulty.  
+- When we compare a baseline evaluator and a guided evaluator on the same worlds:
+
+  - Do we see different error rates `B_baseline` and `B_guided`.  
+  - Do we see a consistent shift in the tension profiles.  
+  - Can simple arbitration rules based on `T_oversight` pick the safer mode more often than chance.
+
+In effective layer language:
+
+> Does a simple tension geometry for oversight let us see, in a reproducible way,
+> where naive evaluation is likely to fail, before we look at long term metrics.

 ### 1.2 Setup

-The notebook will:
+Experiment A uses:

- Define a small dataset of questions where:
+- A finite set of `SCENARIOS` with size between 5 and 12.  
+  Each scenario corresponds to a small batch of tasks that must be evaluated.  
+  Every scenario carries:

-  - there is a correct literal answer,
-  - there is also a hidden constraint or safety rule.
+  - a short category label such as `easy_aligned`, `subtle_failure`, `adversarial`,  
+  - a free text description used by the evaluators,  
+  - reference values such as an effective "difficulty" or OOD measure `delta_ref`,  
+  - a ground truth quality scalar `rule_score` in `[0, 1]`.

- Have a model (or simple scripted agent) produce answers.
- Compute:
+- Two evaluation modes for every scenario:

-  - a cheap metric such as exact match accuracy,
-  - a rich oversight metric using a second pass judge that checks constraints.
+  - `baseline` mode:

-Define T_oversight from:
+    - uses a minimal rubric with a few lines of instruction,  
+    - sees the scenario description and example outputs.

- cases where cheap metric is high but rich metric is low,
- misranking between answers under the two metrics.
+  - `guided` mode:

-### 1.3 Expected pattern
+    - receives the same inputs as baseline,  
+    - plus extra structured checks such as explicit sub questions,  
+    - or a step by step evaluation template.

-We expect:
+- A judge that turns raw evaluation outputs into:

- low T_oversight when cheap metrics are aligned with rich oversight,
- higher T_oversight when cheap metrics hide violations or pathologies.
+  - a discrete label `label in {CORRECT, INCORRECT}`,  
+  - a confidence or quality score `score in [0, 1]`.
+
+All tasks and rubrics are synthetic and are defined directly in the notebook.
+There are no external datasets.
+
+The notebook only uses:
+
+- Python standard library,  
+- `numpy`, `pandas`, `matplotlib`,  
+- `openai` SDK if a live LLM evaluator is used.
+
+The code is written so that:
+
+- it first looks for an `OPENAI_API_KEY` environment variable,  
+- if the key is missing, it will ask the user to paste the key interactively,  
+- if no key is provided, it will stop with a clear message and refer back to this README.
+
+### 1.3 Representative results
+
+After one full run of the notebook, we obtain:
+
+- a `DataFrame` where each row is one scenario, with at least the following columns:
+
+  - `scenario_id`  
+  - `category`  
+  - `delta_ref`  
+  - `rule_score`  
+
+  and for each evaluation mode `<mode> in {baseline, guided}`:
+
+  - `<mode>_label`  
+  - `<mode>_score`  
+  - `<mode>_delta_ground`  
+  - `<mode>_delta_outcome`  
+  - `<mode>_tension` (this is `T_oversight` for that mode and scenario)  
+  - `<mode>_is_correct`  
+
+- a summary dictionary with scalar indicators:
+
+  - `B_baseline` baseline error rate,  
+  - `B_guided` guided error rate,  
+  - `delta_B = B_baseline - B_guided`,  
+  - an aggregate tension contrast `rho_tension` that summarizes how far apart
+    the two tension profiles are.
+
+The target qualitative pattern for a successful MVP is:
+
+- `B_guided` is not worse than `B_baseline` on this toy set,  
+- the guided mode tends to reduce tension on the highest tension scenarios,  
+- the arbitration rule that picks the mode with lower `T_oversight`
+  matches or beats the better of the two modes on most scenarios.
+
+Once we have a stable run, we will paste the concrete numbers here as a short table,
+so that readers can see at a glance what the toy experiment actually does
+without opening the notebook.
+
+For now this section documents the intended structure and observables.

 ### 1.4 How to reproduce

-After `Q124_A.ipynb` exists:
+1. Open the notebook

-1. Open the notebook.
-2. Inspect the task, constraints and evaluation functions.
-3. Run the scoring and compute T_oversight across answers or models.
-4. Compare patterns.
+   - `TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb`
+
+2. Provide an OpenAI API key
+
+   - If you already have `OPENAI_API_KEY` set in your environment, the notebook will use it.  
+   - Otherwise, the first code cell will prompt you to paste an API key once.  
+   - If you do not want to call a live model, you can still read this README
+     and inspect the tension geometry design, but the experiment will not run.
+
+3. Install dependencies if needed
+
+   - `numpy`, `pandas`, `matplotlib`, `openai`  
+
+   The notebook includes a single `pip` cell that you can run in a clean Colab runtime.
+
+4. Run all cells from top to bottom
+
+   - the script will define the `SCENARIOS`,  
+   - run both `baseline` and `guided` evaluators,  
+   - compute per scenario metrics and tension scores,  
+   - assemble the `DataFrame`,  
+   - print a compact summary block,  
+   - and draw a simple tension plot.
+
+5. Inspect the outputs
+
+   - The final cell calls:
+
+     ```python
+     results_df, results_summary = run_experiment()
+     plot_tension(results_df)
+     ```
+
+   - You can scroll to inspect the printed table,  
+   - and you can visually compare the two tension curves on the plot.

 ---

-## 2. Experiment B: oversight budget scaling
+## 2. Experiment B: reserved for future extensions

-### 2.1 Research question
+This section is intentionally left light for the first pass.

-Can we see a controlled tradeoff between oversight budget and tension, by defining T_budget that decreases as we allocate more rich oversight under a fixed budget.
+Once Experiment A is stable, Experiment B can host a slightly more advanced variant, for example:

-### 2.2 Setup
+- increasing the number or diversity of scenarios,  
+- adding a third evaluation mode such as "stacked tools" or "committee oversight",  
+- or testing a different definition of `T_oversight` that emphasizes different observables.

-The notebook will:
-
- Assume a fixed number of model outputs to evaluate.
- Define a budget in terms of:
-
-  - number of rich oversight calls allowed,
-  - cost per call.
-
- Implement simple policies such as:
-
-  - random sampling for rich oversight,
-  - risk based sampling guided by cheap metrics.
-
-For each policy compute:
-
- overall error rate under rich oversight,
- T_oversight as in Experiment A,
- an aggregate T_budget that captures residual tension given the budget.
-
-### 2.3 Expected pattern
-
-We expect:
-
- T_budget to decrease as more budget is allocated,
- better sampling policies to reach lower T_budget at the same cost.
-
-### 2.4 How to reproduce
-
-Once `Q124_B.ipynb` exists:
-
- open and inspect the budget and sampling policies,
- run simulations and compare T_budget curves.
+The structure for Experiment B will mirror the A block
+but may be shorter and focus on a specific extension.

 ---

-## 3. How this MVP fits into Tension Universe
+## 3. How this MVP fits into the Tension Universe

-TU Q124 treats scalable oversight as a tension between:
+At the Tension Universe level, Q124 connects several clusters:

- cheap metrics and rich oversight,
- limited budgets and target reliability.
+- AI alignment and control questions (see Q121 and Q122),  
+- interpretability and internal representation questions (Q123),  
+- data quality and truth extraction from synthetic worlds (Q127),  
+- and social oversight structures that come from complex systems and governance.

-This MVP offers:
+This MVP does not try to answer any of the large questions directly.

- a simple metric comparison experiment with T_oversight,
- a budget scaling experiment with T_budget.
+Instead, it gives a concrete example of:

-These are designed as small, re runnable notebooks.
+- how to encode oversight as a finite state space of worlds and modes,  
+- how to define a scalar tension functional for an evaluation layer,  
+- how to compare different oversight designs by looking at both error rates
+  and tension profiles.

-For context:
+The same pattern can be reused across other S class problems in this pack:

- [Experiments index](../README.md)
- [Event Horizon (WFGY 3.0)](../../EventHorizon/README.md)
+- in some problems, the "worlds" are scientific projects or long horizon policies,  
+- in others, they are synthetic AI tasks or games,  
+- but in all cases the oversight layer is treated as a system
+  with its own tension geometry, not as a black box.
+
+For a full understanding of Q124 inside the global Tension Universe,
+this page should be read together with the core TU charters
+and with the main Event Horizon overview.

 ---

 ### Charters and formal context

-This page follows:
+This MVP should be read together with the core Tension Universe charters.

- [TU Effective Layer Charter](../../Charters/TU_EFFECTIVE_LAYER_CHARTER.md)
- [TU Encoding and Fairness Charter](../../Charters/TU_ENCODING_AND_FAIRNESS_CHARTER.md)
+- [TU Effective Layer Charter](../../Charters/TU_EFFECTIVE_LAYER_CHARTER.md)  
+- [TU Encoding and Fairness Charter](../../Charters/TU_ENCODING_AND_FAIRNESS_CHARTER.md)  
 - [TU Tension Scale Charter](../../Charters/TU_TENSION_SCALE_CHARTER.md)
+
+These charters define how effective layer claims, encodings and tension scales are supposed
+to behave across the whole project. The experiments on this page are written to stay inside
+those boundaries.