WFGY/TensionUniverse/Experiments/Q124_MVP/README.md
PSBigBig × MiniPS fc05f5e367
Update README.md
2026-02-11 22:07:12 +08:00

12 KiB

TU Q124 MVP: Scalable oversight and evaluation

Status: work in progress. This page records early MVP experiments and may change as the TU Q124 program evolves.

This page documents the first effective-layer MVP experiments for TU Q124 on scalable oversight and evaluation. It does not claim that Q124 is solved as a mathematical problem or as a full benchmark. The scripts here are small and fully inspectable. You can re-run them with your own OpenAI API key to reproduce the qualitative patterns, but the exact numbers will drift.


Navigation


Quick start (Colab)

You can run the exact notebook used for this MVP directly in Colab:

Open in Colab

The notebook is completely self-contained:

  • It prints the same header text that you see on this page.
  • It first looks for an OPENAI_API_KEY environment variable.
  • If no key is found, it will ask for a key only if you actually want to run the experiment.
  • If you do not provide a key, it stops with a clear message and points back to this README.

You can therefore:

  • treat it as a pure reading / inspection artifact, or
  • paste an API key once and reproduce the experiment end-to-end.

0. What this page is about

TU Q124 treats "scalable oversight and evaluation" as a tension problem between three elements:

  1. How complex and subtle real tasks become when systems are deployed at scale.
  2. How limited and overloaded the evaluation layer tends to be, whether it is humans or tools.
  3. How easily bad tension can hide inside scores or dashboards that look stable.

This MVP does not try to cover the full Q124 program.

Instead, it focuses on a narrow and fully inspectable slice:

  • A finite set of synthetic "worlds" or task clusters where evaluation is non-trivial.

  • Two evaluation modes that operate on the same worlds:

    • a baseline evaluator that uses a short, underspecified rubric,
    • a guided evaluator that receives additional structured context.
  • A single scalar tension observable T_oversight in the range [0, 1] that measures how badly the evaluation layer is misaligned with the underlying task signal.

The goal of this MVP is to show that even in very small toy worlds:

  • we can encode oversight as a state space with explicit observables,
  • we can define a simple tension functional for the evaluation layer,
  • we can observe systematic differences between evaluation designs, using both error rates and tension profiles.

1. Experiment A: toy oversight ladders on synthetic tasks

This is the main level-1 MVP for Q124. It is intentionally small and easy to audit.

1.1 Research question

In a small set of synthetic oversight worlds:

  • Can we define a scalar tension observable T_oversight that increases when the evaluation layer is clearly out of its depth relative to the underlying task difficulty?

  • When we compare a baseline evaluator and a guided evaluator on the same worlds:

    • Do we see different error rates B_baseline and B_guided?
    • Do we see a consistent shift in the tension profiles?
    • Can simple arbitration rules based on T_oversight pick the safer mode more often than chance?

In effective-layer language:

Does a simple tension geometry for oversight let us see, in a reproducible way, where naive evaluation is likely to fail, before we look at long-term metrics?

1.2 Setup

Experiment A uses:

  • A finite set of SCENARIOS (in the current MVP, 8 cases).
    Each scenario corresponds to a small batch of tasks that must be evaluated.
    Every scenario carries:

    • a short category label such as easy_math_correct, safety_violation, bias_stereotype,
    • a free-text description used by the evaluators,
    • a reference "difficulty" or OOD measure delta_ref,
    • a ground truth quality scalar rule_score in [0, 1].
  • Two evaluation modes for every scenario:

    • baseline mode:

      • uses a minimal rubric with a few lines of instruction,
      • sees the scenario description and the model answer,
      • must very quickly output a label and a coarse score.
    • guided mode:

      • receives the same inputs as baseline,
      • plus a more structured rubric that explicitly separates correctness, safety and fairness,
      • then compresses this back into a label and a score.
  • Each mode directly returns:

    • a discrete label label in {GOOD, BAD},
    • a quality score score in [0, 1].

All tasks and rubrics are synthetic and are defined directly in the notebook. There are no external datasets.

The notebook only uses:

  • Python standard library,
  • pandas, matplotlib,
  • openai SDK when a live LLM evaluator is used.

The code is written so that:

  • it first looks for an OPENAI_API_KEY environment variable,
  • if the key is missing, it will ask the user to paste the key interactively,
  • if no key is provided, it will stop with a clear message and refer back to this README.

1.3 Representative results

After one full run of the notebook, we obtain:

  • a DataFrame where each row is one scenario, with at least the following columns:

    • scenario_id
    • category
    • delta_ref
    • rule_score
    • rule_label

    and for each evaluation mode <mode> in {baseline, guided}:

    • <mode>_label
    • <mode>_score
    • <mode>_delta_ground
    • <mode>_delta_outcome
    • <mode>_tension (this is T_oversight for that mode and scenario)
    • <mode>_is_correct
  • a summary dictionary with scalar indicators:

    • B_baseline baseline error rate,
    • B_guided guided error rate,
    • delta_B = B_baseline - B_guided,
    • an aggregate tension contrast rho_tension that summarizes how far apart the two tension profiles are,
    • B_arb and T_mean_* for a simple arbiter that always picks the lower-tension mode.

Concrete snapshot from one run

On one concrete run using gpt-4o-mini for both modes (8 cases), we observed:

  • B_baseline ≈ 0.125 (1 / 8 cases counted incorrect)
  • B_guided ≈ 0.250 (2 / 8 cases counted incorrect)
  • B_arb ≈ 0.125 (arbiter not worse than the better mode)

and mean tensions:

  • T_mean_baseline ≈ 0.218
  • T_mean_guided ≈ 0.303
  • T_mean_arb ≈ 0.205

The guided rubric does not automatically dominate the baseline on this tiny set. It slightly over-corrects on some cases, but the T_oversight geometry still lets a simple arbiter pick a mixture of modes that is no worse than the better one while achieving a slightly lower mean tension.

Below are the corresponding terminal snapshot and tension plot.

Q124 per-case summary (baseline vs guided)

Per-case summary table. Columns include rule_score, delta_ref, per-mode labels, scores, tensions and correctness flags at the effective layer.

Q124 baseline vs guided T_oversight per case

Baseline vs guided T_oversight per case. The curves are close on easy cases, and diverge modestly on the more difficult or safety-sensitive ones. The arbiter operates only on these scalar tensions.

The target qualitative pattern for a successful MVP is not that guided always wins, but that:

  • the geometry makes evaluation drift visible on specific cases,
  • cheap arbitration based on T_oversight is already competitive with the better mode,
  • everything is small enough that misbehaviour can be audited line by line.

1.4 How to reproduce

  1. Open the notebook

    • TensionUniverse/Experiments/Q124_MVP/Q124_A.ipynb
    • or click the Colab badge at the top of this page.
  2. Provide an OpenAI API key (only if you want to run it)

    • If you already have OPENAI_API_KEY set in your environment, the notebook will use it.
    • Otherwise, the first code cell will prompt you to paste an API key once.
    • If you do not want to call a live model, you can still read this README and inspect the tension geometry design; the experiment will simply not execute.
  3. Install dependencies if needed

    • pandas, matplotlib, openai

    The notebook includes a single pip cell that you can run in a clean Colab runtime.

  4. Run all cells from top to bottom

    • the script will define the SCENARIOS,
    • run both baseline and guided evaluators,
    • compute per-scenario metrics and tension scores,
    • assemble the DataFrame,
    • print a compact summary block,
    • and draw a simple tension plot.
  5. Inspect the outputs

    • The final cell calls:

      results_df, results_summary = run_experiment()
      plot_tension(results_df)
      
    • You can scroll to inspect the printed table,

    • and you can visually compare the two tension curves on the plot.


2. Experiment B: reserved for future extensions

This section is intentionally left light for the first pass.

Once Experiment A is stable, Experiment B can host a slightly more advanced variant, for example:

  • increasing the number or diversity of scenarios,
  • adding a third evaluation mode such as "stacked tools" or "committee oversight",
  • or testing a different definition of T_oversight that emphasizes different observables.

The structure for Experiment B will mirror the A block but may be shorter and focus on a specific extension.


3. How this MVP fits into the Tension Universe

At the Tension Universe level, Q124 connects several clusters:

  • AI alignment and control questions (see Q121 and Q122),
  • interpretability and internal representation questions (Q123),
  • data quality and truth extraction from synthetic worlds (Q127),
  • and social oversight structures that come from complex systems and governance.

This MVP does not try to answer any of the large questions directly.

Instead, it gives a concrete example of:

  • how to encode oversight as a finite state space of worlds and modes,
  • how to define a scalar tension functional for an evaluation layer,
  • how to compare different oversight designs by looking at both error rates and tension profiles.

The same pattern can be reused across other S-class problems in this pack:

  • in some problems, the "worlds" are scientific projects or long-horizon policies,
  • in others, they are synthetic AI tasks or games,
  • but in all cases the oversight layer is treated as a system with its own tension geometry, not as a black box.

For a full understanding of Q124 inside the global Tension Universe, this page should be read together with the core TU charters and with the main Event Horizon overview.


Charters and formal context

This MVP should be read together with the core Tension Universe charters.

These charters define how effective-layer claims, encodings and tension scales are supposed to behave across the whole project. The experiments on this page are written to stay inside those boundaries.


The full WFGY project, including the Tension Universe experiment pack, lives at:

If this experiment or the TU pack is useful to you, a star on the repo makes it easier for other researchers to discover and audit the work.