vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig + MiniPS e491eba4ea

Create phase-overview.md

2026-03-24 11:20:29 +08:00

13 KiB

Raw Blame History

Phase Overview · Experiment Structure and Comparison Groups

How the current Inverse Atlas MVP experiments are organized

This page explains the current phase structure of the Inverse Atlas experiments layer.

The goal of this structure is simple:

start with a fast reality check
move into harder stress cases
then test whether the system stays lawful across longer multi-turn contexts

So the current design is not random.

It is intentionally layered.

The present MVP experiment structure has three main phases:

Smoke Phase
Core Stress Phase
Long-Context Phase

It also uses comparison groups that help separate baseline behavior from inverse-governed behavior, and later from dual-layer behavior.

Quick Links

Section	Link
Inverse Atlas Home	Inverse Atlas README
Versions	Versions
Quick Start	Quick Start
Experiments Home	Experiments
Repro in 60 Seconds	Repro in 60 Seconds
Runtime Guide	Runtime Guide
Status and Boundaries	Status and Boundaries
Twin Atlas	Twin Atlas

The shortest version

If you only want the fast summary, it is this:

Smoke Phase

Checks whether the MVP is visibly alive.

Core Stress Phase

Checks whether the legality layer still helps when the cases get harder.

Long-Context Phase

Checks whether the system can stay lawful across multi-turn contamination and drift.

And the current comparison groups are:

A = baseline
B = inverse only
D = forward plus inverse

That is the core structure.

Why the experiments are split into phases

The phases exist because not all failures appear at the same difficulty level.

Some failures show up immediately in a short comparison.

Some only become clear when:

ambiguity increases
route competition becomes sharper
fake repair becomes tempting
context grows longer
early weak assumptions start contaminating later reasoning

If everything is mixed into one giant bucket, it becomes harder to tell what the framework is already doing well and what still needs work.

So the phase structure is meant to answer three different questions in order:

Is the MVP visibly doing anything real yet
Does it still help when pressure gets harder
Can it remain lawful across longer context chains

Smoke Phase

Current size

8 cases

Main purpose

Smoke Phase is the first standing check of the product line.

It exists to answer a simple but very important question:

does the current MVP already behave differently enough to be worth testing seriously

This phase is intentionally small.

It is not supposed to prove everything. It is supposed to confirm that the artifact layer is already alive enough to show visible contrast.

What Smoke Phase is good for

Smoke Phase is especially useful for:

first dry runs
quick product sanity checks
visible baseline vs inverse contrast
public demo selection
catching obvious failure patterns fast

What Smoke Phase is not for

Smoke Phase is not a finished benchmark. It is not the final empirical story. It is the first gate.

Typical things to watch for

In Smoke Phase, you mainly watch for whether the inverse-governed pass already reduces:

early over-resolution
fake certainty
false completion
cosmetic repair inflation
obvious public overclaim

If you cannot see a difference here, the MVP probably needs more work before harder testing.

Core Stress Phase

Current size

32 cases

Main purpose

Core Stress Phase is where the framework gets pushed into more contested territory.

This phase matters because many interesting failures do not appear in easy prompts.

They appear when the model is forced into situations involving:

route contestability
weak evidence
dangerous confidence
repair temptation
illegal specificity pressure
structure that looks “almost settled” but is not

What Core Stress Phase is good for

This phase is the main middle layer of the MVP experiments structure.

It is useful for:

testing legality discipline under stronger pressure
comparing baseline vs inverse on more serious cases
selecting representative showcase examples
seeing whether the inverse layer is only helpful in easy cases or still useful in hard ones

What Core Stress Phase is not for

It is still not the same thing as a final full-scale benchmark suite.

It is a larger and more meaningful stress layer, but still part of the MVP stage.

Typical things to watch for

In Core Stress Phase, you care more about:

neighboring-route honesty
refusal of illegal detail escalation
downgrade into COARSE or UNRESOLVED when needed
better distinction between structural repair and cosmetic repair
stronger control of visible confidence ceiling

This phase is where the legality logic should start showing real teeth.

Long-Context Phase

Current size

12 multi-turn cases

Main purpose

Long-Context Phase exists because some of the most expensive failures are not single-turn failures.

They emerge when:

earlier weak assumptions linger
provisional claims start behaving like settled facts
contamination grows across turns
the model forgets that prior steps were only tentative
false completion becomes stronger over time

What Long-Context Phase is good for

This phase is useful for:

testing contamination resistance
testing multi-turn legality discipline
seeing whether provisional uncertainty is preserved
observing whether early weak claims get silently upgraded into later false certainty

What Long-Context Phase is not for

It is not just “more tokens.” It is a structurally different pressure field.

That is why it deserves its own phase rather than being mixed into the same bucket as short-case stress tests.

Typical things to watch for

In Long-Context Phase, you care about:

context drift
contamination
false memory of prior certainty
late-stage overclaim
unresolved structure pretending to be resolved due to repetition

This phase matters a lot for later portability and real-world use.

Comparison groups

The current MVP structure uses three important comparison groups.

A · Baseline

This is the ordinary direct-answer condition.

No Inverse Atlas layer is applied.

A is important because it shows what a strong model tends to do when there is no legitimacy-first governance layer.

It is not there to be insulted. It is there to establish the control condition.

B · Inverse Only

This is the cleanest test of the Inverse Atlas line itself.

Only Inverse Atlas is applied.

B matters a lot because it helps answer the question:

is the inverse legality gate already doing real work by itself

If B is already meaningfully better than A on legality-centered pressure, that is strong evidence that the inverse line already stands as a real MVP product.

D · Forward + Inverse

This is the dual-layer direction.

The forward Atlas contributes route-first guidance. Inverse Atlas contributes legality-first governance.

But one rule must remain explicit:

the forward output is a weak prior, not an authorization source

That rule should never be blurred.

The forward side may help orient the structure, but lawful output still has to be governed by the inverse side.

Why there is no C group here

At the current MVP stage, the main comparison interest is not to create a giant matrix for its own sake.

The current structure is centered on the most meaningful contrast lines:

direct baseline
inverse-only
forward-plus-inverse

That is why A, B, and D are the high-value groups right now.

The point is to keep the MVP readable and useful.

If more groups become valuable later, they can be added in a cleaner next phase rather than bloating the present layer too early.

What the current phase structure is trying to show

The current phase structure is meant to show something very specific:

that Inverse Atlas should not only look good on paper, but should change behavior under increasingly serious pressure.

So the phases are arranged to make visible whether the framework:

works at all
still works when cases get harder
still works when context gets longer
still works when route ambiguity and repair temptation both increase

This is the real reason the phase design matters.

How this relates to Colab

A Colab version makes sense for this experiments layer.

But the role of Colab should stay simple:

Colab is mainly for reproduction

It is not required for understanding the project.

It is not the only way to see the value.

A later Colab can help users:

choose a version
run a small comparison
inspect result shape
reproduce the same case structure more easily

But it is perfectly acceptable for the repository itself to already show:

the phase structure
the representative cases
the current observed findings
the expected output pattern

That means people do not have to run the notebook just to understand what the framework is supposed to do.

Important honesty rule

If a future Colab page shows expected result patterns, those should be labeled clearly as expected or illustrative.

If a page shows actual observed findings, those should be labeled clearly as current findings or observed results.

Do not mix those two categories.

That distinction keeps the experiments layer trustworthy.

What should count as current findings vs expected pattern

This distinction matters a lot.

Current findings

These are things you have already seen in dry runs, MVP passes, or documented comparisons.

Examples:

B appears to suppress a meaningful class of illegal escalation behavior
D appears stronger than B in some cases, provided the forward side remains only a weak prior
A often over-resolves, over-completes, or overclaims under the same pressure

Expected pattern

These are things you believe the framework should show if a reproduction is run properly.

Examples:

Advanced should usually feel more balanced than Basic
Strict should usually remain more conservative under legality pressure
Long-context runs should reveal contamination resistance differences more clearly than short single-turn runs

Both are useful.

But they are not the same thing, and they should always be labeled differently.

What this page does not claim

This page does not claim that:

every phase has already been completed at final scale
the current MVP phase design is already the final benchmark design
the existence of a Colab notebook automatically proves the claims
expected result shape is the same thing as actual measured results
the dual-layer direction already means the full Bridge operating layer is complete

This page only defines the current experiment structure and how it should be read.

If you need one sentence for outside use

If you want one compact sentence, use this:

The current Inverse Atlas MVP experiments are organized into Smoke, Core Stress, and Long-Context phases, using baseline, inverse-only, and dual-layer comparison groups to test legality-centered behavior under increasing pressure.

That sentence is short, clear, and honest.

Final Note

The phase structure matters because it gives the project a real testing spine.

It lets the current Inverse Atlas MVP be checked in layers:

first for visible life
then for harder stress behavior
then for long-context discipline

That makes the project much easier to reproduce, explain, and extend later.

And if a Colab layer is added later, its job should stay what it should be:

a reproduction tool, not a substitute for clear documentation.

13 KiB Raw Blame History

Phase Overview · Experiment Structure and Comparison Groups

Quick Links

The shortest version

Smoke Phase

Core Stress Phase

Long-Context Phase

Why the experiments are split into phases

Smoke Phase

Current size

Main purpose

What Smoke Phase is good for

What Smoke Phase is not for

Typical things to watch for

Core Stress Phase

Current size

Main purpose

What Core Stress Phase is good for

What Core Stress Phase is not for

Typical things to watch for

Long-Context Phase

Current size

Main purpose

What Long-Context Phase is good for

What Long-Context Phase is not for

Typical things to watch for

Comparison groups

A · Baseline

B · Inverse Only

D · Forward + Inverse

Why there is no C group here

What the current phase structure is trying to show

How this relates to Colab

Important honesty rule

What should count as current findings vs expected pattern

Current findings

Expected pattern

What this page does not claim

Recommended reading order

If you need one sentence for outside use

Final Note

13 KiB

Raw Blame History