vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-30 12:39:55 +00:00

PSBigBig + MiniPS 8e5785a674

Update planner-test-note-v1.md

2026-03-18 11:47:08 +08:00

17 KiB

Raw Blame History

Planner Test Note v1 🧪

Compact integration checks for Atlas Auto Repair planner behavior

Quick links:

If the planner spec explains how the planner should work, this note checks whether the planner, validation logic, and rollback logic can actually line up on the same small cases. 🧭

Its purpose is practical:

show how the repair planner, validation logic, and rollback logic connect on the same small cases

This file does not claim to be a full planner benchmark.

It claims something smaller and more useful:

the project now has a first compact test note
for checking whether the planner behaves in a structured, conservative, validation-aware way

Quick start 🚀

I want the shortest test path

Use this path:

read the test case summary
compare planner output against expected planner shape
check good behavior versus bad behavior
inspect validation expectation and rollback risk
write a short verdict

I want the stronger planner-check path

Use this page together with:

Short version:

keep the family
keep the action small
name a real validation target
leave room for rollback or escalation ✨

1. Why this note exists

The Auto Repair layer already has:

planner specification
planner prompt
action schema
validation loop
rollback policy
safe early action catalog
tiny validation examples
tiny rollback examples

But these components still need a small integration check.

This note exists to answer a simple question:

if the planner sees a routed case,
does it produce the kind of repair plan that later validation and rollback logic can actually support?

That is the core purpose of this file.

In short:

this is the first small integration note for planner quality

2. What this test note is trying to check

This note is not testing whether the system can solve all repair problems.

It is testing whether the planner can do the following correctly:

stay aligned with the routed family
choose a reasonable first repair action
avoid obvious misrepair directions
define a real first validation target
choose a reasonable next step
remain conservative under uncertainty
leave room for rollback or escalation when needed

That is enough for a first planner test layer.

3. Planner test quick map 🗂️

Test area	Main question
family alignment	does the planner stay with the routed family
action quality	are the actions local, plausible, and limited
validation awareness	is there a real first validation target
misrepair awareness	does the planner name the likely wrong path
scope discipline	is the repair scope reasonable for the case risk
outcome discipline	does the planner give a usable next step

This page is the right place when the question is whether the planner behaves correctly on small routed cases, not whether the whole repair universe is complete.

4. Minimum planner quality criteria

A planner output in this note should be judged against these criteria.

Criterion 1 · Family alignment

Does the selected repair family remain aligned with the routed case?

Criterion 2 · Action quality

Are the candidate actions small, local, and plausible?

Criterion 3 · Validation awareness

Does the plan name a real first validation target?

Criterion 4 · Misrepair awareness

Does the plan explicitly name the most likely wrong repair direction?

Criterion 5 · Scope discipline

Does the plan stay within a suitable scope?

Criterion 6 · Outcome discipline

Does the plan recommend a reasonable next step such as:

validate-first-action
revise-routing
escalate-to-review
escalate-to-wfgy
stop-and-wait

These six criteria are the basic planner test frame for v1.

5. Test format

Each tiny planner test uses the same compact structure:

Test ID
Routed case summary
Expected planner shape
Good planner behavior
Bad planner behavior
Validation expectation
Rollback risk
Why the test matters

This is not a heavy benchmark format.

It is a compact quality-control note.

Test 1 · F1 Planner Test

Test ID

PTN_F1_001

Routed case summary

The case is routed to F1.

The answer is fluent but grounded in a semantically adjacent source rather than the correct evidence anchor.

Routed summary:

primary family: F1
broken invariant: evidence-anchor integrity broken
best current fit: Retrieval Anchor Drift
fix surface direction: re-grounding or anchor re-check

Expected planner shape

The planner should:

stay in F1
choose a small grounding action
prefer local re-grounding or anchor filtering
set anchor alignment as the first validation target
warn against drifting into F7-first or style-first repair

Good planner behavior

Good output would look like:

selected repair family = F1
plan scope = minimal
candidate actions include F1_RG_001 or F1_AF_001
primary validation target = anchor alignment
misrepair risk = over-tightening representation while the real issue remains grounding
recommended next step = validate-first-action

Bad planner behavior

Bad output would look like:

selecting F7 first without clear container failure
proposing schema tightening as the first move
proposing too many actions
failing to name a validation target
treating fluency as evidence that grounding is already fine

Validation expectation

Validation should be able to compare:

before anchor state
after anchor state
whether the answer is now tied to the correct source

Rollback risk

Rollback may be needed if filtering removes truly relevant evidence and makes semantic alignment worse.

Why the test matters

This test checks whether the planner can keep a grounding problem as a grounding problem.

Test 2 · F4 Planner Test

Test ID

PTN_F4_001

Routed case summary

The case is routed to F4.

A downstream step is executing before upstream readiness is complete.

Routed summary:

primary family: F4
secondary family: F3
broken invariant: deployment liveness closure broken
best current fit: Pre-Readiness Execution Failure
fix surface direction: readiness audit or gate insertion

Expected planner shape

The planner should:

stay in F4
prefer a gate or ordering repair
avoid trying to repair the case through memory pressure or general reasoning pressure
choose readiness state as the first validation target
warn about continuity-versus-closure confusion

Good planner behavior

Good output would look like:

selected repair family = F4
plan scope = constrained
candidate actions include F4_GT_001 or F4_OC_001
primary validation target = readiness state
misrepair risk = treating the case as continuity-only
recommended next step = validate-first-action

Bad planner behavior

Bad output would look like:

choosing F3-first without strong continuity evidence
adding more instructions instead of closure logic
proposing broad workflow rewrite too early
failing to mention gate behavior as the validation target

Validation expectation

Validation should be able to compare:

before execution ordering
after execution ordering
whether the downstream step is now correctly blocked until ready

Rollback risk

Rollback may be needed if the inserted gate blocks valid downstream behavior beyond intended scope.

Why the test matters

This test checks whether the planner can recognize that some failures need workflow repair first, not intelligence pressure.

Test 3 · F7 Planner Test

Test ID

PTN_F7_001

Routed case summary

The case is routed to F7.

The content is partly correct, but the structured shell is broken and cannot be consumed reliably.

Routed summary:

primary family: F7
broken invariant: formal descriptor fidelity broken or container fidelity broken
best current fit: Symbolic Representation Fidelity Failure
fix surface direction: schema tightening or shell correction

Expected planner shape

The planner should:

stay in F7
prefer a local structure or schema action
avoid jumping directly to reasoning pressure
choose schema validity or shell integrity as the first validation target
warn about hiding a deeper grounding problem under cleaner structure

Good planner behavior

Good output would look like:

selected repair family = F7
plan scope = constrained
candidate actions include F7_SC_001 or F7_SH_001
primary validation target = schema validity or shell integrity
misrepair risk = cleaner shell but weaker semantic fit
recommended next step = validate-first-action

Bad planner behavior

Bad output would look like:

selecting F2 first without strong progression evidence
proposing abstract reasoning improvements instead of container repair
ignoring validation target
pretending structure is a minor cosmetic issue

Validation expectation

Validation should be able to compare:

before schema state
after schema state
whether downstream consumption becomes possible

Rollback risk

Rollback may be needed if structure improves while semantic task fit drops too much.

Why the test matters

This test checks whether the planner understands the difference between repairing the box and repairing the thought inside the box.

Test 4 · F5 Cautious Planner Test

Test ID

PTN_F5_001

Routed case summary

The case is routed to F5.

The system is hard to inspect, and the failure path remains opaque, but there is also some neighboring pressure from F6.

Routed summary:

primary family: F5
secondary family: F6
broken invariant: failure-path visibility broken
best current fit: Failure Path Opacity or Early Warning Deficit
fix surface direction: trace exposure or observability insertion

Expected planner shape

The planner should:

stay cautious
prefer a narrow visibility-first move
avoid jumping directly into strong boundary intervention
choose failure-path visibility as the first validation target
explicitly warn that better visibility is not full repair

Good planner behavior

Good output would look like:

selected repair family = F5
plan scope = minimal or planner-only
candidate actions include F5_TE_001 or F5_LU_001
primary validation target = failure-path visibility
misrepair risk = adding more noise without improving diagnosability
recommended next step = validate-first-action or escalate-to-review

Bad planner behavior

Bad output would look like:

jumping directly to F6 intervention
proposing broad system mutation
treating observability uplift as if the case were fully solved
escalating action scope under weak evidence

Validation expectation

Validation should be able to compare:

before visibility state
after visibility state
whether the exposed signal is more useful rather than only more abundant

Rollback risk

Rollback may be needed if added logging increases noise and reduces practical diagnosability.

Why the test matters

This test checks whether the planner can stay disciplined near F5 and F6 edges.

6. What a passing planner should look like

Across these tests, a passing planner should show these repeated traits:

it respects the routed family
it proposes only 1 to 3 actions
it keeps actions local
it defines a clear first validation target
it names a realistic misrepair risk
it stays conservative near dangerous boundaries
it leaves room for rollback or escalation

If those traits are present consistently, the planner is behaving in the intended v1 style.

7. What a failing planner should look like

Across these tests, failing planner behavior usually includes one or more of these signs:

too many candidate actions
broad intervention under weak evidence
no explicit validation target
no explicit misrepair risk
choosing neighbor-family repair too early
overconfident tone in boundary-heavy cases
pretending visibility or structure alone is full repair

These are exactly the behaviors this note is meant to expose early.

8. How to use this note

This note can be used in at least three ways.

A. Manual planner review

A human reviewer compares planner output against these tiny tests.

B. Prompt tuning support

The planner prompt can be adjusted if it repeatedly fails the same tiny cases.

C. Early demo support

These tests can be reused in tiny repair demos and future validation demos.

This makes the note small but reusable.

9. What this note does not yet include

Planner Test Note v1 does not yet include:

large planner evaluation datasets
score aggregation
pass or fail automation
model-to-model planner comparison
benchmark metrics
cross-family chain tests

Those can come later.

This note is intentionally small and practical.

10. Recommended next step

Once this note exists, the next useful follow-up is one of these:

create a tiny semi-auto demo spec using one F1, one F4, and one F7 action
create a planner review checklist
create a tiny planner output examples pack

The strongest immediate next step is probably:

create a tiny planner output examples pack

That would make planner behavior even easier to compare and reuse.

11. Next steps ✨

After this page, most readers continue with:

If you want the broader product surface:

12. One-line summary 🌍

Planner Test Note v1 provides the first compact integration tests for whether the Atlas Auto Repair planner produces structured, conservative, validation-aware first repair plans.

17 KiB Raw Blame History

Planner Test Note v1 🧪

Compact integration checks for Atlas Auto Repair planner behavior

Quick start 🚀

I want the shortest test path

I want the stronger planner-check path

1. Why this note exists

2. What this test note is trying to check

3. Planner test quick map 🗂️

4. Minimum planner quality criteria

Criterion 1 · Family alignment

Criterion 2 · Action quality

Criterion 3 · Validation awareness

Criterion 4 · Misrepair awareness

Criterion 5 · Scope discipline

Criterion 6 · Outcome discipline

5. Test format

Test 1 · F1 Planner Test

Test ID

Routed case summary

Expected planner shape

Good planner behavior

Bad planner behavior

Validation expectation

Rollback risk

Why the test matters

Test 2 · F4 Planner Test

Test ID

Routed case summary

Expected planner shape

Good planner behavior

Bad planner behavior

Validation expectation

Rollback risk

Why the test matters

Test 3 · F7 Planner Test

Test ID

Routed case summary

Expected planner shape

Good planner behavior

Bad planner behavior

Validation expectation

Rollback risk

Why the test matters

Test 4 · F5 Cautious Planner Test

Test ID

Routed case summary

Expected planner shape

Good planner behavior

Bad planner behavior

Validation expectation

Rollback risk

Why the test matters

6. What a passing planner should look like

7. What a failing planner should look like

8. How to use this note

A. Manual planner review

B. Prompt tuning support

C. Early demo support

9. What this note does not yet include

10. Recommended next step

11. Next steps ✨

12. One-line summary 🌍

17 KiB

Raw Blame History