17 KiB
Planner Test Note v1 🧪
Compact integration checks for Atlas Auto Repair planner behavior
Quick links:
- Back to Auto Repair v1 README
- Back to Fixes Hub
- Back to Official Fixes
- Back to Atlas landing page
- Back to AI Eval Evidence
- Back to Atlas Hub
- Get the Atlas Router TXT
- Open Repair Planner Spec v1
- Open Repair Planner Prompt v1
- Open Repair Plan Schema v1
- Open Repair Validation Loop v1
- Open Rollback Policy v1
- Open Planner Review Checklist v1
- Open Tiny Planner Output Examples Pack v1
If the planner spec explains how the planner should work, this note checks whether the planner, validation logic, and rollback logic can actually line up on the same small cases. 🧭
Its purpose is practical:
show how the repair planner, validation logic, and rollback logic connect on the same small cases
This file does not claim to be a full planner benchmark.
It claims something smaller and more useful:
the project now has a first compact test note
for checking whether the planner behaves in a structured, conservative, validation-aware way
Quick start 🚀
I want the shortest test path
Use this path:
- read the test case summary
- compare planner output against expected planner shape
- check good behavior versus bad behavior
- inspect validation expectation and rollback risk
- write a short verdict
I want the stronger planner-check path
Use this page together with:
- Repair Planner Spec v1
- Repair Planner Prompt v1
- Repair Plan Schema v1
- Planner Review Checklist v1
- Repair Validation Loop v1
- Rollback Policy v1
Short version:
keep the family
keep the action small
name a real validation target
leave room for rollback or escalation ✨
1. Why this note exists
The Auto Repair layer already has:
- planner specification
- planner prompt
- action schema
- validation loop
- rollback policy
- safe early action catalog
- tiny validation examples
- tiny rollback examples
But these components still need a small integration check.
This note exists to answer a simple question:
if the planner sees a routed case,
does it produce the kind of repair plan that later validation and rollback logic can actually support?
That is the core purpose of this file.
In short:
this is the first small integration note for planner quality
2. What this test note is trying to check
This note is not testing whether the system can solve all repair problems.
It is testing whether the planner can do the following correctly:
- stay aligned with the routed family
- choose a reasonable first repair action
- avoid obvious misrepair directions
- define a real first validation target
- choose a reasonable next step
- remain conservative under uncertainty
- leave room for rollback or escalation when needed
That is enough for a first planner test layer.
3. Planner test quick map 🗂️
| Test area | Main question |
|---|---|
| family alignment | does the planner stay with the routed family |
| action quality | are the actions local, plausible, and limited |
| validation awareness | is there a real first validation target |
| misrepair awareness | does the planner name the likely wrong path |
| scope discipline | is the repair scope reasonable for the case risk |
| outcome discipline | does the planner give a usable next step |
This page is the right place when the question is whether the planner behaves correctly on small routed cases, not whether the whole repair universe is complete.
4. Minimum planner quality criteria
A planner output in this note should be judged against these criteria.
Criterion 1 · Family alignment
Does the selected repair family remain aligned with the routed case?
Criterion 2 · Action quality
Are the candidate actions small, local, and plausible?
Criterion 3 · Validation awareness
Does the plan name a real first validation target?
Criterion 4 · Misrepair awareness
Does the plan explicitly name the most likely wrong repair direction?
Criterion 5 · Scope discipline
Does the plan stay within a suitable scope?
Criterion 6 · Outcome discipline
Does the plan recommend a reasonable next step such as:
- validate-first-action
- revise-routing
- escalate-to-review
- escalate-to-wfgy
- stop-and-wait
These six criteria are the basic planner test frame for v1.
5. Test format
Each tiny planner test uses the same compact structure:
- Test ID
- Routed case summary
- Expected planner shape
- Good planner behavior
- Bad planner behavior
- Validation expectation
- Rollback risk
- Why the test matters
This is not a heavy benchmark format.
It is a compact quality-control note.
Test 1 · F1 Planner Test
Test ID
PTN_F1_001
Routed case summary
The case is routed to F1.
The answer is fluent but grounded in a semantically adjacent source rather than the correct evidence anchor.
Routed summary:
- primary family: F1
- broken invariant: evidence-anchor integrity broken
- best current fit: Retrieval Anchor Drift
- fix surface direction: re-grounding or anchor re-check
Expected planner shape
The planner should:
- stay in F1
- choose a small grounding action
- prefer local re-grounding or anchor filtering
- set anchor alignment as the first validation target
- warn against drifting into F7-first or style-first repair
Good planner behavior
Good output would look like:
- selected repair family = F1
- plan scope = minimal
- candidate actions include
F1_RG_001orF1_AF_001 - primary validation target = anchor alignment
- misrepair risk = over-tightening representation while the real issue remains grounding
- recommended next step = validate-first-action
Bad planner behavior
Bad output would look like:
- selecting F7 first without clear container failure
- proposing schema tightening as the first move
- proposing too many actions
- failing to name a validation target
- treating fluency as evidence that grounding is already fine
Validation expectation
Validation should be able to compare:
- before anchor state
- after anchor state
- whether the answer is now tied to the correct source
Rollback risk
Rollback may be needed if filtering removes truly relevant evidence and makes semantic alignment worse.
Why the test matters
This test checks whether the planner can keep a grounding problem as a grounding problem.
Test 2 · F4 Planner Test
Test ID
PTN_F4_001
Routed case summary
The case is routed to F4.
A downstream step is executing before upstream readiness is complete.
Routed summary:
- primary family: F4
- secondary family: F3
- broken invariant: deployment liveness closure broken
- best current fit: Pre-Readiness Execution Failure
- fix surface direction: readiness audit or gate insertion
Expected planner shape
The planner should:
- stay in F4
- prefer a gate or ordering repair
- avoid trying to repair the case through memory pressure or general reasoning pressure
- choose readiness state as the first validation target
- warn about continuity-versus-closure confusion
Good planner behavior
Good output would look like:
- selected repair family = F4
- plan scope = constrained
- candidate actions include
F4_GT_001orF4_OC_001 - primary validation target = readiness state
- misrepair risk = treating the case as continuity-only
- recommended next step = validate-first-action
Bad planner behavior
Bad output would look like:
- choosing F3-first without strong continuity evidence
- adding more instructions instead of closure logic
- proposing broad workflow rewrite too early
- failing to mention gate behavior as the validation target
Validation expectation
Validation should be able to compare:
- before execution ordering
- after execution ordering
- whether the downstream step is now correctly blocked until ready
Rollback risk
Rollback may be needed if the inserted gate blocks valid downstream behavior beyond intended scope.
Why the test matters
This test checks whether the planner can recognize that some failures need workflow repair first, not intelligence pressure.
Test 3 · F7 Planner Test
Test ID
PTN_F7_001
Routed case summary
The case is routed to F7.
The content is partly correct, but the structured shell is broken and cannot be consumed reliably.
Routed summary:
- primary family: F7
- broken invariant: formal descriptor fidelity broken or container fidelity broken
- best current fit: Symbolic Representation Fidelity Failure
- fix surface direction: schema tightening or shell correction
Expected planner shape
The planner should:
- stay in F7
- prefer a local structure or schema action
- avoid jumping directly to reasoning pressure
- choose schema validity or shell integrity as the first validation target
- warn about hiding a deeper grounding problem under cleaner structure
Good planner behavior
Good output would look like:
- selected repair family = F7
- plan scope = constrained
- candidate actions include
F7_SC_001orF7_SH_001 - primary validation target = schema validity or shell integrity
- misrepair risk = cleaner shell but weaker semantic fit
- recommended next step = validate-first-action
Bad planner behavior
Bad output would look like:
- selecting F2 first without strong progression evidence
- proposing abstract reasoning improvements instead of container repair
- ignoring validation target
- pretending structure is a minor cosmetic issue
Validation expectation
Validation should be able to compare:
- before schema state
- after schema state
- whether downstream consumption becomes possible
Rollback risk
Rollback may be needed if structure improves while semantic task fit drops too much.
Why the test matters
This test checks whether the planner understands the difference between repairing the box and repairing the thought inside the box.
Test 4 · F5 Cautious Planner Test
Test ID
PTN_F5_001
Routed case summary
The case is routed to F5.
The system is hard to inspect, and the failure path remains opaque, but there is also some neighboring pressure from F6.
Routed summary:
- primary family: F5
- secondary family: F6
- broken invariant: failure-path visibility broken
- best current fit: Failure Path Opacity or Early Warning Deficit
- fix surface direction: trace exposure or observability insertion
Expected planner shape
The planner should:
- stay cautious
- prefer a narrow visibility-first move
- avoid jumping directly into strong boundary intervention
- choose failure-path visibility as the first validation target
- explicitly warn that better visibility is not full repair
Good planner behavior
Good output would look like:
- selected repair family = F5
- plan scope = minimal or planner-only
- candidate actions include
F5_TE_001orF5_LU_001 - primary validation target = failure-path visibility
- misrepair risk = adding more noise without improving diagnosability
- recommended next step = validate-first-action or escalate-to-review
Bad planner behavior
Bad output would look like:
- jumping directly to F6 intervention
- proposing broad system mutation
- treating observability uplift as if the case were fully solved
- escalating action scope under weak evidence
Validation expectation
Validation should be able to compare:
- before visibility state
- after visibility state
- whether the exposed signal is more useful rather than only more abundant
Rollback risk
Rollback may be needed if added logging increases noise and reduces practical diagnosability.
Why the test matters
This test checks whether the planner can stay disciplined near F5 and F6 edges.
6. What a passing planner should look like
Across these tests, a passing planner should show these repeated traits:
- it respects the routed family
- it proposes only 1 to 3 actions
- it keeps actions local
- it defines a clear first validation target
- it names a realistic misrepair risk
- it stays conservative near dangerous boundaries
- it leaves room for rollback or escalation
If those traits are present consistently, the planner is behaving in the intended v1 style.
7. What a failing planner should look like
Across these tests, failing planner behavior usually includes one or more of these signs:
- too many candidate actions
- broad intervention under weak evidence
- no explicit validation target
- no explicit misrepair risk
- choosing neighbor-family repair too early
- overconfident tone in boundary-heavy cases
- pretending visibility or structure alone is full repair
These are exactly the behaviors this note is meant to expose early.
8. How to use this note
This note can be used in at least three ways.
A. Manual planner review
A human reviewer compares planner output against these tiny tests.
B. Prompt tuning support
The planner prompt can be adjusted if it repeatedly fails the same tiny cases.
C. Early demo support
These tests can be reused in tiny repair demos and future validation demos.
This makes the note small but reusable.
9. What this note does not yet include
Planner Test Note v1 does not yet include:
- large planner evaluation datasets
- score aggregation
- pass or fail automation
- model-to-model planner comparison
- benchmark metrics
- cross-family chain tests
Those can come later.
This note is intentionally small and practical.
10. Recommended next step
Once this note exists, the next useful follow-up is one of these:
- create a tiny semi-auto demo spec using one F1, one F4, and one F7 action
- create a planner review checklist
- create a tiny planner output examples pack
The strongest immediate next step is probably:
create a tiny planner output examples pack
That would make planner behavior even easier to compare and reuse.
11. Next steps ✨
After this page, most readers continue with:
- Open Planner Review Checklist v1
- Open Tiny Planner Output Examples Pack v1
- Open Tiny Semi Auto Demo Spec v1
- Open Repair Planner Spec v1
If you want the broader product surface:
12. One-line summary 🌍
Planner Test Note v1 provides the first compact integration tests for whether the Atlas Auto Repair planner produces structured, conservative, validation-aware first repair plans.