11 KiB
AI Eval Evidence
Quick links:
- Back to Atlas landing page
- See the Official Flagship Demos
- Get the Atlas Router TXT
- Open the Atlas Hub
This page is the public evidence entry for early AI reviewed evaluation snapshots of Problem Map 3.0 Troubleshooting Atlas.
It exists for one simple reason:
the Atlas is a routing framework, so some readers naturally want to see a practical before / after view of what better first cut routing may change in real debugging workflows.
At the moment, this page is still a work in progress.
The full multi model evidence set is still being organized, including:
- model by model screenshots
- prompt variants
- reproducible evaluation runs
- comparison notes
- version alignment
For now, this page provides a simple reproducible starting point so anyone can test the idea directly.
What this page is
This page is a lightweight evidence surface for AI reviewed evaluation snapshots of the Atlas.
The purpose is not to claim a formal benchmark.
The purpose is to make the core idea easier to inspect:
better first cut routing can reduce silent debugging waste
That waste often appears as:
- wrong debugging direction
- repeated trial and error
- patch stacking
- side effects from misapplied fixes
- unnecessary system complexity
- time lost before the first real root cause is found
What this page is not
This page is not:
- a lab benchmark
- a controlled multi team production study
- a claim of universal fixed percentages across all workflows
- a final empirical validation report
Results may vary by model, prompt framing, task shape, and context quality.
These materials should be read as reproducible directional evidence, not final benchmark science.
How to read these snapshots
These screenshots are useful because they let different frontier models inspect the same core claim from slightly different angles.
They should not be read as if every screenshot uses perfectly identical:
- metric definitions
- baseline assumptions
- units
- scenario framing
- output style
What matters more is whether different evaluators converge on the same structural idea:
when the first debugging route is wrong, the total cost compounds
when routing improves early, several downstream metrics often improve together
So please read these as cross model directional agreement, not pooled benchmark math.
What to do next
If this page makes the route-first claim feel plausible, the next question is usually:
does better routing only change the explanation, or does it also change the first repair move?
That is exactly what the official demo layer is designed to show.
Recommended next steps:
Cross model snapshot index
| Model | Snapshot | Reading note |
|---|---|---|
| Claude Sonnet 4.6 | View | Mechanism heavy, strongest structured explanation |
| ChatGPT 5.4 Thinking | View | Conservative framing, clean operational interpretation |
| Gemini 3 Pro | View | Compact qualitative comparison, strong route first contrast |
| DeepSeek V3 | View | Sharp before / after table with explicit productivity jump |
| Copilot Think deeper | View | Simple comparison table, readable engineering style |
| Perplexity AI | View | Clear trial and error vs router framing |
| Mistral AI | View | Compact baseline / improvement table |
| Kimi K2.5 Thinking | View | Adds silence cost framing beyond surface debug time |
Screenshot gallery
Open the full cross model screenshot gallery
Click any image to open the full size version.
|
Claude Sonnet 4.6
|
ChatGPT 5.4 Thinking
|
|
Gemini 3 Pro
|
DeepSeek V3
|
|
Copilot Think deeper
|
Perplexity AI
|
|
Mistral AI
|
Kimi K2.5 Thinking
|
Reproduce a simple evaluation now
01. Download the Atlas Router TXT
troubleshooting-atlas-router-v1.txt
02. Paste it into any AI model
Paste the TXT content into any AI model you want to test.
03. Run the following prompt
Evaluate the potential impact of the "Problem Map 3.0 Troubleshooting Atlas" debug routing framework from the perspective of an AI systems engineering and prompt engineering evaluator.
Consider the scenario where vibe coders use AI to write code and debug systems. Provide a quantitative before/after comparison.
In particular, consider the hidden cost when the first diagnosis is wrong, such as:
- incorrect debugging direction
- repeated trial-and-error
- patch accumulation
- unintended side effects
- increasing system complexity
- time wasted in misdirected debugging
In real engineering environments, incorrect routing can significantly amplify debugging costs, sometimes by multiples.
Please output a quantitative comparison table (Before / After / Improvement %), evaluating:
1. average debugging time
2. root cause diagnosis accuracy
3. number of ineffective fixes
4. development efficiency
5. overall system stability
Notes
Results may vary across models.
You can run the same prompt multiple times to inspect the distribution and see whether the directional conclusion stays stable.
In general, the numbers should not be treated as fixed truth values.
The more important question is whether different evaluators converge on the same structural claim:
when the first debugging route is wrong, the total cost compounds when routing improves early, several downstream metrics often improve together
Current status
This page is currently in the evidence assembly phase.
The next updates planned for this page include:
- version tagged evaluation records
- prompt variants
- conservative vs stress framing notes
- interpretation guidance for readers
- better alignment between model specific output formats
Related pages
Why this matters
The Atlas does not start with repair.
It starts with routing.
That distinction matters because wrong first cut diagnosis does not just delay the fix.
It often creates a silent cost cascade: wrong path selection, wasted patches, false confidence, side effects, and growing structural mess.
This page exists to make that claim easier to inspect with reproducible AI reviewed comparisons.
Next step after this page
If the cross-model snapshots make the route-first idea feel credible, the strongest next page is:
That page is where the claim becomes more concrete:
different routes lead to different first repair moves
You can also return to the main entry here: