WFGY/ProblemMap/Atlas/ai-eval-evidence.md
2026-03-15 17:29:43 +08:00

11 KiB

AI Eval Evidence

Quick links:


This page is the public evidence entry for early AI reviewed evaluation snapshots of Problem Map 3.0 Troubleshooting Atlas.

It exists for one simple reason:

the Atlas is a routing framework, so some readers naturally want to see a practical before / after view of what better first cut routing may change in real debugging workflows.

At the moment, this page is still a work in progress.

The full multi model evidence set is still being organized, including:

  • model by model screenshots
  • prompt variants
  • reproducible evaluation runs
  • comparison notes
  • version alignment

For now, this page provides a simple reproducible starting point so anyone can test the idea directly.


What this page is

This page is a lightweight evidence surface for AI reviewed evaluation snapshots of the Atlas.

The purpose is not to claim a formal benchmark.

The purpose is to make the core idea easier to inspect:

better first cut routing can reduce silent debugging waste

That waste often appears as:

  • wrong debugging direction
  • repeated trial and error
  • patch stacking
  • side effects from misapplied fixes
  • unnecessary system complexity
  • time lost before the first real root cause is found

What this page is not

This page is not:

  • a lab benchmark
  • a controlled multi team production study
  • a claim of universal fixed percentages across all workflows
  • a final empirical validation report

Results may vary by model, prompt framing, task shape, and context quality.

These materials should be read as reproducible directional evidence, not final benchmark science.


How to read these snapshots

These screenshots are useful because they let different frontier models inspect the same core claim from slightly different angles.

They should not be read as if every screenshot uses perfectly identical:

  • metric definitions
  • baseline assumptions
  • units
  • scenario framing
  • output style

What matters more is whether different evaluators converge on the same structural idea:

when the first debugging route is wrong, the total cost compounds
when routing improves early, several downstream metrics often improve together

So please read these as cross model directional agreement, not pooled benchmark math.


What to do next

If this page makes the route-first claim feel plausible, the next question is usually:

does better routing only change the explanation, or does it also change the first repair move?

That is exactly what the official demo layer is designed to show.

Recommended next steps:

  1. See the Official Flagship Demos
  2. Go back to the Atlas landing page
  3. Try the Atlas Router TXT yourself

Cross model snapshot index

Model Snapshot Reading note
Claude Sonnet 4.6 View Mechanism heavy, strongest structured explanation
ChatGPT 5.4 Thinking View Conservative framing, clean operational interpretation
Gemini 3 Pro View Compact qualitative comparison, strong route first contrast
DeepSeek V3 View Sharp before / after table with explicit productivity jump
Copilot Think deeper View Simple comparison table, readable engineering style
Perplexity AI View Clear trial and error vs router framing
Mistral AI View Compact baseline / improvement table
Kimi K2.5 Thinking View Adds silence cost framing beyond surface debug time

Open the full cross model screenshot gallery

Click any image to open the full size version.

Claude Sonnet 4.6

Claude Sonnet 4.6 evaluation snapshot

ChatGPT 5.4 Thinking

ChatGPT 5.4 Thinking evaluation snapshot

Gemini 3 Pro

Gemini 3 Pro evaluation snapshot

DeepSeek V3

DeepSeek V3 evaluation snapshot

Copilot Think deeper

Copilot Think deeper evaluation snapshot

Perplexity AI

Perplexity AI evaluation snapshot

Mistral AI

Mistral AI evaluation snapshot

Kimi K2.5 Thinking

Kimi K2.5 Thinking evaluation snapshot

Reproduce a simple evaluation now

01. Download the Atlas Router TXT

troubleshooting-atlas-router-v1.txt

02. Paste it into any AI model

Paste the TXT content into any AI model you want to test.

03. Run the following prompt

Evaluate the potential impact of the "Problem Map 3.0 Troubleshooting Atlas" debug routing framework from the perspective of an AI systems engineering and prompt engineering evaluator.

Consider the scenario where vibe coders use AI to write code and debug systems. Provide a quantitative before/after comparison.

In particular, consider the hidden cost when the first diagnosis is wrong, such as:
- incorrect debugging direction
- repeated trial-and-error
- patch accumulation
- unintended side effects
- increasing system complexity
- time wasted in misdirected debugging

In real engineering environments, incorrect routing can significantly amplify debugging costs, sometimes by multiples.

Please output a quantitative comparison table (Before / After / Improvement %), evaluating:

1. average debugging time
2. root cause diagnosis accuracy
3. number of ineffective fixes
4. development efficiency
5. overall system stability

Notes

Results may vary across models.

You can run the same prompt multiple times to inspect the distribution and see whether the directional conclusion stays stable.

In general, the numbers should not be treated as fixed truth values.

The more important question is whether different evaluators converge on the same structural claim:

when the first debugging route is wrong, the total cost compounds when routing improves early, several downstream metrics often improve together


Current status

This page is currently in the evidence assembly phase.

The next updates planned for this page include:

  • version tagged evaluation records
  • prompt variants
  • conservative vs stress framing notes
  • interpretation guidance for readers
  • better alignment between model specific output formats


Why this matters

The Atlas does not start with repair.

It starts with routing.

That distinction matters because wrong first cut diagnosis does not just delay the fix.

It often creates a silent cost cascade: wrong path selection, wasted patches, false confidence, side effects, and growing structural mess.

This page exists to make that claim easier to inspect with reproducible AI reviewed comparisons.


Next step after this page

If the cross-model snapshots make the route-first idea feel credible, the strongest next page is:

Official Flagship Demos

That page is where the claim becomes more concrete:

different routes lead to different first repair moves

You can also return to the main entry here:

Problem Map 3.0 Troubleshooting Atlas