vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig + MiniPS 96509e7bf9

Update ai-eval-evidence.md

2026-03-15 17:29:43 +08:00

11 KiB

Raw Blame History

AI Eval Evidence

Quick links:

This page is the public evidence entry for early AI reviewed evaluation snapshots of Problem Map 3.0 Troubleshooting Atlas.

It exists for one simple reason:

the Atlas is a routing framework, so some readers naturally want to see a practical before / after view of what better first cut routing may change in real debugging workflows.

At the moment, this page is still a work in progress.

The full multi model evidence set is still being organized, including:

model by model screenshots
prompt variants
reproducible evaluation runs
comparison notes
version alignment

For now, this page provides a simple reproducible starting point so anyone can test the idea directly.

What this page is

This page is a lightweight evidence surface for AI reviewed evaluation snapshots of the Atlas.

The purpose is not to claim a formal benchmark.

The purpose is to make the core idea easier to inspect:

better first cut routing can reduce silent debugging waste

That waste often appears as:

wrong debugging direction
repeated trial and error
patch stacking
side effects from misapplied fixes
unnecessary system complexity
time lost before the first real root cause is found

What this page is not

This page is not:

a lab benchmark
a controlled multi team production study
a claim of universal fixed percentages across all workflows
a final empirical validation report

Results may vary by model, prompt framing, task shape, and context quality.

These materials should be read as reproducible directional evidence, not final benchmark science.

How to read these snapshots

These screenshots are useful because they let different frontier models inspect the same core claim from slightly different angles.

They should not be read as if every screenshot uses perfectly identical:

metric definitions
baseline assumptions
units
scenario framing
output style

What matters more is whether different evaluators converge on the same structural idea:

when the first debugging route is wrong, the total cost compounds
when routing improves early, several downstream metrics often improve together

So please read these as cross model directional agreement, not pooled benchmark math.

What to do next

If this page makes the route-first claim feel plausible, the next question is usually:

does better routing only change the explanation, or does it also change the first repair move?

That is exactly what the official demo layer is designed to show.

Recommended next steps:

Cross model snapshot index

Model	Snapshot	Reading note
Claude Sonnet 4.6	View	Mechanism heavy, strongest structured explanation
ChatGPT 5.4 Thinking	View	Conservative framing, clean operational interpretation
Gemini 3 Pro	View	Compact qualitative comparison, strong route first contrast
DeepSeek V3	View	Sharp before / after table with explicit productivity jump
Copilot Think deeper	View	Simple comparison table, readable engineering style
Perplexity AI	View	Clear trial and error vs router framing
Mistral AI	View	Compact baseline / improvement table
Kimi K2.5 Thinking	View	Adds silence cost framing beyond surface debug time

Screenshot gallery

Open the full cross model screenshot gallery

Click any image to open the full size version.

Claude Sonnet 4.6	ChatGPT 5.4 Thinking
Gemini 3 Pro	DeepSeek V3
Copilot Think deeper	Perplexity AI
Mistral AI	Kimi K2.5 Thinking

Reproduce a simple evaluation now

01. Download the Atlas Router TXT

troubleshooting-atlas-router-v1.txt

02. Paste it into any AI model

Paste the TXT content into any AI model you want to test.

03. Run the following prompt

Evaluate the potential impact of the "Problem Map 3.0 Troubleshooting Atlas" debug routing framework from the perspective of an AI systems engineering and prompt engineering evaluator.

Consider the scenario where vibe coders use AI to write code and debug systems. Provide a quantitative before/after comparison.

In particular, consider the hidden cost when the first diagnosis is wrong, such as:
- incorrect debugging direction
- repeated trial-and-error
- patch accumulation
- unintended side effects
- increasing system complexity
- time wasted in misdirected debugging

In real engineering environments, incorrect routing can significantly amplify debugging costs, sometimes by multiples.

Please output a quantitative comparison table (Before / After / Improvement %), evaluating:

1. average debugging time
2. root cause diagnosis accuracy
3. number of ineffective fixes
4. development efficiency
5. overall system stability

Notes

Results may vary across models.

You can run the same prompt multiple times to inspect the distribution and see whether the directional conclusion stays stable.

In general, the numbers should not be treated as fixed truth values.

The more important question is whether different evaluators converge on the same structural claim:

when the first debugging route is wrong, the total cost compounds when routing improves early, several downstream metrics often improve together

Current status

This page is currently in the evidence assembly phase.

The next updates planned for this page include:

version tagged evaluation records
prompt variants
conservative vs stress framing notes
interpretation guidance for readers
better alignment between model specific output formats

Why this matters

The Atlas does not start with repair.

It starts with routing.

That distinction matters because wrong first cut diagnosis does not just delay the fix.

It often creates a silent cost cascade: wrong path selection, wasted patches, false confidence, side effects, and growing structural mess.

This page exists to make that claim easier to inspect with reproducible AI reviewed comparisons.

Next step after this page

If the cross-model snapshots make the route-first idea feel credible, the strongest next page is:

Official Flagship Demos

That page is where the claim becomes more concrete:

different routes lead to different first repair moves

You can also return to the main entry here:

Problem Map 3.0 Troubleshooting Atlas

11 KiB Raw Blame History