WFGY/ProblemMap/Atlas/Fixes/community/benchmark-reruns/README.md
PSBigBig + MiniPS e58de5cd82
Update README.md
2026-03-17 19:04:38 +08:00

6.7 KiB

Community Benchmark Reruns 📊

Rerun packs, comparisons, and route-aware benchmark evidence

Quick links:


If the Community Fix Lab is the broader entry page for community repair assets, this folder is the rerun lane for compact benchmark-style evidence, before-and-after comparisons, and repeatable troubleshooting checks. 🧭

Use this folder when a contribution is mainly about showing a controlled comparison or rerun result, not when the contribution is mainly a notebook, JSON fixture, prompt pack, workflow recipe, or official atlas claim.

Short version:

official layer gives the repair grammar
this folder helps test that grammar in compact repeatable comparison settings


Quick start 🚀

I want to contribute a rerun

Use this path:

  1. decide whether the asset is really a rerun or comparison contribution
  2. route the case first with the atlas
  3. keep the rerun scoped to one family, one task, or one troubleshooting slice
  4. make the setup and comparison method explicit
  5. report the result with limitations, not hype

I want to browse rerun assets

Use this path:

  1. open one rerun with a clear target task or failure family
  2. inspect the rerun setup and variable control
  3. inspect the baseline behavior
  4. inspect the routed or repaired behavior
  5. check the result summary and method limits

Short version:

route first
keep the rerun controlled
make the comparison legible


Benchmark reruns quick map 🗂️

If your asset is mainly... Best folder
a controlled rerun or comparison slice Benchmark Reruns
a runnable notebook walkthrough Colab
a structured fixture or machine-readable case JSON
a route-aware prompt asset or repair prompt pack Prompts
a step-by-step repair sequence Workflows
a portable one-case bundle Reproduction Packs

This folder is the right place when the comparison itself is the main evidence surface.


What belongs here

Good rerun contributions include:

  • one small benchmark slice
  • one clear rerun protocol
  • one route-aware before and after comparison
  • one compact result table with method note
  • one reproducible troubleshooting benchmark example

A good rerun contribution should be:

  • scoped
  • method-aware
  • explicit about data source
  • explicit about limits
  • tied to atlas routing

What does not belong here 🚫

Please do not use this folder for:

  • unsupported score claims
  • screenshots with no method note
  • giant benchmark reports with no case framing
  • unclear comparisons with moving variables
  • claims that a rerun proves the whole atlas by itself
  • leaderboards with no explanation of what changed

A rerun asset should help someone inspect one comparison more clearly, not manufacture broad claims from thin evidence.


Suggested rerun pattern 🧩

A useful rerun contribution usually includes:

  1. target task or failure family
  2. rerun setup
  3. baseline behavior
  4. routed or repaired behavior
  5. compact result summary
  6. limitations

That is enough to make the rerun informative.


Suggested naming style 📌

Examples:

  • f1-grounding-rerun-v1.md
  • f5-trace-uplift-rerun-v1.md
  • f7-structured-output-rerun-v1.md

If code or notebooks are included, place them in a clearly named subfolder.

Keep names readable and compact.


What a good first rerun looks like 🌱

A strong first contribution usually looks like this:

  • one task
  • one failure family
  • one rerun setup
  • one baseline view
  • one routed or repaired view
  • one short result note with limits

Small, clean reruns are much better than giant noisy reports.


Before contributing 📚

Please read:

This helps keep rerun contributions aligned with atlas routing instead of drifting into vague benchmark theater.


Review standard

A rerun contribution is much more likely to be accepted if it is:

  • clearly named
  • easy to inspect
  • method-aware
  • connected to atlas routing
  • explicit about what changed
  • honest about limitations

Messy evidence is still messy.
Clean scoped reruns are more valuable.


Next steps

After this page, most contributors continue with:

  1. Back to Community Fix Lab
  2. Open Contribution Checklist
  3. Open the Flagship Runnable Demo Pack
  4. Back to Official Fixes

If you want the broader product surface:


One-line status 🌍

This folder holds community reruns that test atlas-guided troubleshooting in compact, repeatable benchmark-style settings.