6.7 KiB
Community Benchmark Reruns 📊
Rerun packs, comparisons, and route-aware benchmark evidence
Quick links:
- Back to Community Fix Lab
- Back to Official Fixes
- Back to Fixes Hub
- Back to Atlas landing page
- Back to AI Eval Evidence
- Back to Atlas Hub
- Open the Flagship Runnable Demo Pack
- Open Templates
- Get the Atlas Router TXT
If the Community Fix Lab is the broader entry page for community repair assets, this folder is the rerun lane for compact benchmark-style evidence, before-and-after comparisons, and repeatable troubleshooting checks. 🧭
Use this folder when a contribution is mainly about showing a controlled comparison or rerun result, not when the contribution is mainly a notebook, JSON fixture, prompt pack, workflow recipe, or official atlas claim.
Short version:
official layer gives the repair grammar
this folder helps test that grammar in compact repeatable comparison settings
Quick start 🚀
I want to contribute a rerun
Use this path:
- decide whether the asset is really a rerun or comparison contribution
- route the case first with the atlas
- keep the rerun scoped to one family, one task, or one troubleshooting slice
- make the setup and comparison method explicit
- report the result with limitations, not hype
I want to browse rerun assets
Use this path:
- open one rerun with a clear target task or failure family
- inspect the rerun setup and variable control
- inspect the baseline behavior
- inspect the routed or repaired behavior
- check the result summary and method limits
Short version:
route first
keep the rerun controlled
make the comparison legible ✨
Benchmark reruns quick map 🗂️
| If your asset is mainly... | Best folder |
|---|---|
| a controlled rerun or comparison slice | Benchmark Reruns |
| a runnable notebook walkthrough | Colab |
| a structured fixture or machine-readable case | JSON |
| a route-aware prompt asset or repair prompt pack | Prompts |
| a step-by-step repair sequence | Workflows |
| a portable one-case bundle | Reproduction Packs |
This folder is the right place when the comparison itself is the main evidence surface.
What belongs here ✅
Good rerun contributions include:
- one small benchmark slice
- one clear rerun protocol
- one route-aware before and after comparison
- one compact result table with method note
- one reproducible troubleshooting benchmark example
A good rerun contribution should be:
- scoped
- method-aware
- explicit about data source
- explicit about limits
- tied to atlas routing
What does not belong here 🚫
Please do not use this folder for:
- unsupported score claims
- screenshots with no method note
- giant benchmark reports with no case framing
- unclear comparisons with moving variables
- claims that a rerun proves the whole atlas by itself
- leaderboards with no explanation of what changed
A rerun asset should help someone inspect one comparison more clearly, not manufacture broad claims from thin evidence.
Suggested rerun pattern 🧩
A useful rerun contribution usually includes:
- target task or failure family
- rerun setup
- baseline behavior
- routed or repaired behavior
- compact result summary
- limitations
That is enough to make the rerun informative.
Suggested naming style 📌
Examples:
f1-grounding-rerun-v1.mdf5-trace-uplift-rerun-v1.mdf7-structured-output-rerun-v1.md
If code or notebooks are included, place them in a clearly named subfolder.
Keep names readable and compact.
What a good first rerun looks like 🌱
A strong first contribution usually looks like this:
- one task
- one failure family
- one rerun setup
- one baseline view
- one routed or repaired view
- one short result note with limits
Small, clean reruns are much better than giant noisy reports.
Before contributing 📚
Please read:
This helps keep rerun contributions aligned with atlas routing instead of drifting into vague benchmark theater.
Review standard ✅
A rerun contribution is much more likely to be accepted if it is:
- clearly named
- easy to inspect
- method-aware
- connected to atlas routing
- explicit about what changed
- honest about limitations
Messy evidence is still messy.
Clean scoped reruns are more valuable.
Next steps ✨
After this page, most contributors continue with:
- Back to Community Fix Lab
- Open Contribution Checklist
- Open the Flagship Runnable Demo Pack
- Back to Official Fixes
If you want the broader product surface:
One-line status 🌍
This folder holds community reruns that test atlas-guided troubleshooting in compact, repeatable benchmark-style settings.