vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig + MiniPS e58de5cd82

Update README.md

2026-03-17 19:04:38 +08:00

6.7 KiB

Raw Blame History

Community Benchmark Reruns 📊

Rerun packs, comparisons, and route-aware benchmark evidence

Quick links:

If the Community Fix Lab is the broader entry page for community repair assets, this folder is the rerun lane for compact benchmark-style evidence, before-and-after comparisons, and repeatable troubleshooting checks. 🧭

Use this folder when a contribution is mainly about showing a controlled comparison or rerun result, not when the contribution is mainly a notebook, JSON fixture, prompt pack, workflow recipe, or official atlas claim.

Short version:

official layer gives the repair grammar
this folder helps test that grammar in compact repeatable comparison settings

Quick start 🚀

I want to contribute a rerun

Use this path:

decide whether the asset is really a rerun or comparison contribution
route the case first with the atlas
keep the rerun scoped to one family, one task, or one troubleshooting slice
make the setup and comparison method explicit
report the result with limitations, not hype

I want to browse rerun assets

Use this path:

open one rerun with a clear target task or failure family
inspect the rerun setup and variable control
inspect the baseline behavior
inspect the routed or repaired behavior
check the result summary and method limits

Short version:

route first
keep the rerun controlled
make the comparison legible ✨

Benchmark reruns quick map 🗂️

If your asset is mainly...	Best folder
a controlled rerun or comparison slice	Benchmark Reruns
a runnable notebook walkthrough	Colab
a structured fixture or machine-readable case	JSON
a route-aware prompt asset or repair prompt pack	Prompts
a step-by-step repair sequence	Workflows
a portable one-case bundle	Reproduction Packs

This folder is the right place when the comparison itself is the main evidence surface.

What belongs here ✅

Good rerun contributions include:

one small benchmark slice
one clear rerun protocol
one route-aware before and after comparison
one compact result table with method note
one reproducible troubleshooting benchmark example

A good rerun contribution should be:

scoped
method-aware
explicit about data source
explicit about limits
tied to atlas routing

What does not belong here 🚫

Please do not use this folder for:

unsupported score claims
screenshots with no method note
giant benchmark reports with no case framing
unclear comparisons with moving variables
claims that a rerun proves the whole atlas by itself
leaderboards with no explanation of what changed

A rerun asset should help someone inspect one comparison more clearly, not manufacture broad claims from thin evidence.

Suggested rerun pattern 🧩

A useful rerun contribution usually includes:

target task or failure family
rerun setup
baseline behavior
routed or repaired behavior
compact result summary
limitations

That is enough to make the rerun informative.

Suggested naming style 📌

Examples:

f1-grounding-rerun-v1.md
f5-trace-uplift-rerun-v1.md
f7-structured-output-rerun-v1.md

If code or notebooks are included, place them in a clearly named subfolder.

Keep names readable and compact.

What a good first rerun looks like 🌱

A strong first contribution usually looks like this:

one task
one failure family
one rerun setup
one baseline view
one routed or repaired view
one short result note with limits

Small, clean reruns are much better than giant noisy reports.

Before contributing 📚

Please read:

This helps keep rerun contributions aligned with atlas routing instead of drifting into vague benchmark theater.

Review standard ✅

A rerun contribution is much more likely to be accepted if it is:

clearly named
easy to inspect
method-aware
connected to atlas routing
explicit about what changed
honest about limitations

Messy evidence is still messy.
Clean scoped reruns are more valuable.

Next steps ✨

After this page, most contributors continue with:

If you want the broader product surface:

One-line status 🌍

This folder holds community reruns that test atlas-guided troubleshooting in compact, repeatable benchmark-style settings.

6.7 KiB Raw Blame History