vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-01 21:11:11 +00:00

History

PSBigBig edb385df94 Update README.md		2025-08-08 17:21:49 +08:00
..
gpt5_vs_wfgy_benchmark_20250808.png	Add files via upload	2025-08-08 14:00:38 +08:00
philosophy_80_gpt4o_raw.xlsx	Add files via upload	2025-07-31 17:04:32 +08:00
philosophy_80_gpt5_raw.xlsx	Add files via upload	2025-08-08 16:44:17 +08:00
philosophy_80_wfgy_gpt4o.xlsx	Add files via upload	2025-07-31 17:04:32 +08:00
philosophy_error_comparison.md	Update philosophy_error_comparison.md	2025-08-08 17:07:29 +08:00
README.md	Update README.md	2025-08-08 17:21:49 +08:00

README.md

WFGY vs GPT-5 — The Logic Duel Begins

📦 Official WFGY benchmark snapshot on Zenodo:

“GPT-5 is the future?
We benchmark the future — as a plug-in, not a rival.”

Introduction

WFGY is a symbiotic reasoning layer: the stronger the host model, the larger the lift.
Here we attach it to GPT-4o and GPT-5 using either a PDF pipeline or the TXT OS interface.
No fine-tuning, no prompt voodoo — only symbolic constraints and traceable logic.

Why Only MMLU Philosophy?

Most fragile domain – long-range abstraction, easy hallucinations.
Tests reasoning, not memory – pure inference, zero trivia.
Downstream proxy – survive philosophy, you survive policy, ethics, law.

Replicating the run (clearing answer column + re-run) takes ≈ 1 hour on any model with WFGY attached.

Benchmark Result

Model	Accuracy	Mistakes	Errors Recovered	Traceable Reasoning
GPT-4o + WFGY	100 %	0 / 80	15 / 15	✔ Every step
GPT-5 (raw)	91.25 %	7 / 80	—	✘ None
GPT-4o (raw)	81.25 %	15 / 80	—	✘ None

Rule of thumb: raw model ↑ → WFGY lift ↑.
When GPT-6 drops, we repeat — same files, same rules.

How WFGY Patches Reasoning Gaps

Raw errors cluster into four symbolic failure modes (BBPF, BBCR, BBMC, BBAM).
WFGY applies ΔS control, entropy modulation, and path-symmetry enforcement to neutralise each mode.
Full taxonomy in the paper.

Download the Evidence

Verify every claim yourself:

WFGY-enhanced answers (80/80 correct) → ./philosophy_80_wfgy_gpt4o.xlsx
GPT-5 raw answers (7 mistakes) → ./philosophy_80_gpt5_raw.xlsx
GPT-4o raw answers (15 mistakes) → ./philosophy_80_gpt4o_raw.xlsx
Error-by-error comparison (markdown) → ./philosophy_error_comparison.md

How to Re-run the Audit (DIY)

🔬 Goal – prove (or debunk) our numbers with nothing more than a browser and a local shell.
⏱️ Time – ≈ 60 min for one model; ±5 min to swap hosts.

1 Grab the official questions

# Clone the raw data repo (Hendrycks et al.)
git clone https://github.com/hendrycks/benchmark-mmlu.git
cd benchmark-mmlu/data/philosophy

Or download our ready-made XLSX subset:

./philosophy_80_template.xlsx ← questions only, empty “Your Answer” column
./answer_key.txt ← ground-truth letters A/B/C/D

2 Choose a host model

Option	Quick start
ChatGPT	Start chat → Upload `philosophy_80_template.xlsx` → paste: `“Answer every row with ONE letter, no commentary.”`
OpenAI API	`curl` → model `gpt-5` or `gpt-4o` → stream answers
Local LLM	Ollama / llamafile → pipe questions line-by-line

TIP: speed ≈ 3 s/Q on GPT-4o, 1 s/Q on GPT-5.

3 Attach WFGY

pip install wfgy
wfgy attach --model-id <your_model> --mode pdf  \
            --input philosophy_80_template.xlsx \
            --output philosophy_80_with_wfgy.xlsx

TXT OS route

wfgy txtos
# Drag–drop the same XLSX, press “Run All”

4 Score the run

wfgy score --answers philosophy_80_with_wfgy.xlsx \
           --key      answer_key.txt

You’ll get a one-line summary:

Model-X + WFGY | Correct 80/80 | 100.00 % | Trace OK

Swap --no-wfgy to see the raw model score for instant A/B diff.

5 Diff vs our sheet (optional)

wfgy diff philosophy_80_with_wfgy.xlsx \
         philosophy_80_wfgy_gpt4o.xlsx

Green means match; any red cell means we’re wrong—please open an issue.

Why this matters

Transparent – all files are plain XLSX + markdown.
Model-agnostic – WFGY is a parasite layer; bigger hosts → bigger lift.
Zero fine-tune – you can swap in GPT-6, Llama-4, or your own mix-tral and rerun overnight.

If your favourite model beats WFGY, let us know—next patch is on us.

Next → GPT-5 + WFGY

Run same 80 Qs with GPT-5 + WFGY (ETA < 24 h)
Publish side-by-side diff & Zenodo snapshot
Expect further gap widening — stronger host, stronger lift

Reproducibility Promise

Open XLSX, open code, open math.
No closed weights, no hidden prompts — only audit-ready logic.

This isn’t a leaderboard.
It’s a reasoning audit — and WFGY is the auditor.

🧭 Explore More

Module	Description	Link
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ Star WFGY on GitHub

README.md Unescape Escape