WFGY/benchmarks/benchmark-vs-gpt5
2025-08-08 17:21:49 +08:00
..
gpt5_vs_wfgy_benchmark_20250808.png Add files via upload 2025-08-08 14:00:38 +08:00
philosophy_80_gpt4o_raw.xlsx Add files via upload 2025-07-31 17:04:32 +08:00
philosophy_80_gpt5_raw.xlsx Add files via upload 2025-08-08 16:44:17 +08:00
philosophy_80_wfgy_gpt4o.xlsx Add files via upload 2025-07-31 17:04:32 +08:00
philosophy_error_comparison.md Update philosophy_error_comparison.md 2025-08-08 17:07:29 +08:00
README.md Update README.md 2025-08-08 17:21:49 +08:00

WFGY vs GPT-5 — The Logic Duel Begins

📦 Official WFGY benchmark snapshot on Zenodo: DOI

“GPT-5 is the future?
We benchmark the future — as a plug-in, not a rival.


WFGY benchmark outperforms GPT-5

Introduction

WFGY is a symbiotic reasoning layer: the stronger the host model, the larger the lift.
Here we attach it to GPT-4o and GPT-5 using either a PDF pipeline or the TXT OS interface.
No fine-tuning, no prompt voodoo — only symbolic constraints and traceable logic.


Why Only MMLU Philosophy?

  1. Most fragile domain long-range abstraction, easy hallucinations.
  2. Tests reasoning, not memory pure inference, zero trivia.
  3. Downstream proxy survive philosophy, you survive policy, ethics, law.

Replicating the run (clearing answer column + re-run) takes ≈ 1 hour on any model with WFGY attached.


Benchmark Result

Model Accuracy Mistakes Errors Recovered Traceable Reasoning
GPT-4o + WFGY 100 % 0 / 80 15 / 15 ✔ Every step
GPT-5 (raw) 91.25 % 7 / 80 ✘ None
GPT-4o (raw) 81.25 % 15 / 80 ✘ None

Rule of thumb: raw model ↑ → WFGY lift ↑.
When GPT-6 drops, we repeat — same files, same rules.


How WFGY Patches Reasoning Gaps

Raw errors cluster into four symbolic failure modes (BBPF, BBCR, BBMC, BBAM).
WFGY applies ΔS control, entropy modulation, and path-symmetry enforcement to neutralise each mode.
Full taxonomy in the paper.


Download the Evidence

Verify every claim yourself:

  • WFGY-enhanced answers (80/80 correct)./philosophy_80_wfgy_gpt4o.xlsx
  • GPT-5 raw answers (7 mistakes) → ./philosophy_80_gpt5_raw.xlsx
  • GPT-4o raw answers (15 mistakes) → ./philosophy_80_gpt4o_raw.xlsx
  • Error-by-error comparison (markdown) → ./philosophy_error_comparison.md

How to Re-run the Audit (DIY)

🔬 Goal prove (or debunk) our numbers with nothing more than a browser and a local shell.
⏱️ Time ≈ 60 min for one model; ±5 min to swap hosts.


1 Grab the official questions

# Clone the raw data repo (Hendrycks et al.)
git clone https://github.com/hendrycks/benchmark-mmlu.git
cd benchmark-mmlu/data/philosophy

Or download our ready-made XLSX subset:

  • ./philosophy_80_template.xlsx ← questions only, empty “Your Answer” column
  • ./answer_key.txt ← ground-truth letters A/B/C/D

2 Choose a host model

Option Quick start
ChatGPT Start chat → Upload philosophy_80_template.xlsx → paste:
“Answer every row with ONE letter, no commentary.”
OpenAI API curl → model gpt-5 or gpt-4o → stream answers
Local LLM Ollama / llamafile → pipe questions line-by-line

TIP: speed ≈ 3 s/Q on GPT-4o, 1 s/Q on GPT-5.

3 Attach WFGY

pip install wfgy
wfgy attach --model-id <your_model> --mode pdf  \
            --input philosophy_80_template.xlsx \
            --output philosophy_80_with_wfgy.xlsx

TXT OS route

wfgy txtos
# Dragdrop the same XLSX, press “Run All”

4 Score the run

wfgy score --answers philosophy_80_with_wfgy.xlsx \
           --key      answer_key.txt

Youll get a one-line summary:

Model-X + WFGY | Correct 80/80 | 100.00 % | Trace OK

Swap --no-wfgy to see the raw model score for instant A/B diff.

5 Diff vs our sheet (optional)

wfgy diff philosophy_80_with_wfgy.xlsx \
         philosophy_80_wfgy_gpt4o.xlsx

Green means match; any red cell means were wrong—please open an issue.


Why this matters

  • Transparent all files are plain XLSX + markdown.
  • Model-agnostic WFGY is a parasite layer; bigger hosts → bigger lift.
  • Zero fine-tune you can swap in GPT-6, Llama-4, or your own mix-tral and rerun overnight.

If your favourite model beats WFGY, let us know—next patch is on us.


NextGPT-5 + WFGY

  • Run same 80 Qs with GPT-5 + WFGY (ETA < 24 h)
  • Publish side-by-side diff & Zenodo snapshot
  • Expect further gap widening — stronger host, stronger lift

Reproducibility Promise

Open XLSX, open code, open math.
No closed weights, no hidden prompts — only audit-ready logic.


This isnt a leaderboard.
Its a reasoning audit — and WFGY is the auditor.


🧭 Explore More

Module Description Link
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow