WFGY/benchmarks/benchmark-vs-gpt5/README.md
2025-08-08 17:14:05 +08:00

5.3 KiB
Raw Blame History

WFGY vs GPT-5 — The Logic Duel Begins

📦 Official WFGY benchmark snapshot on Zenodo: DOI

“GPT-5 is the future?
We benchmark the future — as a plug-in, not a rival.


WFGY benchmark outperforms GPT-5

Introduction

WFGY is a symbiotic reasoning layer: the stronger the host model, the larger the lift.
Here we attach it to GPT-4o and GPT-5 using either a PDF pipeline or the TXT OS interface.
No fine-tuning, no prompt voodoo — only symbolic constraints and traceable logic.


Why Only MMLU Philosophy?

  1. Most fragile domain long-range abstraction, easy hallucinations.
  2. Tests reasoning, not memory pure inference, zero trivia.
  3. Downstream proxy survive philosophy, you survive policy, ethics, law.

Replicating the run (clearing answer column + re-run) takes ≈ 1 hour on any model with WFGY attached.


Benchmark Result

Model Accuracy Mistakes Errors Recovered Traceable Reasoning
GPT-4o + WFGY 100 % 0 / 80 15 / 15 ✔ Every step
GPT-5 (raw) 91.25 % 7 / 80 ✘ None
GPT-4o (raw) 81.25 % 15 / 80 ✘ None

Rule of thumb: raw model ↑ → WFGY lift ↑.
When GPT-6 drops, we repeat — same files, same rules.


How WFGY Patches Reasoning Gaps

Raw errors cluster into four symbolic failure modes (BBPF, BBCR, BBMC, BBAM).
WFGY applies ΔS control, entropy modulation, and path-symmetry enforcement to neutralise each mode.
Full taxonomy in the paper.


Download the Evidence

Verify every claim yourself:

  • WFGY-enhanced answers (80/80 correct)./philosophy_80_wfgy_gpt4o.xlsx
  • GPT-5 raw answers (7 mistakes) → ./philosophy_80_gpt5_raw.xlsx
  • GPT-4o raw answers (15 mistakes) → ./philosophy_80_gpt4o_raw.xlsx
  • Error-by-error comparison (markdown) → ./philosophy_error_comparison.md

NextGPT-5 + WFGY

  • Run same 80 Qs with GPT-5 + WFGY (ETA < 24 h)
  • Publish side-by-side diff & Zenodo snapshot
  • Expect further gap widening — stronger host, stronger lift

Reproducibility Promise

Open XLSX, open code, open math.
No closed weights, no hidden prompts — only audit-ready logic.


This isnt a leaderboard.
Its a reasoning audit — and WFGY is the auditor.


🧭 Explore More

Module Description Link
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow