WFGY/benchmarks/benchmark-vs-gpt5/README.md
2025-08-01 14:49:04 +08:00

6 KiB
Raw Blame History

WFGY vs GPT5 — The Logic Duel Begins

📦 Official WFGY benchmark snapshot on Zenodo: DOI
Star_Hero

“GPT5 is the future?
Then well benchmark the future — with the tools we already have.”


Introduction

This benchmark is built using GPT4o + WFGY reasoning engine,
executed through either PDF-based testing pipelines or the TXT OS interface
both powered by the same symbolic structure system known as WFGY (萬法歸一引擎).

We do not rely on LLM tricks, prompting heuristics, or fine-tuning.
We enforce logic.
We enforce traceability.


Why Only MMLU Philosophy?

We deliberately chose the 80-question MMLU Philosophy subset as the first public benchmark for three reasons:

  1. Its the most semantically fragile domain:

    • Questions involve long-range inference, abstract categories, and fine-grained distinctions.
    • GPT models frequently hallucinate or break logic paths here — even under normal prompting.
  2. It tests reasoning, not memory:

    • No factual recall needed.
    • Only coherent semantic alignment and logic flow.
  3. Its a strong indicator of system structure:

    • If a system can survive philosophy cleanly, it can survive anything downstream (law, policy, meta-ethics, etc.)

All questions were answered manually using WFGY-enhanced flows.
Anyone can replicate the entire test by downloading the XLSX files, clearing the answer column,
and re-running the inputs through any AI model + WFGY engine.

Full replication takes ~1 hour.


Benchmark Result: GPT4o (raw) vs GPT4o + WFGY

Model Accuracy Mistakes Errors Recovered Traceable Reasoning
GPT4o (raw) 81.25% 15 / 80 ✘ None
GPT4o + WFGY 100.00% 0 / 80 ✔ 15 / 15 ✔ Every step
GPT5 (TBD) ??? ??? ??? ???

GPT4o got 15 questions wrong.
WFGY fixed every single one — with full semantic traceability per answer.


Why Could We Fix What GPT4o Missed?

Because WFGY is not a prompt trick, but a reasoning engine built on symbolic convergence and collapse prevention.

Each failure by GPT4o fell into one of the following error categories:

  • BBPF — false positive via semantic distractors
  • BBCR — collapse in reasoning loop, reset mid-chain
  • BBMC — missing concept recall, overconfident misfire
  • BBAM — asymmetry in logic path, ambiguous choices unresolved

WFGY applies targeted constraints via ΔS control, entropy modulation, and path symmetry enforcement —
as defined in the (WanFaGuiYi paper) and symbolic engine specs.


Download the Evidence

You dont need to believe us — you can verify it.


What Happens When GPT5 Arrives?

We will:

  • Run the same 80 questions, same format, no tricks
  • Post raw GPT5 results within hours of public release
  • Publish a full comparison update on this page
  • Release a Zenodo-snapshotted benchmark with DOI, to permanently record all result states

If GPT5 performs better — we welcome it.
If it doesnt — well explain why.


Reproducibility Promise

  • No closed weights, no internal hacks
  • Every file is downloadable
  • Every test can be re-run
  • Every answer has a reason

This isnt a leaderboard.
Its a reasoning audit.

And WFGY is the auditor.


🧭 Explore More

Module Description Link
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT5 Stress test GPT5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow