WFGY/benchmarks/benchmark-vs-gpt5
2025-08-08 16:53:53 +08:00
..
gpt5_vs_wfgy_benchmark_20250808.png Add files via upload 2025-08-08 14:00:38 +08:00
philosophy_80_gpt4o_raw.xlsx Add files via upload 2025-07-31 17:04:32 +08:00
philosophy_80_gpt5_raw.xlsx Add files via upload 2025-08-08 16:44:17 +08:00
philosophy_80_wfgy_gpt4o.xlsx Add files via upload 2025-07-31 17:04:32 +08:00
philosophy_error_comparison.md Update philosophy_error_comparison.md 2025-08-08 16:53:53 +08:00
README.md Update README.md 2025-08-08 16:49:29 +08:00

WFGY vs GPT5 — The Logic Duel Begins

📦 Official WFGY benchmark snapshot on Zenodo: DOI

“GPT5 is the future?
Then well benchmark the future — with the tools we already have.”


WFGY benchmark outperforms GPT-5

Introduction

This benchmark is built using GPT4o + WFGY reasoning engine,
executed through either PDF-based testing pipelines or the TXT OS interface
both powered by the same symbolic structure system known as WFGY (萬法歸一引擎).

We do not rely on LLM tricks, prompting heuristics, or fine-tuning.
We enforce logic.
We enforce traceability.


Why Only MMLU Philosophy?

We deliberately chose the 80-question MMLU Philosophy subset as the first public benchmark for three reasons:

  1. Its the most semantically fragile domain:

    • Questions involve long-range inference, abstract categories, and fine-grained distinctions.
    • GPT models frequently hallucinate or break logic paths here — even under normal prompting.
  2. It tests reasoning, not memory:

    • No factual recall needed.
    • Only coherent semantic alignment and logic flow.
  3. Its a strong indicator of system structure:

    • If a system can survive philosophy cleanly, it can survive anything downstream (law, policy, meta-ethics, etc.)

All questions were answered manually using WFGY-enhanced flows.
Anyone can replicate the entire test by downloading the XLSX files, clearing the answer column,
and re-running the inputs through any AI model + WFGY engine.

Full replication takes ~1 hour.


Benchmark Result: GPT-4o (raw) vs GPT-4o + WFGY vs GPT-5 (raw)

Model Accuracy Mistakes Errors Recovered Traceable Reasoning
GPT4o (raw) 81.25% 15 / 80 ✘ None
GPT4o + WFGY 100.00% 0 / 80 ✔ 15 / 15 ✔ Every step
GPT-5 (raw) 91.25% 7 / 80 ✘ None

GPT4o got 15 questions wrong.
WFGY fixed every single one — with full semantic traceability per answer.


Why Could We Fix What GPT4o Missed?

Because WFGY is not a prompt trick, but a reasoning engine built on symbolic convergence and collapse prevention.

Each failure by GPT4o fell into one of the following error categories:

  • BBPF — false positive via semantic distractors
  • BBCR — collapse in reasoning loop, reset mid-chain
  • BBMC — missing concept recall, overconfident misfire
  • BBAM — asymmetry in logic path, ambiguous choices unresolved

WFGY applies targeted constraints via ΔS control, entropy modulation, and path symmetry enforcement —
as defined in the (WanFaGuiYi paper) and symbolic engine specs.


Download the Evidence

You dont need to believe us — you can verify it.


What Happens When GPT-5 Arrives?

We have already run the same 80 questions on GPT-5 (raw).
Next steps:

  • Run GPT-5 + WFGY with identical settings
  • Publish the comparison update (ETA < 24 h)
  • Snapshot the new results to Zenodo with DOI

Reproducibility Promise

  • No closed weights, no internal hacks
  • Every file is downloadable
  • Every test can be re-run
  • Every answer has a reason

This isnt a leaderboard.
Its a reasoning audit.

And WFGY is the auditor.


🧭 Explore More

Module Description Link
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow