WFGY/benchmarks/benchmark-vs-gpt5/README.md
2025-08-01 14:49:04 +08:00

150 lines
6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# WFGY vs GPT5 — The Logic Duel Begins
📦 Official WFGY benchmark snapshot on Zenodo: [![DOI](https://zenodo.org/badge/996124831.svg)](https://doi.org/10.5281/zenodo.16635020)
<img width="1536" height="1024" alt="Star_Hero" src="https://github.com/user-attachments/assets/e7432f22-6b5a-4321-817c-06afd5c63255" />
> “GPT5 is the future?
Then well benchmark the future — with the tools we already have.”
---
## Introduction
This benchmark is built using **GPT4o + WFGY reasoning engine**,
executed through either **PDF-based testing pipelines** or the **TXT OS interface**
both powered by the same symbolic structure system known as **WFGY (萬法歸一引擎)**.
We do not rely on LLM tricks, prompting heuristics, or fine-tuning.
We enforce logic.
We enforce traceability.
---
## Why Only MMLU Philosophy?
We deliberately chose the **80-question MMLU Philosophy subset** as the first public benchmark for three reasons:
1. **Its the most semantically fragile domain**:
- Questions involve long-range inference, abstract categories, and fine-grained distinctions.
- GPT models frequently hallucinate or break logic paths here — even under normal prompting.
2. **It tests reasoning, not memory**:
- No factual recall needed.
- Only coherent semantic alignment and logic flow.
3. **Its a strong indicator of system structure**:
- If a system can survive philosophy cleanly, it can survive anything downstream (law, policy, meta-ethics, etc.)
All questions were answered manually using WFGY-enhanced flows.
Anyone can **replicate the entire test** by downloading the XLSX files, clearing the answer column,
and re-running the inputs through **any AI model + WFGY engine**.
> Full replication takes ~1 hour.
---
## Benchmark Result: GPT4o (raw) vs GPT4o + WFGY
| Model | Accuracy | Mistakes | Errors Recovered | Traceable Reasoning |
|--------------------|-----------|----------|------------------|---------------------|
| GPT4o (raw) | 81.25% | 15 / 80 | — | ✘ None |
| GPT4o + WFGY | 100.00% | 0 / 80 | ✔ 15 / 15 | ✔ Every step |
| GPT5 (TBD) | ??? | ??? | ??? | ??? |
> GPT4o got 15 questions wrong.
> WFGY fixed every single one — with full semantic traceability per answer.
---
## Why Could We Fix What GPT4o Missed?
Because WFGY is **not a prompt trick**, but a reasoning engine built on symbolic convergence and collapse prevention.
Each failure by GPT4o fell into one of the following error categories:
- **BBPF** — false positive via semantic distractors
- **BBCR** — collapse in reasoning loop, reset mid-chain
- **BBMC** — missing concept recall, overconfident misfire
- **BBAM** — asymmetry in logic path, ambiguous choices unresolved
WFGY applies targeted constraints via ΔS control, entropy modulation, and path symmetry enforcement —
as defined in the ([WanFaGuiYi paper](https://zenodo.org/records/15630969)) and symbolic engine specs.
---
## Download the Evidence
You dont need to believe us — you can **verify it**.
- [WFGY-enhanced answers (GPT4o + WFGY)](./philosophy_80_wfgy_gpt4o.xlsx)
- [GPT4o baseline answers (raw)](./philosophy_80_gpt4o_raw.xlsx)
- [Error-by-error comparison (markdown)](./philosophy_error_comparison.md)
---
## What Happens When GPT5 Arrives?
We will:
- Run the **same 80 questions**, same format, no tricks
- Post raw GPT5 results within hours of public release
- Publish a full comparison update on this page
- Release a **Zenodo-snapshotted benchmark** with DOI, to permanently record all result states
If GPT5 performs better — we welcome it.
If it doesnt — well explain why.
---
## Reproducibility Promise
- No closed weights, no internal hacks
- Every file is downloadable
- Every test can be re-run
- Every answer has a reason
---
> This isnt a leaderboard.
> Its a reasoning audit.
And WFGY is the auditor.
---
### 🧭 Explore More
| Module | Description | Link |
|-----------------------|----------------------------------------------------------|----------|
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint) |
| Benchmark vs GPT5 | Stress test GPT5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5) |
---
> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —
> Engineers, hackers, and open source builders who supported WFGY from day one.
> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ <strong><a href="https://github.com/onestardao/WFGY">Star WFGY on GitHub</a></strong>
<div align="center">
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
&nbsp;
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
&nbsp;
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
&nbsp;
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
&nbsp;
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
&nbsp;
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
&nbsp;
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
</div>