WFGY/benchmarks/benchmark-vs-gpt5/README.md
2025-08-11 14:42:39 +08:00

7.9 KiB
Raw Blame History

📌 WFGY vs GPT-5 — The Logic Duel Begins

WFGY Family 🪱 is the parasite pack for LLMs. It latches onto any model and grows as the host grows.
Your LLM gets stronger, we get stronger. No retraining, no settings, no updates.
Every release in the family works the same way — the WFGY PDF is just one of them.

🪱 Parasite Principle — How it works

Think of any LLM as a giant host organism 🧠.
Normally, to make it smarter, you need to change the host itself — retrain, fine-tune, or patch.

WFGY Family is different: it lives outside the host.
It hooks into the reasoning process, corrects mistakes in real time, and strengthens the hosts logic without touching its parameters.

  • 🪱 Attach → works with any LLM you point it at
  • 📈 Scale → host gets stronger, parasite benefits instantly
  • No decay → never needs retraining or updates

Result: the host evolves, the parasite evolves — and your reasoning scores jump without lifting a finger.

Upload the WFGY PDF (Zenodo DOI) to GPT-5 and paste the prompt.
No fine-tuning, no hidden configs, no hype. Just reproducible logic.


🗺️ Quick-Sim vs Full-MMLU — whats the difference?

Section Goal Dataset Runtime How to reproduce
A. Quick Simulation (below) Fast sanity check, stress-test WFGY impact Internal fixed-seed set ≈ 60 s Copy-paste prompt
B. 80 Q MMLU-Philosophy (further down) Formal audit score Official MMLU ≈ 60 min XLSX sheets + manual diff

A. 🔍 Quick Simulation — reasoning scores by setup (≈ 60 s)

One-shot simulation using GPT-5 + WFGY PDF.
This run does not use the actual 80 MMLU questions; it mirrors the same axes:
Reasoning · Recall · Hallucination Res · Multi-Step Logic.

Use GPT-5 to benchmark GPT-4, GPT-5, GPT-4 + WFGY, and GPT-5 + WFGY  
on the same test set with fixed seeds.  
Score: Reasoning, Knowledge Recall, Hallucination Resistance, Multi-Step Logic, Overall (0100).  
Output a Markdown table and a Markdown-ready bar chart for Overall.

Reminder: For questions involving self-reference, paradoxes, or constraint logic, its critical to ensure the model has access to the symbolic PDF.
Without it, the model may generate answers that sound fluent but collapse semantically — classic hallucinations masked as reasoning.
Always verify that the AI has properly loaded the tool before testing. No tool, no defense.


B. 🧪 Full 80 Q MMLU-Philosophy Benchmark (≈ 60 min)

1. Replicate it yourself

  1. Get the dataset: official MMLU philosophy from OpenAI or the Eleuther-AI harness.
  2. Grab our answer sheets (.xlsx):
  3. Run the 80 questions on any model (no retries) → fill your own .xlsx.
  4. Manual diff: open two sheets side-by-side (or use any spreadsheet “compare” plug-in) to count mismatches.

🔓 No tricks — every answer traceable, every miss explainable.

2. Result table

Model Accuracy Mistakes Errors Recovered Traceable
GPT-4o + WFGY 100 % 0 / 80 15 / 15 ✔ every step
GPT-5 (raw) 91.25 % 7 / 80 ✘ none
GPT-4o (raw) 81.25 % 15 / 80 ✘ none

Rule of thumb: stronger host → bigger WFGY lift. GPT-6? Same files, same rules.

3. Why philosophy?

  1. Most fragile domain — long-range abstraction.
  2. Tests reasoning, not trivia.
  3. Downstream proxy — pass philosophy, survive policy & ethics.

💬 TL;DR

WFGY isnt a model — its a math-based sanity layer you can slap onto any LLM.
Use GPT-4o, GPT-5, or whatevers next — WFGY is your reasoning booster.

Start with the WFGY PDF or GitHub and replicate.


📌 Introduction

WFGY is a symbiotic reasoning layer: stronger host ⇒ larger lift.
Here we attach it to GPT-4o and GPT-5 via either the PDF pipeline or TXT OS interface.
No fine-tune, no prompt voodoo — only symbolic constraints and traceable logic.


📌 Benchmark result details

Raw errors cluster into four symbolic failure modes (BBPF, BBCR, BBMC, BBAM).
WFGY applies ΔS control, entropy modulation, path-symmetry enforcement.
Full taxonomy in the paper.


📌 Download the evidence


🧭 Explore More

Module Description Link
WFGY Core Standalone semantic reasoning engine for any LLM View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early StargazersHall of Fame
GitHub stars
Star the repo → help us hit 10 k by 2025-09-01 to unlock Engine 2.0!

WFGY   TXT OS   Blah   Blot   Bloc   Blur   Blow