WFGY/archive/benchmarks_archive/semantic-drift-demo
2026-03-04 06:53:04 +00:00
..
data archive: move benchmarks folder to archive/benchmarks_archive 2026-03-04 04:52:53 +00:00
images archive: move benchmarks folder to archive/benchmarks_archive 2026-03-04 04:52:53 +00:00
scripts archive: move benchmarks folder to archive/benchmarks_archive 2026-03-04 04:52:53 +00:00
README.md docs: replace Explore More footer with unified navigation block 2026-03-04 06:53:04 +00:00
requirements.txt archive: move benchmarks folder to archive/benchmarks_archive 2026-03-04 04:52:53 +00:00

SemanticDrift Demo 📐

A minimal, fullyreproducible experiment to prove how the WFGY framework cuts semantic drift in multistep reasoning.

This demo compares plain LLM answers (Baseline) to WFGY✚DrunkMode on 30 carefullycrafted prompts.
The prompts come from the WFGY 1.0 All Principles Return to One public PDF, and specifically target longchain reasoning weaknesses documented in Section 3 of that paper.

Unlike generic QA tests, this benchmark does not evaluate factual correctness or syntax.
Instead, it tests semantic integrity — whether the model preserves meaning over multi-hop chains.
It was derived directly from Section 3 of the WFGY1.0 paper and quantifies how ΔS and λ_observe
reflect a models ability to avoid drift as reasoning unfolds.


1. Why run this experiment? 🎯

Large language models often look correct but secretly drift—mixing facts, skipping steps, or hallucinating logic as the chain gets longer.
WFGY introduces four closedloop modules (BBMC / BBPF / BBCR / BBAM) to selfheal those drifts in real time.

This repo lets anyone:

  • Quantify drift with two simple metrics (ΔS, λ_observe).
  • Visualise the gap instantly (two PNG charts).
  • Swap in any model (or any guard framework) and reproduce the numbers in <1min.

2. Metrics 📊

Metric Meaning Good?
ΔS Prompttoanswer semantic distance (0 = perfect) lower
λ_observe Percentage of answers with ΔS<0.4 (passrate) higher


* **Left chart** average ΔS: green (WFGY) bar is lower ⇒ answers wander off topic less. * **Right chart** λ_observe passrate: green hits 100% ⇒ WFGY beats baseline on **every** prompt.

3. Quick Start 

3line local run

pip install -r requirements.txt          # sklearn · pandas · matplotlib · statsmodels
python scripts/run_eval.py               # → data/metrics.csv
python scripts/plot_results.py           # → images/ refreshed charts

Oneclick Colab

  1. Open https://colab.research.google.com/
  2. !git clone <YOURREPOURL>
  3. Run the same three lines above.

4. Swap in your own model 🔄

  1. Put your outputs in

    • data/baseline_answers.txt  (WFGY OFF)
    • data/wfgydrunk_answers.txt (WFGY ON) *✧ One answer block per prompt, separated by a blank line.
  2. Rerun the two scripts charts update automatically.

  3. Interpret: green lower ΔS & higher λ=your guard beats raw model; if not, drift remains.

(Optional) Human κ agreement

# create data/error_annotations.csv  with columns: Q#,rater1,rater2,rater3 (ok / drift)
python scripts/compute_kappa.py         # prints Fleiss κ

5. Folder layout 🗂️

semantic-drift-demo/
├─ data/
│   ├─ test_prompts.json      # 30 prompts (from WFGY PDF, Section 3)
│   ├─ baseline_answers.txt   # answers with WFGY OFF
│   ├─ wfgydrunk_answers.txt  # answers with WFGY ON
│   └─ metrics.csv            # autogenerated
├─ scripts/
│   ├─ run_eval.py            # computes ΔS & λ_observe
│   ├─ plot_results.py        # draws the two PNG charts
│   └─ compute_kappa.py       # optional Fleiss κ
├─ images/
│   ├─ drift_comparison.png   # ΔS chart
│   └─ lambda_pass.png        # λ chart
└─ requirements.txt

6. How the code works 🔍

  1. TFIDF ΔS

    • We embed each prompt and answer with TFIDF;  cosine = ΔS.
    • Swap to sentencetransformers in run_eval.py for higherfidelity embeddings.
  2. λ_observe

    • If ΔS < threshold (default 0.4) → pass for that answer.
    • λ=(# passes)/30.
  3. plot_results.py

    • Saves two charts in images/ (scaled to 420px width for GitHub dark mode).
  4. compute_kappa.py

    • Reads three human labels per answer and outputs Fleiss κ (agreement score).

7. Background: WFGY in one paragraph 📚

WFGY 1.0 (paper DOI 10.5281/zenodo.15630969) unifies four modules:

Module Function
BBMC Measures semantic residue (meaning gap) and minimises it.
BBPF Perturbs reasoning paths, encouraging convergent refinement.
BBCR Detects collapse, resets, and rebirths the chain midinference.
BBAM Dampens noisy attention spikes, boosting crossmodal alignment.

The paper reports +22% semantic accuracy and 3.6×stability. This repo isolates the semanticdrift aspect so anyone can reproduce a slice of those gains without full training.


8. FAQ 🙋‍♂️

Question Answer
Why 30 prompts? Enough to visualise trends; small for fast Colab runs. Extend easily by appending prompts.
Can I use GPT4/Claude outputs? Yes—paste them into the two answer files.
Where is the prompt list from? Adapted from Section 3 “Stress Tests” of the WFGY 1.0 PDF.
Charts look blank? Ensure images are committed; GitHub caches aggressively—hardrefresh if needed.
ΔS too close between models? Switch to sentencetransformer embeddings (use_embed=True in run_eval.py) for finer deltas.

9. License 📜

Code released under MIT; prompt set under CCBY 4.0 (credit “PS BigBig, WFGY 1.0 PDF”). See LICENSE for details.


Clone, run, swap, publish—prove your model drifts less. For questions or pullrequests, open an issue or ping @PSBigBig. Good luck & happy benchmarking! 🚂💨


Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
Engine WFGY 1.0 Original PDF based tension engine
Engine WFGY 2.0 Production tension kernel and math engine for RAG and agents
Engine WFGY 3.0 TXT based Singularity tension engine, 131 S class set
Map Problem Map 1.0 Flagship 16 problem RAG failure checklist and fix map
Map Problem Map 2.0 RAG focused recovery pipeline
Map Problem Map 3.0 Global Debug Card, image as a debug protocol layer
Map Semantic Clinic Symptom to family to exact fix
Map Grandmas Clinic Plain language stories mapped to Problem Map 1.0
Onboarding Starter Village Guided tour for newcomers
App TXT OS TXT semantic OS, fast boot
App Blah Blah Blah Abstract and paradox Q and A built on TXT OS
App Blur Blur Blur Text to image with semantic control
App Blow Blow Blow Reasoning game engine and memory demo

If this repository helped, starring it improves discovery so more builders can find the docs and tools. GitHub Repo stars