WFGY/ProblemMap/GlobalFixMap/Eval_Observability/metrics_and_logging.md

5.5 KiB
Raw Permalink Blame History

Eval Observability — Metrics and Logging

🧭 Quick Return to Map

You are in a sub-page of Eval_Observability.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A baseline schema and checklist for logging semantic metrics (ΔS, λ, coverage, E_resonance) during live runs.
Use this page to enforce consistent telemetry so that offline eval and online observability align.


Why log metrics?

  • Drift detection: High ΔS or divergent λ states catch retrieval/logic errors early.
  • Comparability: Same schema across providers, stores, and orchestration layers.
  • Debug loops: Logged traces accelerate reproduction and diagnosis.
  • Regression guards: Simple thresholds protect pipelines before release.

Core metrics to capture

Metric Definition Thresholds
ΔS(question, retrieved) Semantic distance between query and retrieved snippet Stable ≤ 0.45, Transitional 0.450.60, Risk ≥ 0.60
Coverage Fraction of gold/target section retrieved ≥ 0.70
λ_observe State of reasoning flow (→ convergent, ← divergent, <> transitional, × collapse) Must stay convergent across 3 paraphrases
E_resonance Long-window entropy of reasoning steps Should remain flat without spikes

Logging schema (JSON example)

{
  "trace_id": "uuid",
  "timestamp": "2025-08-29T12:34:56Z",
  "question": "...",
  "retrieved": [
    {
      "snippet_id": "s1",
      "section": "intro",
      "source": "docA",
      "offsets": [120, 160],
      "ΔS": 0.42
    }
  ],
  "ΔS_overall": 0.44,
  "coverage": 0.72,
  "λ_state": "→",
  "E_resonance": 0.03,
  "index_hash": "abc123",
  "dedupe_key": "sha256(...)" 
}

Quick probes

  • ΔS probe: Recompute ΔS on each retrieval call. Alert if ≥ 0.60.
  • λ probe: Run three paraphrases per eval batch, log λ_state sequence.
  • Coverage probe: Compare retrieved sections against gold or expected anchors.
  • E_resonance probe: Smooth entropy over 50100 steps, alert if spike > 2× baseline.

Storage tips

  • Write logs to append-only store (e.g., KV or time-series DB).
  • Deduplicate with dedupe_key = sha256(question + index_hash + snippet_id).
  • Keep 3090 days rolling window for regression analysis.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars