Update eval_benchmarking.md

2026-04-28 11:40:07 +00:00 · 2026-02-26 15:41:45 +08:00 · 2026-02-26 15:41:45 +08:00 · f970ea0f98
commit f970ea0f98
parent 9fc01c3de9
1 changed files with 7 additions and 0 deletions
--- a/ProblemMap/GlobalFixMap/Eval/eval_benchmarking.md
+++ b/ProblemMap/GlobalFixMap/Eval/eval_benchmarking.md
@ -16,6 +16,13 @@
  > If you need the full triage and all prescriptions, return to the Emergency Room lobby.
 </details>

+> **Evaluation disclaimer (benchmarking)**  
+> This document talks about benchmarking strategies for AI systems and RAG pipelines.  
+> The examples, scores and comparison plots are scenario specific and depend on the exact models, prompts, datasets and hardware that were used.  
+> They are intended as engineering guidance for local decision making, not as an official leaderboard or proof that one model is better in every setting.  
+> When you publish results based on these ideas, you should clearly state the scope and limitations of your benchmark and avoid over claiming what the numbers say.
+
+---

 This page defines a clean, repeatable way to benchmark your pipeline and prove that a fix actually improved behavior. It uses the same WFGY instruments as everywhere else: ΔS for semantic stress, λ\_observe for stability, and E\_resonance for coherence over long windows.