Update eval_harness.md

This commit is contained in:
PSBigBig × MiniPS 2026-02-26 15:32:45 +08:00 committed by GitHub
parent bcc48b4572
commit 9fc01c3de9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -16,6 +16,13 @@
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
</details>
> **Evaluation disclaimer (eval harness)**
> This page sketches a harness for running structured evaluations on AI pipelines.
> Any metrics or labels that pass through such a harness remain heuristic outputs of models, scripts and annotators.
> They do not become scientific proof just because they flow through this structure.
> Use the harness to compare variants inside a controlled scenario, and avoid presenting those numbers as universal claims about model quality beyond that scenario.
---
A minimal yet strict harness to run repeatable evaluations for RAG and agent pipelines. It fixes the two usual failures. First, non-reproducible runs. Second, noisy metrics that cannot explain drift. Everything here maps to WFGY pages with measurable targets.