WFGY/Avatar/eval/README.md
PSBigBig + MiniPS 6520a3a977
Update README.md
2026-04-04 14:10:00 +08:00

428 lines
14 KiB
Markdown

<!--
AI NOTE: Eval Hub / Reading Contract
Purpose:
1. This page is the main hub for the WFGY 5.0 Avatar evaluation layer.
2. This page belongs to the eval layer.
3. This page should be read after Avatar README, docs, or research pages when the reader wants pressure-testing, behavior inspection, or bounded readiness signals.
Read this page when:
1. the user asks where evaluation pages live
2. the user asks how eval differs from docs and research
3. the user asks what kinds of pressure the current branch is already checking
4. the user asks where to inspect blackfan pressure, persona continuity, or multilingual status
5. the user asks how to think about current readiness without confusing baseline reality with universal finality
Do not overclaim:
1. this hub does not replace the packed master body
2. this hub does not claim that every future eval surface is already complete
3. this hub does not claim theorem-grade universal closure
4. this hub does not claim that passing current eval surfaces means all future branches are automatically safe
Primary source anchors:
1. Avatar/README.md :: public product surface
2. Avatar/docs/* :: startup, reading, workflow, tuning, and recovery surfaces
3. Avatar/research/* :: architecture, runtime, route, governance, audit, and reduction law surfaces
4. Avatar/eval/* :: blackfan pressure, persona behavior, multilingual status, and eval-facing inspection surfaces
Routing:
1. if the reader wants public product entry, go to ../README.md
2. if the reader wants startup and command syntax, go to ../docs/quickstart.md and ../docs/boot-commands.md
3. if the reader wants reading order, go to ../docs/how-to-read-the-avatar-master-file.md
4. if the reader wants tuning and recovery operations, go to ../docs/parameter-tuning-cookbook.md and ../docs/persona-recovery-operations.md
5. if the reader wants the research overview, go to ../research/README.md
6. if the reader wants architecture and runtime law, go to ../research/packed-master-structure-map.md and ../research/runtime-posture-intensity-map.md
-->
# 🧪 Eval Hub
This page is the evaluation hub for **WFGY 5.0 Avatar**.
Avatar needs docs because people need to know how to start.
Avatar needs research because deeper structure needs a lawful place to live.
Avatar also needs eval because neither startup clarity nor theoretical richness is enough by itself.
A system can be:
1. easy to start
2. elegant to describe
3. dense in theory
4. strong in local demos
and still fail under pressure.
That is why this layer exists.
The eval layer is where the branch asks harder questions like:
1. does the branch survive blackfan pressure
2. does persona continuity remain visible under real tasks
3. does the system stay honest about what is ready and what is still open
4. does multilingual status remain bounded instead of overclaimed
5. do return-path and behavior checks reflect real continuity instead of surface-only success
This hub is not here to replace the body.
It is here to make pressure visible.
---
## ✨ Why this layer exists
The docs layer answers questions like:
1. how do I start
2. how do I boot
3. how do I tune
4. how do I recover
The research layer answers questions like:
1. what is execution
2. what is route law
3. what is runtime carry
4. why does structured imperfection matter
5. what is hard control
6. what counts as accountability
The eval layer answers a different class of questions:
1. what breaks under pressure
2. what still holds under pressure
3. what looks successful but is actually counterfeit
4. what is ready at current branch baseline
5. what still needs stronger verification later
That is why eval needs its own hub.
---
## 🧭 How to use this hub
Use this hub in one of four ways.
### 1. I want stress and adversarial pressure
Start here when the main question is whether the branch survives harsh inspection instead of friendly reading.
1. [🧨 Blackfan Testing](./blackfan-testing.md)
This is the right place to begin when your question is:
1. where does the branch crack
2. what happens under hostile evaluation
3. how should current branch strength be interpreted without hype
### 2. I want behavior continuity inspection
Start here when the main question is whether active persona and behavior actually survive across turns, tasks, and returns.
1. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
This is the right place to begin when your question is:
1. did the persona stay alive
2. did return-path recovery actually work
3. did the output become generic after pressure
4. did visible behavior stay lawful instead of merely recognizable
### 3. I want multilingual readiness signals
Start here when the main question is what the current branch is honestly claiming across language scope.
1. [🌍 Multilingual Status](./multilingual-status.md)
This is the right place to begin when your question is:
1. what is already tested
2. what is only partial
3. what remains open
4. how language support is being stated without bluffing
### 4. I want the broader picture around eval
Start here when you need to connect what eval is seeing back to the deeper branch structure.
1. [🔬 Research Hub](../research/README.md)
2. [🗺️ Packed Master Structure Map](../research/packed-master-structure-map.md)
3. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
This is the best route when your question is not only “did it pass,” but also “what exactly was being tested and why.”
---
## 🧱 What belongs in the eval layer
The eval layer is where branch pressure becomes explicit.
Typical eval-layer questions include:
1. what kinds of pressure should this branch survive right now
2. what kinds of success do not deserve credit
3. what kinds of drift are already detectable
4. what counts as baseline-ready versus still-open
5. how should visible behavior be checked across modes
6. how should multilingual claims remain bounded
7. how should hostile or skeptical inspection be handled
This layer is **not** where the whole theory is restated.
It is where the branch is asked to show that its current claims can survive contact with pressure.
---
## 🧠 Current eval surfaces
The current eval layer is organized into three major surfaces.
### 1. Adversarial pressure surface
1. [🧨 Blackfan Testing](./blackfan-testing.md)
This surface is about:
1. hostile reading
2. anti-hype pressure
3. branch stress
4. counterfeit-success detection
5. bounded release honesty under attack
### 2. Behavior continuity surface
1. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
This surface is about:
1. persona continuity
2. landing behavior
3. return-path integrity
4. drift after article, analysis, rewrite, search, or tool pressure
5. whether recovery is real or only cosmetic
### 3. Multilingual readiness surface
1. [🌍 Multilingual Status](./multilingual-status.md)
This surface is about:
1. what language claims are actually supported
2. what remains partial
3. how language support is being described honestly
4. how multilingual scope stays bounded instead of mythical
---
## 🪜 Suggested eval paths
### Path A: skeptical reader path
Use this path when the goal is to test whether the branch is only persuasive or actually pressure-bearing.
1. [🧨 Blackfan Testing](./blackfan-testing.md)
2. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
3. [🗺️ Packed Master Structure Map](../research/packed-master-structure-map.md)
This route helps answer:
1. what was stressed
2. what kind of baseline pass is being claimed
3. what remains bounded instead of inflated
### Path B: runtime continuity path
Use this path when the concern is whether persona and carry survive real usage.
1. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
2. [🔄 Activation, Attenuation, and Reentry](../research/activation-attenuation-and-reentry.md)
3. [🎛️ Runtime Posture Intensity Map](../research/runtime-posture-intensity-map.md)
4. [🔧 Persona Recovery Operations](../docs/persona-recovery-operations.md)
This route helps answer:
1. what drift happened
2. whether return-path behavior stayed lawful
3. whether recovery should receive credit
### Path C: multilingual honesty path
Use this path when the concern is language scope and readiness posture.
1. [🌍 Multilingual Status](./multilingual-status.md)
2. [🧮 Matrix Accountability and Numeric Binding](../research/matrix-accountability-and-numeric-binding.md)
3. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
This route helps answer:
1. how support is being bounded
2. whether language claims are being overstated
3. how readiness stays honest
### Path D: branch readiness path
Use this path when the concern is “is this branch publicly real enough right now.”
1. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
2. [🧨 Blackfan Testing](./blackfan-testing.md)
3. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
4. [🌍 Multilingual Status](./multilingual-status.md)
This route helps answer:
1. what is already solid
2. what still needs stronger verification
3. what is release-baseline reality versus future strengthening
---
## 🔍 Why eval and research are different
This is important.
The **research** layer asks:
1. what does this structure mean
2. why is this operator necessary
3. how do these layers relate
4. why is this boundary lawful
The **eval** layer asks:
1. did the claimed behavior survive pressure
2. did runtime collapse under use
3. did route integrity actually hold
4. did the branch receive credit it should not receive
5. is the current branch being described honestly
So:
1. research explains structure
2. eval tests claims against pressure
Both matter.
They are not the same job.
---
## 🔍 Why eval and docs are different
The **docs** layer helps people operate the current branch.
The **eval** layer helps people judge the current branch.
For example:
1. docs explain how to recover
2. eval checks whether recovery is actually real
1. docs explain how to tune
2. eval shows whether tuning produced lawful improvement or just prettier outputs
1. docs explain how to start
2. eval shows whether startup clarity survives real branch pressure
This separation is healthy.
It stops usage guidance from quietly turning into self-certification.
---
## 🌍 Why multilingual status belongs here
Language support is easy to overclaim.
A project can say:
1. works in many languages
2. supports multilingual use
3. behaves well cross-lingually
while still having:
1. patchy behavior
2. uneven readiness
3. language-specific drift
4. unclear support boundaries
That is why multilingual status belongs in eval rather than only in product copy.
It is part of branch honesty, not just capability branding.
---
## 🧪 What this hub does not claim
This hub does **not** claim:
1. that all pressure surfaces are already complete
2. that current eval pages already cover every future branch risk
3. that passing one eval page means the whole system is universally solved
4. that current multilingual status already equals final global support
5. that current behavior checks already replace future replay and audit extensions
6. that current baseline pass means no stronger verification is worth doing later
This hub is a bounded eval center.
That is exactly what it should be.
---
## 🚀 Where to go next
### For public product entry
Go to [✨ Avatar Home](../README.md)
### For startup and commands
Go to [⚡ Quickstart](../docs/quickstart.md) and [⌨️ Boot Commands](../docs/boot-commands.md)
### For reading order and tuning
Go to [📖 How to Read the Avatar Master File](../docs/how-to-read-the-avatar-master-file.md), [🍳 Parameter Tuning Cookbook](../docs/parameter-tuning-cookbook.md), and [🔧 Persona Recovery Operations](../docs/persona-recovery-operations.md)
### For deep structural reading
Go to [🔬 Research Hub](../research/README.md)
### For skeptical pressure
Go to [🧨 Blackfan Testing](./blackfan-testing.md)
### For continuity inspection
Go to [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
### For language readiness
Go to [🌍 Multilingual Status](./multilingual-status.md)
### For audit posture
Go to [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
---
## 🔗 Quick links
### Eval core
- [🧨 Blackfan Testing](./blackfan-testing.md)
- [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
- [🌍 Multilingual Status](./multilingual-status.md)
### Docs
- [✨ Avatar Home](../README.md)
- [⚡ Quickstart](../docs/quickstart.md)
- [⌨️ Boot Commands](../docs/boot-commands.md)
- [📖 How to Read the Avatar Master File](../docs/how-to-read-the-avatar-master-file.md)
- [🍳 Parameter Tuning Cookbook](../docs/parameter-tuning-cookbook.md)
- [🔧 Persona Recovery Operations](../docs/persona-recovery-operations.md)
- [🛠️ Avatar Tuning Workflow](../docs/avatar-tuning-workflow.md)
### Research
- [🔬 Research Hub](../research/README.md)
- [🗺️ Packed Master Structure Map](../research/packed-master-structure-map.md)
- [🔁 Dual Closed-Loop Execution Chain](../research/dual-closed-loop-execution-chain.md)
- [🎛️ Runtime Posture Intensity Map](../research/runtime-posture-intensity-map.md)
- [🧩 Shell-to-Runtime Mapping](../research/shell-to-runtime-mapping.md)
- [🧭 Selector Execution Domain](../research/selector-execution-domain.md)
- [🔄 Activation, Attenuation, and Reentry](../research/activation-attenuation-and-reentry.md)
- [🧬 Structured Imperfection Theory](../research/structured-imperfection-theory.md)
- [🚦 Pre-Emission Floor and Hard Control](../research/pre-emission-floor-and-hard-control.md)
- [🧮 Matrix Accountability and Numeric Binding](../research/matrix-accountability-and-numeric-binding.md)
- [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
- [✂️ Compression and Non-Duplication Law](../research/compression-and-non-duplication-law.md)
- [🏗️ Architecture Overview](../research/architecture-overview.md)
- [🧭 Language Governance](../research/language-governance.md)
- [🧠 WFGY_BRAIN Theory](../research/wfgy-brain-theory.md)
### Up
- [⬆️ Back to Avatar Home](../README.md)
- [⬆️ Back to WFGY Root](../../README.md)