WFGY/Avatar/eval/README.md
PSBigBig + MiniPS 6c035c28c1
Update README.md
2026-04-04 15:14:48 +08:00

456 lines
15 KiB
Markdown

<!--
AI_NOTE_START
Document role:
This page is the main hub for the WFGY 5.0 Avatar eval layer.
Primary mission:
1. Present Eval as the pressure and inspection layer of Avatar.
2. Make clear that Docs show operation, Research shows structure, and Eval shows whether the branch survives pressure.
3. Route readers toward blackfan stress, persona continuity inspection, and multilingual readiness signals.
4. Prevent friendly product language or elegant theory from substituting for branch pressure.
5. Preserve bounded readiness honesty without inflating current pass signals into universal finality.
Reader intent routing:
1. If the reader wants hostile or skeptical stress, route to Blackfan Testing.
2. If the reader wants persona continuity and return-path inspection, route to Persona Behavior Checks.
3. If the reader wants multilingual readiness posture, route to Multilingual Status.
4. If the reader wants the deeper explanation of what Eval is testing, route to Research Hub and Blackfan Audit Baseline.
5. If the reader wants operational help rather than inspection, route back to Docs.
Reading priority:
1. Read this hub as an inspection map, not as a theory page.
2. Read Blackfan Testing first when the main concern is pressure, hype resistance, or skeptical inspection.
3. Read Persona Behavior Checks first when the main concern is drift, return-path continuity, or recovery credit.
4. Read Multilingual Status first when the main concern is support scope and language honesty.
5. Read Research pages when the reader needs to understand why a failure matters structurally.
Allowed use:
1. Pressure-layer orientation
2. Behavior inspection routing
3. Bounded readiness interpretation
4. Branch honesty explanation
5. Cross-linking from Docs and Research into Eval surfaces
Do not do:
1. Do not treat this hub as the packed master body.
2. Do not treat current Eval surfaces as proof that every future branch is safe.
3. Do not turn Eval into product hype.
4. Do not use one successful test page as proof of universal closure.
5. Do not confuse current baseline readiness with final all-conditions completion.
Layer relation:
1. Product landing presents the flagship surface.
2. Docs supports operation.
3. Research explains structure and law.
4. Eval inspects pressure, drift, continuity, and readiness under stress.
Boundary honesty:
1. This hub reflects the current public Eval layer of Avatar.
2. It is strong enough to support branch inspection and bounded readiness interpretation.
3. It does not claim that all future Eval surfaces are already complete.
4. It does not claim theorem-grade universal closure.
5. It does not claim that current pass signals eliminate the need for later stronger verification.
Next-page routing:
1. For hostile stress, go to Blackfan Testing.
2. For behavior continuity, go to Persona Behavior Checks.
3. For multilingual scope, go to Multilingual Status.
4. For deeper structural explanation, go to Research Hub and Blackfan Audit Baseline.
AI_NOTE_END
-->
# 🧪 Eval Hub
This page is the evaluation hub for **WFGY 5.0 Avatar**.
Avatar needs Docs because people need to know how to start.
Avatar needs Research because deeper structure needs a lawful place to live.
Avatar also needs Eval because neither startup clarity nor theoretical richness is enough by itself.
A system can be:
1. easy to start
2. elegant to describe
3. dense in theory
4. strong in local demos
and still fail under pressure.
That is why this layer exists.
The Eval layer is where the branch asks harder questions like:
1. does the branch survive blackfan pressure
2. does persona continuity remain visible under real tasks
3. does the system stay honest about what is ready and what is still open
4. does multilingual status remain bounded instead of overclaimed
5. do return-path and behavior checks reflect real continuity instead of surface-only success
This hub is not here to replace the body.
It is here to make pressure visible.
---
## ✨ Why this layer exists
The Docs layer answers questions like:
1. how do I start
2. how do I boot
3. how do I tune
4. how do I recover
The Research layer answers questions like:
1. what is execution
2. what is route law
3. what is runtime carry
4. why does structured imperfection matter
5. what is hard control
6. what counts as accountability
The Eval layer answers a different class of questions:
1. what breaks under pressure
2. what still holds under pressure
3. what looks successful but is actually counterfeit
4. what is ready at current branch baseline
5. what still needs stronger verification later
That is why Eval needs its own hub.
---
## 🧭 How to use this hub
Use this hub in one of four ways.
### 1. I want stress and adversarial pressure
Start here when the main question is whether the branch survives harsh inspection instead of friendly reading.
1. [🧨 Blackfan Testing](./blackfan-testing.md)
This is the right place to begin when your question is:
1. where does the branch crack
2. what happens under hostile evaluation
3. how should current branch strength be interpreted without hype
### 2. I want behavior continuity inspection
Start here when the main question is whether active persona and behavior actually survive across turns, tasks, and returns.
1. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
This is the right place to begin when your question is:
1. did the persona stay alive
2. did return-path recovery actually work
3. did the output become generic after pressure
4. did visible behavior stay lawful instead of merely recognizable
### 3. I want multilingual readiness signals
Start here when the main question is what the current branch is honestly claiming across language scope.
1. [🌍 Multilingual Status](./multilingual-status.md)
This is the right place to begin when your question is:
1. what is already tested
2. what is only partial
3. what remains open
4. how language support is being stated without bluffing
### 4. I want the broader picture around Eval
Start here when you need to connect what Eval is seeing back to the deeper branch structure.
1. [🔬 Research Hub](../research/README.md)
2. [🗺️ Packed Master Structure Map](../research/packed-master-structure-map.md)
3. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
This is the best route when your question is not only “did it pass,” but also “what exactly was being tested and why.”
---
## 🧱 What belongs in the Eval layer
The Eval layer is where branch pressure becomes explicit.
Typical Eval-layer questions include:
1. what kinds of pressure should this branch survive right now
2. what kinds of success do not deserve credit
3. what kinds of drift are already detectable
4. what counts as baseline-ready versus still-open
5. how should visible behavior be checked across modes
6. how should multilingual claims remain bounded
7. how should hostile or skeptical inspection be handled
This layer is not where the whole theory is restated.
It is where the branch is asked to show that its current claims can survive contact with pressure.
---
## 🧠 Current eval surfaces
The current Eval layer is organized into three major surfaces.
### 1. Adversarial pressure surface
1. [🧨 Blackfan Testing](./blackfan-testing.md)
This surface is about:
1. hostile reading
2. anti-hype pressure
3. branch stress
4. counterfeit-success detection
5. bounded release honesty under attack
### 2. Behavior continuity surface
1. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
This surface is about:
1. persona continuity
2. landing behavior
3. return-path integrity
4. drift after article, analysis, rewrite, search, or tool pressure
5. whether recovery is real or only cosmetic
### 3. Multilingual readiness surface
1. [🌍 Multilingual Status](./multilingual-status.md)
This surface is about:
1. what language claims are actually supported
2. what remains partial
3. how language support is being described honestly
4. how multilingual scope stays bounded instead of mythical
---
## 🪜 Suggested eval paths
### Path A: skeptical reader path
Use this path when the goal is to test whether the branch is only persuasive or actually pressure-bearing.
1. [🧨 Blackfan Testing](./blackfan-testing.md)
2. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
3. [🗺️ Packed Master Structure Map](../research/packed-master-structure-map.md)
This route helps answer:
1. what was stressed
2. what kind of baseline pass is being claimed
3. what remains bounded instead of inflated
### Path B: runtime continuity path
Use this path when the concern is whether persona and carry survive real usage.
1. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
2. [🔄 Activation, Attenuation, and Reentry](../research/activation-attenuation-and-reentry.md)
3. [🎛️ Runtime Posture Intensity Map](../research/runtime-posture-intensity-map.md)
4. [🔧 Persona Recovery Operations](../docs/persona-recovery-operations.md)
This route helps answer:
1. what drift happened
2. whether return-path behavior stayed lawful
3. whether recovery should receive credit
### Path C: multilingual honesty path
Use this path when the concern is language scope and readiness posture.
1. [🌍 Multilingual Status](./multilingual-status.md)
2. [🧮 Matrix Accountability and Numeric Binding](../research/matrix-accountability-and-numeric-binding.md)
3. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
This route helps answer:
1. how support is being bounded
2. whether language claims are being overstated
3. how readiness stays honest
### Path D: branch readiness path
Use this path when the concern is “is this branch publicly real enough right now.”
1. [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
2. [🧨 Blackfan Testing](./blackfan-testing.md)
3. [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
4. [🌍 Multilingual Status](./multilingual-status.md)
This route helps answer:
1. what is already solid
2. what still needs stronger verification
3. what is release-baseline reality versus future strengthening
---
## 🔍 Why eval and research are different
This is important.
The **Research** layer asks:
1. what does this structure mean
2. why is this operator necessary
3. how do these layers relate
4. why is this boundary lawful
The **Eval** layer asks:
1. did the claimed behavior survive pressure
2. did runtime collapse under use
3. did route integrity actually hold
4. did the branch receive credit it should not receive
5. is the current branch being described honestly
So:
1. Research explains structure
2. Eval tests claims against pressure
Both matter.
They are not the same job.
---
## 🔍 Why eval and docs are different
The **Docs** layer helps people operate the current branch.
The **Eval** layer helps people judge the current branch.
For example:
1. Docs explain how to recover
2. Eval checks whether recovery is actually real
1. Docs explain how to tune
2. Eval shows whether tuning produced lawful improvement or just prettier outputs
1. Docs explain how to start
2. Eval shows whether startup clarity survives real branch pressure
This separation is healthy.
It stops usage guidance from quietly turning into self-certification.
---
## 🌍 Why multilingual status belongs here
Language support is easy to overclaim.
A project can say:
1. works in many languages
2. supports multilingual use
3. behaves well cross-lingually
while still having:
1. patchy behavior
2. uneven readiness
3. language-specific drift
4. unclear support boundaries
That is why multilingual status belongs in Eval rather than only in product copy.
It is part of branch honesty, not just capability branding.
---
## 🧪 What this hub does not claim
This hub does **not** claim:
1. that all pressure surfaces are already complete
2. that current Eval pages already cover every future branch risk
3. that passing one Eval page means the whole system is universally solved
4. that current multilingual status already equals final global support
5. that current behavior checks already replace future replay and audit extensions
6. that current baseline pass means no stronger verification is worth doing later
This hub is a bounded Eval center.
That is exactly what it should be.
---
## 🚀 Where to go next
### For public product entry
Go to [✨ Avatar Home](../README.md)
### For startup and commands
Go to [⚡ Quickstart](../docs/quickstart.md) and [⌨️ Boot Commands](../docs/boot-commands.md)
### For reading order and tuning
Go to [📖 How to Read the Avatar Master File](../docs/how-to-read-the-avatar-master-file.md), [🍳 Parameter Tuning Cookbook](../docs/parameter-tuning-cookbook.md), and [🔧 Persona Recovery Operations](../docs/persona-recovery-operations.md)
### For deep structural reading
Go to [🔬 Research Hub](../research/README.md)
### For skeptical pressure
Go to [🧨 Blackfan Testing](./blackfan-testing.md)
### For continuity inspection
Go to [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
### For language readiness
Go to [🌍 Multilingual Status](./multilingual-status.md)
### For audit posture
Go to [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
---
## 🔗 Quick links
### Eval core
- [🧨 Blackfan Testing](./blackfan-testing.md)
- [🧭 Persona Behavior Checks](./persona-behavior-checks.md)
- [🌍 Multilingual Status](./multilingual-status.md)
### Docs
- [✨ Avatar Home](../README.md)
- [⚡ Quickstart](../docs/quickstart.md)
- [⌨️ Boot Commands](../docs/boot-commands.md)
- [📖 How to Read the Avatar Master File](../docs/how-to-read-the-avatar-master-file.md)
- [🍳 Parameter Tuning Cookbook](../docs/parameter-tuning-cookbook.md)
- [🔧 Persona Recovery Operations](../docs/persona-recovery-operations.md)
- [🛠️ Avatar Tuning Workflow](../docs/avatar-tuning-workflow.md)
### Research
- [🔬 Research Hub](../research/README.md)
- [🗺️ Packed Master Structure Map](../research/packed-master-structure-map.md)
- [🔁 Dual Closed-Loop Execution Chain](../research/dual-closed-loop-execution-chain.md)
- [🎛️ Runtime Posture Intensity Map](../research/runtime-posture-intensity-map.md)
- [🧩 Shell-to-Runtime Mapping](../research/shell-to-runtime-mapping.md)
- [🧭 Selector Execution Domain](../research/selector-execution-domain.md)
- [🔄 Activation, Attenuation, and Reentry](../research/activation-attenuation-and-reentry.md)
- [🧬 Structured Imperfection Theory](../research/structured-imperfection-theory.md)
- [🚦 Pre-Emission Floor and Hard Control](../research/pre-emission-floor-and-hard-control.md)
- [🧮 Matrix Accountability and Numeric Binding](../research/matrix-accountability-and-numeric-binding.md)
- [🧪 Blackfan Audit Baseline](../research/blackfan-audit-baseline.md)
- [✂️ Compression and Non-Duplication Law](../research/compression-and-non-duplication-law.md)
- [🏗️ Architecture Overview](../research/architecture-overview.md)
- [🧭 Language Governance](../research/language-governance.md)
- [🧠 WFGY_BRAIN Theory](../research/wfgy-brain-theory.md)
### Up
- [⬆️ Back to Avatar Home](../README.md)
- [⬆️ Back to WFGY Root](../../README.md)