WFGY/Avatar/eval
PSBigBig + MiniPS 6c035c28c1
Update README.md
2026-04-04 15:14:48 +08:00
..
blackfan-testing.md Update blackfan-testing.md 2026-04-01 17:31:53 +08:00
multilingual-status.md Create multilingual-status.md 2026-04-01 17:21:26 +08:00
persona-behavior-checks.md Update persona-behavior-checks.md 2026-04-01 17:16:55 +08:00
README.md Update README.md 2026-04-04 15:14:48 +08:00

🧪 Eval Hub

This page is the evaluation hub for WFGY 5.0 Avatar.

Avatar needs Docs because people need to know how to start.
Avatar needs Research because deeper structure needs a lawful place to live.
Avatar also needs Eval because neither startup clarity nor theoretical richness is enough by itself.

A system can be:

  1. easy to start
  2. elegant to describe
  3. dense in theory
  4. strong in local demos

and still fail under pressure.

That is why this layer exists.

The Eval layer is where the branch asks harder questions like:

  1. does the branch survive blackfan pressure
  2. does persona continuity remain visible under real tasks
  3. does the system stay honest about what is ready and what is still open
  4. does multilingual status remain bounded instead of overclaimed
  5. do return-path and behavior checks reflect real continuity instead of surface-only success

This hub is not here to replace the body.
It is here to make pressure visible.


Why this layer exists

The Docs layer answers questions like:

  1. how do I start
  2. how do I boot
  3. how do I tune
  4. how do I recover

The Research layer answers questions like:

  1. what is execution
  2. what is route law
  3. what is runtime carry
  4. why does structured imperfection matter
  5. what is hard control
  6. what counts as accountability

The Eval layer answers a different class of questions:

  1. what breaks under pressure
  2. what still holds under pressure
  3. what looks successful but is actually counterfeit
  4. what is ready at current branch baseline
  5. what still needs stronger verification later

That is why Eval needs its own hub.


🧭 How to use this hub

Use this hub in one of four ways.

1. I want stress and adversarial pressure

Start here when the main question is whether the branch survives harsh inspection instead of friendly reading.

  1. 🧨 Blackfan Testing

This is the right place to begin when your question is:

  1. where does the branch crack
  2. what happens under hostile evaluation
  3. how should current branch strength be interpreted without hype

2. I want behavior continuity inspection

Start here when the main question is whether active persona and behavior actually survive across turns, tasks, and returns.

  1. 🧭 Persona Behavior Checks

This is the right place to begin when your question is:

  1. did the persona stay alive
  2. did return-path recovery actually work
  3. did the output become generic after pressure
  4. did visible behavior stay lawful instead of merely recognizable

3. I want multilingual readiness signals

Start here when the main question is what the current branch is honestly claiming across language scope.

  1. 🌍 Multilingual Status

This is the right place to begin when your question is:

  1. what is already tested
  2. what is only partial
  3. what remains open
  4. how language support is being stated without bluffing

4. I want the broader picture around Eval

Start here when you need to connect what Eval is seeing back to the deeper branch structure.

  1. 🔬 Research Hub
  2. 🗺️ Packed Master Structure Map
  3. 🧪 Blackfan Audit Baseline

This is the best route when your question is not only “did it pass,” but also “what exactly was being tested and why.”


🧱 What belongs in the Eval layer

The Eval layer is where branch pressure becomes explicit.

Typical Eval-layer questions include:

  1. what kinds of pressure should this branch survive right now
  2. what kinds of success do not deserve credit
  3. what kinds of drift are already detectable
  4. what counts as baseline-ready versus still-open
  5. how should visible behavior be checked across modes
  6. how should multilingual claims remain bounded
  7. how should hostile or skeptical inspection be handled

This layer is not where the whole theory is restated.
It is where the branch is asked to show that its current claims can survive contact with pressure.


🧠 Current eval surfaces

The current Eval layer is organized into three major surfaces.

1. Adversarial pressure surface

  1. 🧨 Blackfan Testing

This surface is about:

  1. hostile reading
  2. anti-hype pressure
  3. branch stress
  4. counterfeit-success detection
  5. bounded release honesty under attack

2. Behavior continuity surface

  1. 🧭 Persona Behavior Checks

This surface is about:

  1. persona continuity
  2. landing behavior
  3. return-path integrity
  4. drift after article, analysis, rewrite, search, or tool pressure
  5. whether recovery is real or only cosmetic

3. Multilingual readiness surface

  1. 🌍 Multilingual Status

This surface is about:

  1. what language claims are actually supported
  2. what remains partial
  3. how language support is being described honestly
  4. how multilingual scope stays bounded instead of mythical

🪜 Suggested eval paths

Path A: skeptical reader path

Use this path when the goal is to test whether the branch is only persuasive or actually pressure-bearing.

  1. 🧨 Blackfan Testing
  2. 🧪 Blackfan Audit Baseline
  3. 🗺️ Packed Master Structure Map

This route helps answer:

  1. what was stressed
  2. what kind of baseline pass is being claimed
  3. what remains bounded instead of inflated

Path B: runtime continuity path

Use this path when the concern is whether persona and carry survive real usage.

  1. 🧭 Persona Behavior Checks
  2. 🔄 Activation, Attenuation, and Reentry
  3. 🎛️ Runtime Posture Intensity Map
  4. 🔧 Persona Recovery Operations

This route helps answer:

  1. what drift happened
  2. whether return-path behavior stayed lawful
  3. whether recovery should receive credit

Path C: multilingual honesty path

Use this path when the concern is language scope and readiness posture.

  1. 🌍 Multilingual Status
  2. 🧮 Matrix Accountability and Numeric Binding
  3. 🧪 Blackfan Audit Baseline

This route helps answer:

  1. how support is being bounded
  2. whether language claims are being overstated
  3. how readiness stays honest

Path D: branch readiness path

Use this path when the concern is “is this branch publicly real enough right now.”

  1. 🧪 Blackfan Audit Baseline
  2. 🧨 Blackfan Testing
  3. 🧭 Persona Behavior Checks
  4. 🌍 Multilingual Status

This route helps answer:

  1. what is already solid
  2. what still needs stronger verification
  3. what is release-baseline reality versus future strengthening

🔍 Why eval and research are different

This is important.

The Research layer asks:

  1. what does this structure mean
  2. why is this operator necessary
  3. how do these layers relate
  4. why is this boundary lawful

The Eval layer asks:

  1. did the claimed behavior survive pressure
  2. did runtime collapse under use
  3. did route integrity actually hold
  4. did the branch receive credit it should not receive
  5. is the current branch being described honestly

So:

  1. Research explains structure
  2. Eval tests claims against pressure

Both matter.
They are not the same job.


🔍 Why eval and docs are different

The Docs layer helps people operate the current branch.

The Eval layer helps people judge the current branch.

For example:

  1. Docs explain how to recover

  2. Eval checks whether recovery is actually real

  3. Docs explain how to tune

  4. Eval shows whether tuning produced lawful improvement or just prettier outputs

  5. Docs explain how to start

  6. Eval shows whether startup clarity survives real branch pressure

This separation is healthy.
It stops usage guidance from quietly turning into self-certification.


🌍 Why multilingual status belongs here

Language support is easy to overclaim.

A project can say:

  1. works in many languages
  2. supports multilingual use
  3. behaves well cross-lingually

while still having:

  1. patchy behavior
  2. uneven readiness
  3. language-specific drift
  4. unclear support boundaries

That is why multilingual status belongs in Eval rather than only in product copy.

It is part of branch honesty, not just capability branding.


🧪 What this hub does not claim

This hub does not claim:

  1. that all pressure surfaces are already complete
  2. that current Eval pages already cover every future branch risk
  3. that passing one Eval page means the whole system is universally solved
  4. that current multilingual status already equals final global support
  5. that current behavior checks already replace future replay and audit extensions
  6. that current baseline pass means no stronger verification is worth doing later

This hub is a bounded Eval center.

That is exactly what it should be.


🚀 Where to go next

For public product entry

Go to Avatar Home

For startup and commands

Go to Quickstart and ⌨️ Boot Commands

For reading order and tuning

Go to 📖 How to Read the Avatar Master File, 🍳 Parameter Tuning Cookbook, and 🔧 Persona Recovery Operations

For deep structural reading

Go to 🔬 Research Hub

For skeptical pressure

Go to 🧨 Blackfan Testing

For continuity inspection

Go to 🧭 Persona Behavior Checks

For language readiness

Go to 🌍 Multilingual Status

For audit posture

Go to 🧪 Blackfan Audit Baseline


Eval core

Docs

Research

Up