vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

History

PSBigBig + MiniPS 6c035c28c1 Update README.md		2026-04-04 15:14:48 +08:00
..
blackfan-testing.md	Update blackfan-testing.md	2026-04-01 17:31:53 +08:00
multilingual-status.md	Create multilingual-status.md	2026-04-01 17:21:26 +08:00
persona-behavior-checks.md	Update persona-behavior-checks.md	2026-04-01 17:16:55 +08:00
README.md	Update README.md	2026-04-04 15:14:48 +08:00

README.md

🧪 Eval Hub

This page is the evaluation hub for WFGY 5.0 Avatar.

Avatar needs Docs because people need to know how to start.
Avatar needs Research because deeper structure needs a lawful place to live.
Avatar also needs Eval because neither startup clarity nor theoretical richness is enough by itself.

A system can be:

easy to start
elegant to describe
dense in theory
strong in local demos

and still fail under pressure.

That is why this layer exists.

The Eval layer is where the branch asks harder questions like:

does the branch survive blackfan pressure
does persona continuity remain visible under real tasks
does the system stay honest about what is ready and what is still open
does multilingual status remain bounded instead of overclaimed
do return-path and behavior checks reflect real continuity instead of surface-only success

This hub is not here to replace the body.
It is here to make pressure visible.

✨ Why this layer exists

The Docs layer answers questions like:

how do I start
how do I boot
how do I tune
how do I recover

The Research layer answers questions like:

what is execution
what is route law
what is runtime carry
why does structured imperfection matter
what is hard control
what counts as accountability

The Eval layer answers a different class of questions:

what breaks under pressure
what still holds under pressure
what looks successful but is actually counterfeit
what is ready at current branch baseline
what still needs stronger verification later

That is why Eval needs its own hub.

🧭 How to use this hub

Use this hub in one of four ways.

1. I want stress and adversarial pressure

Start here when the main question is whether the branch survives harsh inspection instead of friendly reading.

🧨 Blackfan Testing

This is the right place to begin when your question is:

where does the branch crack
what happens under hostile evaluation
how should current branch strength be interpreted without hype

2. I want behavior continuity inspection

Start here when the main question is whether active persona and behavior actually survive across turns, tasks, and returns.

🧭 Persona Behavior Checks

This is the right place to begin when your question is:

did the persona stay alive
did return-path recovery actually work
did the output become generic after pressure
did visible behavior stay lawful instead of merely recognizable

3. I want multilingual readiness signals

Start here when the main question is what the current branch is honestly claiming across language scope.

🌍 Multilingual Status

This is the right place to begin when your question is:

what is already tested
what is only partial
what remains open
how language support is being stated without bluffing

4. I want the broader picture around Eval

Start here when you need to connect what Eval is seeing back to the deeper branch structure.

This is the best route when your question is not only “did it pass,” but also “what exactly was being tested and why.”

🧱 What belongs in the Eval layer

The Eval layer is where branch pressure becomes explicit.

Typical Eval-layer questions include:

what kinds of pressure should this branch survive right now
what kinds of success do not deserve credit
what kinds of drift are already detectable
what counts as baseline-ready versus still-open
how should visible behavior be checked across modes
how should multilingual claims remain bounded
how should hostile or skeptical inspection be handled

This layer is not where the whole theory is restated.
It is where the branch is asked to show that its current claims can survive contact with pressure.

🧠 Current eval surfaces

The current Eval layer is organized into three major surfaces.

1. Adversarial pressure surface

🧨 Blackfan Testing

This surface is about:

hostile reading
anti-hype pressure
branch stress
counterfeit-success detection
bounded release honesty under attack

2. Behavior continuity surface

🧭 Persona Behavior Checks

This surface is about:

persona continuity
landing behavior
return-path integrity
drift after article, analysis, rewrite, search, or tool pressure
whether recovery is real or only cosmetic

3. Multilingual readiness surface

🌍 Multilingual Status

This surface is about:

what language claims are actually supported
what remains partial
how language support is being described honestly
how multilingual scope stays bounded instead of mythical

🪜 Suggested eval paths

Path A: skeptical reader path

Use this path when the goal is to test whether the branch is only persuasive or actually pressure-bearing.

This route helps answer:

what was stressed
what kind of baseline pass is being claimed
what remains bounded instead of inflated

Path B: runtime continuity path

Use this path when the concern is whether persona and carry survive real usage.

This route helps answer:

what drift happened
whether return-path behavior stayed lawful
whether recovery should receive credit

Path C: multilingual honesty path

Use this path when the concern is language scope and readiness posture.

This route helps answer:

how support is being bounded
whether language claims are being overstated
how readiness stays honest

Path D: branch readiness path

Use this path when the concern is “is this branch publicly real enough right now.”

This route helps answer:

what is already solid
what still needs stronger verification
what is release-baseline reality versus future strengthening

🔍 Why eval and research are different

This is important.

The Research layer asks:

what does this structure mean
why is this operator necessary
how do these layers relate
why is this boundary lawful

The Eval layer asks:

did the claimed behavior survive pressure
did runtime collapse under use
did route integrity actually hold
did the branch receive credit it should not receive
is the current branch being described honestly

So:

Research explains structure
Eval tests claims against pressure

Both matter.
They are not the same job.

🔍 Why eval and docs are different

The Docs layer helps people operate the current branch.

The Eval layer helps people judge the current branch.

For example:

Docs explain how to recover
Eval checks whether recovery is actually real
Docs explain how to tune
Eval shows whether tuning produced lawful improvement or just prettier outputs
Docs explain how to start
Eval shows whether startup clarity survives real branch pressure

This separation is healthy.
It stops usage guidance from quietly turning into self-certification.

🌍 Why multilingual status belongs here

Language support is easy to overclaim.

A project can say:

works in many languages
supports multilingual use
behaves well cross-lingually

while still having:

patchy behavior
uneven readiness
language-specific drift
unclear support boundaries

That is why multilingual status belongs in Eval rather than only in product copy.

It is part of branch honesty, not just capability branding.

🧪 What this hub does not claim

This hub does not claim:

that all pressure surfaces are already complete
that current Eval pages already cover every future branch risk
that passing one Eval page means the whole system is universally solved
that current multilingual status already equals final global support
that current behavior checks already replace future replay and audit extensions
that current baseline pass means no stronger verification is worth doing later

This hub is a bounded Eval center.

That is exactly what it should be.

🚀 Where to go next

README.md

🧪 Eval Hub

✨ Why this layer exists

🧭 How to use this hub

1. I want stress and adversarial pressure

2. I want behavior continuity inspection

3. I want multilingual readiness signals

4. I want the broader picture around Eval

🧱 What belongs in the Eval layer

🧠 Current eval surfaces

1. Adversarial pressure surface

2. Behavior continuity surface

3. Multilingual readiness surface

🪜 Suggested eval paths

Path A: skeptical reader path

Path B: runtime continuity path

Path C: multilingual honesty path

Path D: branch readiness path

🔍 Why eval and research are different

🔍 Why eval and docs are different

🌍 Why multilingual status belongs here

🧪 What this hub does not claim

🚀 Where to go next

For public product entry

For startup and commands

For reading order and tuning

For deep structural reading

For skeptical pressure

For continuity inspection

For language readiness

For audit posture

🔗 Quick links

Eval core

Docs

Research

Up