Merge pull request #116 from onestardao/PILOT_OFFER_ONE_PAGER.md

Add pilot collaboration entry and sample deliverable docs
2026-04-28 03:29:51 +00:00 · 2026-03-08 23:07:21 +08:00 · 2026-03-08 23:07:21 +08:00 · 3b22016667
commit 3b22016667
parent 7cb5d6731a ebaa6b0c38
2 changed files with 690 additions and 0 deletions
--- a/PILOT_OFFER_ONE_PAGER.md
+++ b/PILOT_OFFER_ONE_PAGER.md
@ -0,0 +1,263 @@
+# PILOT_OFFER_ONE_PAGER
+
+Pilot collaboration entry for teams exploring WFGY in real AI workflows.
+
+This page is a compact, buyer-facing summary of what a WFGY pilot can look like.
+
+It is written for teams who already have a real system, a real failure pattern, or a real evaluation problem, and want to test whether WFGY is useful in practice.
+
+For the broader collaboration entry, see [WORK_WITH_WFGY.md](./WORK_WITH_WFGY.md).  
+For a historical view of how WFGY became publicly legible, see [EVIDENCE_TIMELINE.md](./EVIDENCE_TIMELINE.md).  
+For a sample output shape, see [SAMPLE_DELIVERABLE.md](./SAMPLE_DELIVERABLE.md).
+
+---
+
+## What this page is
+
+This page is a practical pilot overview.
+
+Its job is simple:
+
+help a serious team answer three questions quickly:
+
+1. Is WFGY relevant to our situation
+2. What would a small pilot actually look like
+3. What would we likely get back at the end
+
+This is not a pitch deck, not a customer logo page, and not a promise of enterprise deployment.
+
+---
+
+## Who this is for
+
+WFGY pilots are best suited for teams that already have one of the following:
+
+* a RAG system that keeps returning wrong answers even when infra looks normal
+* an agent or multi-agent workflow with unstable behavior, drift, or brittle handoffs
+* an evaluation workflow that can score outputs, but still cannot clearly explain failure structure
+* a debugging process that is expensive, slow, and overly dependent on ad hoc intuition
+* a research or platform team that wants a more structured way to classify failure modes
+
+In short, this page is for teams with real questions, not for people looking for generic prompt advice.
+
+---
+
+## What WFGY is most useful for in a pilot
+
+At the current stage, the strongest practical wedge for WFGY is structured diagnosis.
+
+That usually means one or more of the following:
+
+* classifying recurring failure modes in a RAG or agent pipeline
+* separating retrieval, prompt assembly, orchestration, memory, and evaluation failures
+* building a more stable debugging vocabulary across engineers, PMs, and researchers
+* turning scattered symptoms into a smaller set of reproducible failure categories
+* reducing guesswork before a team spends time on bigger architectural changes
+
+This is especially useful when a team already knows that “something is wrong,” but cannot yet describe the failure in a way that leads to clean fixes.
+
+---
+
+## Pilot formats
+
+A WFGY pilot will usually fit one of these formats.
+
+### 1. Failure audit pilot
+
+Best for teams with a live or recently failing RAG or agent workflow.
+
+Typical goal:
+
+map observed failures into a smaller set of structured categories, identify the likely layer where the problem actually lives, and suggest the smallest next debugging moves.
+
+Typical inputs:
+
+* failing examples
+* run traces, logs, screenshots, or prompt chains
+* brief architecture description
+* known symptoms and current hypotheses
+
+Typical outputs:
+
+* structured failure classification
+* likely root-cause layer analysis
+* fix priority suggestions
+* a clearer debugging route for the team
+
+---
+
+### 2. Triage workshop pilot
+
+Best for teams that need fast alignment across internal stakeholders.
+
+Typical goal:
+
+use WFGY surfaces such as the Problem Map or Global Debug Card to create a shared language for triage, review, and prioritization.
+
+Typical inputs:
+
+* representative failure cases
+* current internal workflow for debugging or review
+* participating team roles
+* constraints on time, tooling, or ownership
+
+Typical outputs:
+
+* a shared failure vocabulary
+* a smaller triage decision surface
+* candidate routing rules for common cases
+* a cleaner handoff structure across team members
+
+---
+
+### 3. Design partner pilot
+
+Best for teams exploring deeper protocol, tooling, or evaluation integration.
+
+Typical goal:
+
+test whether WFGY can serve as part of a reusable debugging, evaluation, or reasoning layer inside a broader product or research workflow.
+
+Typical inputs:
+
+* a clear use case
+* target surface for integration or evaluation
+* baseline workflow or benchmark
+* practical constraints and success criteria
+
+Typical outputs:
+
+* pilot framing document
+* integration hypotheses
+* structured observations from the trial
+* recommendation on whether deeper work is justified
+
+---
+
+## What a team usually needs to provide
+
+A good pilot depends on concrete material.
+
+The team does not need to provide everything at once, but a serious pilot usually needs:
+
+* one clear system or use case
+* several representative failures or stress cases
+* enough context to understand where the system boundaries are
+* the current debugging or evaluation workflow, even if it is messy
+* one contact point who can answer follow-up questions
+
+If the pilot is about a production system, confidentiality and scope should be discussed early.
+
+---
+
+## What WFGY usually provides
+
+A WFGY pilot usually provides structure, not magic.
+
+That structure may include:
+
+* a clearer failure map
+* a smaller set of meaningful categories
+* sharper distinctions between surface symptoms and deeper causes
+* a more reproducible debugging route
+* a shared interpretive layer that makes future failures easier to discuss
+
+Where relevant, WFGY may also provide draft artifacts such as:
+
+* a case classification sheet
+* a triage summary
+* a debug routing proposal
+* an evaluation framing note
+* a recommended next-step sequence
+
+For an example of the shape of outputs, see [SAMPLE_DELIVERABLE.md](./SAMPLE_DELIVERABLE.md).
+
+---
+
+## What this does not claim
+
+A WFGY pilot does not automatically mean:
+
+* full production integration
+* guaranteed model quality improvement
+* enterprise-grade support or SLA
+* replacement of platform engineering, ML engineering, or security review
+* one-step diagnosis of every failure in a complex system
+
+WFGY is most useful when it helps a team see the failure structure more clearly.
+
+That often improves decision quality, but it should not be described as a universal fix.
+
+---
+
+## Good fit and bad fit
+
+### Good fit
+
+A pilot is usually a good fit when:
+
+* the team has real failure cases
+* the problem is costly enough to matter
+* the team wants sharper structure, not vague brainstorming
+* the team is open to disciplined boundary-setting
+* the team can provide enough evidence to reason from
+
+### Bad fit
+
+A pilot is usually a poor fit when:
+
+* there is no concrete system yet
+* the team only wants generic prompting advice
+* the team wants guaranteed outcomes before sharing any evidence
+* the problem is actually legal, security, compliance, or infra ownership only
+* the team expects WFGY to replace core implementation work
+
+---
+
+## Suggested pilot flow
+
+A small pilot can often be framed in four stages:
+
+1. Scope  
+   define the system, the problem surface, and the pilot question
+
+2. Evidence intake  
+   review examples, traces, and known symptoms
+
+3. Structured analysis  
+   map failures, isolate likely layers, and identify the most useful distinctions
+
+4. Return package  
+   provide a compact summary of findings, boundaries, and recommended next moves
+
+This is intentionally small.
+
+The purpose of a pilot is not to pretend the whole system is solved.  
+The purpose is to learn whether WFGY creates real clarity and practical leverage.
+
+---
+
+## Best current reading of WFGY pilot value
+
+Today, the safest and strongest claim is this:
+
+WFGY is most legible as a structured reasoning and debugging layer for AI systems, especially where teams need better failure classification, cleaner triage, and more reproducible diagnosis.
+
+That is the right starting point for a pilot.
+
+Broader claims should only be made if later evidence supports them.
+
+---
+
+## Next step
+
+If your team is exploring a pilot, start here:
+
+* [WORK_WITH_WFGY.md](./WORK_WITH_WFGY.md) for the broader collaboration entry
+* [CASE_EVIDENCE.md](./CASE_EVIDENCE.md) for how public cases should be read
+* [ADOPTERS.md](./ADOPTERS.md) for the shortest public proof summary
+
+If needed, this page can later evolve into a more formal outward-facing pilot brief.  
+For now, its role is simpler:
+
+to make the pilot path legible without overselling it.
--- a/SAMPLE_DELIVERABLE.md
+++ b/SAMPLE_DELIVERABLE.md
@ -0,0 +1,427 @@
+# SAMPLE_DELIVERABLE
+
+Sample structure of a compact WFGY pilot return package.
+
+This page shows what a small WFGY deliverable may look like after a pilot, audit, or structured review.
+
+It is not a promise that every engagement will produce the same sections in the same length.  
+It is a practical sample that makes the expected output shape easier to understand before collaboration begins.
+
+For the pilot entry itself, see [PILOT_OFFER_ONE_PAGER.md](./PILOT_OFFER_ONE_PAGER.md).  
+For the broader collaboration entry, see [WORK_WITH_WFGY.md](./WORK_WITH_WFGY.md).  
+For historical context and public proof, see [EVIDENCE_TIMELINE.md](./EVIDENCE_TIMELINE.md).
+
+---
+
+## What this page is
+
+This page is a sample return package.
+
+Its purpose is to answer a simple but important question:
+
+**what would a team actually receive after a small WFGY pilot**
+
+The answer is not “a vague summary.”  
+The intended shape is a bounded, structured, decision-useful package that helps a team move from scattered symptoms toward clearer categories, clearer boundaries, and more disciplined next steps.
+
+This page is not:
+
+* a fixed legal scope
+* a formal statement of work
+* a guarantee of outcome
+* a claim that every engagement will uncover the same level of clarity
+
+It is a model of what “useful structure” can look like.
+
+---
+
+## Best way to read this sample
+
+Read this page as a sample in three senses at once:
+
+### 1. Structure sample
+
+It shows the sections a compact WFGY return package is likely to include.
+
+### 2. Decision sample
+
+It shows the kind of judgments WFGY tries to make legible:
+
+* what the likely failure layers are
+* what is high confidence versus low confidence
+* what next moves are worth doing first
+
+### 3. Boundary sample
+
+It shows that a good deliverable should not only say what seems likely.  
+It should also say what remains uncertain and what is out of scope.
+
+---
+
+# Sample WFGY Return Package
+
+## 1. Engagement snapshot
+
+**Project type**  
+RAG or agent workflow review
+
+**Pilot type**  
+Failure audit pilot
+
+**Review scope**  
+Small scoped review based on a limited set of representative failures, system notes, and current debugging assumptions
+
+**Primary question**  
+Why does the system continue to produce wrong, unstable, or weakly grounded outputs even when the infrastructure appears mostly healthy
+
+**Deliverable goal**  
+Convert scattered symptoms into a smaller set of structured categories, identify the most likely failure layers, and recommend a practical next-step sequence
+
+**Overall reading**  
+The system does not appear to be facing one isolated issue.  
+The evidence suggests a layered debugging problem with multiple interacting surfaces.
+
+---
+
+## 2. Inputs reviewed
+
+The following materials were reviewed in this sample scenario:
+
+* representative failing examples
+* selected logs, traces, screenshots, or prompt chains
+* a short description of the current architecture
+* the team’s current explanations or debugging hypotheses
+* key constraints on ownership, tooling, and deployment
+
+### Boundary note
+
+The pilot does not assume full access to every production component.  
+The goal is to review enough material to reach a disciplined structural reading, not to claim omniscience over the whole system.
+
+---
+
+## 3. System snapshot
+
+The reviewed system is a retrieval-backed generation workflow with a multi-step prompt construction path, a document retrieval layer, a ranking layer, and a final answer-generation stage.
+
+The team reports the following recurring pattern:
+
+* some answers are fluent but incorrect
+* similar questions may produce different retrieved evidence and different final outputs
+* some failures appear before final generation
+* debugging discussions often collapse multiple error types into one generic label
+
+### Why this section exists
+
+This section is intentionally short.  
+Its purpose is to establish a readable shared context before moving into classification and judgment.
+
+---
+
+## 4. Observed failure surface
+
+Before any deeper interpretation, the visible failure surface in this sample case looks like this:
+
+1. The system often produces answers that sound stable but are not reliably grounded in the retrieved material.
+2. Similar inputs do not consistently lead to similar retrieval and answer behavior.
+3. Evidence suggests that some failures emerge before final answer generation, especially in selection or context preparation.
+4. The current debugging loop appears to focus heavily on the model output itself, while upstream layers may be contributing materially to the final result.
+
+### Observational status
+
+This section stays close to visible behavior.  
+It does not yet claim root cause.
+
+---
+
+## 5. Structured failure classification
+
+This is one of the core sections in a WFGY-style return package.
+
+### Primary category cluster
+
+#### A. Retrieval-selection instability
+
+Relevant material is not being surfaced consistently enough across similar requests.
+
+**Confidence**  
+High
+
+**Why this appears likely**  
+Repeated variation in retrieved context suggests the issue begins upstream of final generation in at least part of the case set.
+
+#### B. Context assembly distortion
+
+Useful material may exist, but the way it is combined, compressed, ordered, or weighted may reduce its practical usefulness before generation.
+
+**Confidence**  
+Medium to high
+
+**Why this appears likely**  
+Some failures show a gap between the presence of relevant source material and the quality of the final answer.
+
+#### C. Final-answer overconfidence
+
+The answer layer sometimes presents weakly supported outputs with stronger confidence than the evidence can justify.
+
+**Confidence**  
+Medium
+
+**Why this appears likely**  
+Observed outputs appear rhetorically stable even when grounding is partial or inconsistent.
+
+### Secondary category cluster
+
+#### D. Evaluation blind spots
+
+The current review loop may detect bad outcomes, but does not yet separate retrieval, orchestration, and answer-layer failures reliably enough.
+
+**Confidence**  
+High
+
+#### E. Triage vocabulary weakness
+
+Multiple distinct failure patterns may be grouped under one generic description, making debugging slower and less reproducible.
+
+**Confidence**  
+High
+
+### Why this section matters
+
+This section converts raw symptoms into smaller, reusable buckets.
+
+That matters because teams often lose time not only from technical issues, but from category confusion.
+
+---
+
+## 6. Likely root-cause layers
+
+This section moves from classification toward deeper reading.
+
+### Highest-probability layers
+
+#### 1. Retrieval and selection layer
+
+**Priority**  
+Highest
+
+**Confidence**  
+High
+
+**Reading**  
+At least part of the observed failure surface likely begins before the model writes the final answer.
+
+#### 2. Context construction layer
+
+**Priority**  
+High
+
+**Confidence**  
+Medium to high
+
+**Reading**  
+The prompt may be receiving technically relevant material in a structurally degraded form.
+
+#### 3. Review and evaluation layer
+
+**Priority**  
+High
+
+**Confidence**  
+High
+
+**Reading**  
+The current internal debugging loop may not yet distinguish failure signatures by layer clearly enough to support fast iteration.
+
+### Lower-confidence but relevant layers
+
+#### 4. Memory or carryover behavior
+
+**Priority**  
+Medium
+
+**Confidence**  
+Low to medium
+
+#### 5. Tool or handoff instability
+
+**Priority**  
+Medium
+
+**Confidence**  
+Low to medium
+
+#### 6. Prompt framing side effects
+
+**Priority**  
+Medium
+
+**Confidence**  
+Low
+
+### Interpretation rule
+
+A strong deliverable should separate:
+
+* what appears likely
+* what remains possible
+* what is still too weak to assert
+
+That distinction is part of the value.
+
+---
+
+## 7. Working diagnosis
+
+### Core reading
+
+The current pattern does not look like a pure model-quality problem.
+
+The stronger reading is that this is a layered systems problem in which retrieval quality, context assembly, and evaluation framing interact to produce unstable or weakly grounded final answers.
+
+### Why this matters
+
+If the team continues to read the issue only as “the model hallucinated,” it may keep applying fixes at the wrong layer.
+
+The evidence in this sample case suggests that the more useful route is to separate the failure surface into upstream selection, context construction, and final-answer expression.
+
+### Boundary
+
+This is a working diagnosis, not a claim of full proof.
+
+---
+
+## 8. Recommended next moves
+
+This section should be concrete, limited, and sequenced.
+
+### Priority 1
+
+Separate retrieval failure from generation failure using a smaller reviewed case set.
+
+**Goal**  
+Stop treating all bad answers as one category.
+
+**Why first**  
+This creates the cleanest structural gain for the least cost.
+
+### Priority 2
+
+Inspect context assembly rules for compression, ranking, truncation, and ordering artifacts.
+
+**Goal**  
+Check whether useful material is being technically retrieved but practically neutralized before generation.
+
+**Why second**  
+This is one of the most likely places where “good inputs turn into weak answer conditions.”
+
+### Priority 3
+
+Add a lightweight layer tag to internal review.
+
+**Goal**  
+Mark each failure as most likely retrieval, assembly, answer, tool, memory, or evaluation related before discussing fixes.
+
+**Why third**  
+A small tagging habit often improves debugging clarity more than another round of vague brainstorming.
+
+### Priority 4
+
+Standardize a short internal vocabulary for repeated failure classes.
+
+**Goal**  
+Reduce repeated ambiguity in triage conversations.
+
+**Why fourth**  
+This makes future failures cheaper to discuss and faster to route.
+
+---
+
+## 9. What remains uncertain
+
+A good deliverable should say clearly what it does not yet know.
+
+In this sample scenario, the reviewed material is sufficient for a structured preliminary reading, but not sufficient for strong claims about all long-run production behavior.
+
+### Still uncertain
+
+1. Whether ranking logic or chunking logic is the dominant upstream driver
+2. Whether carryover or memory effects are meaningful or only incidental
+3. Whether some observed failures are benchmark-specific rather than architecture-level
+4. Whether the same pattern holds consistently across all major workload classes
+
+### Why this section matters
+
+Without an uncertainty section, teams often over-read a pilot and treat it as a full-system verdict.
+
+That would be a mistake.
+
+---
+
+## 10. Boundaries and non-claims
+
+A compact WFGY return package should clearly state what it does **not** establish.
+
+This sample does not claim:
+
+* that every major failure has been found
+* that all root causes are proven
+* that the system is near production readiness
+* that architecture changes are unnecessary
+* that a small pilot replaces engineering, security, or infrastructure work
+* that every future failure will fit the same categories
+
+The purpose of the package is narrower and more practical:
+
+to improve structural clarity, reduce debugging ambiguity, and make the next round of decisions more disciplined.
+
+---
+
+## 11. Possible follow-on outputs
+
+Depending on scope, a future engagement may extend into outputs such as:
+
+* a cleaner internal failure taxonomy
+* a triage worksheet for recurring incidents
+* a review rubric for future runs
+* a routing guide for common failure types
+* a summary note for decision-makers
+* a deeper design-partner or integration proposal
+
+These are possible extensions.  
+They are not automatic promises.
+
+For the pilot framing that may lead into these, see [PILOT_OFFER_ONE_PAGER.md](./PILOT_OFFER_ONE_PAGER.md).
+
+---
+
+## 12. Why this sample matters
+
+Many teams do not need more generic advice.  
+They need a better way to move from messy evidence to smaller, more meaningful decisions.
+
+That is the role of a WFGY deliverable at its best.
+
+It helps a team move from:
+
+**something is wrong**
+
+toward:
+
+**these are the likely failure layers, these are the boundaries, and these are the next moves worth trying**
+
+That is a much better place to be.
+
+---
+
+## Related pages
+
+* [PILOT_OFFER_ONE_PAGER.md](./PILOT_OFFER_ONE_PAGER.md)
+* [WORK_WITH_WFGY.md](./WORK_WITH_WFGY.md)
+* [CASE_EVIDENCE.md](./CASE_EVIDENCE.md)
+* [ADOPTERS.md](./ADOPTERS.md)
+* [EVIDENCE_TIMELINE.md](./EVIDENCE_TIMELINE.md)
+
+---
+
+Maintained as a sample structure, not a fixed contract.