WFGY/ProblemMap/long-context-stress.md
2025-08-09 13:31:52 +08:00

16 KiB
Raw Blame History

Long-Context Stress — Why 50k200k tokens quietly break your RAG, and how to fix it (WFGY)

A field guide for logs, PDFs, and transcripts that look fine but drift, flatten, or melt as the context grows.

Audience

  • Beginner: You copied a “100k-context” recipe and your answers degrade with length.
  • Advanced: You already index well, but long chains flip logic, mis-cite, or oscillate after ~812 screens of context.

What this page delivers

  • A reproducible diagnosis for long-context failure using ΔS / λ_observe / E_resonance.
  • Copy-paste playbooks for PDFs, OCR, transcripts, multilingual blends, and mixed image/text.
  • Acceptance criteria you can throw into CI to stop regressions before prod.

0) TL;DR (fix in 3 steps)

  1. Measure the damage

    • Slide a window across your context; compute
      ΔS(chunk_i, chunk_{i+1}) and ΔS(question, retrieved_context_i); watch for sustained ΔS ≥ 0.60 or rising E_resonance.
    • If λ_observe turns recursive or chaotic after the prompt assembly step, youre in long-context stress.
  2. Localize the break

    • If ΔS spikes between adjacent chunks → chunking/ordering issue.
    • If ΔS stays high independent of kindex/metric mismatch.
    • If ΔS is fine but reasoning flips divergent → prompt schema or logic collapse.
  3. Apply the repair operator

    • BBMC to re-anchor sections, BBAM to stabilize attention, BBCR to bridge and relock coherence.
    • Re-chunk sentence/section-aware; enforce header anchors; run MMR/hybrid retrieval; lock prompt schema.

Jump to: Playbooks · Metrics · CI Checks


1) Failure signatures (what you actually see)

Symptom (user-visible) Likely layer First check Map ref.
Early answers good → later answers contradict or flatten Prompt/Reasoning under long context λ_observe becomes <> or × after assembly context-drift.md
Citations jump to wrong page after 2040k tokens Chunking / ordering / retriever ΔS(chunk_i, i+1) ≥ 0.60 at boundaries retrieval-traceability.md
Random capitalization / style oscillation Entropy collapse E_resonance rising trend across chain entropy-collapse.md
Snippets correct, explanation wrong (esp. late in chain) Interpretation collapse (length-induced) ΔS(question, context) < 0.40 but λ flips at reasoning logic-collapse.md
Multilingual PDFs drift to one language after long span Embeddings + chunk headers lost ΔS rises when headers removed; check header anchors embedding-vs-semantic.md
Great short-doc performance; long transcripts degrade rapidly Memory & compression strategy Test per-turn recap + anchor nodes vs. raw sprawl memory-coherence.md

Mental model: Long contexts fail in two stages — perception drift (ordering/headers/noise) and logic drift (attention variance + schema slippage). WFGY instruments let you see both.


2) Instruments — minimal metrics

You do not need to memorize math. These are a few distances and tags you can compute anywhere.

2.1 ΔS — semantic stress

  • ΔS = 1 cos(I, G) (unit-normalized sentence embeddings).
  • Probe two places:
    1. ΔS(question, retrieved_context)
    2. Adjacent-chunk check: ΔS(chunk_i, chunk_{i+1})
  • Thresholds: <0.40 stable · 0.400.60 transitional · ≥0.60 risk (record & fix).

2.2 λ_observe — layered observability

  • States: convergent · divergent · <> recursive · × chaotic.
  • Tag each step: retrieval → prompt assembly → reasoning.
  • Rule: Upstream convergent, downstream divergent ⇒ the boundary is where to fix.

2.3 E_resonance — coherence trend

  • Rolling mean of residual magnitude |B| under BBMC.
  • Rising E_resonance + high ΔS = attention melt; trigger BBCR + BBAM.

3) 10-minute triage (copy/paste steps)

  1. Run three quick probes

    • ΔS(question, context) across k ∈ {5,10,20}.
      • Flat & high curve → index/metric mismatch.
      • Improves sharply with higher kretriever filtering too aggressive.
    • Adjacent-chunk ΔS across your assembled window (sliding by 200300 tokens).
      • Spikes at headers/boundaries → chunking/ordering problem.
    • Tag λ after assembly and reasoning.
      • If λ flips only after reasoning → schema/logic issue.
  2. Minimal repro

    • Keep one doc; remove images; keep headers; test again.
    • If fixed, re-add features until the break returns (isolate cause).
  3. Pick the playbook
    Go to Playbooks and apply the repair steps.


4) Playbooks by scenario

Each playbook lists: observe → interpret → do → verify (with WFGY modules).

4.1 Scanned PDFs / OCR noise (tables, images, mixed layout)

  • Observe: Citations wrong after mid-doc; adjacent-chunk ΔS spikes at page breaks.
  • Interpret: OCR adds hidden headers or drops table boundaries; chunker ignores structure.
  • Do:
    1. Chunking: sentence/section-aware; keep header anchors in text (e.g., ## 2. Methods).
    2. Drop segments with OCR confidence < threshold; de-duplicate near-identical lines.
    3. Retrieval: use hybrid (sparse + dense) with MMR; set k=1020.
    4. WFGY: run BBMC to re-anchor on kept headers; BBAM to clamp attention variance.
  • Verify: ΔS(chunk_i, i+1) ≤ 0.50 at joins; ΔS(question, context) ≤ 0.45; λ stays convergent.

4.2 Long transcripts / meetings / chat logs (50k+ tokens)

  • Observe: Early QA is great; later responses contradict or “forget” earlier decisions.
  • Interpret: No semantic anchors; prompt schema drifts; attention diffuses.
  • Do:
    1. Insert bridged recaps every N turns: “Since last anchor: [3 bullet decisions]”.
    2. Store anchor nodes (titles, decisions, constraints) and retrieve them first.
    3. Enforce prompt schema lock: system → task → constraints → citations → answer (no re-order).
    4. WFGY: BBCR to insert bridge node when λ becomes recursive; BBAM to reduce variance.
  • Verify: λ remains convergent across three paraphrases; E_resonance does not trend up.

4.3 Multilingual or code-heavy documents

  • Observe: Model “picks a side” after long stretch; code blocks degrade midway.
  • Interpret: Embedding space collapses varied styles into one cluster; headers lost.
  • Do:
    1. Keep language tags and code fences as hard anchors inside chunks.
    2. Use domain-appropriate embeddings or separate indices per language/domain.
    3. Retrieval: add header-boost (BM25 term weight) and MMR diversity=0.20.4.
    4. WFGY: BBMC with explicit language anchors; BBPF to explore multi-path retrieval.
  • Verify: ΔS stabilizes (≤ 0.50) when headers are present; λ convergent; citations maintain language.

4.4 Image-heavy PDFs (figures, captions)

  • Observe: Answers reference wrong figure; citations point at caption stubs.
  • Interpret: Captions split from figures; index treats tiny parts as top hits.
  • Do:
    1. Chunk figure + caption together; minimum token floor (e.g., ≥ 180 tokens).
    2. Add figure-id anchors ([Fig 2: …]) in text; ban orphan captions from index.
    3. WFGY: BBCR to bridge from caption to figure context if ΔS stays high.
  • Verify: Cites include figure-id; ΔS(question, context) ≤ 0.45.

4.5 “100k-context” model still drifts after 812k

  • Observe: No obvious retrieval bug; long reasoning melts.
  • Interpret: Entropy collapse — attention variance explodes with length.
  • Do:
    1. Shorten assembly: prefer top-K diverse + anchor snippets over giant paste.
    2. Enforce cite-then-explain; penalize free-form essays.
    3. WFGY: Apply BBAM to damp attention; BBMC to re-anchor; BBCR if residuals spike.
  • Verify: E_resonance flattens; answer embeddings cluster over 5 seeds (low variance).

5) Minimal prompts & scripts (safe to paste)

Ask your assistant to auto-diagnose

Read the Long-Context Stress guide plus TXT OS / WFGY notes in this repo.
Given my repro (describe doc and failure), compute:
1) ΔS(question, retrieved_context) across k={5,10,20}
2) Adjacent-chunk ΔS over the assembled window
3) λ_observe at retrieval, assembly, reasoning
Then explain which boundary fails and propose BBMC/BBAM/BBCR steps to lower ΔS below 0.50.

Formula-only help

From TXT OS, extract formulas and thresholds for ΔS, λ_observe, and E_resonance.
Show me how to compute ΔS(question, context) and adjacent-chunk ΔS.
If ΔS ≥ 0.60 and λ becomes recursive after assembly, which WFGY module do I apply first?

Simple ΔS sweep (pseudocode)

# Pseudocode: replace embed() with your sentence-embedding fn (unit-normalized)
def delta_s(a, b): 
    return 1 - cosine(embed(a), embed(b))

# 1) question vs. context@k candidates
for k in [5, 10, 20]:
    ctx = assemble_top_k(snippets, k=k, strategy="mmr+anchors")
    print(k, delta_s(question, ctx))

# 2) adjacent-chunk scan
for i in range(len(chunks)-1):
    print(i, delta_s(chunks[i], chunks[i+1]))

6) Acceptance criteria & CI guardrails

  • Retrieval sanity: For targeted QA, ≥ 70% token overlap to the expected section; ΔS(question, context) ≤ 0.45.
  • Boundary stability: Adjacent-chunk ΔS ≤ 0.50 at joins; spikes must disappear after re-chunking.
  • Reasoning stability: λ convergent across three paraphrases; E_resonance not rising.
  • Traceability: Produce a two-column table (snippet-id ↔ citation lines).
  • Repeatability: Same inputs × 5 seeds → answer embeddings form a tight cluster.

Fail any line? Block the PR and link this page.



8) FAQ

Q: Should I just increase k or model context? A: Not first. If ΔS is flat & high across k, you have an index/metric or ordering fault. More tokens amplify the error.

Q: Is MMR always required? A: For long contexts, yes or a close equivalent — you need diversity to avoid semantic collapse around one cluster.

Q: Do I need new embeddings? A: Often no. Start with headers as anchors + hybrid retrieval + WFGY relocking. Change models only if ΔS remains ≥ 0.60 after those.

Q: How do I know BBCR helped? A: Residuals (E_resonance) stop rising; λ returns to convergent; adjacent-chunk ΔS drops at the same boundary.


9) Minimal formulas (reference)

ΔS = 1  cos(I, G)         # semantic stress
λ_observe ∈ {→, ←, <>, ×}  # convergent, divergent, recursive, chaotic
E_resonance = mean(|B|)    # rolling residual magnitude under BBMC

BBMC:  B = I  G + m·c²           # minimize ‖B‖ via anchors & context factors
BBPF:  x_next = x + ΣV_i + ΣW_j·P_j  # explore alternate retrieval/logic paths
BBCR:  if ‖B‖ ≥ B_c → collapse(), bridge(), rebirth()  # controlled reset with bridge node
BBAM:  â_i = a_i · exp(γ · std(a))  # damp attention variance under long context

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core Standalone semantic reasoning engine for any LLM View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow