WFGY/ProblemMap/GlobalFixMap/Embeddings/update_and_index_skew.md

10 KiB
Raw Blame History

Update and Index Skew: Guardrails and Fix Patterns

🧭 Quick Return to Map

You are in a sub-page of Embeddings.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A repair guide for pipelines where fresh content does not show up, shards disagree after a redeploy, or recall drops right after a routine job. Use this page to localize drift between ingestion, embedding, and index structures, then lock ordering and verify with ΔS, coverage, and λ.

Open these first

When to use this page

  • New docs appear in object store but not in retrieval
  • Some tenants or shards recall fine while others look stale
  • After a redeploy, recall falls or top k order flips
  • Index reports healthy yet coverage to anchors is low
  • ANN rebuild completes but neighbor order looks random

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45
  • Coverage of the target section ≥ 0.70
  • λ remains convergent across three paraphrases and two seeds
  • E_resonance stays flat on long windows

Symptom to likely cause


Fix in 60 seconds

  1. Read the watermarks For each stage write a simple count and last processed id or time. Compare DOC_COUNT, EMB_COUNT, IDX_COUNT. Any gap indicates skew.

  2. Pin versions and abort on mismatch Ingest refuses rows if any of these differ from the contract or store metadata: embed_model, embed_rev, dim, metric, normalize_l2, analyzer_rev, ann_rev, index_hash. See data-contracts.md.

  3. Rebuild the broken segment Re-embed and re-index the affected shard or time window. Retrain ANN and PQ on the new vectors. Do not reuse old graphs.

  4. Clamp λ on the prompt side Use citation first and fixed header order to avoid prompt variance while you repair the store. See retrieval-traceability.md.

  5. Verify Three paraphrases and two seeds. Require coverage ≥ 0.70 and ΔS ≤ 0.45 on the gold anchors.


Root causes checklist

  • Non idempotent upserts by (doc_id, section_id, rev)
  • Background jobs race with live writers
  • Mixed embed_model or normalize_l2 across namespaces
  • ANN params not retrained after rebuild
  • Analyzer or tokenizer version differs across shards
  • TTL or retention silently dropped sections
  • Partial deploy cut over while index still training
  • Streaming path uses a different preprocessor than batch

Minimal probes

Probe A — watermark audit
For each stage {ingest, embed, index}:
  read COUNT and LAST_TS
Expect ingest ≥ embed ≥ index with small gaps. Any large gap is skew.

Probe B — version parity
Sample 1k rows per shard and tabulate:
  embed_model, embed_rev, dim, metric, normalize_l2, analyzer_rev, ann_rev
Any heterogeneity inside one collection is a fail.

Probe C — recall delta
Run the same 50 gold queries before and after shard rebuild.
Require coverage gain ≥ 0.10 if the shard was failing.

Probe D — ANN sanity
Toggle reranker on and off at k=20.
If reranker recovers most anchors while base k misses, retrain ANN or rebuild.

Contract fields to add

{
  "doc_id": "stable",
  "section_id": "stable",
  "rev": "v2025-08-28",
  "ingest_ts": "2025-08-28T10:42:00Z",
  "embed_model": "exact-id",
  "embed_rev": "hash-or-date",
  "dim": 768,
  "metric": "cosine",
  "normalize_l2": true,
  "analyzer_rev": "text-preproc-v3",
  "ann_index": "hnsw",
  "ann_rev": "hnsw_v5",
  "index_hash": "sha256:...",
  "partition": "tenant_a|shard_03",
  "write_path": "batch|stream",
  "tombstone": false
}

Operational guardrails

  • Single writer per partition and idempotent upsert
  • Preflight that halts when store.metric != contract.metric or dim mismatches
  • Blue green or shadow collection for any rebuild, with union retriever and deterministic rerank during cutover
  • Scheduled drift sweep that compares watermarks and ΔS across partitions
  • Alerts on ΔS ≥ 0.60 or λ flip rate spikes on live traffic

Verification checklist

  • Coverage ≥ 0.70 and ΔS ≤ 0.45 on a ten question gold set
  • λ convergent across two seeds and three paraphrases
  • Top k overlap across seeds ≥ 0.8 after the fix
  • Watermarks aligned for ingest, embed, and index within your SLO window

Copy paste prompt for the LLM step

TXT OS and the WFGY Problem Map are loaded.

My issue: updates not reflected or recall dropped after a job.
Traces:
- watermarks: ingest=..., embed=..., index=...
- versions: embed_model=..., embed_rev=..., metric=..., ann_rev=...
- ΔS(question,retrieved)=..., coverage=..., λ across 3 paraphrases

Tell me:
1) the failing layer and why,
2) the exact WFGY page to open next,
3) the minimal structural fix to remove skew and pass targets,
4) a short verification plan for coverage ≥ 0.70 and ΔS ≤ 0.45.
Use BBMC, BBCR, BBPF, BBAM when relevant.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars