WFGY/ProblemMap/GlobalFixMap/RAG_VectorDB/update_and_index_skew.md

8.1 KiB
Raw Blame History

Update and Index Skew — Guardrails and Fix Pattern

🧭 Quick Return to Map

You are in a sub-page of RAG_VectorDB.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Use this page when recall flips or citations drift after a data or model update.
Skew appears when writers and readers see different corpus revisions, or when the index was rebuilt with changed params without a cutover plan.


Open these first


Core acceptance

  • Single INDEX_HASH is identical for writer, retriever, reranker, and LLM side prompts.
  • ΔS(question, retrieved) ≤ 0.45 on 3 paraphrases and 2 seeds after the update.
  • Coverage ≥ 0.70 to the target section, stable across shards and regions.
  • λ remains convergent during the cutover window, no flip states at header reorder.

Symptoms → likely cause → open this


Fix in 60 seconds

  1. Pin the contract
    Compute INDEX_HASH = sha256(model_id + tokenizer_ver + chunk_schema + metric + dim + store_params + corpus_rev).
    Log it on writer, retriever, reranker, and in the LLM prompt header.

  2. Shadow read
    Run a gold set against current index and a rebuilt index in parallel.
    Alert if ΔS variance > 0.05 or coverage drops below 0.70.

  3. Freeze and rebuild
    Stop writes. Re-embed and rebuild offline with explicit dim, metric, and normalization.
    Verify tokenizer and casing are identical to the previous contract.

  4. Cutover with warmup
    Warm the new index. Switch read traffic via percentage ramp.
    Abort if λ flips or ΔS exceeds 0.60 on any guardrail probe.


Minimal checks you must script

  • Contract echo
    Every query path must log INDEX_HASH, MODEL_ID, TOKENIZER_VER, CHUNK_SCHEMA_VER.

  • Shard parity probe
    Run the same 25 queries to each shard or region.
    Flag if Jaccard(top-k) < 0.6 against the reference shard.

  • Cache invalidation
    Clear reranker and query embedding caches when INDEX_HASH changes.

  • Reader staleness
    Reject queries if reader_index_hash != router_index_hash. Fail fast, do not serve stale.


Common gotchas

  • Silent analyzer change in a search backend re-tokenizes text while vectors are unchanged.
  • HNSW or IVF params differ between shards, causing order instability at k=10 but not at k=3.
  • APM dashboards show healthy ingestion yet the retriever reads from a lagging replica.
  • Reranker model upgraded without re-baselining acceptance targets.
  • Partial re-embed of only new docs creates a semantic seam at time T.

Verification

  • Gold set of 100 questions, 3 paraphrases each.
  • Require ΔS ≤ 0.45 and coverage ≥ 0.70 on both old and new indexes before cutover.
  • After cutover, repeat on two seeds. If λ remains convergent and ΔS does not spike, close the change.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
Engine WFGY 1.0 Original PDF based tension engine
Engine WFGY 2.0 Production tension kernel and math engine for RAG and agents
Engine WFGY 3.0 TXT based Singularity tension engine, 131 S class set
Map Problem Map 1.0 Flagship 16 problem RAG failure checklist and fix map
Map Problem Map 2.0 RAG focused recovery pipeline
Map Problem Map 3.0 Global Debug Card, image as a debug protocol layer
Map Semantic Clinic Symptom to family to exact fix
Map Grandmas Clinic Plain language stories mapped to Problem Map 1.0
Onboarding Starter Village Guided tour for newcomers
App TXT OS TXT semantic OS, fast boot
App Blah Blah Blah Abstract and paradox Q and A built on TXT OS
App Blur Blur Blur Text to image with semantic control
App Blow Blow Blow Reasoning game engine and memory demo

If this repository helped, starring it improves discovery so more builders can find the docs and tools. GitHub Repo stars