Bypasses broken CDX HTML extractor by processing pre-extracted text
from Common Crawl WET files. Filters by 30 medical + CS domains,
chunks content, and batch injects into pi.ruv.io brain.
Includes: processor, filter/injector, Cloud Run Job config,
orchestrator for multi-segment processing.
Target: full corpus in 6 weeks at ~$200 total cost.
Co-Authored-By: claude-flow <ruv@ruv.net>