Commit graph

1 commit

Author SHA1 Message Date
rUv
14ab7b0bdc feat: WET processing pipeline for full medical + CS corpus import (ADR-120)
Bypasses broken CDX HTML extractor by processing pre-extracted text
from Common Crawl WET files. Filters by 30 medical + CS domains,
chunks content, and batch injects into pi.ruv.io brain.

Includes: processor, filter/injector, Cloud Run Job config,
orchestrator for multi-segment processing.

Target: full corpus in 6 weeks at ~$200 total cost.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-03-22 00:50:12 +00:00