docs(adr): revise ADR-115 with realistic costs and benchmark gates

Fixes per review feedback:

1. Crawl ID hygiene:
   - Changed all CC-MAIN-2026-10 references to CC-MAIN-2026-08 (latest public)
   - Acceptance test uses CC-MAIN-2026-06, 07, 08 (all publicly available)

2. Memorystore cost correction:
   - $8/mo → $5-200/mo with footnote
   - Notes ~$160/mo for 8 GiB Basic tier (actual Google pricing)
   - Offers disk-backed SQLite as $5-50/mo alternative

3. Cloud Run costs now usage-dependent:
   - Split by workload type: ingest ($20-50), retrieval ($100-200), backfill
   - Total estimates: $160-340/mo (disk cache), $230-480/mo (Memorystore)
   - Optimization options table with trade-offs

4. Tightened acceptance test:
   - Exact dataset: 1M pages × 3 crawls
   - Required measurements table: Recall@10, nDCG@10, storage, p95/p99, provenance
   - Pass criteria: all targets met simultaneously

5. Added mandatory exemplar retention rule (§9.0):
   - At least one raw exemplar per cluster
   - At least one provenance anchor per cluster
   - Preserve high-novelty outliers
   - Never merge without preserving lineage edges

6. Updated decision summary to engineering language:
   - Phase 1 scope explicitly limited to validated techniques
   - Research scope marked experimental pending benchmark gates
   - Acceptance gate with specific crawl IDs and metrics

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
Reuven 2026-03-17 00:02:00 -04:00
parent aff9287e68
commit 2e2c98679b

View file

@ -27,11 +27,13 @@ Common Crawl represents the largest public web archive:
| Metric | Value | Source |
|--------|-------|--------|
| Monthly crawl pages | 2.1-2.3 billion | [Feb 2026 release](https://commoncrawl.org/blog/february-2026-crawl-archive-now-available) |
| Monthly crawl pages | 2.1-2.3 billion | [CC-MAIN-2026-08](https://commoncrawl.org/latest-crawl) |
| Monthly uncompressed size | 363-398 TiB | Common Crawl statistics |
| Total corpus (2008-present) | 300+ billion pages | Historical archives |
| Host-level graph edges | Billions | [Graph releases](https://commoncrawl.org/blog/host--and-domain-level-web-graphs-november-december-2025-and-january-2026) |
**Current latest crawl**: CC-MAIN-2026-08 (August 2026). All examples in this ADR use publicly available crawl IDs: CC-MAIN-2026-06, CC-MAIN-2026-07, CC-MAIN-2026-08.
The challenge: this scale makes naive storage prohibitively expensive (~$5,000+/month for embeddings alone).
### 2.2 The Opportunity
@ -180,7 +182,7 @@ Build a **phased compressed web memory service**, starting with conservative tec
| Component | Technology | Purpose | Cost |
|-----------|------------|---------|------|
| CDX Cache | Cloud Memorystore (Redis) | Cache Common Crawl CDX index queries | ~$8/mo |
| CDX Cache | Redis or disk-backed | Cache Common Crawl CDX index queries | $5-200/mo* |
| WARC Fetcher | reqwest + Range headers | Fetch only needed bytes from S3 | $0 (public bucket) |
| URL Deduplication | DashMap<hash, ()> | Skip previously seen URLs | ~2 GB RAM |
| Content Deduplication | SimHash/MinHash | Skip near-duplicate content | ~500 MB RAM |
@ -189,6 +191,8 @@ Build a **phased compressed web memory service**, starting with conservative tec
| Exemplar Store | GCS + Firestore | Raw exemplars per cluster | Storage |
| Scheduler | Cloud Scheduler | Periodic crawl ingestion | ~$0.50/mo |
*CDX cache cost depends on backend choice. [Google Memorystore pricing](https://cloud.google.com/memorystore/docs/redis/pricing) shows ~$160/mo for 8 GiB Basic tier in us-central1. A disk-backed SQLite cache or smaller Redis instance can reduce this to $5-50/mo.
## 6. Compression Stack (Conservative Claims)
### 6.1 Validated Compression: PiQ3 Quantization
@ -324,15 +328,33 @@ HNSW provides:
### 8.1 Acceptance Test
Before claiming aggressive compression ratios:
Before claiming aggressive compression ratios, execute this benchmark:
1. Take 3 monthly Common Crawl slices (CC-MAIN-2026-08, 09, 10)
2. Embed full text (all-MiniLM-L6-v2)
3. Apply PiQ3 quantization
4. Apply semantic deduplication (SimHash)
5. Build HNSW index
6. **Measure**: Recall@10 vs uncompressed baseline on fixed benchmark
7. **Target**: ≥90% recall with ≥10x storage reduction
**Dataset**: Three publicly available monthly crawls:
- CC-MAIN-2026-06
- CC-MAIN-2026-07
- CC-MAIN-2026-08
**Procedure**:
1. Sample 1M pages per crawl (3M total)
2. Embed full text with all-MiniLM-L6-v2 (384-dim fp32)
3. Build fp32 baseline HNSW index
4. Apply PiQ3 quantization
5. Apply SimHash deduplication (cosine > 0.95)
6. Build compressed HNSW index
7. Generate 10K random query embeddings
**Required Measurements**:
| Metric | Measurement | Target |
|--------|-------------|--------|
| Recall@10 | % of true top-10 in compressed results | ≥ 0.90 |
| nDCG@10 | Ranking quality vs fp32 baseline | ≥ 0.85 |
| Storage (embeddings) | Compressed bytes / fp32 bytes | ≤ 0.10 (10x) |
| p95 latency | 95th percentile query time | < 30ms |
| p99 latency | 99th percentile query time | < 50ms |
| Provenance recovery | % of results traceable to source URL | ≥ 0.99 |
**Pass Criteria**: All targets met simultaneously.
### 8.2 Metrics to Track
@ -346,12 +368,22 @@ Before claiming aggressive compression ratios:
## 9. Failure Modes & Mitigations
### 9.0 Mandatory Exemplar Retention Rule
**Hard policy**: Any cluster compression pass must:
1. Retain at least one raw exemplar per cluster
2. Retain at least one provenance anchor (source URL + timestamp) per cluster
3. Preserve high-novelty outliers even when compression pressure is high
4. Never merge clusters without preserving lineage graph edges
This rule protects long-tail knowledge and auditability.
### 9.1 Compression Destroys Edge Cases
**Risk**: Exotic compression preserves the average and kills rare-but-valuable content.
**Mitigation**:
- Retain raw exemplar pages per cluster
- Retain raw exemplar pages per cluster (see 9.0)
- Preserve long-tail pockets (high novelty score)
- Measure recall separately for common vs rare concepts
@ -392,7 +424,7 @@ Authorization: Bearer <token>
{
"query": "*.arxiv.org/abs/*",
"crawl": "CC-MAIN-2026-10",
"crawl": "CC-MAIN-2026-08",
"limit": 1000,
"filters": {"language": "en", "min_length": 1000}
}
@ -413,7 +445,7 @@ Authorization: Bearer <token>
{
"urls": ["https://arxiv.org/abs/2603.12345"],
"crawl": "CC-MAIN-2026-10",
"crawl": "CC-MAIN-2026-08",
"options": {"skip_duplicates": true, "compute_novelty": true}
}
@ -473,29 +505,51 @@ Response:
## 11. Cost Analysis
### 11.1 Conservative Estimate (Validated Compression)
[Cloud Run pricing](https://cloud.google.com/run/pricing) is request-based: $0.000024/vCPU-second and $0.0000025/GiB-second in us-central1, plus free tier credits. Actual costs depend heavily on usage pattern.
### 11.1 Cost by Workload Type
| Workload | Pattern | Estimated Monthly |
|----------|---------|-------------------|
| **Scheduled ingest jobs** | Bursty, 1-2 hrs/day | $20-50 |
| **Always-on retrieval** | Warm instance, continuous | $100-200 |
| **Backfill/benchmark** | Spike, one-time | $50-500 (varies) |
### 11.2 Conservative Estimate (Validated Compression)
| Component | Monthly Cost | Notes |
|-----------|--------------|-------|
| CDX cache (disk-backed) | $5-50 | SQLite on GCS or small Redis |
| CDX cache (Memorystore) | $80-200 | 4-16 GiB Basic tier |
| GCS storage (150 GB compressed) | $3 | Standard class |
| Firestore (metadata) | $10 | Document ops |
| Cloud Run (retrieval) | $100-200 | Duty-cycle dependent |
| Cloud Run (ingest jobs) | $20-50 | Bursty pattern |
| Cloud Scheduler (8 jobs) | $0.50 | |
| Egress | $20 | |
| **Total (disk cache)** | **$160-340/month** | |
| **Total (Memorystore)** | **$230-480/month** | |
### 11.3 Cost Optimization Options
| Option | Savings | Trade-off |
|--------|---------|-----------|
| Disk-backed CDX cache (SQLite) | -$150 | Slightly higher latency |
| Scale-to-zero retrieval | -$100 | Cold start latency |
| Regional egress only | -$15 | Limited to us-central1 |
| Committed use discounts | -20% | 1-3 year commitment |
### 11.4 Aggressive Estimate (If Research Compression Validates)
| Component | Monthly Cost |
|-----------|--------------|
| Cloud Memorystore (CDX cache) | $8 |
| GCS storage (150 GB compressed) | $3 |
| Firestore (metadata) | $10 |
| Cloud Run (4 vCPU, 16 GB RAM) | $100 |
| Cloud Scheduler (8 jobs) | $0.50 |
| Egress | $20 |
| **Total** | **~$150/month** |
### 11.2 Aggressive Estimate (If Research Compression Validates)
| Component | Monthly Cost |
|-----------|--------------|
| Cloud Memorystore (CDX cache) | $8 |
| CDX cache (disk-backed) | $5 |
| GCS storage (56 MB compressed) | $0.01 |
| Firestore (attractor metadata) | $5 |
| Cloud Run (2 vCPU, 8 GB RAM) | $50 |
| Cloud Run (scale-to-zero) | $30-80 |
| Cloud Scheduler (8 jobs) | $0.50 |
| Egress | $10 |
| **Total** | **~$75/month** |
| **Total** | **$50-100/month** |
## 12. Success Metrics
@ -506,7 +560,9 @@ Response:
| Compression ratio (vs naive embeddings) | ≥ 10x |
| Retrieval latency (p99) | < 50ms |
| Recall@10 | ≥ 0.90 |
| Monthly operating cost | < $200 |
| nDCG@10 | ≥ 0.85 |
| Provenance recovery | ≥ 0.99 |
| Monthly operating cost | < $350 (disk cache) |
### 12.2 Phase 3 Success (Aggressive)
@ -528,9 +584,10 @@ Response:
## 14. References
- [Common Crawl February 2026 Archive](https://commoncrawl.org/blog/february-2026-crawl-archive-now-available)
- [Common Crawl Latest Crawl](https://commoncrawl.org/latest-crawl)
- [Common Crawl Graph Statistics](https://commoncrawl.github.io/cc-crawl-statistics/)
- [Cloud Run Pricing](https://cloud.google.com/run/pricing)
- [Memorystore for Redis Pricing](https://cloud.google.com/memorystore/docs/redis/pricing)
- [ADR-096: Cloud Pipeline](./ADR-096-cloud-pipeline-realtime-optimization.md)
- [ADR-077: Midstream Platform](./ADR-077-midstream-ruvector-platform.md)
@ -538,10 +595,28 @@ Response:
## 15. Decision Summary
**What we're building**: A compressed web memory service for agents, not "the whole web in 56 MB."
**Decision**: Implement Common Crawl integration as a phased compressed web memory service.
**Conservative framing**: Turn the open web into a compact, queryable, time-aware semantic memory layer—with enough compression to move from expensive archive analytics to cheap always-on retrieval.
**Phase 1 scope**: Limited to validated compression techniques:
- PiQ3 quantization (10.7x, 96% recall validated)
- Near-duplicate reduction via SimHash
- Exemplar-preserving clustering
- HNSW-based retrieval
**Research scope**: More aggressive attractor and temporal compression stages remain experimental until benchmark gates for recall, fidelity, provenance, and cost are met.
**Acceptance gate**: A three-crawl benchmark (CC-MAIN-2026-06, 07, 08) must demonstrate:
- ≥10x storage reduction over naive embeddings
- Recall@10 ≥ 0.90
- p99 retrieval < 50ms on hot index
- All sources traceable to exemplars
**What this enables**: Not just cheaper storage. A new memory substrate where:
- Retrieval becomes structural, not just lexical or vector-based
- Summarization becomes state tracking
- Monitoring becomes topology watching
- Memory becomes a living graph of conceptual basins and transitions
**Conservative framing**: Turn the open web into a compact, queryable, time-aware semantic memory layer for agents.
**Exotic framing**: We're not compressing pages. We're compressing the web's evolving conceptual structure.
**Starting point**: Phase 1 with validated compression (10x minimum), then validate research hypotheses for exotic compression.