mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-05-25 23:24:03 +00:00
feat: ADR-119 historical crawl evolutionary comparison
Implement temporal knowledge evolution tracking across quarterly Common Crawl snapshots (2020-2026). Includes: - ADR-119 with architecture, cost model, acceptance criteria - Historical crawl import script (14 quarterly snapshots, 5 domains) - Evolutionary analysis module (drift detection, concept birth, similarity) - Initial analysis report on existing brain content (71 memories) Cost: ~$7-15 one-time for full 2020-2026 import. Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
parent
a81c13514c
commit
1ab5240956
4 changed files with 425 additions and 0 deletions
108
docs/adr/ADR-119-historical-crawl-evolutionary-comparison.md
Normal file
108
docs/adr/ADR-119-historical-crawl-evolutionary-comparison.md
Normal file
|
|
@ -0,0 +1,108 @@
|
|||
# ADR-119: Historical Common Crawl Evolutionary Comparison
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-03-22
|
||||
**Author**: Claude (ruvnet)
|
||||
**Related**: ADR-094 (Shared Web Memory), ADR-115 (Common Crawl Compression), ADR-118 (Cost-Effective Crawl)
|
||||
|
||||
## Context
|
||||
|
||||
The pi.ruv.io brain ingests current Common Crawl data (ADR-115 Phase 1), but medical knowledge evolves over time. Understanding HOW dermatology content changed across years enables:
|
||||
- Detecting when new treatment protocols emerged
|
||||
- Tracking consensus formation on diagnostic criteria
|
||||
- Identifying knowledge fragmentation (narrative fractures)
|
||||
- Measuring the pace of AI adoption in dermatology
|
||||
|
||||
Common Crawl maintains monthly crawl archives from 2008 to present, each with its own CDX index. By querying the same medical domains across multiple crawl snapshots, we can build temporal knowledge evolution graphs.
|
||||
|
||||
## Decision
|
||||
|
||||
Implement a historical crawl importer that queries the same domains across quarterly Common Crawl snapshots (2020-2026), computes embedding drift between temporal versions, and stores WebPageDelta chains in the brain for evolutionary analysis.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
Quarterly Crawl Snapshots (24 crawls, 2020-2026)
|
||||
│
|
||||
▼ CDX Query: same domains across each crawl
|
||||
┌──────────────────────────────────────┐
|
||||
│ For each crawl snapshot: │
|
||||
│ 1. Query CDX for target domains │
|
||||
│ 2. Range-GET page content │
|
||||
│ 3. Extract text, embed (128-dim) │
|
||||
│ 4. Compare to previous snapshot │
|
||||
│ 5. Compute WebPageDelta │
|
||||
│ 6. Store with crawl_timestamp │
|
||||
└──────────┬───────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│ Evolutionary Analysis: │
|
||||
│ • Embedding drift per URL over time │
|
||||
│ • Concept birth detection │
|
||||
│ • Consensus formation tracking │
|
||||
│ • Narrative fracture via MinCut │
|
||||
│ • Lyapunov stability per domain │
|
||||
└──────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Target Domains (Medical/Dermatology)
|
||||
|
||||
| Domain | Content |
|
||||
|--------|---------|
|
||||
| aad.org | American Academy of Dermatology — guidelines, patient info |
|
||||
| dermnetnz.org | DermNet NZ — comprehensive dermatology reference |
|
||||
| skincancer.org | Skin Cancer Foundation — screening, prevention |
|
||||
| cancer.org | American Cancer Society — cancer statistics, guidelines |
|
||||
| ncbi.nlm.nih.gov | PubMed/NCBI — research abstracts |
|
||||
| who.int | WHO — global health guidance |
|
||||
| melanoma.org | Melanoma Research Foundation |
|
||||
|
||||
### Crawl Schedule (Quarterly Sampling)
|
||||
|
||||
24 crawl indices from 2020 Q1 to 2026 Q1:
|
||||
CC-MAIN-2020-16, CC-MAIN-2020-34, CC-MAIN-2020-50,
|
||||
CC-MAIN-2021-10, CC-MAIN-2021-25, CC-MAIN-2021-43,
|
||||
CC-MAIN-2022-05, CC-MAIN-2022-21, CC-MAIN-2022-40,
|
||||
CC-MAIN-2023-06, CC-MAIN-2023-23, CC-MAIN-2023-40,
|
||||
CC-MAIN-2024-10, CC-MAIN-2024-26, CC-MAIN-2024-42,
|
||||
CC-MAIN-2025-05, CC-MAIN-2025-22, CC-MAIN-2025-40,
|
||||
CC-MAIN-2026-06, CC-MAIN-2026-08
|
||||
|
||||
### Cost
|
||||
|
||||
| Item | Cost |
|
||||
|------|------|
|
||||
| CDX queries (24 crawls x 7 domains) | $0 |
|
||||
| Page extraction (~200 pages/crawl) | $0 (free CC egress) |
|
||||
| Cloud Run compute | ~$5-10 one-time |
|
||||
| Firestore storage | ~$2-5 |
|
||||
| **Total** | **~$7-15 one-time** |
|
||||
|
||||
### Outputs
|
||||
|
||||
1. `GET /v1/web/evolution?url=X` — temporal delta history for a URL
|
||||
2. `GET /v1/web/drift?topic=X&months=N` — drift score and trend
|
||||
3. `GET /v1/web/concepts/births?since=2020` — newly emerged concepts
|
||||
4. Brain memories tagged with `crawl_index` for temporal queries
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
1. Import >=100 pages across >=4 quarterly crawl snapshots
|
||||
2. Compute WebPageDelta with embedding_drift for each URL across time
|
||||
3. Store temporal chain in brain with crawl_timestamp metadata
|
||||
4. Verify search returns time-ordered results for evolved content
|
||||
5. Total cost <= $15
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Brain gains historical context — not just current knowledge
|
||||
- Drift detection shows which medical topics are evolving fastest
|
||||
- DrAgnes can reference "how guidelines changed over time"
|
||||
- Foundation for concept birth detection and narrative tracking
|
||||
|
||||
### Negative
|
||||
- Historical CC CDX can be slow (older indices, less maintained)
|
||||
- Some URLs may not appear in every crawl snapshot
|
||||
- Content extraction quality varies across crawl periods
|
||||
63
docs/research/DrAgnes/evolution-analysis.md
Normal file
63
docs/research/DrAgnes/evolution-analysis.md
Normal file
|
|
@ -0,0 +1,63 @@
|
|||
# Historical Crawl Evolutionary Analysis
|
||||
|
||||
**Date**: 2026-03-22
|
||||
**Memories analyzed**: 71
|
||||
**Embedding pairs computed**: 390
|
||||
|
||||
## Knowledge Distribution by Month
|
||||
|
||||
| Month | Memories | Topics |
|
||||
|-------|----------|--------|
|
||||
| 2026-03 | 71 | dragnes, competitive-analysis, dermatology, skin-cancer, common-crawl |
|
||||
|
||||
## Most Similar Content Pairs (Potential Temporal Versions)
|
||||
|
||||
| Similarity | Content A | Content B |
|
||||
|-----------|-----------|----------|
|
||||
| 1.000 | PubMed: Molecular Landscape of Natural C | PubMed: Molecular Landscape of Natural C |
|
||||
| 1.000 | PubMed: Molecular Landscape of Natural C | PubMed: Molecular Landscape of Natural C |
|
||||
| 1.000 | PubMed: Molecular Landscape of Natural C | PubMed: Molecular Landscape of Natural C |
|
||||
| 1.000 | Fix audit items #15 and #16 | Fix audit items #15 and #16 |
|
||||
| 1.000 | Swarm plan: spec:dax:055:A | Swarm plan: spec:dax:055:A |
|
||||
| 1.000 | Swarm plan: adr:052 | Swarm plan: adr:052 |
|
||||
| 0.938 | DrAgnes Swarm Architecture — ADR-032: Hi | DrAgnes Specialist Agent Implementation |
|
||||
| 0.933 | DrAgnes Phase 1 Swarm Sprint Plan: Orche | DrAgnes Orchestrator Agent — Second-Opin |
|
||||
| 0.929 | DrAgnes Swarm Architecture — ADR-032: Hi | DrAgnes Phase 1 Swarm Sprint Plan: Orche |
|
||||
| 0.928 | DrAgnes Phase 1 Swarm Sprint Plan: Orche | DrAgnes Specialist Agent Implementation |
|
||||
| 0.922 | AGI Self-Training Attempt — Honest Audit | DrAgnes Specialist Agent Implementation |
|
||||
| 0.920 | DrAgnes Specialist Agent Implementation | DrAgnes Orchestrator Agent — Second-Opin |
|
||||
| 0.916 | DrAgnes Swarm Phase 1 Implementation: 7 | DrAgnes Orchestrator Agent — Second-Opin |
|
||||
| 0.915 | DrAgnes Swarm Architecture — ADR-032: Hi | DrAgnes Orchestrator Agent — Second-Opin |
|
||||
| 0.901 | DrAgnes Swarm Phase 1 Implementation: 7 | DrAgnes Specialist Agent Implementation |
|
||||
|
||||
## Topic Clusters
|
||||
|
||||
| Tag | Count |
|
||||
|-----|-------|
|
||||
| dragnes | 17 |
|
||||
| dermatology | 7 |
|
||||
| medical | 7 |
|
||||
| swarm | 6 |
|
||||
| source:swarm-rvf | 6 |
|
||||
| project:daxiom | 6 |
|
||||
| pubmed | 6 |
|
||||
| orchestrator | 5 |
|
||||
| ham10000 | 5 |
|
||||
| FDA | 4 |
|
||||
| skin-cancer | 4 |
|
||||
| implementation | 4 |
|
||||
| 2026-03 | 4 |
|
||||
| security | 4 |
|
||||
| file:memory.cli-59065.rvf | 4 |
|
||||
| namespace:decisions | 4 |
|
||||
| adr | 4 |
|
||||
| research | 4 |
|
||||
| gap-fill | 4 |
|
||||
| discovery | 4 |
|
||||
|
||||
## Key Findings
|
||||
|
||||
- Total medical knowledge memories: 71
|
||||
- High-similarity pairs (>0.7): 390 (potential temporal versions or related content)
|
||||
- Most common topic: dragnes (17 memories)
|
||||
- Date range: 2026-03 to 2026-03
|
||||
Loading…
Add table
Add a link
Reference in a new issue