* docs(adr): ADR-148 brain hypothesis engine — Gemini + DiskANN + auto-experimentation Proposes four additive capabilities for the pi.ruv.io brain: 1. Hypothesis generation via Gemini 2.5 Flash on cross-domain edges 2. Quality scoring via DiskANN + PageRank (ForwardPush sublinear) 3. Noise filtering (ingestion gate + meta-mincut on knowledge graph) 4. Self-improvement tracking (50-query benchmark suite + auto-rollback) All feature-gated. No changes to running brain. Separate Cloud Run service for hypothesis engine. DiskANN is fallback-only (HNSW stays primary <50K). 5-week phased implementation. ~$0.03/day Gemini cost. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(brain): improve daily digest email — filter noise, better formatting The daily digest was showing 10 identical "Self-reflection: training cycle" debug entries. Now: 1. Filters out debug category memories entirely 2. Filters known noise patterns (training cycles, IEEE events, DailyMed) 3. Skips content < 50 chars (scraping artifacts) 4. Category emojis for visual scanning 5. Cleaner layout with sentence-boundary truncation 6. Better subject line: "[pi brain] 5 new discoveries today" 7. Updated header: "What the Brain Learned Today" 8. Filters auto-generated tags from display Co-Authored-By: claude-flow <ruv@ruv.net> * fix(brain): tune gist publishing thresholds + improve daily email Gist publishing was never firing because thresholds were too aggressive (set when brain had 3K memories; now has 10K+): - MIN_NEW_INFERENCES: 10 → 3 - MIN_EVIDENCE: 1000 → 100 - MIN_STRANGE_LOOP_SCORE: 0.1 → 0.01 - MIN_PROPOSITIONS: 20 → 5 - MIN_PARETO_GROWTH: 3 → 1 - MIN_INFERENCE_CONFIDENCE: 0.70 → 0.60 - MIN_UNIQUE_CATEGORIES: 4 → 2 - strong_inferences: >= 3 → >= 1 - strong_propositions: >= 5 → >= 2 - min_interval: 3 days → 1 day Daily email improvements: - Filter debug/training-cycle entries from digest - Filter known noise patterns (IEEE events, DailyMed, etc.) - Skip content < 50 chars (scraping artifacts) - Category emojis for visual scanning - Cleaner subject: "[pi brain] N new discoveries today" - Better header: "What the Brain Learned Today" - Sentence-boundary truncation for content previews - System font instead of monospace for readability Co-Authored-By: claude-flow <ruv@ruv.net> --------- Co-authored-by: Reuven <cohen@ruv-mac-mini.local>
11 KiB
ADR-148: Brain Hypothesis Engine — Self-Improving Knowledge System with Gemini, DiskANN, and Auto-Experimentation
Status
Proposed
Date
2026-04-13
Context
The pi.ruv.io brain (10,300+ memories, 38M graph edges, LoRA epoch 41) stores and retrieves knowledge but cannot:
- Generate hypotheses from cross-domain connections
- Evaluate quality beyond embedding similarity (quality scores mostly 0.0)
- Filter noise from curated knowledge (random IEEE events alongside real patterns)
- Measure whether LoRA training actually improves search quality
The brain runs on Google Cloud Run (ruvbrain service, us-central1) backed by crates/mcp-brain-server/ (Rust/Axum). Current embedding: ruvllm::RlmEmbedder at 128-dim. Current index: flat HNSW.
Decision
Add four capabilities as additive layers — no changes to the running brain's core path. All new code is behind feature flags or in separate Cloud Run services.
Architecture: Three New Components
┌─────────────────────────────────────────────────────────┐
│ EXISTING (untouched) │
│ mcp-brain-server: store, search, graph, drift, LoRA │
│ Embedder: ruvllm::RlmEmbedder (128-dim) │
│ Index: flat HNSW │
└──────────────┬──────────────────────────────────────────┘
│ (reads from, writes back to)
v
┌─────────────────────────────────────────────────────────┐
│ NEW: Hypothesis Engine (separate Cloud Run service) │
│ │
│ 1. HYPOTHESIS GENERATOR │
│ - Watches for new cross-domain graph edges │
│ - Templates: "If X works in domain A, │
│ then X should work in domain B" │
│ - Uses Gemini 2.5 Flash for hypothesis formulation │
│ and experiment design │
│ - Stores hypotheses as "untested" memories │
│ │
│ 2. QUALITY SCORER │
│ - DiskANN index over all 10K+ memory embeddings │
│ - PageRank via ruvector-solver ForwardPush │
│ - Multi-signal: centrality + citations + verdicts │
│ + contributor rep + temporal + surprise │
│ - Updates quality field via brain API │
│ │
│ 3. NOISE FILTER │
│ - Ingestion gate: regex + embedding dedup │
│ - Weekly cleanup: archive orphan low-quality │
│ - Meta-mincut: ruvector-mincut on knowledge graph │
│ to find noise partition │
│ │
│ 4. BENCHMARK SUITE │
│ - 50 curated test queries with known-good answers │
│ - Runs before/after each LoRA epoch │
│ - Tracks MRR, precision@5, cross-domain recall │
│ - Auto-rollback if MRR drops > 5% │
│ │
└─────────────────────────────────────────────────────────┘
Component Details
Gemini 2.5 Flash for Hypothesis Generation
Why Gemini, not local LLM:
- Hypothesis generation is infrequent (triggered by new cross-domain edges, ~10/day)
- Requires reasoning about domain transfer ("if mincut detects seizures, could it detect X?")
- Gemini 2.5 Flash: fast, cheap (~$0.15/1M input tokens), 1M context window
- Local RLM embedder stays for indexing (it's tuned to the corpus) — Gemini is for reasoning only
API integration:
// New module: crates/mcp-brain-server/src/hypothesis.rs
// Feature-gated: #[cfg(feature = "hypothesis")]
use google_generativeai::Client; // or raw REST via reqwest
async fn generate_hypothesis(edge: &CrossDomainEdge) -> Hypothesis {
let prompt = format!(
"Given this cross-domain connection:\n\
Domain A: {}\nDomain B: {}\nBridge concept: {}\n\n\
Generate a testable hypothesis: if the pattern from domain A \
works, what specific prediction does it make in domain B? \
Include: hypothesis statement, test method, expected outcome, \
null hypothesis, required data.",
edge.domain_a, edge.domain_b, edge.bridge_concept
);
// Call Gemini 2.5 Flash
let response = gemini_client.generate(&prompt).await?;
parse_hypothesis(response)
}
Cost estimate: ~10 hypotheses/day × ~500 tokens each = ~5K tokens/day = ~$0.001/day. Negligible.
DiskANN for Scalable Quality Scoring
Why DiskANN, not current flat HNSW:
- Current HNSW is in-memory, fine for 10K memories
- At 100K+ memories (projected within months), memory pressure becomes real
- DiskANN stores the graph on SSD, loads only neighbors on demand
- Product Quantization (PQ) compresses vectors 4-8x for candidate filtering
ruvector-diskannalready implements Vamana graph + PQ (ADR-146)
Integration plan:
// New module: crates/mcp-brain-server/src/diskann_index.rs
// Feature-gated: #[cfg(feature = "diskann")]
use ruvector_diskann::{DiskAnnIndex, DiskAnnConfig};
pub struct HybridIndex {
hnsw: HnswIndex, // Existing, stays as primary for <50K
diskann: DiskAnnIndex, // New, activates at >50K memories
threshold: usize, // Switch point (default: 50_000)
}
impl HybridIndex {
pub fn search(&self, query: &[f32], k: usize) -> Vec<(usize, f32)> {
if self.hnsw.len() < self.threshold {
self.hnsw.search(query, k)
} else {
self.diskann.search(query, k)
}
}
}
Benchmark plan: Run both HNSW and DiskANN on the current 10K corpus, measure:
- Recall@10 (should be >95% for both)
- Query latency (HNSW: ~1ms, DiskANN: ~5-10ms expected)
- Memory usage (HNSW: ~50MB, DiskANN: ~5MB + SSD)
- Index build time
Quality Scorer with ForwardPush PageRank
// crates/mcp-brain-server/src/quality.rs
pub fn compute_quality_scores(brain: &Brain) -> Vec<(MemoryId, f64)> {
// 1. Build CSR graph from memory edges
let graph = brain.graph_to_csr();
// 2. Run ForwardPush PageRank (sublinear, O(1/epsilon))
let pr = ForwardPushSolver::new(0.85, 0.001);
let pagerank = pr.solve(&graph)?;
// 3. Compute multi-signal quality
brain.memories().map(|m| {
let centrality = pagerank[m.id];
let citations = m.inbound_edge_count as f64 / max_citations;
let verdict = match m.verdict {
Confirmed => 1.0,
Refuted => -0.5,
Untested => 0.0,
};
let surprise = 1.0 - m.max_similarity_to_existing;
let temporal = recency_weight(m.created_at);
let bridge = if m.crosses_domains { 0.3 } else { 0.0 };
let quality = 0.25 * centrality
+ 0.20 * citations
+ 0.20 * verdict
+ 0.15 * surprise
+ 0.10 * temporal
+ 0.10 * bridge;
(m.id, quality.clamp(0.0, 1.0))
}).collect()
}
Safety Constraints (don't break the running system)
-
All new code is feature-gated. The existing
mcp-brain-serverbinary is unchanged unless--features hypothesis,diskann,benchmarkis explicitly enabled. -
Hypothesis engine runs as a SEPARATE Cloud Run service. It calls the brain's API; it doesn't modify the brain's process. If it crashes, the brain keeps running.
-
DiskANN is a fallback, not a replacement. HNSW stays as primary for <50K memories. DiskANN only activates when memory count exceeds the threshold. Both can be queried in parallel for benchmark comparison.
-
Quality scores are written to a NEW field (
quality_v2). The existingqualityfield is untouched until v2 scores are validated. -
Noise filtering is archive-only. Memories are archived (moved to cold storage), never deleted. Full rollback possible.
-
Benchmark auto-rollback. If LoRA epoch N+1 degrades MRR by >5%, the epoch is discarded and the EWC checkpoint is restored automatically.
-
Gemini API key stored in gcloud secrets. Already available as
GEMINI_API_KEY. Rate-limited to 10 calls/hour to avoid cost surprises.
Implementation Phases
| Phase | What | Risk | Timeline |
|---|---|---|---|
| P0: ADR + Branch | This document + feature branch | None | Done |
| P1: Benchmark suite | 50 test queries, MRR tracking | None (read-only) | 3 days |
| P2: Quality scorer | PageRank + multi-signal scoring | Low (writes to new field) | 1 week |
| P3: Noise filter | Ingestion gate + weekly cleanup | Low (archive-only) | 3 days |
| P4: DiskANN integration | Hybrid index behind feature flag | Low (fallback only) | 1 week |
| P5: Hypothesis engine | Gemini integration + auto-test | Medium (new service) | 2 weeks |
Total: ~5 weeks, phased. P1-P3 can run in parallel.
Consequences
Positive
- Brain evolves from "smart database" to "scientific reasoner"
- Quality scores become meaningful (currently all 0.0)
- Noise filtering reduces graph pollution
- LoRA training becomes measurable and rollback-safe
- DiskANN prepares for 100K+ memory scale
- Gemini hypothesis generation is the first step toward autonomous discovery
Negative
- New dependency: Google Gemini API (adds cost, ~$0.03/day estimated)
- DiskANN adds complexity to the index path
- Hypothesis engine needs curation — false hypotheses could pollute if not filtered
- More Cloud Run services to monitor
Risks
- Gemini may generate low-quality hypotheses → mitigated by verdict system (untested until confirmed)
- DiskANN recall may be lower than HNSW at small corpus → mitigated by hybrid approach with threshold
- Quality scoring may be gamed by circular citations → mitigated by PageRank dampening
References
- ADR-146: DiskANN Vamana Implementation
- ADR-131: Consciousness Metrics Crate
- ADR-048: Sublinear Graph Attention
- Subramanya et al., "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search" (NeurIPS 2019)
- Google Gemini API: https://ai.google.dev/gemini-api
- ForwardPush PPR: Andersen, Chung, Lang 2006