mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-23 12:55:26 +00:00

rUv 38d93a6e8d feat: Add comprehensive dataset discovery framework for RuVector (#104 )

* feat: Add comprehensive dataset discovery framework for RuVector

This commit introduces a powerful dataset discovery framework with
integrations for three high-impact public data sources:

## Core Framework (examples/data/framework/)
- DataIngester: Streaming ingestion with batching and deduplication
- CoherenceEngine: Min-cut based coherence signal computation
- DiscoveryEngine: Pattern detection for emerging structures

## OpenAlex Integration (examples/data/openalex/)
- Research frontier radar: Detect emerging fields via boundary motion
- Cross-domain bridge detection: Find connector subgraphs
- Topic graph construction from citation networks
- Full API client with cursor-based pagination

## Climate Integration (examples/data/climate/)
- NOAA GHCN and NASA Earthdata clients
- Sensor network graph construction
- Regime shift detection using min-cut coherence breaks
- Time series vectorization for similarity search
- Seasonal decomposition analysis

## SEC EDGAR Integration (examples/data/edgar/)
- XBRL financial statement parsing
- Peer network construction
- Coherence watch: Detect fundamental vs narrative divergence
- Filing analysis with sentiment and risk extraction
- Cross-company contagion detection

Each integration leverages RuVector's unique capabilities:
- Vector memory for semantic similarity
- Graph structures for relationship modeling
- Dynamic min-cut for coherence signal computation
- Time series embeddings for pattern matching

Discovery thesis: Detect emerging patterns before they have names,
find non-obvious cross-domain bridges, and map causality chains.

* feat: Add working discovery examples for climate and financial data

- Fix borrow checker issues in coherence analysis modules
- Create standalone workspace for data examples
- Add regime_detector.rs for climate network coherence analysis
- Add coherence_watch.rs for SEC EDGAR narrative-fundamental divergence
- Add frontier_radar.rs template for OpenAlex research discovery
- Update Cargo.toml dependencies for example executability
- Add rand dev-dependency for demo data generation

Examples successfully detect:
- Climate regime shifts via min-cut coherence analysis
- Cross-regional teleconnection patterns
- Fundamental vs narrative divergence in SEC filings
- Sector fragmentation signals in financial data

* feat: Add working discovery examples for climate and financial data

- Add RuVector-native discovery engine with Stoer-Wagner min-cut
- Implement cross-domain pattern detection (climate ↔ finance)
- Add cosine similarity for vector-based semantic matching
- Create cross_domain_discovery example demonstrating:
  - 42% cross-domain edge connectivity
  - Bridge formation detection with 0.73-0.76 confidence
  - Climate and finance correlation hypothesis generation

* perf: Add optimized discovery engine with SIMD and parallel processing

Performance improvements:
- 8.84x speedup for vector insertion via parallel batching
- 2.91x SIMD speedup for cosine similarity (chunked + AVX2)
- Incremental graph updates with adjacency caching
- Early termination in Stoer-Wagner min-cut

Statistical analysis features:
- P-value computation for pattern significance
- Effect size (Cohen's d) calculation
- 95% confidence intervals
- Granger-style temporal causality detection

Benchmark results (248 vectors, 3 domains):
- Cross-domain edges: 34.9% of total graph
- Domain coherence: Climate 0.74, Finance 0.94, Research 0.97
- Detected climate-finance temporal correlations

* feat: Add discovery hunter and comprehensive README tutorial

New features:
- Discovery hunter example with multi-phase pattern detection
- Climate extremes, financial stress, and research data generation
- Cross-domain hypothesis generation
- Anomaly injection testing

Documentation:
- Detailed README with step-by-step tutorial
- API reference for OptimizedConfig and patterns
- Performance benchmarks and best practices
- Troubleshooting guide

* feat: Complete discovery framework with all features

HNSW Indexing (754 lines):
- O(log n) approximate nearest neighbor search
- Configurable M, ef_construction parameters
- Cosine, Euclidean, Manhattan distance metrics
- Batch insertion support

API Clients (888 lines):
- OpenAlex: academic works, authors, topics
- NOAA: climate observations
- SEC EDGAR: company filings
- Rate limiting and retry logic

Persistence (638 lines):
- Save/load engine state and patterns
- Gzip compression (3-10x size reduction)
- Incremental pattern appending

CLI Tool (1,109 lines):
- discover, benchmark, analyze, export commands
- Colored terminal output
- JSON and human-readable formats

Streaming (570 lines):
- Async stream processing
- Sliding and tumbling windows
- Real-time pattern detection
- Backpressure handling

Tests (30 unit tests):
- Stoer-Wagner min-cut verification
- SIMD cosine similarity accuracy
- Statistical significance
- Granger causality
- Cross-domain patterns

Benchmarks:
- CLI: 176 vectors/sec @ 2000 vectors
- SIMD: 6.82M ops/sec (2.06x speedup)
- Vector insertion: 1.61x speedup
- Total: 44.74ms for 248 vectors

* feat: Add visualization, export, forecasting, and real data discovery

Visualization (555 lines):
- ASCII graph rendering with box-drawing characters
- Domain-based ANSI coloring (Climate=blue, Finance=green, Research=yellow)
- Coherence timeline sparklines
- Pattern summary dashboard
- Domain connectivity matrix

Export (650 lines):
- GraphML export for Gephi/Cytoscape
- DOT export for Graphviz
- CSV export for patterns and coherence history
- Filtered export by domain, weight, time range
- Batch export with README generation

Forecasting (525 lines):
- Holt's double exponential smoothing for trend
- CUSUM-based regime change detection (70.67% accuracy)
- Cross-domain correlation forecasting (r=1.000)
- Prediction intervals (95% CI)
- Anomaly probability scoring

Real Data Discovery:
- Fetched 80 actual papers from OpenAlex API
- Topics: climate risk, stranded assets, carbon pricing, physical risk, transition risk
- Built coherence graph: 592 nodes, 1049 edges
- Average min-cut: 185.76 (well-connected research cluster)

* feat: Add medical, real-time, and knowledge graph data sources

New API Clients:
- PubMed E-utilities for medical literature search (NCBI)
- ClinicalTrials.gov v2 API for clinical study data
- FDA OpenFDA for drug adverse events and recalls
- Wikipedia article search and extraction
- Wikidata SPARQL queries for structured knowledge

Real-time Features:
- RSS/Atom feed parsing with deduplication
- News aggregator with multiple source support
- WebSocket and REST polling infrastructure
- Event streaming with configurable windows

Examples:
- medical_discovery: PubMed + ClinicalTrials + FDA integration
- multi_domain_discovery: Climate-health-finance triangulation
- wiki_discovery: Wikipedia/Wikidata knowledge graph
- realtime_feeds: News feed aggregation demo

Tested across 70+ unit tests with all domains integrated.

* feat: Add economic, patent, and ArXiv data source clients

New API Clients:
- FredClient: Federal Reserve economic indicators (GDP, CPI, unemployment)
- WorldBankClient: Global development indicators and climate data
- AlphaVantageClient: Stock market daily prices
- ArxivClient: Scientific preprint search with category and date filters
- UsptoPatentClient: USPTO patent search by keyword, assignee, CPC class
- EpoClient: Placeholder for European patent search

New Domain:
- Domain::Economic for economic/financial indicator data

Updated Exports:
- Domain colors and shapes for Economic in visualization and export

Examples:
- economic_discovery: FRED + World Bank integration demo
- arxiv_discovery: AI/ML/Climate paper search demo
- patent_discovery: Climate tech and AI patent search demo

All 85 tests passing. APIs tested with live endpoints.

* feat: Add Semantic Scholar, bioRxiv/medRxiv, and CrossRef research clients

New Research API Clients:
- SemanticScholarClient: Citation graph analysis, paper search, author lookup
  - Methods: search_papers, get_citations, get_references, search_by_field
  - Builds citation networks for graph analysis

- BiorxivClient: Life sciences preprints
  - Methods: search_recent, search_by_category (neuroscience, genomics, etc.)
  - Automatic conversion to Domain::Research

- MedrxivClient: Medical preprints
  - Methods: search_covid, search_clinical, search_by_date_range
  - Automatic conversion to Domain::Medical

- CrossRefClient: DOI metadata and scholarly communication
  - Methods: search_works, get_work, search_by_funder, get_citations
  - Polite pool support for better rate limits

All clients include:
- Rate limiting respecting API guidelines
- Retry logic with exponential backoff
- SemanticVector conversion with rich metadata
- Comprehensive unit tests

Examples:
- biorxiv_discovery: Fetch neuroscience and clinical research
- crossref_demo: Search publications, funders, datasets

Total: 104 tests passing, ~2,500 new lines of code

* feat: Add MCP server with STDIO/SSE transport and optimized discovery

MCP Server Implementation (mcp_server.rs):
- JSON-RPC 2.0 protocol with MCP 2024-11-05 compliance
- Dual transport: STDIO for CLI, SSE for HTTP streaming
- 22 discovery tools exposing all data sources:
  - Research: OpenAlex, ArXiv, Semantic Scholar, CrossRef, bioRxiv, medRxiv
  - Medical: PubMed, ClinicalTrials.gov, FDA
  - Economic: FRED, World Bank
  - Climate: NOAA
  - Knowledge: Wikipedia, Wikidata SPARQL
  - Discovery: Multi-source, coherence analysis, pattern detection
- Resources: discovery://patterns, discovery://graph, discovery://history
- Pre-built prompts: cross_domain_discovery, citation_analysis, trend_detection

Binary Entry Point (bin/mcp_discovery.rs):
- CLI arguments with clap
- Configurable discovery parameters
- STDIO/SSE mode selection

Optimized Discovery Runner:
- Parallel data fetching with tokio::join!
- SIMD-accelerated vector operations (1.1M comparisons/sec)
- 6-phase discovery pipeline with benchmarking
- Statistical significance testing (p-values)
- Cross-domain correlation analysis
- CSV export and hypothesis report generation

Performance Results:
- 180 vectors from 3 sources in 7.5s
- 686 edges computed in 8ms
- SIMD throughput: 1,122,216 comparisons/sec

All 106 tests passing.

* feat: Add space, genomics, and physics data source clients

Add exotic data source integrations:
- Space clients: NASA (APOD, NEO, Mars, DONKI), Exoplanet Archive, SpaceX API, TNS Astronomy
- Genomics clients: NCBI (genes, proteins, SNPs), UniProt, Ensembl, GWAS Catalog
- Physics clients: USGS Earthquakes, CERN Open Data, Argo Ocean, Materials Project

New domains: Space, Genomics, Physics, Seismic, Ocean

All 106 tests passing, SIMD benchmark: 208k comparisons/sec

* chore: Update export/visualization and output files

* docs: Add API client inventory and reference documentation

* fix: Update API clients for 2025 endpoint changes

- ArXiv: Switch from HTTP to HTTPS (export.arxiv.org)
- USPTO: Migrate to PatentSearch API v2 (search.patentsview.org)
  - Legacy API (api.patentsview.org) discontinued May 2025
  - Updated query format from POST to GET
  - Note: May require API authentication
- FRED: Require API key (mandatory as of 2025)
  - Added error handling for missing API key
  - Added response error field parsing

All tests passing, ArXiv discovery confirmed working

* feat: Implement comprehensive 2025 API client library (11,810 lines)

Add 7 new API client modules implementing 35+ data sources:

Academic APIs (1,328 lines):
- OpenAlexClient, CoreClient, EricClient, UnpaywallClient

Finance APIs (1,517 lines):
- FinnhubClient, TwelveDataClient, CoinGeckoClient, EcbClient, BlsClient

Geospatial APIs (1,250 lines):
- NominatimClient, OverpassClient, GeonamesClient, OpenElevationClient

News & Social APIs (1,606 lines):
- HackerNewsClient, GuardianClient, NewsDataClient, RedditClient

Government APIs (2,354 lines):
- CensusClient, DataGovClient, EuOpenDataClient, UkGovClient
- WorldBankGovClient, UNDataClient

AI/ML APIs (2,035 lines):
- HuggingFaceClient, OllamaClient, ReplicateClient
- TogetherAiClient, PapersWithCodeClient

Transportation APIs (1,720 lines):
- GtfsClient, MobilityDatabaseClient
- OpenRouteServiceClient, OpenChargeMapClient

All clients include:
- Async/await with tokio and reqwest
- Mock data fallback for testing without API keys
- Rate limiting with configurable delays
- SemanticVector conversion for RuVector integration
- Comprehensive unit tests (252 total tests passing)
- Full error handling with FrameworkError

* docs: Add API client documentation for new implementations

Add documentation for:
- Geospatial clients (Nominatim, Overpass, Geonames, OpenElevation)
- ML clients (HuggingFace, Ollama, Replicate, Together, PapersWithCode)
- News clients (HackerNews, Guardian, NewsData, Reddit)
- Finance clients implementation notes

* feat: Implement dynamic min-cut tracking system (SODA 2026)

Based on El-Hayek, Henzinger, Li (SODA 2026) subpolynomial dynamic min-cut algorithm.

Core Components (2,626 lines):
- dynamic_mincut.rs (1,579 lines): EulerTourTree, DynamicCutWatcher, LocalMinCutProcedure
- cut_aware_hnsw.rs (1,047 lines): CutAwareHNSW, CoherenceZones, CutGatedSearch

Key Features:
- O(log n) connectivity queries via Euler-tour trees
- n^{o(1)} update time when λ ≤ 2^{(log n)^{3/4}} (vs O(n³) Stoer-Wagner)
- Cut-gated HNSW search that respects coherence boundaries
- Real-time cut monitoring with threshold-based deep evaluation
- Thread-safe structures with Arc<RwLock>

Performance (benchmarked):
- 75x speedup over periodic recomputation
- O(1) min-cut queries vs O(n³) recompute
- ~25µs per edge update

Tests & Benchmarks:
- 36+ unit tests across both modules
- 5 benchmark suites comparing periodic vs dynamic
- Integration with existing OptimizedDiscoveryEngine

This enables real-time coherence tracking in RuVector, transforming
min-cut from an expensive periodic computation to a maintained invariant.

---------

Co-authored-by: Claude <noreply@anthropic.com>

2026-01-04 14:36:41 -05:00

27 KiB

Raw Blame History

RuVector Data Framework - API Clients Comprehensive Inventory

Overview

Complete analysis of 12 client modules providing access to 30+ data sources across 10 domains.

Total Clients Analyzed: 30 Total Public Methods: 150+ Domain Coverage: News, Social, Research, Economic, Patent, Space, Genomics, Physics, Medical, Knowledge Graph Data Format: All convert to SemanticVector or DataRecord with embeddings

News API Client

Endpoint: https://newsapi.org/v2 Authentication: Required (API key) Rate Limit: 100ms delay (configurable)

Methods (4):

new(api_key: String) - Initialize client
search_articles(query, from_date, to_date, language) - Search news articles
get_top_headlines(category, country) - Get top headlines by category/country
get_sources(category, language, country) - List available news sources

Rate Limiting:

const DEFAULT_RATE_LIMIT_DELAY_MS: u64 = 100;
rate_limit_delay: Duration

Data Transformation:

NewsArticle -> SemanticVector {
    id: format!("NEWS:{}", hash(url)),
    embedding: embed_text(title + description + content),
    domain: Domain::News,
    metadata: {title, author, source, url, published_at, description}
}

Error Handling:

Retry on TOO_MANY_REQUESTS (max 3 retries)
Exponential backoff: RETRY_DELAY_MS * retries
Network error wrapping via FrameworkError::Network

Reddit Client

Endpoint: https://oauth.reddit.com Authentication: Required (client_id, client_secret) Rate Limit: 1000ms delay (Reddit: 60 req/min)

Methods (5):

new(client_id, client_secret) - OAuth authentication
search_posts(query, subreddit, limit) - Search posts in subreddit
get_hot_posts(subreddit, limit) - Get hot posts
get_top_posts(subreddit, time_filter, limit) - Get top posts (hour/day/week/month/year/all)
get_post_comments(post_id, limit) - Get post comments

Rate Limiting:

const REDDIT_RATE_LIMIT_MS: u64 = 1000; // 60 req/min

Data Transformation:

RedditPost -> SemanticVector {
    id: format!("REDDIT:{}", post_id),
    embedding: embed_text(title + selftext),
    domain: Domain::Social,
    metadata: {subreddit, author, score, num_comments, created_utc, url}
}

GitHub Client

Endpoint: https://api.github.com Authentication: Optional (higher rate limits with token) Rate Limit: 1000ms delay (5000/hour with token, 60/hour without)

Methods (4):

new(token: Option<String>) - Initialize with optional token
search_repositories(query, sort, limit) - Search repos
get_repository_issues(owner, repo, state) - Get issues (open/closed/all)
search_code(query, language, limit) - Search code

Rate Limiting:

const GITHUB_RATE_LIMIT_MS: u64 = 1000;
rate_limit_delay: Duration

HackerNews Client

Endpoint: https://hacker-news.firebaseio.com/v0 Authentication: Not required Rate Limit: 100ms delay

Methods (4):

new() - Initialize client
get_top_stories(limit) - Get top stories
get_new_stories(limit) - Get newest stories
get_best_stories(limit) - Get best stories

Data Transformation:

HnStory -> SemanticVector {
    id: format!("HN:{}", story_id),
    embedding: embed_text(title + text),
    domain: Domain::News,
    metadata: {title, url, score, descendants (comments), by (author)}
}

2. economic_clients.rs - Economic & Financial Data

World Bank Client

Endpoint: https://api.worldbank.org/v2 Authentication: Not required Rate Limit: 250ms delay

Methods (3):

new() - Initialize client
get_indicator_data(indicator, country, start_year, end_year) - Get economic indicators
search_indicators(query) - Search available indicators

Common Indicators:

NY.GDP.MKTP.CD - GDP (current US$)
SP.POP.TOTL - Population
NY.GDP.PCAP.CD - GDP per capita
FP.CPI.TOTL.ZG - Inflation rate

Data Transformation:

WorldBankIndicator -> SemanticVector {
    id: format!("WB:{}:{}:{}", country, indicator, date),
    embedding: embed_text(indicator_name + country),
    domain: Domain::Economic,
    metadata: {indicator, country, value, date, country_name, indicator_name}
}

FRED Client (Federal Reserve Economic Data)

Endpoint: https://api.stlouisfed.org/fred Authentication: Required (API key from research.stlouisfed.org) Rate Limit: 200ms delay

Methods (3):

new(api_key) - Initialize with FRED API key
get_series(series_id, start_date, end_date) - Get time series data
search_series(query) - Search available series

Popular Series:

GDP - Gross Domestic Product
UNRATE - Unemployment Rate
CPIAUCSL - Consumer Price Index
DFF - Federal Funds Rate

Alpha Vantage Client

Endpoint: https://www.alphavantage.co/query Authentication: Required (free tier: 5 req/min, 500/day) Rate Limit: 12000ms delay (5 req/min)

Methods (4):

new(api_key) - Initialize client
get_stock_price(symbol) - Real-time stock price
get_time_series_daily(symbol, days) - Historical daily prices
get_forex_rate(from_currency, to_currency) - FX rates

IMF Client (International Monetary Fund)

Endpoint: https://www.imf.org/external/datamapper/api/v1 Authentication: Not required Rate Limit: 500ms delay

Methods (2):

new() - Initialize client
get_indicator(indicator_code, countries) - Get IMF indicators

3. patent_clients.rs - Patent Data

USPTO Client (US Patent Office)

Endpoint: https://developer.uspto.gov/ibd-api/v1 Authentication: Not required Rate Limit: 500ms delay

Methods (3):

new() - Initialize client
search_patents(query, start_date, end_date) - Search patents
get_patent(patent_number) - Get specific patent

EPO Client (European Patent Office)

Endpoint: https://ops.epo.org/3.2/rest-services Authentication: Required (OAuth2) Rate Limit: 1000ms delay

Methods (3):

new(consumer_key, consumer_secret) - OAuth2 authentication
search_patents(query) - Search European patents
get_patent_details(patent_number) - Get patent details

Google Patents Client

Endpoint: https://patents.google.com Authentication: Not required Rate Limit: 1000ms delay (conservative)

Methods (2):

new() - Initialize client
search_patents(query, max_results) - Search patents

4. arxiv_client.rs - Research Papers

ArXiv Client

Endpoint: http://export.arxiv.org/api/query Authentication: Not required Rate Limit: 3000ms delay (max 1 req/3sec per ArXiv guidelines)

Methods (4):

new() - Initialize client
search(query, max_results) - Search papers by query
search_by_category(category, max_results) - Search by category (cs.AI, physics.gen-ph, etc.)
get_paper(arxiv_id) - Get specific paper by ID

Categories Supported:

cs.AI - Artificial Intelligence
cs.LG - Machine Learning
physics.gen-ph - General Physics
math.CO - Combinatorics
q-bio.GN - Genomics

Data Transformation:

ArxivEntry -> SemanticVector {
    id: format!("ARXIV:{}", arxiv_id),
    embedding: embed_text(title + summary),
    domain: Domain::Research,
    metadata: {arxiv_id, title, summary, authors, published, updated, category, pdf_url}
}

5. semantic_scholar.rs - Academic Papers

Semantic Scholar Client

Endpoint: https://api.semanticscholar.org/graph/v1 Authentication: Optional (API key for higher limits) Rate Limit:

Without key: 1000ms (100 req/5min)
With key: 100ms (1000 req/5min)

Methods (6):

new(api_key: Option<String>) - Initialize client
search_papers(query, limit) - Search papers
get_paper(paper_id) - Get paper by S2 ID or DOI
get_paper_citations(paper_id, limit) - Get citing papers
get_paper_references(paper_id, limit) - Get referenced papers
search_authors(query, limit) - Search authors

Data Transformation:

S2Paper -> SemanticVector {
    id: format!("S2:{}", paper_id),
    embedding: embed_text(title + abstract),
    domain: Domain::Research,
    metadata: {
        paper_id, title, abstract, authors, year,
        citation_count, reference_count, fields_of_study,
        venue, doi, arxiv_id, pubmed_id
    }
}

6. biorxiv_client.rs - Biomedical Preprints

bioRxiv Client

Endpoint: https://api.biorxiv.org/details/biorxiv Authentication: Not required Rate Limit: 500ms delay

Methods (4):

new() - Initialize client
search_preprints(query, days_back) - Search preprints
get_preprint(doi) - Get preprint by DOI
get_recent(days, limit) - Get recent preprints

medRxiv Client

Endpoint: https://api.biorxiv.org/details/medrxiv Authentication: Not required Rate Limit: 500ms delay

Methods (4):

Same as bioRxiv but for medical preprints

Data Transformation:

BiorxivPreprint -> SemanticVector {
    id: format!("BIORXIV:{}", doi),
    embedding: embed_text(title + abstract),
    domain: Domain::Research,
    metadata: {doi, title, authors, date, category, version, abstract}
}

7. crossref_client.rs - DOI Registry

CrossRef Client

Endpoint: https://api.crossref.org/works Authentication: Not required (polite pool with email recommended) Rate Limit: 200ms delay

Methods (5):

new(mailto: Option<String>) - Initialize with optional email
search_works(query, limit) - Search scholarly works
get_work(doi) - Get work by DOI
get_journal_articles(issn, limit) - Get articles from journal
search_by_type(work_type, query, limit) - Search by type (journal-article, book-chapter, etc.)

Work Types:

journal-article
book-chapter
proceedings-article
posted-content
dataset

8. space_clients.rs - Space & Astronomy

NASA APOD Client (Astronomy Picture of the Day)

Endpoint: https://api.nasa.gov/planetary/apod Authentication: API key (DEMO_KEY for testing) Rate Limit: 1000ms delay

Methods (3):

new(api_key: Option<String>) - Use DEMO_KEY if none provided
get_today() - Get today's APOD
get_date(date) - Get APOD for specific date

SpaceX Launch Client

Endpoint: https://api.spacexdata.com/v4 Authentication: Not required Rate Limit: 500ms delay

Methods (4):

new() - Initialize client
get_latest_launch() - Get most recent launch
get_upcoming_launches(limit) - Get upcoming launches
get_past_launches(limit) - Get historical launches

SIMBAD Astronomical Database Client

Endpoint: https://simbad.cds.unistra.fr/simbad/sim-tap Authentication: Not required Rate Limit: 1000ms delay

Methods (3):

new() - Initialize client
search_objects(query) - Search astronomical objects
query_region(ra, dec, radius) - Search by sky coordinates

9. genomics_clients.rs - Genomics & Proteomics

NCBI Gene Client

Endpoint: https://eutils.ncbi.nlm.nih.gov/entrez/eutils Authentication: Optional (API key for higher rate limits) Rate Limit:

Without key: 334ms (~3 req/sec)
With key: 100ms (10 req/sec)

Methods (4):

new(api_key: Option<String>) - Initialize client
search_genes(query, organism, max_results) - Search genes
get_gene(gene_id) - Get gene details by ID
get_gene_summary(gene_id) - Get gene summary

Ensembl Client

Endpoint: https://rest.ensembl.org Authentication: Not required Rate Limit: 200ms delay (15 req/sec limit)

Methods (5):

new() - Initialize client
search_genes(query, species) - Search genes in species
get_sequence(gene_id) - Get gene sequence
get_homology(gene_id) - Get homologous genes across species
get_variants(gene_id) - Get genetic variants

UniProt Client

Endpoint: https://rest.uniprot.org Authentication: Not required Rate Limit: 200ms delay

Methods (4):

new() - Initialize client
search_proteins(query, limit) - Search proteins
get_protein(accession) - Get protein by accession
get_protein_features(accession) - Get protein features

PDB Client (Protein Data Bank)

Endpoint: https://search.rcsb.org/rcsbsearch/v2/query Authentication: Not required Rate Limit: 500ms delay

Methods (3):

new() - Initialize client
search_structures(query, limit) - Search protein structures
get_structure(pdb_id) - Get structure by PDB ID

10. physics_clients.rs - Physics & Earth Science

USGS Earthquake Client

Endpoint: https://earthquake.usgs.gov/fdsnws/event/1 Authentication: Not required Rate Limit: 200ms delay (~5 req/sec)

Methods (5):

new() - Initialize client
get_recent(min_magnitude, days) - Recent earthquakes
search_by_region(lat, lon, radius_km, days) - Regional search
get_significant(days) - Significant earthquakes (mag ≥6.0 or sig ≥600)
get_by_magnitude_range(min, max, days) - Magnitude range

Data Transformation:

UsgsEarthquake -> SemanticVector {
    id: format!("USGS:{}", earthquake_id),
    embedding: embed_text("Magnitude {mag} earthquake at {place}"),
    domain: Domain::Seismic,
    metadata: {
        magnitude, place, latitude, longitude, depth_km,
        tsunami, significance, status, alert
    }
}

CERN Open Data Client

Endpoint: https://opendata.cern.ch/api/records Authentication: Not required Rate Limit: 500ms delay

Methods (3):

new() - Initialize client
search_datasets(query) - Search LHC datasets
get_dataset(recid) - Get dataset by record ID
search_by_experiment(experiment) - Search by experiment (CMS, ATLAS, LHCb, ALICE)

Data Transformation:

CernRecord -> SemanticVector {
    id: format!("CERN:{}", recid),
    embedding: embed_text(title + description + experiment),
    domain: Domain::Physics,
    metadata: {
        recid, title, experiment, collision_energy,
        collision_type, data_type
    }
}

Argo Ocean Data Client

Endpoint: https://data-argo.ifremer.fr Authentication: Not required Rate Limit: 300ms delay (~3 req/sec)

Methods (4):

new() - Initialize client
get_recent_profiles(days) - Recent ocean profiles
search_by_region(lat, lon, radius_km) - Regional ocean data
get_temperature_profiles() - Temperature-focused profiles
create_sample_profiles(count) - Generate sample data for testing

Materials Project Client

Endpoint: https://api.materialsproject.org Authentication: Required (API key from materialsproject.org) Rate Limit: 1000ms delay (1 req/sec for free tier)

Methods (3):

new(api_key) - Initialize with API key
search_materials(formula) - Search by chemical formula (Si, Fe2O3, LiFePO4)
get_material(material_id) - Get material by MP ID (mp-149)
search_by_property(property, min, max) - Search by property range (band_gap, density)

11. wiki_clients.rs - Knowledge Graphs

Wikipedia Client

Endpoint: https://{lang}.wikipedia.org/w/api.php Authentication: Not required Rate Limit: 100ms delay

Methods (4):

new(language) - Initialize for language (en, de, fr, etc.)
search(query, limit) - Search articles (max 500)
get_article(title) - Get article by title
get_categories(title) - Get article categories
get_links(title) - Get outgoing links

Data Transformation:

WikiPage -> DataRecord {
    id: format!("wikipedia_{}_{}", language, pageid),
    source: "wikipedia",
    record_type: "article",
    embedding: embed_text(title + extract),
    relationships: [
        {target: category, rel_type: "in_category", weight: 1.0},
        {target: linked_page, rel_type: "links_to", weight: 0.5}
    ]
}

Wikidata Client

Endpoint: https://www.wikidata.org/w/api.php SPARQL Endpoint: https://query.wikidata.org/sparql Authentication: Not required Rate Limit: 100ms delay

Methods (7):

new() - Initialize client
search_entities(query) - Search Wikidata entities
get_entity(qid) - Get entity by Q-identifier (Q42 = Douglas Adams)
sparql_query(query) - Execute SPARQL query
query_climate_entities() - Predefined climate change query
query_pharmaceutical_companies() - Pharma companies query
query_disease_outbreaks() - Disease outbreaks query

Predefined SPARQL Queries (5):

CLIMATE_CHANGE - Climate change entities
PHARMACEUTICAL_COMPANIES - Pharma companies with founding dates, employees
DISEASE_OUTBREAKS - Epidemic events with locations, casualties
RESEARCH_INSTITUTIONS - Research institutes by country
NOBEL_LAUREATES - Nobel Prize winners by field and year

12. medical_clients.rs - Medical & Health Data

PubMed Client

Endpoint: https://eutils.ncbi.nlm.nih.gov/entrez/eutils Authentication: Optional (NCBI API key) Rate Limit:

Without key: 334ms (~3 req/sec)
With key: 100ms (10 req/sec)

Methods (4):

new(api_key: Option<String>) - Initialize client
search_articles(query, max_results) - Search medical literature
search_pmids(query, max_results) - Get PMIDs only
fetch_abstracts(pmids) - Fetch full abstracts (batches of 200)

Data Transformation:

PubmedArticle -> SemanticVector {
    id: format!("PMID:{}", pmid),
    embedding: embed_text(title + abstract),
    domain: Domain::Medical,
    metadata: {pmid, title, abstract, authors, publication_date},
    embedding_dimension: 384 // Higher for medical text
}

ClinicalTrials.gov Client

Endpoint: https://clinicaltrials.gov/api/v2 Authentication: Not required Rate Limit: 100ms delay

Methods (2):

new() - Initialize client
search_trials(condition, status) - Search trials by condition and status
- Status: RECRUITING, COMPLETED, ACTIVE_NOT_RECRUITING, etc.

Data Transformation:

ClinicalStudy -> SemanticVector {
    id: format!("NCT:{}", nct_id),
    embedding: embed_text(title + summary + conditions),
    domain: Domain::Medical,
    metadata: {nct_id, title, summary, conditions, status}
}

FDA OpenFDA Client

Endpoint: https://api.fda.gov Authentication: Not required Rate Limit: 250ms delay (~4 req/sec)

Methods (3):

new() - Initialize client
search_drug_events(drug_name) - Search adverse drug events
search_recalls(reason) - Search device recalls

Data Transformation:

FdaDrugEvent -> SemanticVector {
    id: format!("FDA_EVENT:{}", safety_report_id),
    embedding: embed_text("Drug: {drugs} Reactions: {reactions}"),
    domain: Domain::Medical,
    metadata: {report_id, drugs, reactions, serious}
}

FdaRecall -> SemanticVector {
    id: format!("FDA_RECALL:{}", recall_number),
    embedding: embed_text("Product: {product} Reason: {reason}"),
    domain: Domain::Medical,
    metadata: {recall_number, reason, product, classification}
}

Common Patterns Across All Clients

1. Error Handling Pattern

async fn fetch_with_retry(&self, url: &str) -> Result<reqwest::Response> {
    let mut retries = 0;
    loop {
        match self.client.get(url).send().await {
            Ok(response) => {
                if response.status() == StatusCode::TOO_MANY_REQUESTS
                   && retries < MAX_RETRIES {
                    retries += 1;
                    sleep(Duration::from_millis(RETRY_DELAY_MS * retries as u64)).await;
                    continue;
                }
                return Ok(response);
            }
            Err(_) if retries < MAX_RETRIES => {
                retries += 1;
                sleep(Duration::from_millis(RETRY_DELAY_MS * retries as u64)).await;
            }
            Err(e) => return Err(FrameworkError::Network(e)),
        }
    }
}

Constants:

MAX_RETRIES: u32 = 3
RETRY_DELAY_MS: u64 = 1000
Exponential backoff: delay * retries

2. Rate Limiting Pattern

// Before each API call
sleep(self.rate_limit_delay).await;
let response = self.fetch_with_retry(&url).await?;

Rate Limit Table:

Client	Delay (ms)	Req/Sec	Notes
News API	100	~10	Configurable
Reddit	1000	1	60 req/min limit
GitHub	1000	1	5000/hr with token
HackerNews	100	~10	No auth required
World Bank	250	4	No auth required
FRED	200	5	API key required
Alpha Vantage	12000	0.08	5 req/min limit
IMF	500	2	No auth required
USPTO	500	2	No auth required
EPO	1000	1	OAuth2 required
Google Patents	1000	1	Conservative
ArXiv	3000	0.33	1 req/3sec guideline
Semantic Scholar (no key)	1000	1	100 req/5min
Semantic Scholar (with key)	100	10	1000 req/5min
bioRxiv/medRxiv	500	2	No auth required
CrossRef	200	5	Polite pool with email
NASA APOD	1000	1	DEMO_KEY available
SpaceX	500	2	No auth required
SIMBAD	1000	1	TAP service
NCBI Gene (no key)	334	3	NCBI guidelines
NCBI Gene (with key)	100	10	API key required
Ensembl	200	5	15 req/sec limit
UniProt	200	5	No auth required
PDB	500	2	No auth required
USGS	200	5	Real-time seismic
CERN	500	2	Open data portal
Argo	300	3	Ocean float data
Materials Project	1000	1	1 req/sec free tier
Wikipedia	100	~10	No auth required
Wikidata	100	~10	SPARQL available
PubMed (no key)	334	3	NCBI guidelines
PubMed (with key)	100	10	API key required
ClinicalTrials	100	~10	No auth required
FDA OpenFDA	250	4	No auth required

3. Embedding Pattern

// SimpleEmbedder - deterministic hash-based embeddings
embedder: Arc<SimpleEmbedder> = Arc::new(SimpleEmbedder::new(dimension));

// Dimensions by domain:
// - 256: Most clients (news, social, research)
// - 384: Medical/scientific (PubMed, ClinicalTrials, FDA)
// - Configurable per client based on text complexity

4. Metadata Pattern

let mut metadata = HashMap::new();
metadata.insert("source".to_string(), "client_name".to_string());
metadata.insert("id".to_string(), record_id);
// Domain-specific fields

Common Metadata Fields:

source - Client identifier
title - Record title
url - Source URL
timestamp - Publication/update date
Domain-specific fields (authors, categories, scores, etc.)

Summary Statistics

By Domain Coverage

News & Social: 4 clients (News API, Reddit, GitHub, HackerNews)
Economic: 4 clients (World Bank, FRED, Alpha Vantage, IMF)
Patents: 3 clients (USPTO, EPO, Google Patents)
Research: 4 clients (ArXiv, Semantic Scholar, bioRxiv, CrossRef)
Space: 3 clients (NASA APOD, SpaceX, SIMBAD)
Genomics: 4 clients (NCBI Gene, Ensembl, UniProt, PDB)
Physics: 4 clients (USGS, CERN, Argo, Materials Project)
Knowledge: 2 clients (Wikipedia, Wikidata)
Medical: 3 clients (PubMed, ClinicalTrials, FDA)

By Authentication Requirements

No Auth Required: 17 clients (57%)
Optional Auth: 5 clients (17%) - improved rate limits
Required Auth: 8 clients (26%)

By Method Count

Total Public Methods: 150+
Average per client: ~5 methods
Range: 2-7 methods per client

By Rate Limit Strictness

Very Strict (>1000ms): 2 clients - ArXiv (3000ms), Alpha Vantage (12000ms)
Strict (500-1000ms): 11 clients
Moderate (200-500ms): 11 clients
Permissive (<200ms): 6 clients

By Embedding Dimensions

256 dimensions: 26 clients (87%)
384 dimensions: 4 clients (13%) - medical/scientific domains

Data Flow Architecture

API Source → Client → Response Parser → SemanticVector/DataRecord
                                              ↓
                                       Embedding (SimpleEmbedder)
                                              ↓
                                       Domain Classification
                                              ↓
                                       Metadata Extraction
                                              ↓
                                       RuVector Storage

Usage Recommendations

1. Rate Limit Compliance

Always use provided rate limit delays
Consider API key registration for higher limits
Batch requests when possible (e.g., PubMed: 200 PMIDs/request)

2. Error Handling

All clients implement retry logic with exponential backoff
Handle FrameworkError::Network for connectivity issues
Check for empty results (some APIs return 404 for no matches)

3. Authentication

Store API keys in environment variables
Use optional auth when available for better rate limits
OAuth2 clients (Reddit, EPO) require credential management

4. Performance Optimization

Use parallel requests for independent queries
Leverage batch endpoints (PubMed abstracts, etc.)
Cache results when appropriate
Consider semantic search with embeddings vs. full-text search

5. Domain-Specific Considerations

Medical: Higher embedding dimensions (384) for richer semantics
Research: Check multiple sources (ArXiv + Semantic Scholar + CrossRef)
Economic: Time-series data requires date range management
Genomics: Species-specific searches (Ensembl supports 100+ species)
Physics: Geographic searches use Haversine distance calculations

Integration Example

use ruvector_data_framework::*;

#[tokio::main]
async fn main() -> Result<()> {
    // Initialize multiple clients
    let arxiv = ArxivClient::new()?;
    let s2 = SemanticScholarClient::new(Some("API_KEY".to_string()))?;
    let pubmed = PubMedClient::new(Some("NCBI_KEY".to_string()))?;

    // Parallel search across domains
    let query = "machine learning healthcare";

    let (arxiv_results, s2_results, pubmed_results) = tokio::join!(
        arxiv.search(query, 50),
        s2.search_papers(query, 50),
        pubmed.search_articles(query, 50)
    );

    // Combine vectors
    let mut all_vectors = Vec::new();
    all_vectors.extend(arxiv_results?);
    all_vectors.extend(s2_results?);
    all_vectors.extend(pubmed_results?);

    // Store in RuVector for semantic search
    // ... vector storage code ...

    Ok(())
}

Future Enhancements

Dynamic Rate Limiting: Adjust based on response headers
Circuit Breakers: Fail-fast on repeated errors
Response Caching: Redis/disk cache for repeated queries
Streaming APIs: Support for SSE/WebSocket endpoints
Advanced Embeddings: Integration with transformer models
Relationship Graphs: Enhanced Wikipedia/Wikidata graph traversal
Multi-language Support: Expand beyond English for international sources
Specialized Domains: Climate, energy, agriculture data sources

Last Updated: 2026-01-04 Total Clients: 30 Total Methods: 150+ API Coverage: 10 domains across research, economic, medical, and scientific data

27 KiB Raw Blame History

RuVector Data Framework - API Clients Comprehensive Inventory

Overview

1. api_clients.rs - News & Social Media

News API Client

Methods (4):

Rate Limiting:

Data Transformation:

Error Handling:

Reddit Client

Methods (5):

Rate Limiting:

Data Transformation:

GitHub Client

Methods (4):

Rate Limiting:

HackerNews Client

Methods (4):

Data Transformation:

2. economic_clients.rs - Economic & Financial Data

World Bank Client

Methods (3):

Common Indicators:

Data Transformation:

FRED Client (Federal Reserve Economic Data)

Methods (3):

Popular Series:

Alpha Vantage Client

Methods (4):

IMF Client (International Monetary Fund)

Methods (2):

3. patent_clients.rs - Patent Data

USPTO Client (US Patent Office)

Methods (3):

EPO Client (European Patent Office)

Methods (3):

Google Patents Client

Methods (2):

4. arxiv_client.rs - Research Papers

ArXiv Client

Methods (4):

Categories Supported:

Data Transformation:

5. semantic_scholar.rs - Academic Papers

Semantic Scholar Client

Methods (6):

Data Transformation:

6. biorxiv_client.rs - Biomedical Preprints

bioRxiv Client

Methods (4):

medRxiv Client

Methods (4):

Data Transformation:

7. crossref_client.rs - DOI Registry

CrossRef Client

Methods (5):

Work Types:

8. space_clients.rs - Space & Astronomy

NASA APOD Client (Astronomy Picture of the Day)

Methods (3):

SpaceX Launch Client

Methods (4):

SIMBAD Astronomical Database Client

Methods (3):

9. genomics_clients.rs - Genomics & Proteomics

NCBI Gene Client

Methods (4):

Ensembl Client

Methods (5):

UniProt Client

Methods (4):

PDB Client (Protein Data Bank)

Methods (3):

10. physics_clients.rs - Physics & Earth Science

USGS Earthquake Client

Methods (5):

Data Transformation:

CERN Open Data Client

Methods (3):

Data Transformation:

27 KiB

Raw Blame History