ruvector/examples/apify/ruvector-postgres/README.md

# Self-Learning Postgres DB - Vector Database for AI Agents

A distributed vector database that **truly learns**. Store embeddings, query with semantic search, and let the index improve itself through TRM (Tiny Recursive Models), SONA (Self-Optimizing Neural Architecture), and Graph Neural Networks.

[![Apify Actor](https://img.shields.io/badge/Apify-Actor-blue)](https://apify.com/ruv/self-learning-postgres-db)
[![PostgreSQL 17](https://img.shields.io/badge/PostgreSQL-17.7-blue)](https://www.postgresql.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Version](https://img.shields.io/badge/version-2.1-green)](https://github.com/ruvnet/ruvector)

## Key AI Features

| Feature | Description |
|---------|-------------|
| **TRM** | 7M parameter recursive reasoning (83% on GSM8K) |
| **SONA** | 3-tier learning (Instant/Background/Deep) |
| **EWC++** | Anti-forgetting protection (λ=2000) |
| **GNN** | Graph Neural Network index optimization |
| **Trajectory Tracking** | Learn from query patterns |

---

## Features

**30+ Operations** for complete vector database management:

- **Semantic Search** - Find documents by meaning, not just keywords
- **Batch Operations** - Insert and search thousands of documents efficiently
- **Hybrid Search** - Combine vector similarity with keyword matching
- **RAG Support** - Built-in Retrieval-Augmented Generation queries
- **Self-Learning** - GNN training for index optimization
- **Clustering** - K-means document clustering
- **Deduplication** - Find and remove duplicate content
- **Export/Import** - JSON and CSV data migration

**Zero Setup Required:**
- Embedded PostgreSQL with ruvector extension
- Local AI embeddings (no OpenAI API key needed)
- Automatic table and index creation

---

## Quick Start (30 Seconds)

### Full Demo

```json
{
  "action": "full_workflow",
  "query": "How does machine learning work?",
  "documents": [
    {"content": "Machine learning is AI that learns patterns from data.", "metadata": {"category": "AI"}},
    {"content": "PostgreSQL is a powerful relational database.", "metadata": {"category": "Database"}},
    {"content": "Neural networks consist of layers of nodes.", "metadata": {"category": "AI"}},
    {"content": "Vector databases store embeddings for similarity search.", "metadata": {"category": "Database"}}
  ]
}
```

**Result:** Documents ranked by semantic relevance to your query.

---

## All 38 Actions

### Document Operations
| Action | Description |
|--------|-------------|
| `insert` | Add documents with auto-generated embeddings |
| `batch_insert` | Efficiently insert large document sets |
| `get` | Retrieve single document by ID |
| `list` | List documents with filtering |
| `update` | Modify existing document content/metadata |
| `delete` | Remove documents by ID, IDs, or filter |
| `upsert` | Insert or update (smart merge) |

### Search Operations
| Action | Description |
|--------|-------------|
| `search` | Semantic similarity search |
| `batch_search` | Multiple queries in one call |
| `hybrid_search` | Vector + BM25 keyword combined |
| `multi_query_search` | Aggregate results from multiple queries |
| `mmr_search` | Maximal Marginal Relevance (diverse results) |
| `graph_search` | Graph-based similarity traversal |
| `range_search` | All results within distance threshold |

### Table Operations
| Action | Description |
|--------|-------------|
| `create_table` | Create new vector collection |
| `drop_table` | Delete collection |
| `list_tables` | Show all vector tables |
| `table_stats` | Collection statistics and metrics |
| `create_index` | Add HNSW or IVFFlat index |
| `reindex` | Rebuild indexes |

### Self-Learning / GNN / SONA
| Action | Description |
|--------|-------------|
| `train_gnn` | Train Graph Neural Network on data |
| `optimize_index` | Auto-tune HNSW parameters |
| `analyze_patterns` | Analyze data distribution |
| `sona_learn` | Trigger TRM/SONA background learning cycle |
| `sona_status` | Check SONA learning status and capabilities |

### Clustering & Deduplication
| Action | Description |
|--------|-------------|
| `cluster` | K-means document clustering |
| `find_duplicates` | Detect similar document pairs |
| `deduplicate` | Remove duplicate documents |

### Data Operations
| Action | Description |
|--------|-------------|
| `export` | Export to JSON or CSV |
| `import` | Import from JSON data |

### AI / RAG
| Action | Description |
|--------|-------------|
| `rag_query` | Build RAG context from search results |
| `summarize` | Document statistics and previews |

### Utility
| Action | Description |
|--------|-------------|
| `ping` | Test database connection |
| `version` | Get version and feature info |
| `embedding_models` | List available models |
| `generate_embedding` | Create embeddings without storing |
| `similarity` | Compare similarity of two texts |

---

## Use Cases

### 1. AI Agent Memory

```json
{
  "action": "insert",
  "tableName": "agent_memory",
  "documents": [
    {"content": "User prefers dark mode", "metadata": {"user_id": "123", "type": "preference"}},
    {"content": "User asked about Python tutorials", "metadata": {"user_id": "123", "type": "history"}}
  ]
}
```

Retrieve memories:
```json
{
  "action": "search",
  "tableName": "agent_memory",
  "query": "What does this user like?",
  "filter": "metadata->>'user_id' = '123'"
}
```

### 2. RAG Pipeline

```json
{
  "action": "rag_query",
  "query": "How do I return a product?",
  "topK": 5,
  "ragMaxTokens": 2000
}
```

Returns context ready to feed to your LLM.

### 3. Batch Document Processing

```json
{
  "action": "batch_insert",
  "batchSize": 100,
  "documents": [
    // ... thousands of documents
  ]
}
```

### 4. Find & Remove Duplicates

```json
{
  "action": "find_duplicates",
  "similarityThreshold": 0.95
}
```

Then:
```json
{
  "action": "deduplicate",
  "similarityThreshold": 0.95
}
```

### 5. Document Clustering

```json
{
  "action": "cluster",
  "numClusters": 10,
  "clusteringAlgorithm": "kmeans"
}
```

### 6. Index Optimization

```json
{
  "action": "optimize_index",
  "enableLearning": true
}
```

### 7. SONA Self-Learning

Check learning status:
```json
{
  "action": "sona_status"
}
```

Trigger learning cycle:
```json
{
  "action": "sona_learn",
  "ewcLambda": 2000,
  "patternThreshold": 0.7
}
```

---

## Parameters Reference

### Core Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `action` | string | `search` | Operation to perform |
| `connectionString` | string | embedded | PostgreSQL URL for persistence |
| `tableName` | string | `documents` | Table/collection name |

### Search Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `query` | string | - | Natural language search query |
| `queryVector` | array | - | Pre-computed embedding vector |
| `topK` | integer | 10 | Number of results |
| `distanceMetric` | string | `cosine` | cosine, l2, inner_product, manhattan |
| `filter` | string | - | SQL WHERE clause |
| `minScore` | number | 0 | Minimum similarity score (0-1) |
| `maxDistance` | number | - | Maximum distance threshold |

### Embedding Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `embeddingModel` | string | `all-MiniLM-L6-v2` | AI embedding model |
| `generateEmbeddings` | boolean | true | Auto-generate embeddings |
| `dimensions` | integer | 384 | Vector dimensions |

### Index Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `indexType` | string | `hnsw` | hnsw, ivfflat, none |
| `hnswM` | integer | 16 | HNSW max connections |
| `hnswEfConstruction` | integer | 64 | HNSW build quality |
| `hnswEfSearch` | integer | 100 | HNSW search quality |
| `ivfLists` | integer | 100 | IVFFlat partitions |

### GNN Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `enableLearning` | boolean | false | Enable self-learning |
| `learningRate` | number | 0.01 | GNN learning rate |
| `gnnLayers` | integer | 2 | GNN layer count |
| `trainEpochs` | integer | 10 | Training epochs |

### SONA / TRM Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `sonaEnabled` | boolean | true | Enable TRM/SONA self-learning |
| `ewcLambda` | number | 2000 | EWC++ anti-forgetting strength |
| `patternThreshold` | number | 0.7 | Pattern recognition confidence |
| `maxTrajectories` | integer | 100 | Max trajectory steps to track |
| `sonaLearningTiers` | array | ["instant", "background"] | Learning tiers to enable |

### Clustering Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `numClusters` | integer | 10 | K-means clusters |
| `similarityThreshold` | number | 0.95 | Duplicate detection threshold |

---

## Embedding Models

| Model | Dimensions | Speed | Quality | Best For |
|-------|------------|-------|---------|----------|
| `all-MiniLM-L6-v2` | 384 | Fast | Good | Prototyping |
| `bge-small-en-v1.5` | 384 | Fast | Excellent | Production |
| `bge-base-en-v1.5` | 768 | Medium | Better | High accuracy |
| `nomic-embed-text-v1` | 768 | Medium | Good | Long documents (8K) |
| `gte-small` | 384 | Fast | Good | General use |
| `e5-small-v2` | 384 | Fast | Good | Multilingual |

---

## Persistent Storage

### Hybrid Persistence Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    Actor Run                            │
│  ┌──────────────┐    ┌──────────────┐    ┌───────────┐ │
│  │ Key-Value    │───▶│ Embedded     │───▶│ Key-Value │ │
│  │ Store (load) │    │ PostgreSQL   │    │ (save)    │ │
│  └──────────────┘    └──────────────┘    └───────────┘ │
│       START              WORK               END         │
└─────────────────────────────────────────────────────────┘
```

**Flow:**
1. **On Start** → Load documents from Key-Value Store into embedded PostgreSQL
2. **During Run** → Full vector search capabilities (HNSW, cosine, etc.)
3. **On End** → Export documents back to Key-Value Store

### Storage Options Comparison

| Feature | External PostgreSQL | Apify Key-Value Store |
|---------|---------------------|----------------------|
| Setup required | Yes | No |
| Cost | Separate billing | Included in Apify |
| Max size | Unlimited | ~9GB per store |
| Cold start | Fast | Slower (load data) |
| Best for | Large/production | Small-medium datasets |

### External PostgreSQL

For persistent storage with external database:

```json
{
  "connectionString": "postgresql://user:password@host:5432/database",
  "action": "search",
  "query": "Your query"
}
```

**Supported:**
- PostgreSQL 14+ with ruvector extension
- PostgreSQL with pgvector (compatibility mode)
- Supabase, Neon, AWS RDS, etc.

---

## API Integration

### Python
```python
from apify_client import ApifyClient

client = ApifyClient("your-api-token")
run = client.actor("ruv/self-learning-postgres-db").call(run_input={
    "action": "search",
    "query": "machine learning basics",
    "topK": 5
})
results = client.dataset(run["defaultDatasetId"]).list_items().items
```

### JavaScript
```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'your-api-token' });
const run = await client.actor('ruv/self-learning-postgres-db').call({
    action: 'search',
    query: 'machine learning basics',
    topK: 5
});
const results = await client.dataset(run.defaultDatasetId).listItems();
```

### cURL
```bash
curl -X POST "https://api.apify.com/v2/acts/ruv~self-learning-postgres-db/runs" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "action": "search",
    "query": "machine learning",
    "topK": 10
  }'
```

---

## Performance

Built on PostgreSQL 17.7 with AVX-512 SIMD acceleration:

| Dataset Size | Search Time | Accuracy |
|--------------|-------------|----------|
| 10,000 docs | ~0.3ms | 99%+ |
| 100,000 docs | ~0.5ms | 99%+ |
| 1,000,000 docs | ~1.2ms | 98%+ |

---

## Pricing (Apify Pay-per-event)

### Core Events
| Event | Price | Description |
|-------|-------|-------------|
| Actor Start | $0.001 | Per GB memory used |
| Document Insert | $0.001 | Per document stored |
| Vector Search | $0.001 | Per search query |
| Result | $0.0005 | Per result returned |

### Advanced Operations
| Event | Price | Description |
|-------|-------|-------------|
| Batch Operation | $0.002 | Per batch insert/search |
| RAG Query | $0.002 | Per RAG context build |
| GNN Training | $0.01 | Per training session |
| Clustering | $0.005 | Per cluster operation |
| Deduplication | $0.003 | Per dedupe run |
| Data Export | $0.002 | Per export |
| Data Import | $0.002 | Per import |
| Table Operation | $0.001 | Create/drop table |
| Index Operation | $0.002 | Create/optimize index |
| Similarity Check | $0.001 | Per comparison |
| Embedding Generation | $0.001 | Per embedding |

**Volume Discounts:**
- Bronze: -14% off results
- Silver: -26% off results
- Gold: -40% off results

---

## Development

### Local Testing

```bash
# Start ruvector-postgres
docker run -d --name ruvector-pg -e POSTGRES_PASSWORD=secret -p 5432:5432 ruvnet/ruvector-postgres:latest

# Run tests
DATABASE_URL="postgresql://postgres:secret@localhost:5432/postgres" npm test
```

### Deployment

```bash
# Set your API token in root .env
echo "APIFY_API_TOKEN=your_token" >> ../../../.env

# Deploy
npm run deploy
```

---

## Links

- [GitHub Repository](https://github.com/ruvnet/ruvector)
- [Apify Store](https://apify.com/ruv/self-learning-postgres-db)
- [Docker Image](https://hub.docker.com/r/ruvnet/ruvector-postgres)
- [RuVector Documentation](https://github.com/ruvnet/ruvector/tree/main/crates/ruvector-postgres)

---

## Support

- [Open an Issue](https://github.com/ruvnet/ruvector/issues)
- [Apify Community](https://discord.gg/apify)

---

**Built with RuVector** - High-performance vector search with TRM/SONA self-learning for the AI era.