- Added title and description to all nested object properties - Fixed confidenceLevel type from number to string for select editor - Fixed patterns editor from stringList to select with enum - Removed neural-trader native dependency from neural-trader-system All 6 actors now successfully deployed to Apify: - AI Trading Simulator (N1s3iuVcCrz5wcnoV) - Agent Training Factory (qP6kNaWoD6VqpwhZr) - Market Research Swarm (PVyyfXAwFMRqfwCuC) - Financial Stress Test (7K3WQwvPHq2h7iyE8) - RAG Knowledge Builder (Dhtq8JwapevaRtgAw) - Neural Trader System (BizYfvSOLAmZdIUD2) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| .actor | ||
| src | ||
| package.json | ||
| README.md | ||
RAG Knowledge Base Builder
Crawl, Chunk, Embed & Index for AI
Build production-ready knowledge bases for RAG (Retrieval Augmented Generation) systems in minutes.
Transform any website into a searchable, AI-ready knowledge base. This Apify actor crawls content, intelligently chunks it, generates embeddings, and exports to popular vector databases. Perfect for AI chatbots, semantic search engines, and LLM-powered assistants.
🌟 Features
Multi-Source Content Ingestion
- URLs List: Crawl from a list of starting URLs
- XML Sitemaps: Process entire sites via sitemap.xml
- Actor Datasets: Use output from other Apify actors
- File Upload: Process uploaded documents (coming soon)
4 Intelligent Chunking Strategies
- Fixed Size - Consistent token-based chunks with configurable overlap
- Semantic - Topic-aware chunking that preserves context
- Paragraph-based - Natural paragraph boundaries
- Sentence-based - Sentence-level granularity for fine-grained search
Embedding Model Support
- OpenAI text-embedding-3-small (1536 dimensions)
- OpenAI text-embedding-3-large (3072 dimensions)
- Cohere Embed English v3 (1024 dimensions)
- Local MiniLM (384 dimensions)
- Simulated embeddings (for testing without API keys)
Vector Database Export Formats
- Pinecone - Ready-to-upload format
- Weaviate - Class-based schema format
- Qdrant - Point-based format
- Chroma - Collection-ready format
- AgentDB - Optimized for ruv.io AgentDB
- JSONL - Generic format for any vector DB
Advanced Features
- Smart Deduplication - Removes duplicate and highly similar chunks
- Rich Metadata Preservation - Source URLs, titles, timestamps, custom fields
- Configurable Crawl Depth - Control link following (1-5 levels)
- CSS Selector Control - Include/exclude specific page elements
- Custom Metadata - Add your own fields to all chunks
- Chunking Statistics - Track tokens, chunk counts, and performance
🚀 Quick Start
Example 1: Build Documentation Knowledge Base
{
"urls": [
{ "url": "https://docs.python.org/3/tutorial/index.html" }
],
"crawlDepth": 3,
"chunkStrategy": "semantic",
"chunkSize": 512,
"chunkOverlap": 128,
"embeddingModel": "text-embedding-3-small",
"outputFormat": "pinecone",
"deduplication": true
}
Result: Creates a searchable Python documentation knowledge base with ~500 semantic chunks, ready for RAG.
Example 2: Product Catalog for E-commerce Chatbot
{
"urls": [
{ "url": "https://example-store.com/products" }
],
"crawlDepth": 2,
"chunkStrategy": "paragraph",
"chunkSize": 384,
"embeddingModel": "text-embedding-3-small",
"outputFormat": "weaviate",
"includeSelectors": [".product-description", ".product-specs"],
"metadata": {
"source_type": "product_catalog",
"company": "Example Store"
}
}
Result: Indexed product descriptions and specs, optimized for customer support chatbots.
Example 3: Support Articles Knowledge Base
{
"urls": [
{ "url": "https://help.example.com" }
],
"crawlDepth": 2,
"chunkStrategy": "fixed_size",
"chunkSize": 768,
"chunkOverlap": 192,
"embeddingModel": "cohere-embed-english-v3",
"outputFormat": "qdrant",
"excludeSelectors": ["nav", "footer", ".related-articles"],
"deduplication": true,
"similarityThreshold": 0.90
}
Result: Clean support article chunks with aggressive deduplication, perfect for customer service AI.
🔌 Apify MCP Integration
Integrate this actor with Claude Code using Apify MCP:
# Add to your Claude Code MCP servers
claude mcp add rag-builder -- npx -y @apify/actors-mcp-server --actors "ruv/rag-knowledge-builder"
Then use in Claude Code:
// Run via MCP
mcp__apify__run_actor({
actorId: "ruv/rag-knowledge-builder",
input: {
urls: [{ url: "https://docs.example.com" }],
chunkStrategy: "semantic",
outputFormat: "agentdb"
}
})
📚 Tutorials
Tutorial 1: Build Knowledge Base from Documentation Site
Goal: Create a searchable knowledge base from your product documentation.
Steps:
- Configure input with your docs URL
- Set
crawlDepth: 3to get all pages - Use
chunkStrategy: "semantic"for topic coherence - Set
chunkSize: 512for balanced context - Choose embedding model (OpenAI recommended)
- Select output format matching your vector DB
- Run actor and download dataset
Best Practices:
- Use
excludeSelectorsto remove navigation, footers - Enable
deduplicationto avoid redundancy - Add
metadatato track documentation version
Tutorial 2: Create Product Catalog RAG from E-commerce Scrape
Goal: Build a product knowledge base for chatbot recommendations.
Steps:
- First, scrape product pages with Apify Web Scraper
- Use scraped dataset as input:
sourceType: "actor_dataset" - Set
includeSelectors: [".product-description", ".specs"] - Use
chunkStrategy: "paragraph"for natural product info chunks - Add custom metadata:
{ category, brand, price_range } - Export to vector DB format
- Import into your RAG system
Integration:
// Chain actors: Scraper → RAG Builder
const scraperRun = await apifyClient.actor('scraper').call(scrapingInput);
const ragRun = await apifyClient.actor('rag-builder').call({
sourceType: 'actor_dataset',
actorDatasetId: scraperRun.defaultDatasetId,
chunkStrategy: 'paragraph'
});
Tutorial 3: Index Support Articles for Chatbot
Goal: Power customer support chatbot with knowledge base.
Steps:
- Configure with support site URL
- Set
crawlDepth: 2for main articles only - Use
excludeSelectorsto remove FAQs, contact forms - Set
chunkSize: 768for detailed answers - Enable
deduplicationwith high threshold (0.95) - Export to Pinecone/Weaviate/Qdrant
- Connect to LangChain/LlamaIndex RAG pipeline
RAG Pipeline:
# Use chunks in LangChain
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
# Load chunks from Apify dataset
chunks = load_apify_dataset(run_id)
# Create vector store
vectorstore = Pinecone.from_documents(
chunks,
OpenAIEmbeddings(),
index_name="support-kb"
)
# Query for relevant context
docs = vectorstore.similarity_search(user_question, k=5)
Tutorial 4: Combine Multiple Sources into Unified Knowledge Base
Goal: Merge documentation, blog posts, and support articles.
Steps:
- Run actor separately for each source:
- Docs:
crawlDepth: 3, chunkStrategy: semantic - Blog:
crawlDepth: 2, chunkStrategy: paragraph - Support:
crawlDepth: 1, chunkStrategy: fixed_size
- Docs:
- Add different metadata to each run
- Combine datasets using Apify's dataset merge
- Use unified dataset for comprehensive RAG
Advanced: Use different chunk sizes per source type for optimal retrieval.
🔧 Configuration Reference
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
sourceType |
enum | urls |
Content source: urls, sitemap, actor_dataset, file_upload |
urls |
array | [] |
List of start URLs to crawl |
crawlDepth |
integer | 1 |
Maximum link depth (1-5) |
maxPages |
integer | 100 |
Maximum pages to crawl (0 = unlimited) |
chunkStrategy |
enum | fixed_size |
Chunking strategy: fixed_size, semantic, paragraph, sentence |
chunkSize |
integer | 512 |
Target chunk size in tokens (256-2048) |
chunkOverlap |
integer | 128 |
Token overlap between chunks (0-256) |
embeddingModel |
enum | text-embedding-3-small |
Embedding model to use |
outputFormat |
enum | jsonl |
Export format: jsonl, pinecone, weaviate, qdrant, chroma, agentdb |
includeMetadata |
boolean | true |
Include rich metadata in chunks |
deduplication |
boolean | true |
Remove duplicate chunks |
excludeSelectors |
array | [nav, header, footer] |
CSS selectors to exclude |
includeSelectors |
array | [] |
CSS selectors to include (empty = all) |
metadata |
object | {} |
Custom metadata for all chunks |
Chunking Strategies Comparison
| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Fixed Size | General purpose | Consistent size, predictable costs | May split mid-sentence |
| Semantic | Documentation, articles | Preserves topic coherence | Variable chunk sizes |
| Paragraph | Natural text | Natural boundaries | Size variability |
| Sentence | Q&A, definitions | High precision | Many small chunks |
Recommended Configurations
Documentation Sites:
{
"chunkStrategy": "semantic",
"chunkSize": 512,
"chunkOverlap": 128,
"crawlDepth": 3
}
Product Catalogs:
{
"chunkStrategy": "paragraph",
"chunkSize": 384,
"chunkOverlap": 64,
"crawlDepth": 2
}
Blog Posts:
{
"chunkStrategy": "semantic",
"chunkSize": 768,
"chunkOverlap": 192,
"crawlDepth": 1
}
📊 Output Format Examples
JSONL (Generic)
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Python is a high-level programming language...",
"embedding": [0.123, -0.456, 0.789, ...],
"tokenCount": 512,
"metadata": {
"source": "https://docs.python.org/3/tutorial/intro.html",
"title": "Introduction to Python",
"chunkIndex": 0,
"crawledAt": "2025-12-13T10:30:00Z"
}
}
Pinecone Format
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"values": [0.123, -0.456, 0.789, ...],
"metadata": {
"text": "Python is a high-level programming language...",
"source": "https://docs.python.org/3/tutorial/intro.html",
"title": "Introduction to Python",
"tokenCount": 512
}
}
AgentDB Format (ruv.io)
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"vector": [0.123, -0.456, 0.789, ...],
"text": "Python is a high-level programming language...",
"metadata": {
"tokenCount": 512,
"source": "https://docs.python.org/3/tutorial/intro.html",
"title": "Introduction to Python",
"indexed_at": "2025-12-13T10:30:00Z",
"source": "apify-rag-builder"
}
}
🎯 RAG Implementation Guide
Step 1: Build Knowledge Base
Run this actor with your content sources.
Step 2: Load into Vector Database
# Pinecone example
import pinecone
from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_TOKEN')
dataset = client.dataset('YOUR_DATASET_ID').list_items().items
pinecone.init(api_key='YOUR_PINECONE_KEY')
index = pinecone.Index('knowledge-base')
# Upload chunks
index.upsert(vectors=[
(item['id'], item['values'], item['metadata'])
for item in dataset
])
Step 3: Implement RAG Query
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
# Create retriever from vector store
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Create QA chain
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
chain_type="stuff",
retriever=retriever
)
# Query knowledge base
answer = qa.run("How do I install Python packages?")
🔍 SEO Keywords
This actor is optimized for:
- RAG (Retrieval Augmented Generation) - Build knowledge bases for RAG systems
- Knowledge Base - Create searchable knowledge bases from web content
- Embeddings - Generate vector embeddings for semantic search
- Vector Database - Export to Pinecone, Weaviate, Qdrant, Chroma
- Semantic Search - Enable semantic search with embeddings
- LLM - Augment large language models with custom knowledge
- Chatbot - Power AI chatbots with accurate, up-to-date information
- AI Assistant - Build intelligent assistants with domain-specific knowledge
- Text Chunking - Intelligent text chunking for optimal retrieval
- Document Indexing - Index documents for AI-powered search
- OpenAI Embeddings - Generate OpenAI text-embedding-3 vectors
- Cohere Embeddings - Use Cohere Embed models for embeddings
- Web Crawling for AI - Crawl and process web content for AI applications
- Documentation Indexing - Index documentation sites for AI assistants
- Knowledge Graph - Build structured knowledge for AI systems
🌐 Use Cases
- Customer Support Chatbots - Index support articles, FAQs, product docs
- Documentation Assistants - Create AI assistants for technical documentation
- E-commerce Recommendations - Build product knowledge bases for shopping assistants
- Research Tools - Index academic papers, research articles
- Internal Knowledge Management - Organize company wikis, policies, procedures
- Content Discovery - Enable semantic search across blog posts, articles
- Legal Document Search - Index contracts, case law, regulations
- Educational Platforms - Build Q&A systems for course materials
- Healthcare Information - Index medical knowledge bases (ensure compliance)
- News & Media - Create searchable news archives with semantic search
🔗 Integration Examples
LangChain Integration
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from apify_client import ApifyClient
# Load chunks from Apify
client = ApifyClient('YOUR_TOKEN')
chunks = client.dataset('DATASET_ID').list_items().items
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_texts(
[chunk['text'] for chunk in chunks],
embeddings,
metadatas=[chunk['metadata'] for chunk in chunks]
)
LlamaIndex Integration
from llama_index import VectorStoreIndex, Document
from apify_client import ApifyClient
# Load chunks
client = ApifyClient('YOUR_TOKEN')
chunks = client.dataset('DATASET_ID').list_items().items
# Create documents
documents = [
Document(text=chunk['text'], metadata=chunk['metadata'])
for chunk in chunks
]
# Build index
index = VectorStoreIndex.from_documents(documents)
Haystack Integration
from haystack.document_stores import PineconeDocumentStore
from apify_client import ApifyClient
# Initialize document store
document_store = PineconeDocumentStore(
api_key='YOUR_PINECONE_KEY',
index='knowledge-base'
)
# Load and write chunks
client = ApifyClient('YOUR_TOKEN')
chunks = client.dataset('DATASET_ID').list_items().items
document_store.write_documents([
{"content": chunk['text'], "meta": chunk['metadata']}
for chunk in chunks
])
🛠️ Advanced Configuration
Custom Chunking Logic
For specialized chunking needs, fork this actor and modify the ChunkingEngine class in src/main.js.
Embedding Model Customization
Add your own embedding model by extending the EmbeddingEngine class:
async customEmbedding(text) {
// Your custom embedding logic
const response = await fetch('YOUR_EMBEDDING_API', {
method: 'POST',
body: JSON.stringify({ text })
});
return response.json().embedding;
}
Output Format Extensions
Add new vector database formats by extending the OutputFormatter class:
customDbFormat(chunk, embedding) {
return {
// Your custom format
};
}
📈 Performance Tips
- Optimize Chunk Size: 512 tokens balances context and retrieval accuracy
- Use Overlap: 128-256 token overlap prevents context loss at boundaries
- Enable Deduplication: Reduces storage costs and improves search quality
- Exclude Navigation: Remove menus, footers with
excludeSelectors - Limit Crawl Depth: Deep crawls may include low-quality pages
- Batch Processing: Process large sites in chunks with
maxPages - Monitor Token Usage: Track
tokenCountto estimate embedding costs
💡 Best Practices
Content Preparation
- Clean HTML with
excludeSelectorsbefore chunking - Use
includeSelectorsfor focused content extraction - Add custom metadata to improve search filtering
Chunking Strategy Selection
- Documentation: Semantic chunking preserves topic coherence
- Q&A: Sentence chunking for precise answers
- Mixed Content: Fixed-size chunking for consistency
Embedding Selection
- OpenAI text-embedding-3-small: Best cost/performance balance
- OpenAI text-embedding-3-large: Highest accuracy, higher cost
- Cohere: Good alternative to OpenAI
- Local: Privacy-focused, no API costs
Vector Database Selection
- Pinecone: Managed, scalable, easy setup
- Weaviate: Open-source, GraphQL API
- Qdrant: High performance, Rust-based
- Chroma: Lightweight, Python-first
- AgentDB: Optimized for ruv.io ecosystem
🆘 Troubleshooting
Issue: No chunks created
- Solution: Check
includeSelectorsandexcludeSelectorsconfiguration
Issue: Chunks too small/large
- Solution: Adjust
chunkSizeparameter (recommended: 384-768)
Issue: Too many duplicates
- Solution: Increase
similarityThresholdor enablededuplication
Issue: Missing embeddings
- Solution: Verify API keys for OpenAI/Cohere models
Issue: Crawl stops early
- Solution: Increase
maxPagesor checkcrawlDepthsetting
📞 Support & Resources
- Actor Source: GitHub - ruvnet/ruvector
- Documentation: ruv.io RAG Guide
- Vector Database: AgentDB by ruv.io
- Issues: GitHub Issues
- Community: Discord Community
📜 License
Apache 2.0 - See LICENSE file for details
👨💻 Author
rUv
- Website: https://ruv.io
- Email: info@ruv.io
- GitHub: @ruvnet
Built with ❤️ by rUv | Powered by Apify | Part of the ruvector ecosystem
🚀 Get Started Now
- Run this actor on Apify - Sign up for free tier
- Configure your knowledge sources - Add URLs, sitemap, or datasets
- Choose chunking strategy - Select from 4 intelligent methods
- Generate embeddings - Use OpenAI, Cohere, or local models
- Export to vector DB - Pinecone, Weaviate, Qdrant, Chroma, AgentDB
- Build your RAG system - Integrate with LangChain, LlamaIndex, or custom
🎯 From web content to production RAG in minutes!