mirror of
https://github.com/supermemoryai/supermemory.git
synced 2026-05-17 12:20:04 +00:00
396 lines
12 KiB
Text
396 lines
12 KiB
Text
---
|
|
title: "Web Crawler Connector"
|
|
description: "Crawl and sync websites automatically with scheduled recrawling and robots.txt compliance"
|
|
icon: "globe"
|
|
---
|
|
|
|
Connect websites to automatically crawl and sync web pages into your Supermemory knowledge base. The web crawler respects robots.txt rules, includes SSRF protection, and automatically recrawls sites on a schedule.
|
|
|
|
<Warning>
|
|
The web crawler connector requires a **Scale Plan** or **Enterprise Plan**.
|
|
</Warning>
|
|
|
|
## Quick Setup
|
|
|
|
### 1. Create Web Crawler Connection
|
|
|
|
<Tabs>
|
|
<Tab title="TypeScript">
|
|
```typescript
|
|
import Supermemory from 'supermemory';
|
|
|
|
const client = new Supermemory({
|
|
apiKey: process.env.SUPERMEMORY_API_KEY!
|
|
});
|
|
|
|
const connection = await client.connections.create('web-crawler', {
|
|
redirectUrl: 'https://yourapp.com/callback',
|
|
containerTags: ['user-123', 'website-sync'],
|
|
documentLimit: 5000,
|
|
metadata: {
|
|
startUrl: 'https://docs.example.com'
|
|
}
|
|
});
|
|
|
|
// Web crawler doesn't require OAuth - connection is ready immediately
|
|
console.log('Connection ID:', connection.id);
|
|
console.log('Connection created:', connection.createdAt);
|
|
// Note: connection.authLink is undefined for web-crawler
|
|
```
|
|
</Tab>
|
|
<Tab title="Python">
|
|
```python
|
|
from supermemory import Supermemory
|
|
import os
|
|
|
|
client = Supermemory(api_key=os.environ.get("SUPERMEMORY_API_KEY"))
|
|
|
|
connection = client.connections.create(
|
|
'web-crawler',
|
|
redirect_url='https://yourapp.com/callback',
|
|
container_tags=['user-123', 'website-sync'],
|
|
document_limit=5000,
|
|
metadata={
|
|
'startUrl': 'https://docs.example.com'
|
|
}
|
|
)
|
|
|
|
# Web crawler doesn't require OAuth - connection is ready immediately
|
|
print(f'Connection ID: {connection.id}')
|
|
print(f'Connection created: {connection.created_at}')
|
|
# Note: connection.auth_link is None for web-crawler
|
|
```
|
|
</Tab>
|
|
<Tab title="cURL">
|
|
```bash
|
|
curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler" \
|
|
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"redirectUrl": "https://yourapp.com/callback",
|
|
"containerTags": ["user-123", "website-sync"],
|
|
"documentLimit": 5000,
|
|
"metadata": {
|
|
"startUrl": "https://docs.example.com"
|
|
}
|
|
}'
|
|
|
|
# Response: {
|
|
# "id": "conn_wc123",
|
|
# "redirectsTo": "https://yourapp.com/callback",
|
|
# "authLink": null,
|
|
# "expiresIn": null
|
|
# }
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
### 2. Connection Established
|
|
|
|
Unlike other connectors, the web crawler doesn't require OAuth authentication. The connection is established immediately upon creation, and crawling begins automatically.
|
|
|
|
### 3. Monitor Sync Progress
|
|
|
|
<Tabs>
|
|
<Tab title="TypeScript">
|
|
```typescript
|
|
// Check connection details
|
|
const connection = await client.connections.getByTags('web-crawler', {
|
|
containerTags: ['user-123', 'website-sync']
|
|
});
|
|
|
|
console.log('Start URL:', connection.metadata?.startUrl);
|
|
console.log('Connection created:', connection.createdAt);
|
|
|
|
// List synced web pages
|
|
const documents = await client.connections.listDocuments('web-crawler', {
|
|
containerTags: ['user-123', 'website-sync']
|
|
});
|
|
|
|
console.log(`Synced ${documents.length} web pages`);
|
|
```
|
|
</Tab>
|
|
<Tab title="Python">
|
|
```python
|
|
# Check connection details
|
|
connection = client.connections.get_by_tags(
|
|
'web-crawler',
|
|
container_tags=['user-123', 'website-sync']
|
|
)
|
|
|
|
print(f'Start URL: {connection.metadata.get("startUrl")}')
|
|
print(f'Connection created: {connection.created_at}')
|
|
|
|
# List synced web pages
|
|
documents = client.connections.list_documents(
|
|
'web-crawler',
|
|
container_tags=['user-123', 'website-sync']
|
|
)
|
|
|
|
print(f'Synced {len(documents)} web pages')
|
|
```
|
|
</Tab>
|
|
<Tab title="cURL">
|
|
```bash
|
|
# Get connection details by provider and tags
|
|
curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/connection" \
|
|
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"containerTags": ["user-123", "website-sync"]}'
|
|
|
|
# Response includes connection details:
|
|
# {
|
|
# "id": "conn_wc123",
|
|
# "provider": "web-crawler",
|
|
# "createdAt": "2024-01-15T10:00:00Z",
|
|
# "documentLimit": 5000,
|
|
# "metadata": {"startUrl": "https://docs.example.com", ...}
|
|
# }
|
|
|
|
# List synced documents
|
|
curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/documents" \
|
|
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"containerTags": ["user-123", "website-sync"]}'
|
|
|
|
# Response: Array of document objects
|
|
# [
|
|
# {"title": "Home Page", "type": "webpage", "status": "done", "url": "https://docs.example.com"},
|
|
# {"title": "Getting Started", "type": "webpage", "status": "done", "url": "https://docs.example.com/getting-started"}
|
|
# ]
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
## Supported Content Types
|
|
|
|
### Web Pages
|
|
- **HTML content** extracted and converted to markdown
|
|
- **Same-domain crawling** only (respects hostname boundaries)
|
|
- **Robots.txt compliance** - respects disallow rules
|
|
- **Content filtering** - only HTML pages (skips non-HTML content)
|
|
|
|
### URL Requirements
|
|
|
|
The web crawler only processes valid public URLs:
|
|
- Must be a public URL (not localhost, private IPs, or internal domains)
|
|
- Must be accessible from the internet
|
|
- Must return HTML content (non-HTML files are skipped)
|
|
|
|
## Sync Mechanism
|
|
|
|
The web crawler uses **scheduled recrawling** rather than real-time webhooks:
|
|
|
|
- **Initial Crawl**: Begins immediately after connection creation
|
|
- **Scheduled Recrawling**: Automatically recrawls sites that haven't been synced in 7+ days
|
|
- **No Real-time Updates**: Unlike other connectors, web crawler doesn't support webhook-based real-time sync
|
|
|
|
<Note>
|
|
The recrawl schedule is automatically assigned when the connection is created. Sites are recrawled periodically to keep content up to date, but updates are not instantaneous.
|
|
</Note>
|
|
|
|
## Connection Management
|
|
|
|
### List All Connections
|
|
|
|
<Tabs>
|
|
<Tab title="TypeScript">
|
|
```typescript
|
|
// List all web crawler connections
|
|
const connections = await client.connections.list({
|
|
containerTags: ['user-123']
|
|
});
|
|
|
|
const webCrawlerConnections = connections.filter(
|
|
conn => conn.provider === 'web-crawler'
|
|
);
|
|
|
|
webCrawlerConnections.forEach(conn => {
|
|
console.log(`Start URL: ${conn.metadata?.startUrl}`);
|
|
console.log(`Connection ID: ${conn.id}`);
|
|
console.log(`Created: ${conn.createdAt}`);
|
|
});
|
|
```
|
|
</Tab>
|
|
<Tab title="Python">
|
|
```python
|
|
# List all web crawler connections
|
|
connections = client.connections.list(container_tags=['user-123'])
|
|
|
|
web_crawler_connections = [
|
|
conn for conn in connections if conn.provider == 'web-crawler'
|
|
]
|
|
|
|
for conn in web_crawler_connections:
|
|
print(f'Start URL: {conn.metadata.get("startUrl")}')
|
|
print(f'Connection ID: {conn.id}')
|
|
print(f'Created: {conn.created_at}')
|
|
```
|
|
</Tab>
|
|
<Tab title="cURL">
|
|
```bash
|
|
# List all connections
|
|
curl -X POST "https://api.supermemory.ai/v3/connections/list" \
|
|
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"containerTags": ["user-123"]}'
|
|
|
|
# Response: [
|
|
# {
|
|
# "id": "conn_wc123",
|
|
# "provider": "web-crawler",
|
|
# "createdAt": "2024-01-15T10:30:00.000Z",
|
|
# "documentLimit": 5000,
|
|
# "metadata": {"startUrl": "https://docs.example.com", ...}
|
|
# }
|
|
# ]
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
### Delete Connection
|
|
|
|
Remove a web crawler connection when no longer needed:
|
|
|
|
<Tabs>
|
|
<Tab title="TypeScript">
|
|
```typescript
|
|
// Delete by connection ID
|
|
const result = await client.connections.delete('connection_id_123');
|
|
console.log('Deleted connection:', result.id);
|
|
|
|
// Delete by provider and container tags
|
|
const providerResult = await client.connections.deleteByProvider('web-crawler', {
|
|
containerTags: ['user-123']
|
|
});
|
|
console.log('Deleted web crawler connection for user');
|
|
```
|
|
</Tab>
|
|
<Tab title="Python">
|
|
```python
|
|
# Delete by connection ID
|
|
result = client.connections.delete('connection_id_123')
|
|
print(f'Deleted connection: {result.id}')
|
|
|
|
# Delete by provider and container tags
|
|
provider_result = client.connections.delete_by_provider(
|
|
'web-crawler',
|
|
container_tags=['user-123']
|
|
)
|
|
print('Deleted web crawler connection for user')
|
|
```
|
|
</Tab>
|
|
<Tab title="cURL">
|
|
```bash
|
|
# Delete by connection ID
|
|
curl -X DELETE "https://api.supermemory.ai/v3/connections/connection_id_123" \
|
|
-H "Authorization: Bearer $SUPERMEMORY_API_KEY"
|
|
|
|
# Delete by provider and container tags
|
|
curl -X DELETE "https://api.supermemory.ai/v3/connections/web-crawler" \
|
|
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"containerTags": ["user-123"]}'
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
<Note>
|
|
Deleting a connection will:
|
|
- Stop all future crawls from the website
|
|
- Keep existing synced documents in Supermemory (they won't be deleted)
|
|
- Remove the connection configuration
|
|
</Note>
|
|
|
|
## Advanced Configuration
|
|
|
|
### Content Filtering
|
|
|
|
Control which web pages get synced using the settings API:
|
|
|
|
<Tabs>
|
|
<Tab title="TypeScript">
|
|
```typescript
|
|
// Configure intelligent filtering for web content
|
|
await client.settings.update({
|
|
shouldLLMFilter: true,
|
|
includeItems: {
|
|
urlPatterns: ['*docs*', '*documentation*', '*guide*'],
|
|
titlePatterns: ['*Getting Started*', '*API Reference*', '*Tutorial*']
|
|
},
|
|
excludeItems: {
|
|
urlPatterns: ['*admin*', '*private*', '*test*'],
|
|
titlePatterns: ['*Draft*', '*Archive*', '*Old*']
|
|
},
|
|
filterPrompt: "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
|
|
});
|
|
```
|
|
</Tab>
|
|
<Tab title="Python">
|
|
```python
|
|
# Configure intelligent filtering for web content
|
|
client.settings.update(
|
|
should_llm_filter=True,
|
|
include_items={
|
|
'urlPatterns': ['*docs*', '*documentation*', '*guide*'],
|
|
'titlePatterns': ['*Getting Started*', '*API Reference*', '*Tutorial*']
|
|
},
|
|
exclude_items={
|
|
'urlPatterns': ['*admin*', '*private*', '*test*'],
|
|
'titlePatterns': ['*Draft*', '*Archive*', '*Old*']
|
|
},
|
|
filter_prompt="Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
|
|
)
|
|
```
|
|
</Tab>
|
|
<Tab title="cURL">
|
|
```bash
|
|
# Configure intelligent filtering for web content
|
|
curl -X PATCH "https://api.supermemory.ai/v3/settings" \
|
|
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"shouldLLMFilter": true,
|
|
"includeItems": {
|
|
"urlPatterns": ["*docs*", "*documentation*", "*guide*"],
|
|
"titlePatterns": ["*Getting Started*", "*API Reference*", "*Tutorial*"]
|
|
},
|
|
"excludeItems": {
|
|
"urlPatterns": ["*admin*", "*private*", "*test*"],
|
|
"titlePatterns": ["*Draft*", "*Archive*", "*Old*"]
|
|
},
|
|
"filterPrompt": "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
|
|
}'
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
## Security & Compliance
|
|
|
|
### SSRF Protection
|
|
|
|
Built-in protection against Server-Side Request Forgery (SSRF) attacks:
|
|
- Blocks private IP addresses (10.x.x.x, 192.168.x.x, 172.16-31.x.x)
|
|
- Blocks localhost and internal domains
|
|
- Blocks cloud metadata endpoints
|
|
- Only allows public, internet-accessible URLs
|
|
|
|
### URL Validation
|
|
|
|
All URLs are validated before crawling:
|
|
- Must be valid HTTP/HTTPS URLs
|
|
- Must be publicly accessible
|
|
- Must return HTML content
|
|
- Response size limited to 10MB
|
|
|
|
|
|
<Warning>
|
|
**Important Limitations:**
|
|
- Requires Scale Plan or Enterprise Plan
|
|
- Only crawls same-domain URLs
|
|
- Scheduled recrawling means updates are not real-time
|
|
- Large websites may take significant time to crawl initially
|
|
- Robots.txt restrictions may prevent crawling some pages
|
|
- URLs must be publicly accessible (no authentication required)
|
|
</Warning>
|
|
|