feat (docs): web crawler connector (#593)

This commit is contained in:
Mahesh Sanikommu 2025-11-24 17:12:25 -08:00 committed by Mahesh Sanikommmu
parent 9bb4c204f0
commit 28bdcc1858
5 changed files with 437 additions and 13 deletions

View file

@ -1,14 +1,14 @@
---
title: "Connectors Overview"
description: "Integrate Google Drive, Notion, and OneDrive to automatically sync documents into your knowledge base"
description: "Integrate Google Drive, Notion, OneDrive, and Web Crawler to automatically sync documents into your knowledge base"
sidebarTitle: "Overview"
---
Connect external platforms to automatically sync documents into Supermemory. Supported connectors include Google Drive, Notion, and OneDrive with real-time synchronization and intelligent content processing.
Connect external platforms to automatically sync documents into Supermemory. Supported connectors include Google Drive, Notion, OneDrive, and Web Crawler with real-time synchronization and intelligent content processing.
## Supported Connectors
<CardGroup cols={3}>
<CardGroup cols={2}>
<Card title="Google Drive" icon="google-drive" href="/connectors/google-drive">
**Google Docs, Slides, Sheets**
@ -26,6 +26,12 @@ Connect external platforms to automatically sync documents into Supermemory. Sup
Scheduled sync every 4 hours. Supports personal and business accounts with file versioning.
</Card>
<Card title="Web Crawler" icon="globe" href="/connectors/web-crawler">
**Web Pages, Documentation**
Crawl websites automatically with robots.txt compliance. Scheduled recrawling keeps content up to date.
</Card>
</CardGroup>
## Quick Start
@ -181,10 +187,10 @@ curl -X POST "https://api.supermemory.ai/v3/documents/list" \
### Authentication Flow
1. **Create Connection**: Call `/v3/connections/{provider}` to get OAuth URL
2. **User Authorization**: Redirect user to complete OAuth flow
1. **Create Connection**: Call `/v3/connections/{provider}` to get OAuth URL (or direct connection for web-crawler)
2. **User Authorization**: Redirect user to complete OAuth flow (not required for web-crawler)
3. **Automatic Setup**: Connection established, sync begins immediately
4. **Continuous Sync**: Real-time updates via webhooks + scheduled sync every 4 hours
4. **Continuous Sync**: Real-time updates via webhooks + scheduled sync every 4 hours (or scheduled recrawling for web-crawler)
### Document Processing Pipeline
@ -206,6 +212,7 @@ graph TD
| **Google Drive** | ✅ Webhooks (7-day expiry) | ✅ Every 4 hours | ✅ On-demand |
| **Notion** | ✅ Webhooks | ✅ Every 4 hours | ✅ On-demand |
| **OneDrive** | ✅ Webhooks (30-day expiry) | ✅ Every 4 hours | ✅ On-demand |
| **Web Crawler** | ❌ Not supported | ✅ Scheduled recrawling (7+ days) | ✅ On-demand |
## Connection Management

View file

@ -0,0 +1,396 @@
---
title: "Web Crawler Connector"
description: "Crawl and sync websites automatically with scheduled recrawling and robots.txt compliance"
icon: "globe"
---
Connect websites to automatically crawl and sync web pages into your Supermemory knowledge base. The web crawler respects robots.txt rules, includes SSRF protection, and automatically recrawls sites on a schedule.
<Warning>
The web crawler connector requires a **Scale Plan** or **Enterprise Plan**.
</Warning>
## Quick Setup
### 1. Create Web Crawler Connection
<Tabs>
<Tab title="TypeScript">
```typescript
import Supermemory from 'supermemory';
const client = new Supermemory({
apiKey: process.env.SUPERMEMORY_API_KEY!
});
const connection = await client.connections.create('web-crawler', {
redirectUrl: 'https://yourapp.com/callback',
containerTags: ['user-123', 'website-sync'],
documentLimit: 5000,
metadata: {
startUrl: 'https://docs.example.com'
}
});
// Web crawler doesn't require OAuth - connection is ready immediately
console.log('Connection ID:', connection.id);
console.log('Connection created:', connection.createdAt);
// Note: connection.authLink is undefined for web-crawler
```
</Tab>
<Tab title="Python">
```python
from supermemory import Supermemory
import os
client = Supermemory(api_key=os.environ.get("SUPERMEMORY_API_KEY"))
connection = client.connections.create(
'web-crawler',
redirect_url='https://yourapp.com/callback',
container_tags=['user-123', 'website-sync'],
document_limit=5000,
metadata={
'startUrl': 'https://docs.example.com'
}
)
# Web crawler doesn't require OAuth - connection is ready immediately
print(f'Connection ID: {connection.id}')
print(f'Connection created: {connection.created_at}')
# Note: connection.auth_link is None for web-crawler
```
</Tab>
<Tab title="cURL">
```bash
curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"redirectUrl": "https://yourapp.com/callback",
"containerTags": ["user-123", "website-sync"],
"documentLimit": 5000,
"metadata": {
"startUrl": "https://docs.example.com"
}
}'
# Response: {
# "id": "conn_wc123",
# "redirectsTo": "https://yourapp.com/callback",
# "authLink": null,
# "expiresIn": null
# }
```
</Tab>
</Tabs>
### 2. Connection Established
Unlike other connectors, the web crawler doesn't require OAuth authentication. The connection is established immediately upon creation, and crawling begins automatically.
### 3. Monitor Sync Progress
<Tabs>
<Tab title="TypeScript">
```typescript
// Check connection details
const connection = await client.connections.getByTags('web-crawler', {
containerTags: ['user-123', 'website-sync']
});
console.log('Start URL:', connection.metadata?.startUrl);
console.log('Connection created:', connection.createdAt);
// List synced web pages
const documents = await client.connections.listDocuments('web-crawler', {
containerTags: ['user-123', 'website-sync']
});
console.log(`Synced ${documents.length} web pages`);
```
</Tab>
<Tab title="Python">
```python
# Check connection details
connection = client.connections.get_by_tags(
'web-crawler',
container_tags=['user-123', 'website-sync']
)
print(f'Start URL: {connection.metadata.get("startUrl")}')
print(f'Connection created: {connection.created_at}')
# List synced web pages
documents = client.connections.list_documents(
'web-crawler',
container_tags=['user-123', 'website-sync']
)
print(f'Synced {len(documents)} web pages')
```
</Tab>
<Tab title="cURL">
```bash
# Get connection details by provider and tags
curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/connection" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"containerTags": ["user-123", "website-sync"]}'
# Response includes connection details:
# {
# "id": "conn_wc123",
# "provider": "web-crawler",
# "createdAt": "2024-01-15T10:00:00Z",
# "documentLimit": 5000,
# "metadata": {"startUrl": "https://docs.example.com", ...}
# }
# List synced documents
curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/documents" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"containerTags": ["user-123", "website-sync"]}'
# Response: Array of document objects
# [
# {"title": "Home Page", "type": "webpage", "status": "done", "url": "https://docs.example.com"},
# {"title": "Getting Started", "type": "webpage", "status": "done", "url": "https://docs.example.com/getting-started"}
# ]
```
</Tab>
</Tabs>
## Supported Content Types
### Web Pages
- **HTML content** extracted and converted to markdown
- **Same-domain crawling** only (respects hostname boundaries)
- **Robots.txt compliance** - respects disallow rules
- **Content filtering** - only HTML pages (skips non-HTML content)
### URL Requirements
The web crawler only processes valid public URLs:
- Must be a public URL (not localhost, private IPs, or internal domains)
- Must be accessible from the internet
- Must return HTML content (non-HTML files are skipped)
## Sync Mechanism
The web crawler uses **scheduled recrawling** rather than real-time webhooks:
- **Initial Crawl**: Begins immediately after connection creation
- **Scheduled Recrawling**: Automatically recrawls sites that haven't been synced in 7+ days
- **No Real-time Updates**: Unlike other connectors, web crawler doesn't support webhook-based real-time sync
<Note>
The recrawl schedule is automatically assigned when the connection is created. Sites are recrawled periodically to keep content up to date, but updates are not instantaneous.
</Note>
## Connection Management
### List All Connections
<Tabs>
<Tab title="TypeScript">
```typescript
// List all web crawler connections
const connections = await client.connections.list({
containerTags: ['user-123']
});
const webCrawlerConnections = connections.filter(
conn => conn.provider === 'web-crawler'
);
webCrawlerConnections.forEach(conn => {
console.log(`Start URL: ${conn.metadata?.startUrl}`);
console.log(`Connection ID: ${conn.id}`);
console.log(`Created: ${conn.createdAt}`);
});
```
</Tab>
<Tab title="Python">
```python
# List all web crawler connections
connections = client.connections.list(container_tags=['user-123'])
web_crawler_connections = [
conn for conn in connections if conn.provider == 'web-crawler'
]
for conn in web_crawler_connections:
print(f'Start URL: {conn.metadata.get("startUrl")}')
print(f'Connection ID: {conn.id}')
print(f'Created: {conn.created_at}')
```
</Tab>
<Tab title="cURL">
```bash
# List all connections
curl -X POST "https://api.supermemory.ai/v3/connections/list" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"containerTags": ["user-123"]}'
# Response: [
# {
# "id": "conn_wc123",
# "provider": "web-crawler",
# "createdAt": "2024-01-15T10:30:00.000Z",
# "documentLimit": 5000,
# "metadata": {"startUrl": "https://docs.example.com", ...}
# }
# ]
```
</Tab>
</Tabs>
### Delete Connection
Remove a web crawler connection when no longer needed:
<Tabs>
<Tab title="TypeScript">
```typescript
// Delete by connection ID
const result = await client.connections.delete('connection_id_123');
console.log('Deleted connection:', result.id);
// Delete by provider and container tags
const providerResult = await client.connections.deleteByProvider('web-crawler', {
containerTags: ['user-123']
});
console.log('Deleted web crawler connection for user');
```
</Tab>
<Tab title="Python">
```python
# Delete by connection ID
result = client.connections.delete('connection_id_123')
print(f'Deleted connection: {result.id}')
# Delete by provider and container tags
provider_result = client.connections.delete_by_provider(
'web-crawler',
container_tags=['user-123']
)
print('Deleted web crawler connection for user')
```
</Tab>
<Tab title="cURL">
```bash
# Delete by connection ID
curl -X DELETE "https://api.supermemory.ai/v3/connections/connection_id_123" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY"
# Delete by provider and container tags
curl -X DELETE "https://api.supermemory.ai/v3/connections/web-crawler" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"containerTags": ["user-123"]}'
```
</Tab>
</Tabs>
<Note>
Deleting a connection will:
- Stop all future crawls from the website
- Keep existing synced documents in Supermemory (they won't be deleted)
- Remove the connection configuration
</Note>
## Advanced Configuration
### Content Filtering
Control which web pages get synced using the settings API:
<Tabs>
<Tab title="TypeScript">
```typescript
// Configure intelligent filtering for web content
await client.settings.update({
shouldLLMFilter: true,
includeItems: {
urlPatterns: ['*docs*', '*documentation*', '*guide*'],
titlePatterns: ['*Getting Started*', '*API Reference*', '*Tutorial*']
},
excludeItems: {
urlPatterns: ['*admin*', '*private*', '*test*'],
titlePatterns: ['*Draft*', '*Archive*', '*Old*']
},
filterPrompt: "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
});
```
</Tab>
<Tab title="Python">
```python
# Configure intelligent filtering for web content
client.settings.update(
should_llm_filter=True,
include_items={
'urlPatterns': ['*docs*', '*documentation*', '*guide*'],
'titlePatterns': ['*Getting Started*', '*API Reference*', '*Tutorial*']
},
exclude_items={
'urlPatterns': ['*admin*', '*private*', '*test*'],
'titlePatterns': ['*Draft*', '*Archive*', '*Old*']
},
filter_prompt="Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
)
```
</Tab>
<Tab title="cURL">
```bash
# Configure intelligent filtering for web content
curl -X PATCH "https://api.supermemory.ai/v3/settings" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"shouldLLMFilter": true,
"includeItems": {
"urlPatterns": ["*docs*", "*documentation*", "*guide*"],
"titlePatterns": ["*Getting Started*", "*API Reference*", "*Tutorial*"]
},
"excludeItems": {
"urlPatterns": ["*admin*", "*private*", "*test*"],
"titlePatterns": ["*Draft*", "*Archive*", "*Old*"]
},
"filterPrompt": "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
}'
```
</Tab>
</Tabs>
## Security & Compliance
### SSRF Protection
Built-in protection against Server-Side Request Forgery (SSRF) attacks:
- Blocks private IP addresses (10.x.x.x, 192.168.x.x, 172.16-31.x.x)
- Blocks localhost and internal domains
- Blocks cloud metadata endpoints
- Only allows public, internet-accessible URLs
### URL Validation
All URLs are validated before crawling:
- Must be valid HTTP/HTTPS URLs
- Must be publicly accessible
- Must return HTML content
- Response size limited to 10MB
<Warning>
**Important Limitations:**
- Requires Scale Plan or Enterprise Plan
- Only crawls same-domain URLs
- Scheduled recrawling means updates are not real-time
- Large websites may take significant time to crawl initially
- Robots.txt restrictions may prevent crawling some pages
- URLs must be publicly accessible (no authentication required)
</Warning>

View file

@ -138,6 +138,7 @@
"connectors/notion",
"connectors/google-drive",
"connectors/onedrive",
"connectors/web-crawler",
"connectors/troubleshooting"
]
},

View file

@ -13,9 +13,15 @@ const client = new Supermemory({
apiKey: process.env['SUPERMEMORY_API_KEY'], // This is the default and can be omitted
});
// For OAuth providers (notion, google-drive, onedrive)
const connection = await client.connections.create('notion');
console.debug(connection.authLink);
// For web-crawler (no OAuth required)
const webCrawlerConnection = await client.connections.create('web-crawler', {
metadata: { startUrl: 'https://docs.example.com' }
});
console.debug(webCrawlerConnection.id); // authLink will be null
```
```python Python
@ -57,12 +63,14 @@ curl --request POST \
### Parameters
- `provider`: The provider to connect to. Currently supported providers are `notion`, `google-drive`, `one-drive`
- `provider`: The provider to connect to. Currently supported providers are `notion`, `google-drive`, `onedrive`, `web-crawler`
- `redirectUrl`: The URL to redirect to after the connection is created (your app URL)
- Note: For `web-crawler`, this is optional as no OAuth flow is required
- `containerTags`: Optional. For partitioning users, organizations, etc. in your app.
- Example: `["user_123", "project_alpha"]`
- `metadata`: Optional. Any metadata you want to associate with the connection.
- This metadata is added to every document synced from this connection.
- For `web-crawler`, must include `startUrl` in metadata: `{"startUrl": "https://example.com"}`
- `documentLimit`: Optional. The maximum number of documents to sync from this connection.
- Default: 10,000
- This can be used to limit costs and sync a set number of documents for a specific user.
@ -80,6 +88,10 @@ supermemory sends a response with the following schema:
}
```
You can use the `authLink` to redirect the user to the provider's login page.
For most providers (notion, google-drive, onedrive), you can use the `authLink` to redirect the user to the provider's login page.
<Note>
**Web Crawler Exception:** For `web-crawler` provider, `authLink` and `expiresIn` will be `null` since no OAuth flow is required. The connection is established immediately upon creation.
</Note>
Next up, managing connections.

View file

@ -1,26 +1,34 @@
---
title: 'Connectors Overview'
sidebarTitle: 'Overview'
description: 'Sync external connections like Google Drive, Notion, OneDrive with supermemory'
description: 'Sync external connections like Google Drive, Notion, OneDrive, Web Crawler with supermemory'
---
supermemory can sync external connections like Google Drive, Notion, OneDrive with more coming soon.
supermemory can sync external connections like Google Drive, Notion, OneDrive, and Web Crawler.
### The Flow
For OAuth-based connectors (Notion, Google Drive, OneDrive):
1. Make a `POST` request to `/v3/connections/{provider}`
2. supermemory will return an `authLink` which you can redirect the user to
3. The user will be redirected to the provider's login page
4. User is redirected back to your app's `redirectUrl`
For Web Crawler:
1. Make a `POST` request to `/v3/connections/web-crawler` with `startUrl` in metadata
2. Connection is established immediately (no OAuth required)
3. Crawling begins automatically
![Connectors Flow](/images/connectors-flow.png)
## Sync frequency
supermemory syncs documents:
- **A document is modified or created (Webhook recieved)**
- **A document is modified or created (Webhook received)**
- Note that not all providers are synced via webhook (Instant sync right now)
- `Google-Drive` and `Notion` documents are synced instantaneously
- Every **four hours**
- `Web-Crawler` uses scheduled recrawling instead of webhooks
- Every **four hours** (for OAuth-based connectors)
- **Scheduled recrawling** (for Web Crawler - sites recrawled if not synced in 7+ days)
- On **Manual Sync** (API call)
- You can call `/v3/connections/{provider}/sync` to sync documents manually