feat (docs): web crawler connector (#593)

2026-05-19 07:42:43 +00:00 · 2025-11-24 17:12:25 -08:00 · 2025-11-24 17:12:25 -08:00 · 28bdcc1858
commit 28bdcc1858
parent 9bb4c204f0
5 changed files with 437 additions and 13 deletions
--- a/apps/docs/connectors/overview.mdx
+++ b/apps/docs/connectors/overview.mdx
@ -1,14 +1,14 @@
 ---
 title: "Connectors Overview"
-description: "Integrate Google Drive, Notion, and OneDrive to automatically sync documents into your knowledge base"
+description: "Integrate Google Drive, Notion, OneDrive, and Web Crawler to automatically sync documents into your knowledge base"
 sidebarTitle: "Overview"
 ---

-Connect external platforms to automatically sync documents into Supermemory. Supported connectors include Google Drive, Notion, and OneDrive with real-time synchronization and intelligent content processing.
+Connect external platforms to automatically sync documents into Supermemory. Supported connectors include Google Drive, Notion, OneDrive, and Web Crawler with real-time synchronization and intelligent content processing.

 ## Supported Connectors

-<CardGroup cols={3}>
+<CardGroup cols={2}>
  <Card title="Google Drive" icon="google-drive" href="/connectors/google-drive">
    **Google Docs, Slides, Sheets**

@ -26,6 +26,12 @@ Connect external platforms to automatically sync documents into Supermemory. Sup

    Scheduled sync every 4 hours. Supports personal and business accounts with file versioning.
  </Card>
+
+  <Card title="Web Crawler" icon="globe" href="/connectors/web-crawler">
+    **Web Pages, Documentation**
+
+    Crawl websites automatically with robots.txt compliance. Scheduled recrawling keeps content up to date.
+  </Card>
 </CardGroup>

 ## Quick Start
@ -181,10 +187,10 @@ curl -X POST "https://api.supermemory.ai/v3/documents/list" \

 ### Authentication Flow

-1. **Create Connection**: Call `/v3/connections/{provider}` to get OAuth URL
-2. **User Authorization**: Redirect user to complete OAuth flow
+1. **Create Connection**: Call `/v3/connections/{provider}` to get OAuth URL (or direct connection for web-crawler)
+2. **User Authorization**: Redirect user to complete OAuth flow (not required for web-crawler)
 3. **Automatic Setup**: Connection established, sync begins immediately
-4. **Continuous Sync**: Real-time updates via webhooks + scheduled sync every 4 hours
+4. **Continuous Sync**: Real-time updates via webhooks + scheduled sync every 4 hours (or scheduled recrawling for web-crawler)

 ### Document Processing Pipeline

@ -206,6 +212,7 @@ graph TD
 | **Google Drive** | ✅ Webhooks (7-day expiry) | ✅ Every 4 hours | ✅ On-demand |
 | **Notion** | ✅ Webhooks | ✅ Every 4 hours | ✅ On-demand |
 | **OneDrive** | ✅ Webhooks (30-day expiry) | ✅ Every 4 hours | ✅ On-demand |
+| **Web Crawler** | ❌ Not supported | ✅ Scheduled recrawling (7+ days) | ✅ On-demand |


 ## Connection Management
--- a/apps/docs/connectors/web-crawler.mdx
+++ b/apps/docs/connectors/web-crawler.mdx
@ -0,0 +1,396 @@
+---
+title: "Web Crawler Connector"
+description: "Crawl and sync websites automatically with scheduled recrawling and robots.txt compliance"
+icon: "globe"
+---
+
+Connect websites to automatically crawl and sync web pages into your Supermemory knowledge base. The web crawler respects robots.txt rules, includes SSRF protection, and automatically recrawls sites on a schedule.
+
+<Warning>
+The web crawler connector requires a **Scale Plan** or **Enterprise Plan**.
+</Warning>
+
+## Quick Setup
+
+### 1. Create Web Crawler Connection
+
+<Tabs>
+  <Tab title="TypeScript">
+    ```typescript
+    import Supermemory from 'supermemory';
+
+    const client = new Supermemory({
+      apiKey: process.env.SUPERMEMORY_API_KEY!
+    });
+
+    const connection = await client.connections.create('web-crawler', {
+      redirectUrl: 'https://yourapp.com/callback',
+      containerTags: ['user-123', 'website-sync'],
+      documentLimit: 5000,
+      metadata: {
+        startUrl: 'https://docs.example.com'
+      }
+    });
+
+    // Web crawler doesn't require OAuth - connection is ready immediately
+    console.log('Connection ID:', connection.id);
+    console.log('Connection created:', connection.createdAt);
+    // Note: connection.authLink is undefined for web-crawler
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    from supermemory import Supermemory
+    import os
+
+    client = Supermemory(api_key=os.environ.get("SUPERMEMORY_API_KEY"))
+
+    connection = client.connections.create(
+        'web-crawler',
+        redirect_url='https://yourapp.com/callback',
+        container_tags=['user-123', 'website-sync'],
+        document_limit=5000,
+        metadata={
+            'startUrl': 'https://docs.example.com'
+        }
+    )
+
+    # Web crawler doesn't require OAuth - connection is ready immediately
+    print(f'Connection ID: {connection.id}')
+    print(f'Connection created: {connection.created_at}')
+    # Note: connection.auth_link is None for web-crawler
+    ```
+  </Tab>
+  <Tab title="cURL">
+    ```bash
+    curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler" \
+      -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
+      -H "Content-Type: application/json" \
+      -d '{
+        "redirectUrl": "https://yourapp.com/callback",
+        "containerTags": ["user-123", "website-sync"],
+        "documentLimit": 5000,
+        "metadata": {
+          "startUrl": "https://docs.example.com"
+        }
+      }'
+
+    # Response: {
+    #   "id": "conn_wc123",
+    #   "redirectsTo": "https://yourapp.com/callback",
+    #   "authLink": null,
+    #   "expiresIn": null
+    # }
+    ```
+  </Tab>
+</Tabs>
+
+### 2. Connection Established
+
+Unlike other connectors, the web crawler doesn't require OAuth authentication. The connection is established immediately upon creation, and crawling begins automatically.
+
+### 3. Monitor Sync Progress
+
+<Tabs>
+  <Tab title="TypeScript">
+    ```typescript
+    // Check connection details
+    const connection = await client.connections.getByTags('web-crawler', {
+      containerTags: ['user-123', 'website-sync']
+    });
+
+    console.log('Start URL:', connection.metadata?.startUrl);
+    console.log('Connection created:', connection.createdAt);
+
+    // List synced web pages
+    const documents = await client.connections.listDocuments('web-crawler', {
+      containerTags: ['user-123', 'website-sync']
+    });
+
+    console.log(`Synced ${documents.length} web pages`);
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    # Check connection details
+    connection = client.connections.get_by_tags(
+        'web-crawler',
+        container_tags=['user-123', 'website-sync']
+    )
+
+    print(f'Start URL: {connection.metadata.get("startUrl")}')
+    print(f'Connection created: {connection.created_at}')
+
+    # List synced web pages
+    documents = client.connections.list_documents(
+        'web-crawler',
+        container_tags=['user-123', 'website-sync']
+    )
+
+    print(f'Synced {len(documents)} web pages')
+    ```
+  </Tab>
+  <Tab title="cURL">
+    ```bash
+    # Get connection details by provider and tags
+    curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/connection" \
+      -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
+      -H "Content-Type: application/json" \
+      -d '{"containerTags": ["user-123", "website-sync"]}'
+
+    # Response includes connection details:
+    # {
+    #   "id": "conn_wc123",
+    #   "provider": "web-crawler",
+    #   "createdAt": "2024-01-15T10:00:00Z",
+    #   "documentLimit": 5000,
+    #   "metadata": {"startUrl": "https://docs.example.com", ...}
+    # }
+
+    # List synced documents
+    curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/documents" \
+      -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
+      -H "Content-Type: application/json" \
+      -d '{"containerTags": ["user-123", "website-sync"]}'
+
+    # Response: Array of document objects
+    # [
+    #   {"title": "Home Page", "type": "webpage", "status": "done", "url": "https://docs.example.com"},
+    #   {"title": "Getting Started", "type": "webpage", "status": "done", "url": "https://docs.example.com/getting-started"}
+    # ]
+    ```
+  </Tab>
+</Tabs>
+
+## Supported Content Types
+
+### Web Pages
+- **HTML content** extracted and converted to markdown
+- **Same-domain crawling** only (respects hostname boundaries)
+- **Robots.txt compliance** - respects disallow rules
+- **Content filtering** - only HTML pages (skips non-HTML content)
+
+### URL Requirements
+
+The web crawler only processes valid public URLs:
+- Must be a public URL (not localhost, private IPs, or internal domains)
+- Must be accessible from the internet
+- Must return HTML content (non-HTML files are skipped)
+
+## Sync Mechanism
+
+The web crawler uses **scheduled recrawling** rather than real-time webhooks:
+
+- **Initial Crawl**: Begins immediately after connection creation
+- **Scheduled Recrawling**: Automatically recrawls sites that haven't been synced in 7+ days
+- **No Real-time Updates**: Unlike other connectors, web crawler doesn't support webhook-based real-time sync
+
+<Note>
+The recrawl schedule is automatically assigned when the connection is created. Sites are recrawled periodically to keep content up to date, but updates are not instantaneous.
+</Note>
+
+## Connection Management
+
+### List All Connections
+
+<Tabs>
+  <Tab title="TypeScript">
+    ```typescript
+    // List all web crawler connections
+    const connections = await client.connections.list({
+      containerTags: ['user-123']
+    });
+
+    const webCrawlerConnections = connections.filter(
+      conn => conn.provider === 'web-crawler'
+    );
+
+    webCrawlerConnections.forEach(conn => {
+      console.log(`Start URL: ${conn.metadata?.startUrl}`);
+      console.log(`Connection ID: ${conn.id}`);
+      console.log(`Created: ${conn.createdAt}`);
+    });
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    # List all web crawler connections
+    connections = client.connections.list(container_tags=['user-123'])
+
+    web_crawler_connections = [
+        conn for conn in connections if conn.provider == 'web-crawler'
+    ]
+
+    for conn in web_crawler_connections:
+        print(f'Start URL: {conn.metadata.get("startUrl")}')
+        print(f'Connection ID: {conn.id}')
+        print(f'Created: {conn.created_at}')
+    ```
+  </Tab>
+  <Tab title="cURL">
+    ```bash
+    # List all connections
+    curl -X POST "https://api.supermemory.ai/v3/connections/list" \
+      -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
+      -H "Content-Type: application/json" \
+      -d '{"containerTags": ["user-123"]}'
+
+    # Response: [
+    #   {
+    #     "id": "conn_wc123",
+    #     "provider": "web-crawler",
+    #     "createdAt": "2024-01-15T10:30:00.000Z",
+    #     "documentLimit": 5000,
+    #     "metadata": {"startUrl": "https://docs.example.com", ...}
+    #   }
+    # ]
+    ```
+  </Tab>
+</Tabs>
+
+### Delete Connection
+
+Remove a web crawler connection when no longer needed:
+
+<Tabs>
+  <Tab title="TypeScript">
+    ```typescript
+    // Delete by connection ID
+    const result = await client.connections.delete('connection_id_123');
+    console.log('Deleted connection:', result.id);
+
+    // Delete by provider and container tags
+    const providerResult = await client.connections.deleteByProvider('web-crawler', {
+      containerTags: ['user-123']
+    });
+    console.log('Deleted web crawler connection for user');
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    # Delete by connection ID
+    result = client.connections.delete('connection_id_123')
+    print(f'Deleted connection: {result.id}')
+
+    # Delete by provider and container tags
+    provider_result = client.connections.delete_by_provider(
+        'web-crawler',
+        container_tags=['user-123']
+    )
+    print('Deleted web crawler connection for user')
+    ```
+  </Tab>
+  <Tab title="cURL">
+    ```bash
+    # Delete by connection ID
+    curl -X DELETE "https://api.supermemory.ai/v3/connections/connection_id_123" \
+      -H "Authorization: Bearer $SUPERMEMORY_API_KEY"
+
+    # Delete by provider and container tags
+    curl -X DELETE "https://api.supermemory.ai/v3/connections/web-crawler" \
+      -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
+      -H "Content-Type: application/json" \
+      -d '{"containerTags": ["user-123"]}'
+    ```
+  </Tab>
+</Tabs>
+
+<Note>
+Deleting a connection will:
+- Stop all future crawls from the website
+- Keep existing synced documents in Supermemory (they won't be deleted)
+- Remove the connection configuration
+</Note>
+
+## Advanced Configuration
+
+### Content Filtering
+
+Control which web pages get synced using the settings API:
+
+<Tabs>
+  <Tab title="TypeScript">
+    ```typescript
+    // Configure intelligent filtering for web content
+    await client.settings.update({
+      shouldLLMFilter: true,
+      includeItems: {
+        urlPatterns: ['*docs*', '*documentation*', '*guide*'],
+        titlePatterns: ['*Getting Started*', '*API Reference*', '*Tutorial*']
+      },
+      excludeItems: {
+        urlPatterns: ['*admin*', '*private*', '*test*'],
+        titlePatterns: ['*Draft*', '*Archive*', '*Old*']
+      },
+      filterPrompt: "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
+    });
+    ```
+  </Tab>
+  <Tab title="Python">
+    ```python
+    # Configure intelligent filtering for web content
+    client.settings.update(
+        should_llm_filter=True,
+        include_items={
+            'urlPatterns': ['*docs*', '*documentation*', '*guide*'],
+            'titlePatterns': ['*Getting Started*', '*API Reference*', '*Tutorial*']
+        },
+        exclude_items={
+            'urlPatterns': ['*admin*', '*private*', '*test*'],
+            'titlePatterns': ['*Draft*', '*Archive*', '*Old*']
+        },
+        filter_prompt="Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
+    )
+    ```
+  </Tab>
+  <Tab title="cURL">
+    ```bash
+    # Configure intelligent filtering for web content
+    curl -X PATCH "https://api.supermemory.ai/v3/settings" \
+      -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
+      -H "Content-Type: application/json" \
+      -d '{
+        "shouldLLMFilter": true,
+        "includeItems": {
+          "urlPatterns": ["*docs*", "*documentation*", "*guide*"],
+          "titlePatterns": ["*Getting Started*", "*API Reference*", "*Tutorial*"]
+        },
+        "excludeItems": {
+          "urlPatterns": ["*admin*", "*private*", "*test*"],
+          "titlePatterns": ["*Draft*", "*Archive*", "*Old*"]
+        },
+        "filterPrompt": "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages."
+      }'
+    ```
+  </Tab>
+</Tabs>
+
+## Security & Compliance
+
+### SSRF Protection
+
+Built-in protection against Server-Side Request Forgery (SSRF) attacks:
+- Blocks private IP addresses (10.x.x.x, 192.168.x.x, 172.16-31.x.x)
+- Blocks localhost and internal domains
+- Blocks cloud metadata endpoints
+- Only allows public, internet-accessible URLs
+
+### URL Validation
+
+All URLs are validated before crawling:
+- Must be valid HTTP/HTTPS URLs
+- Must be publicly accessible
+- Must return HTML content
+- Response size limited to 10MB
+
+
+<Warning>
+**Important Limitations:**
+- Requires Scale Plan or Enterprise Plan
+- Only crawls same-domain URLs
+- Scheduled recrawling means updates are not real-time
+- Large websites may take significant time to crawl initially
+- Robots.txt restrictions may prevent crawling some pages
+- URLs must be publicly accessible (no authentication required)
+</Warning>
+
--- a/apps/docs/docs.json
+++ b/apps/docs/docs.json
@ -138,6 +138,7 @@
 											"connectors/notion",
 											"connectors/google-drive",
 											"connectors/onedrive",
+											"connectors/web-crawler",
 											"connectors/troubleshooting"
 										]
 									},
--- a/apps/docs/memory-api/connectors/creating-connection.mdx
+++ b/apps/docs/memory-api/connectors/creating-connection.mdx
@ -13,9 +13,15 @@ const client = new Supermemory({
  apiKey: process.env['SUPERMEMORY_API_KEY'], // This is the default and can be omitted
 });

+// For OAuth providers (notion, google-drive, onedrive)
 const connection = await client.connections.create('notion');
-
 console.debug(connection.authLink);
+
+// For web-crawler (no OAuth required)
+const webCrawlerConnection = await client.connections.create('web-crawler', {
+  metadata: { startUrl: 'https://docs.example.com' }
+});
+console.debug(webCrawlerConnection.id); // authLink will be null
 ```

 ```python Python
@ -57,12 +63,14 @@ curl --request POST \

 ### Parameters

- `provider`: The provider to connect to. Currently supported providers are `notion`, `google-drive`, `one-drive`
+- `provider`: The provider to connect to. Currently supported providers are `notion`, `google-drive`, `onedrive`, `web-crawler`
 - `redirectUrl`: The URL to redirect to after the connection is created (your app URL)
+    - Note: For `web-crawler`, this is optional as no OAuth flow is required
 - `containerTags`: Optional. For partitioning users, organizations, etc. in your app.
    - Example: `["user_123", "project_alpha"]`
 - `metadata`: Optional. Any metadata you want to associate with the connection.
    - This metadata is added to every document synced from this connection.
+    - For `web-crawler`, must include `startUrl` in metadata: `{"startUrl": "https://example.com"}`
 - `documentLimit`: Optional. The maximum number of documents to sync from this connection.
    - Default: 10,000
    - This can be used to limit costs and sync a set number of documents for a specific user.
@ -80,6 +88,10 @@ supermemory sends a response with the following schema:
 }
 ```

-You can use the `authLink` to redirect the user to the provider's login page.
+For most providers (notion, google-drive, onedrive), you can use the `authLink` to redirect the user to the provider's login page.
+
+<Note>
+**Web Crawler Exception:** For `web-crawler` provider, `authLink` and `expiresIn` will be `null` since no OAuth flow is required. The connection is established immediately upon creation.
+</Note>

 Next up, managing connections. 
--- a/apps/docs/memory-api/connectors/overview.mdx
+++ b/apps/docs/memory-api/connectors/overview.mdx
@ -1,26 +1,34 @@
 ---
 title: 'Connectors Overview'
 sidebarTitle: 'Overview'
-description: 'Sync external connections like Google Drive, Notion, OneDrive with supermemory'
+description: 'Sync external connections like Google Drive, Notion, OneDrive, Web Crawler with supermemory'
 ---

-supermemory can sync external connections like Google Drive, Notion, OneDrive with more coming soon.
+supermemory can sync external connections like Google Drive, Notion, OneDrive, and Web Crawler.

 ### The Flow

+For OAuth-based connectors (Notion, Google Drive, OneDrive):
 1. Make a `POST` request to `/v3/connections/{provider}`
 2. supermemory will return an `authLink` which you can redirect the user to
 3. The user will be redirected to the provider's login page
 4. User is redirected back to your app's `redirectUrl`

+For Web Crawler:
+1. Make a `POST` request to `/v3/connections/web-crawler` with `startUrl` in metadata
+2. Connection is established immediately (no OAuth required)
+3. Crawling begins automatically
+
 ![Connectors Flow](/images/connectors-flow.png)

 ## Sync frequency

 supermemory syncs documents:
- **A document is modified or created (Webhook recieved)**
+- **A document is modified or created (Webhook received)**
    - Note that not all providers are synced via webhook (Instant sync right now)
    - `Google-Drive` and `Notion` documents are synced instantaneously
- Every **four hours**
+    - `Web-Crawler` uses scheduled recrawling instead of webhooks
+- Every **four hours** (for OAuth-based connectors)
+- **Scheduled recrawling** (for Web Crawler - sites recrawled if not synced in 7+ days)
 - On **Manual Sync** (API call)
    - You can call `/v3/connections/{provider}/sync` to sync documents manually