From a0aa29eeb0472bc688595e4718e7acf0c33b198f Mon Sep 17 00:00:00 2001 From: "MSI\\ModSetter" Date: Mon, 21 Jul 2025 12:14:11 -0700 Subject: [PATCH] chore: updated docs for docling --- README.md | 11 ++++++++++- .../[search_space_id]/documents/upload/page.tsx | 13 ++++++++++++- surfsense_web/content/docs/docker-installation.mdx | 12 +++--------- surfsense_web/content/docs/index.mdx | 12 +++++++++++- surfsense_web/content/docs/manual-installation.mdx | 5 +++-- 5 files changed, 39 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index c37d1b8..9a2b6ea 100644 --- a/README.md +++ b/README.md @@ -72,28 +72,36 @@ Open source and easy to deploy locally. ## 📄 **Supported File Extensions** -> **Note**: File format support depends on your ETL service configuration. LlamaCloud supports 50+ formats, while Unstructured supports 34+ core formats. +> **Note**: File format support depends on your ETL service configuration. LlamaCloud supports 50+ formats, Unstructured supports 34+ core formats, and Docling (core formats, local processing, privacy-focused, no API key). ### Documents & Text **LlamaCloud**: `.pdf`, `.doc`, `.docx`, `.docm`, `.dot`, `.dotm`, `.rtf`, `.txt`, `.xml`, `.epub`, `.odt`, `.wpd`, `.pages`, `.key`, `.numbers`, `.602`, `.abw`, `.cgm`, `.cwk`, `.hwp`, `.lwp`, `.mw`, `.mcw`, `.pbd`, `.sda`, `.sdd`, `.sdp`, `.sdw`, `.sgl`, `.sti`, `.sxi`, `.sxw`, `.stw`, `.sxg`, `.uof`, `.uop`, `.uot`, `.vor`, `.wps`, `.zabw` **Unstructured**: `.doc`, `.docx`, `.odt`, `.rtf`, `.pdf`, `.xml`, `.txt`, `.md`, `.markdown`, `.rst`, `.html`, `.org`, `.epub` +**Docling**: `.pdf`, `.docx`, `.html`, `.htm`, `.xhtml`, `.adoc`, `.asciidoc` + ### Presentations **LlamaCloud**: `.ppt`, `.pptx`, `.pptm`, `.pot`, `.potm`, `.potx`, `.odp`, `.key` **Unstructured**: `.ppt`, `.pptx` +**Docling**: `.pptx` + ### Spreadsheets & Data **LlamaCloud**: `.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlw`, `.csv`, `.tsv`, `.ods`, `.fods`, `.numbers`, `.dbf`, `.123`, `.dif`, `.sylk`, `.slk`, `.prn`, `.et`, `.uos1`, `.uos2`, `.wk1`, `.wk2`, `.wk3`, `.wk4`, `.wks`, `.wq1`, `.wq2`, `.wb1`, `.wb2`, `.wb3`, `.qpw`, `.xlr`, `.eth` **Unstructured**: `.xls`, `.xlsx`, `.csv`, `.tsv` +**Docling**: `.xlsx`, `.csv` + ### Images **LlamaCloud**: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.svg`, `.tiff`, `.webp`, `.html`, `.htm`, `.web` **Unstructured**: `.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.heic` +**Docling**: `.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.tif`, `.webp` + ### Audio & Video *(Always Supported)* `.mp3`, `.mpga`, `.m4a`, `.wav`, `.mp4`, `.mpeg`, `.webm` @@ -142,6 +150,7 @@ Before installation, make sure to complete the [prerequisite setup steps](https: - **File Processing ETL Service** (choose one): - Unstructured.io API key (supports 34+ formats) - LlamaIndex API key (enhanced parsing, supports 50+ formats) + - Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV) - Other required API keys ## Screenshots diff --git a/surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx b/surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx index 8516f68..302c149 100644 --- a/surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx +++ b/surfsense_web/app/dashboard/[search_space_id]/documents/upload/page.tsx @@ -106,8 +106,19 @@ export default function FileUploader() { }; } else if (etlService === 'DOCLING') { return { - // Docling supported file types (currently only PDF) + // Docling supported file types 'application/pdf': ['.pdf'], + 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'], + 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'], + 'application/vnd.openxmlformats-officedocument.presentationml.presentation': ['.pptx'], + 'text/asciidoc': ['.adoc', '.asciidoc'], + 'text/html': ['.html', '.htm', '.xhtml'], + 'text/csv': ['.csv'], + 'image/png': ['.png'], + 'image/jpeg': ['.jpg', '.jpeg'], + 'image/tiff': ['.tiff', '.tif'], + 'image/bmp': ['.bmp'], + 'image/webp': ['.webp'], // Audio files (always supported) ...audioFileTypes, }; diff --git a/surfsense_web/content/docs/docker-installation.mdx b/surfsense_web/content/docs/docker-installation.mdx index 0a5ba8a..da75ecd 100644 --- a/surfsense_web/content/docs/docker-installation.mdx +++ b/surfsense_web/content/docs/docker-installation.mdx @@ -4,14 +4,7 @@ description: Setting up SurfSense using Docker full: true --- -## Known Limitations -⚠️ **Important Note:** Currently, the following features have limited functionality when running in Docker: - -- **Ollama integration:** Local Ollama models do not work when running SurfSense in Docker. Please use other LLM providers like OpenAI or Gemini instead. -- **Web crawler functionality:** The web crawler feature currently doesn't work properly within the Docker environment. - -We're actively working to resolve these limitations in future releases. # Docker Installation @@ -28,6 +21,7 @@ Before you begin, ensure you have: - **File Processing ETL Service** (choose one): - Unstructured.io API key (Supports 34+ formats) - LlamaIndex API key (enhanced parsing, supports 50+ formats) + - Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV) - Other required API keys ## Installation Steps @@ -97,7 +91,7 @@ Before you begin, ensure you have: | STT_SERVICE_API_KEY | API key for the Speech-to-Text service | | STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service | | FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling | -| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats) or `LLAMACLOUD` (supports 50+ formats including legacy document types) | +| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) | | UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) | | LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) | @@ -152,7 +146,7 @@ For more details, see the [Uvicorn documentation](https://www.uvicorn.org/#comma | ------------------------------- | ---------------------------------------------------------- | | NEXT_PUBLIC_FASTAPI_BACKEND_URL | URL of the backend service (e.g., `http://localhost:8000`) | | NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication | -| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED` or `LLAMACLOUD` - affects supported file formats in upload interface | +| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED`, `LLAMACLOUD`, or `DOCLING` - affects supported file formats in upload interface | 2. **Build and Start Containers** diff --git a/surfsense_web/content/docs/index.mdx b/surfsense_web/content/docs/index.mdx index 0b76904..658b603 100644 --- a/surfsense_web/content/docs/index.mdx +++ b/surfsense_web/content/docs/index.mdx @@ -67,7 +67,7 @@ To set up Google OAuth: ## File Upload's -SurfSense supports two ETL (Extract, Transform, Load) services for converting files to LLM-friendly formats: +SurfSense supports three ETL (Extract, Transform, Load) services for converting files to LLM-friendly formats: ### Option 1: Unstructured @@ -85,6 +85,16 @@ Files are converted using [LlamaIndex](https://www.llamaindex.ai/) which offers 2. Sign up for a LlamaCloud account to access their parsing services 3. LlamaCloud provides enhanced parsing capabilities for complex documents +### Option 3: Docling (Recommended for Privacy) + +Files are processed locally using [Docling](https://github.com/DS4SD/docling) - IBM's open-source document parsing library. + +1. **No API key required** - all processing happens locally +2. **Privacy-focused** - documents never leave your system +3. **Supported formats**: PDF, Office documents (Word, Excel, PowerPoint), images (PNG, JPEG, TIFF, BMP, WebP), HTML, CSV, AsciiDoc +4. **Enhanced features**: Advanced table detection, image extraction, and structured document parsing +5. **GPU acceleration** support for faster processing (when available) + **Note**: You only need to set up one of these services. --- diff --git a/surfsense_web/content/docs/manual-installation.mdx b/surfsense_web/content/docs/manual-installation.mdx index 1f58783..6275b98 100644 --- a/surfsense_web/content/docs/manual-installation.mdx +++ b/surfsense_web/content/docs/manual-installation.mdx @@ -16,6 +16,7 @@ Before beginning the manual installation, ensure you have completed all the [pre - **File Processing ETL Service** (choose one): - Unstructured.io API key (Supports 34+ formats) - LlamaIndex API key (enhanced parsing, supports 50+ formats) + - Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV) - Other required API keys ## Backend Setup @@ -67,7 +68,7 @@ Edit the `.env` file and set the following variables: | STT_SERVICE_API_KEY | API key for the Speech-to-Text service | | STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service | | FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling | -| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats) or `LLAMACLOUD` (supports 50+ formats including legacy document types) | +| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) | | UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) | | LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) | @@ -198,7 +199,7 @@ Edit the `.env` file and set: | ------------------------------- | ------------------------------------------- | | NEXT_PUBLIC_FASTAPI_BACKEND_URL | Backend URL (e.g., `http://localhost:8000`) | | NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication | -| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED` or `LLAMACLOUD` - affects supported file formats in upload interface | +| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED`, `LLAMACLOUD`, or `DOCLING` - affects supported file formats in upload interface | ### 2. Install Dependencies