Merge pull request #216 from MODSetter/dev

chore: updated docs for docling
This commit is contained in:
Rohan Verma 2025-07-22 00:46:01 +05:30 committed by GitHub
commit 82b402cc31
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 39 additions and 14 deletions

View file

@ -72,28 +72,36 @@ Open source and easy to deploy locally.
## 📄 **Supported File Extensions** ## 📄 **Supported File Extensions**
> **Note**: File format support depends on your ETL service configuration. LlamaCloud supports 50+ formats, while Unstructured supports 34+ core formats. > **Note**: File format support depends on your ETL service configuration. LlamaCloud supports 50+ formats, Unstructured supports 34+ core formats, and Docling (core formats, local processing, privacy-focused, no API key).
### Documents & Text ### Documents & Text
**LlamaCloud**: `.pdf`, `.doc`, `.docx`, `.docm`, `.dot`, `.dotm`, `.rtf`, `.txt`, `.xml`, `.epub`, `.odt`, `.wpd`, `.pages`, `.key`, `.numbers`, `.602`, `.abw`, `.cgm`, `.cwk`, `.hwp`, `.lwp`, `.mw`, `.mcw`, `.pbd`, `.sda`, `.sdd`, `.sdp`, `.sdw`, `.sgl`, `.sti`, `.sxi`, `.sxw`, `.stw`, `.sxg`, `.uof`, `.uop`, `.uot`, `.vor`, `.wps`, `.zabw` **LlamaCloud**: `.pdf`, `.doc`, `.docx`, `.docm`, `.dot`, `.dotm`, `.rtf`, `.txt`, `.xml`, `.epub`, `.odt`, `.wpd`, `.pages`, `.key`, `.numbers`, `.602`, `.abw`, `.cgm`, `.cwk`, `.hwp`, `.lwp`, `.mw`, `.mcw`, `.pbd`, `.sda`, `.sdd`, `.sdp`, `.sdw`, `.sgl`, `.sti`, `.sxi`, `.sxw`, `.stw`, `.sxg`, `.uof`, `.uop`, `.uot`, `.vor`, `.wps`, `.zabw`
**Unstructured**: `.doc`, `.docx`, `.odt`, `.rtf`, `.pdf`, `.xml`, `.txt`, `.md`, `.markdown`, `.rst`, `.html`, `.org`, `.epub` **Unstructured**: `.doc`, `.docx`, `.odt`, `.rtf`, `.pdf`, `.xml`, `.txt`, `.md`, `.markdown`, `.rst`, `.html`, `.org`, `.epub`
**Docling**: `.pdf`, `.docx`, `.html`, `.htm`, `.xhtml`, `.adoc`, `.asciidoc`
### Presentations ### Presentations
**LlamaCloud**: `.ppt`, `.pptx`, `.pptm`, `.pot`, `.potm`, `.potx`, `.odp`, `.key` **LlamaCloud**: `.ppt`, `.pptx`, `.pptm`, `.pot`, `.potm`, `.potx`, `.odp`, `.key`
**Unstructured**: `.ppt`, `.pptx` **Unstructured**: `.ppt`, `.pptx`
**Docling**: `.pptx`
### Spreadsheets & Data ### Spreadsheets & Data
**LlamaCloud**: `.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlw`, `.csv`, `.tsv`, `.ods`, `.fods`, `.numbers`, `.dbf`, `.123`, `.dif`, `.sylk`, `.slk`, `.prn`, `.et`, `.uos1`, `.uos2`, `.wk1`, `.wk2`, `.wk3`, `.wk4`, `.wks`, `.wq1`, `.wq2`, `.wb1`, `.wb2`, `.wb3`, `.qpw`, `.xlr`, `.eth` **LlamaCloud**: `.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlw`, `.csv`, `.tsv`, `.ods`, `.fods`, `.numbers`, `.dbf`, `.123`, `.dif`, `.sylk`, `.slk`, `.prn`, `.et`, `.uos1`, `.uos2`, `.wk1`, `.wk2`, `.wk3`, `.wk4`, `.wks`, `.wq1`, `.wq2`, `.wb1`, `.wb2`, `.wb3`, `.qpw`, `.xlr`, `.eth`
**Unstructured**: `.xls`, `.xlsx`, `.csv`, `.tsv` **Unstructured**: `.xls`, `.xlsx`, `.csv`, `.tsv`
**Docling**: `.xlsx`, `.csv`
### Images ### Images
**LlamaCloud**: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.svg`, `.tiff`, `.webp`, `.html`, `.htm`, `.web` **LlamaCloud**: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.svg`, `.tiff`, `.webp`, `.html`, `.htm`, `.web`
**Unstructured**: `.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.heic` **Unstructured**: `.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.heic`
**Docling**: `.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.tif`, `.webp`
### Audio & Video *(Always Supported)* ### Audio & Video *(Always Supported)*
`.mp3`, `.mpga`, `.m4a`, `.wav`, `.mp4`, `.mpeg`, `.webm` `.mp3`, `.mpga`, `.m4a`, `.wav`, `.mp4`, `.mpeg`, `.webm`
@ -142,6 +150,7 @@ Before installation, make sure to complete the [prerequisite setup steps](https:
- **File Processing ETL Service** (choose one): - **File Processing ETL Service** (choose one):
- Unstructured.io API key (supports 34+ formats) - Unstructured.io API key (supports 34+ formats)
- LlamaIndex API key (enhanced parsing, supports 50+ formats) - LlamaIndex API key (enhanced parsing, supports 50+ formats)
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
- Other required API keys - Other required API keys
## Screenshots ## Screenshots

View file

@ -106,8 +106,19 @@ export default function FileUploader() {
}; };
} else if (etlService === 'DOCLING') { } else if (etlService === 'DOCLING') {
return { return {
// Docling supported file types (currently only PDF) // Docling supported file types
'application/pdf': ['.pdf'], 'application/pdf': ['.pdf'],
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
'application/vnd.openxmlformats-officedocument.presentationml.presentation': ['.pptx'],
'text/asciidoc': ['.adoc', '.asciidoc'],
'text/html': ['.html', '.htm', '.xhtml'],
'text/csv': ['.csv'],
'image/png': ['.png'],
'image/jpeg': ['.jpg', '.jpeg'],
'image/tiff': ['.tiff', '.tif'],
'image/bmp': ['.bmp'],
'image/webp': ['.webp'],
// Audio files (always supported) // Audio files (always supported)
...audioFileTypes, ...audioFileTypes,
}; };

View file

@ -4,14 +4,7 @@ description: Setting up SurfSense using Docker
full: true full: true
--- ---
## Known Limitations
⚠️ **Important Note:** Currently, the following features have limited functionality when running in Docker:
- **Ollama integration:** Local Ollama models do not work when running SurfSense in Docker. Please use other LLM providers like OpenAI or Gemini instead.
- **Web crawler functionality:** The web crawler feature currently doesn't work properly within the Docker environment.
We're actively working to resolve these limitations in future releases.
# Docker Installation # Docker Installation
@ -28,6 +21,7 @@ Before you begin, ensure you have:
- **File Processing ETL Service** (choose one): - **File Processing ETL Service** (choose one):
- Unstructured.io API key (Supports 34+ formats) - Unstructured.io API key (Supports 34+ formats)
- LlamaIndex API key (enhanced parsing, supports 50+ formats) - LlamaIndex API key (enhanced parsing, supports 50+ formats)
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
- Other required API keys - Other required API keys
## Installation Steps ## Installation Steps
@ -97,7 +91,7 @@ Before you begin, ensure you have:
| STT_SERVICE_API_KEY | API key for the Speech-to-Text service | | STT_SERVICE_API_KEY | API key for the Speech-to-Text service |
| STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service | | STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service |
| FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling | | FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling |
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats) or `LLAMACLOUD` (supports 50+ formats including legacy document types) | | ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) |
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) | | UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) |
| LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) | | LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) |
@ -152,7 +146,7 @@ For more details, see the [Uvicorn documentation](https://www.uvicorn.org/#comma
| ------------------------------- | ---------------------------------------------------------- | | ------------------------------- | ---------------------------------------------------------- |
| NEXT_PUBLIC_FASTAPI_BACKEND_URL | URL of the backend service (e.g., `http://localhost:8000`) | | NEXT_PUBLIC_FASTAPI_BACKEND_URL | URL of the backend service (e.g., `http://localhost:8000`) |
| NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication | | NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication |
| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED` or `LLAMACLOUD` - affects supported file formats in upload interface | | NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED`, `LLAMACLOUD`, or `DOCLING` - affects supported file formats in upload interface |
2. **Build and Start Containers** 2. **Build and Start Containers**

View file

@ -67,7 +67,7 @@ To set up Google OAuth:
## File Upload's ## File Upload's
SurfSense supports two ETL (Extract, Transform, Load) services for converting files to LLM-friendly formats: SurfSense supports three ETL (Extract, Transform, Load) services for converting files to LLM-friendly formats:
### Option 1: Unstructured ### Option 1: Unstructured
@ -85,6 +85,16 @@ Files are converted using [LlamaIndex](https://www.llamaindex.ai/) which offers
2. Sign up for a LlamaCloud account to access their parsing services 2. Sign up for a LlamaCloud account to access their parsing services
3. LlamaCloud provides enhanced parsing capabilities for complex documents 3. LlamaCloud provides enhanced parsing capabilities for complex documents
### Option 3: Docling (Recommended for Privacy)
Files are processed locally using [Docling](https://github.com/DS4SD/docling) - IBM's open-source document parsing library.
1. **No API key required** - all processing happens locally
2. **Privacy-focused** - documents never leave your system
3. **Supported formats**: PDF, Office documents (Word, Excel, PowerPoint), images (PNG, JPEG, TIFF, BMP, WebP), HTML, CSV, AsciiDoc
4. **Enhanced features**: Advanced table detection, image extraction, and structured document parsing
5. **GPU acceleration** support for faster processing (when available)
**Note**: You only need to set up one of these services. **Note**: You only need to set up one of these services.
--- ---

View file

@ -16,6 +16,7 @@ Before beginning the manual installation, ensure you have completed all the [pre
- **File Processing ETL Service** (choose one): - **File Processing ETL Service** (choose one):
- Unstructured.io API key (Supports 34+ formats) - Unstructured.io API key (Supports 34+ formats)
- LlamaIndex API key (enhanced parsing, supports 50+ formats) - LlamaIndex API key (enhanced parsing, supports 50+ formats)
- Docling (local processing, no API key required, supports PDF, Office docs, images, HTML, CSV)
- Other required API keys - Other required API keys
## Backend Setup ## Backend Setup
@ -67,7 +68,7 @@ Edit the `.env` file and set the following variables:
| STT_SERVICE_API_KEY | API key for the Speech-to-Text service | | STT_SERVICE_API_KEY | API key for the Speech-to-Text service |
| STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service | | STT_SERVICE_API_BASE | (Optional) Custom API base URL for the Speech-to-Text service |
| FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling | | FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling |
| ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats) or `LLAMACLOUD` (supports 50+ formats including legacy document types) | | ETL_SERVICE | Document parsing service: `UNSTRUCTURED` (supports 34+ formats), `LLAMACLOUD` (supports 50+ formats including legacy document types), or `DOCLING` (local processing, supports PDF, Office docs, images, HTML, CSV) |
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) | | UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing (required if ETL_SERVICE=UNSTRUCTURED) |
| LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) | | LLAMA_CLOUD_API_KEY | API key for LlamaCloud service for document parsing (required if ETL_SERVICE=LLAMACLOUD) |
@ -198,7 +199,7 @@ Edit the `.env` file and set:
| ------------------------------- | ------------------------------------------- | | ------------------------------- | ------------------------------------------- |
| NEXT_PUBLIC_FASTAPI_BACKEND_URL | Backend URL (e.g., `http://localhost:8000`) | | NEXT_PUBLIC_FASTAPI_BACKEND_URL | Backend URL (e.g., `http://localhost:8000`) |
| NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication | | NEXT_PUBLIC_FASTAPI_BACKEND_AUTH_TYPE | Same value as set in backend AUTH_TYPE i.e `GOOGLE` for OAuth with Google, `LOCAL` for email/password authentication |
| NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED` or `LLAMACLOUD` - affects supported file formats in upload interface | | NEXT_PUBLIC_ETL_SERVICE | Document parsing service (should match backend ETL_SERVICE): `UNSTRUCTURED`, `LLAMACLOUD`, or `DOCLING` - affects supported file formats in upload interface |
### 2. Install Dependencies ### 2. Install Dependencies