- Deleted the document_processing module and its associated docling_service.
- Updated imports in documents_routes.py and background_tasks.py to reflect the new service structure.
- Ensured compatibility with the task logging system by adjusting type hints for log entries.
- Integrated Docling ETL service with new task logging system
- Maintained consistent logging pattern across all ETL services
- Added progress and success/failure logging for Docling processing
- Added DOCLING as third ETL_SERVICE option (alongside UNSTRUCTURED/LLAMACLOUD)
- Implemented add_received_file_document_using_docling function
- Added Docling processing logic in documents_routes.py
- Enhanced chunking with configurable overlap support
- Added comprehensive document processing service
- Supports both CPU and GPU processing with user selection
Addresses #161 - Add Docling Support as an ETL_SERVICE
Follows same pattern as LlamaCloud integration (PR #123)
\
- Earlier for each chunk, whole message (with all annotations included)
were streamed. Leading to extremely large data length.
- Fixed to only stream new chunk.
- Updated ANSWER part to be streamed as message content (following
Vercel's Stream Protocol)\
- Fixed yield typo
- Integrated TaskLoggingService to log the start, progress, success, and failure of podcast generation tasks.
- Updated user ID handling to ensure it is consistently converted to a string across various tasks.
- Modified frontend success message to direct users to the logs tab for status updates on podcast generation.
- Added TaskLoggingService to log the start, progress, success, and failure of indexing tasks for Slack, Notion, GitHub, Linear, and Discord connectors.
- Updated frontend to reflect changes in indexing status messages.
- Updated import paths for LLM, connector, query, and streaming services to reflect their new location in the 'services' module.
- Removed obsolete utility service files that have been migrated.
Here's a rundown of what I did:
Fix: Robust Slack rate limiting, error handling & GitHub org repos
This update delivers comprehensive improvements to Slack connector stability and enhances the GitHub connector.
**Slack Connector (`slack_history.py`, `connectors_indexing_tasks.py`):**
- I've implemented proactive delays (1.2s for `conversations.history`, 3s for `conversations.list` pagination) and `Retry-After` header handling for 429 rate limit errors across `conversations.list`, `conversations.history`, and `users.info` API calls.
- I'll now gracefully handle `not_in_channel` errors when fetching conversation history by logging a warning and skipping the channel.
- I've refactored channel info fetching: `get_all_channels` now returns richer channel data (including `is_member`, `is_private`).
- I've removed direct calls to `conversations.info` from `connectors_indexing_tasks.py`, using the richer data from `get_all_channels` instead, to prevent associated rate limits.
- I corrected a `SyntaxError` (non-printable character) in `slack_history.py`.
- I've enhanced logging for rate limit actions, delays, and errors.
- I've updated unit tests in `test_slack_history.py` to cover all new logic.
**GitHub Connector (`github_connector.py`):**
- I've modified `get_user_repositories` to fetch all repositories accessible by you (owned, collaborated, organization) by changing the API call parameter from `type='owner'` to `type='all'`.
- I've included unit tests in `test_github_connector.py` for this change.