feat: Added Speech to Text support.

- Supports audio & video files.
- Will be useful for Youtube vids which dont have transcripts.
This commit is contained in:
DESKTOP-RTLN3BA\$punk 2025-05-13 21:13:53 -07:00
parent 57987ecc76
commit a8080d2dc7
8 changed files with 172 additions and 73 deletions

View file

@ -27,28 +27,27 @@ https://github.com/user-attachments/assets/bf64a6ca-934b-47ac-9e1b-edac5fe972ec
## Key Features ## Key Features
### 1. Latest
#### 💡 **Idea**: ### 💡 **Idea**:
Have your own highly customizable private NotebookLM and Perplexity integrated with external sources. Have your own highly customizable private NotebookLM and Perplexity integrated with external sources.
#### 📁 **Multiple File Format Uploading Support** ### 📁 **Multiple File Format Uploading Support**
Save content from your own personal files *(Documents, images and supports **27 file extensions**)* to your own personal knowledge base . Save content from your own personal files *(Documents, images, videos and supports **34 file extensions**)* to your own personal knowledge base .
#### 🔍 **Powerful Search** ### 🔍 **Powerful Search**
Quickly research or find anything in your saved content . Quickly research or find anything in your saved content .
#### 💬 **Chat with your Saved Content** ### 💬 **Chat with your Saved Content**
Interact in Natural Language and get cited answers. Interact in Natural Language and get cited answers.
#### 📄 **Cited Answers** ### 📄 **Cited Answers**
Get Cited answers just like Perplexity. Get Cited answers just like Perplexity.
#### 🔔 **Privacy & Local LLM Support** ### 🔔 **Privacy & Local LLM Support**
Works Flawlessly with Ollama local LLMs. Works Flawlessly with Ollama local LLMs.
#### 🏠 **Self Hostable** ### 🏠 **Self Hostable**
Open source and easy to deploy locally. Open source and easy to deploy locally.
#### 🎙️ Podcasts ### 🎙️ Podcasts
- Blazingly fast podcast generation agent. (Creates a 3-minute podcast in under 20 seconds.) - Blazingly fast podcast generation agent. (Creates a 3-minute podcast in under 20 seconds.)
- Convert your chat conversations into engaging audio content - Convert your chat conversations into engaging audio content
- Support for multiple TTS providers (OpenAI, Azure, Google Vertex AI) - Support for multiple TTS providers (OpenAI, Azure, Google Vertex AI)
#### 📊 **Advanced RAG Techniques** ### 📊 **Advanced RAG Techniques**
- Supports 150+ LLM's - Supports 150+ LLM's
- Supports 6000+ Embedding Models. - Supports 6000+ Embedding Models.
- Supports all major Rerankers (Pinecode, Cohere, Flashrank etc) - Supports all major Rerankers (Pinecode, Cohere, Flashrank etc)
@ -56,7 +55,7 @@ Open source and easy to deploy locally.
- Utilizes Hybrid Search (Semantic + Full Text Search combined with Reciprocal Rank Fusion). - Utilizes Hybrid Search (Semantic + Full Text Search combined with Reciprocal Rank Fusion).
- RAG as a Service API Backend. - RAG as a Service API Backend.
#### **External Sources** ### **External Sources**
- Search Engines (Tavily, LinkUp) - Search Engines (Tavily, LinkUp)
- Slack - Slack
- Linear - Linear
@ -65,7 +64,39 @@ Open source and easy to deploy locally.
- GitHub - GitHub
- and more to come..... - and more to come.....
#### 🔖 Cross Browser Extension ### 📄 **Supported File Extensions**
#### Document
`.doc`, `.docx`, `.odt`, `.rtf`, `.pdf`, `.xml`
#### Text & Markup
`.txt`, `.md`, `.markdown`, `.rst`, `.html`, `.org`
#### Spreadsheets & Tables
`.xls`, `.xlsx`, `.csv`, `.tsv`
#### Audio & Video
`.mp3`, `.mpga`, `.m4a`, `.wav`, `.mp4`, `.mpeg`, `.webm`
#### Images
`.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.heic`
#### Email & eBooks
`.eml`, `.msg`, `.epub`
#### PowerPoint Presentations & Other
`.ppt`, `.pptx`, `.p7s`
### 🔖 Cross Browser Extension
- The SurfSense extension can be used to save any webpage you like. - The SurfSense extension can be used to save any webpage you like.
- Its main usecase is to save any webpages protected beyond authentication. - Its main usecase is to save any webpages protected beyond authentication.
@ -209,16 +240,8 @@ Before installation, make sure to complete the [prerequisite setup steps](https:
## Future Work ## Future Work
- Add More Connectors. - Add More Connectors.
- Patch minor bugs. - Patch minor bugs.
- Implement Canvas. - Document Chat **[REIMPLEMENT]**
- Complete Hybrid Search. **[Done]** - Document Podcasts
- Add support for file uploads QA. **[Done]**
- Shift to WebSockets for Streaming responses. **[Deprecated in favor of AI SDK Stream Protocol]**
- Based on feedback, I will work on making it compatible with local models. **[Done]**
- Cross Browser Extension **[Done]**
- Critical Notifications **[Done | PAUSED]**
- Saving Chats **[Done]**
- Basic keyword search page for saved sessions **[Done]**
- Multi & Single Document Chat **[Done]**

View file

@ -18,6 +18,9 @@ LONG_CONTEXT_LLM="gemini/gemini-2.0-flash"
#LiteLLM TTS Provider: https://docs.litellm.ai/docs/text_to_speech#supported-providers #LiteLLM TTS Provider: https://docs.litellm.ai/docs/text_to_speech#supported-providers
TTS_SERVICE="openai/tts-1" TTS_SERVICE="openai/tts-1"
#LiteLLM STT Provider: https://docs.litellm.ai/docs/audio_transcription#supported-providers
STT_SERVICE="openai/whisper-1"
# Chosen LiteLLM Providers Keys # Chosen LiteLLM Providers Keys
OPENAI_API_KEY="sk-proj-iA" OPENAI_API_KEY="sk-proj-iA"
GEMINI_API_KEY="AIzaSyB6-1641124124124124124124124124124" GEMINI_API_KEY="AIzaSyB6-1641124124124124124124124124124"
@ -35,3 +38,5 @@ LANGSMITH_PROJECT="surfsense"
FAST_LLM_API_BASE="" FAST_LLM_API_BASE=""
STRATEGIC_LLM_API_BASE="" STRATEGIC_LLM_API_BASE=""
LONG_CONTEXT_LLM_API_BASE="" LONG_CONTEXT_LLM_API_BASE=""
TTS_SERVICE_API_BASE=""
STT_SERVICE_API_BASE=""

View file

@ -135,7 +135,16 @@ async def create_merged_podcast_audio(state: State, config: RunnableConfig) -> D
filename = f"{temp_dir}/{session_id}_{index}.mp3" filename = f"{temp_dir}/{session_id}_{index}.mp3"
try: try:
# Generate speech using litellm if app_config.TTS_SERVICE_API_BASE:
response = await aspeech(
model=app_config.TTS_SERVICE,
api_base=app_config.TTS_SERVICE_API_BASE,
voice=voice,
input=dialog,
max_retries=2,
timeout=600,
)
else:
response = await aspeech( response = await aspeech(
model=app_config.TTS_SERVICE, model=app_config.TTS_SERVICE,
voice=voice, voice=voice,

View file

@ -6,7 +6,7 @@ from chonkie import AutoEmbeddings, CodeChunker, RecursiveChunker
from dotenv import load_dotenv from dotenv import load_dotenv
from langchain_community.chat_models import ChatLiteLLM from langchain_community.chat_models import ChatLiteLLM
from rerankers import Reranker from rerankers import Reranker
from litellm import speech
# Get the base directory of the project # Get the base directory of the project
BASE_DIR = Path(__file__).resolve().parent.parent.parent BASE_DIR = Path(__file__).resolve().parent.parent.parent
@ -97,6 +97,12 @@ class Config:
# Litellm TTS Configuration # Litellm TTS Configuration
TTS_SERVICE = os.getenv("TTS_SERVICE") TTS_SERVICE = os.getenv("TTS_SERVICE")
TTS_SERVICE_API_BASE = os.getenv("TTS_SERVICE_API_BASE")
# Litellm STT Configuration
STT_SERVICE = os.getenv("STT_SERVICE")
STT_SERVICE_API_BASE = os.getenv("STT_SERVICE_API_BASE")
# Validation Checks # Validation Checks
# Check embedding dimension # Check embedding dimension

View file

@ -1,3 +1,4 @@
from litellm import atranscription
from fastapi import APIRouter, Depends, BackgroundTasks, UploadFile, Form, HTTPException from fastapi import APIRouter, Depends, BackgroundTasks, UploadFile, Form, HTTPException
from sqlalchemy.ext.asyncio import AsyncSession from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.future import select from sqlalchemy.future import select
@ -7,6 +8,7 @@ from app.schemas import DocumentsCreate, DocumentUpdate, DocumentRead
from app.users import current_active_user from app.users import current_active_user
from app.utils.check_ownership import check_ownership from app.utils.check_ownership import check_ownership
from app.tasks.background_tasks import add_received_markdown_file_document, add_extension_received_document, add_received_file_document, add_crawled_url_document, add_youtube_video_document from app.tasks.background_tasks import add_received_markdown_file_document, add_extension_received_document, add_received_file_document, add_crawled_url_document, add_youtube_video_document
from app.config import config as app_config
# Force asyncio to use standard event loop before unstructured imports # Force asyncio to use standard event loop before unstructured imports
import asyncio import asyncio
try: try:
@ -17,9 +19,9 @@ import os
os.environ["UNSTRUCTURED_HAS_PATCHED_LOOP"] = "1" os.environ["UNSTRUCTURED_HAS_PATCHED_LOOP"] = "1"
router = APIRouter() router = APIRouter()
@router.post("/documents/") @router.post("/documents/")
async def create_documents( async def create_documents(
request: DocumentsCreate, request: DocumentsCreate,
@ -69,6 +71,7 @@ async def create_documents(
detail=f"Failed to process documents: {str(e)}" detail=f"Failed to process documents: {str(e)}"
) )
@router.post("/documents/fileupload") @router.post("/documents/fileupload")
async def create_documents( async def create_documents(
files: list[UploadFile], files: list[UploadFile],
@ -151,6 +154,42 @@ async def process_file_in_background(
markdown_content, markdown_content,
search_space_id search_space_id
) )
# Check if the file is an audio file
elif filename.lower().endswith(('.mp3', '.mp4', '.mpeg', '.mpga', '.m4a', '.wav', '.webm')):
# Open the audio file for transcription
with open(file_path, "rb") as audio_file:
# Use LiteLLM for audio transcription
if app_config.STT_SERVICE_API_BASE:
transcription_response = await atranscription(
model=app_config.STT_SERVICE,
file=audio_file,
api_base=app_config.STT_SERVICE_API_BASE
)
else:
transcription_response = await atranscription(
model=app_config.STT_SERVICE,
file=audio_file
)
# Extract the transcribed text
transcribed_text = transcription_response.get("text", "")
# Add metadata about the transcription
transcribed_text = f"# Transcription of {filename}\n\n{transcribed_text}"
# Clean up the temp file
try:
os.unlink(file_path)
except:
pass
# Process transcription as markdown document
await add_received_markdown_file_document(
session,
filename,
transcribed_text,
search_space_id
)
else: else:
# Use synchronous unstructured API to avoid event loop issues # Use synchronous unstructured API to avoid event loop issues
from langchain_unstructured import UnstructuredLoader from langchain_unstructured import UnstructuredLoader
@ -186,6 +225,7 @@ async def process_file_in_background(
import logging import logging
logging.error(f"Error processing file in background: {str(e)}") logging.error(f"Error processing file in background: {str(e)}")
@router.get("/documents/", response_model=List[DocumentRead]) @router.get("/documents/", response_model=List[DocumentRead])
async def read_documents( async def read_documents(
skip: int = 0, skip: int = 0,
@ -195,7 +235,8 @@ async def read_documents(
user: User = Depends(current_active_user) user: User = Depends(current_active_user)
): ):
try: try:
query = select(Document).join(SearchSpace).filter(SearchSpace.user_id == user.id) query = select(Document).join(SearchSpace).filter(
SearchSpace.user_id == user.id)
# Filter by search_space_id if provided # Filter by search_space_id if provided
if search_space_id is not None: if search_space_id is not None:
@ -226,6 +267,7 @@ async def read_documents(
detail=f"Failed to fetch documents: {str(e)}" detail=f"Failed to fetch documents: {str(e)}"
) )
@router.get("/documents/{document_id}", response_model=DocumentRead) @router.get("/documents/{document_id}", response_model=DocumentRead)
async def read_document( async def read_document(
document_id: int, document_id: int,
@ -262,6 +304,7 @@ async def read_document(
detail=f"Failed to fetch document: {str(e)}" detail=f"Failed to fetch document: {str(e)}"
) )
@router.put("/documents/{document_id}", response_model=DocumentRead) @router.put("/documents/{document_id}", response_model=DocumentRead)
async def update_document( async def update_document(
document_id: int, document_id: int,
@ -309,6 +352,7 @@ async def update_document(
detail=f"Failed to update document: {str(e)}" detail=f"Failed to update document: {str(e)}"
) )
@router.delete("/documents/{document_id}", response_model=dict) @router.delete("/documents/{document_id}", response_model=dict)
async def delete_document( async def delete_document(
document_id: int, document_id: int,
@ -357,6 +401,7 @@ async def process_extension_document_with_new_session(
import logging import logging
logging.error(f"Error processing extension document: {str(e)}") logging.error(f"Error processing extension document: {str(e)}")
async def process_crawled_url_with_new_session( async def process_crawled_url_with_new_session(
url: str, url: str,
search_space_id: int search_space_id: int
@ -371,6 +416,7 @@ async def process_crawled_url_with_new_session(
import logging import logging
logging.error(f"Error processing crawled URL: {str(e)}") logging.error(f"Error processing crawled URL: {str(e)}")
async def process_file_in_background_with_new_session( async def process_file_in_background_with_new_session(
file_path: str, file_path: str,
filename: str, filename: str,
@ -382,6 +428,7 @@ async def process_file_in_background_with_new_session(
async with async_session_maker() as session: async with async_session_maker() as session:
await process_file_in_background(file_path, filename, search_space_id, session) await process_file_in_background(file_path, filename, search_space_id, session)
async def process_youtube_video_with_new_session( async def process_youtube_video_with_new_session(
url: str, url: str,
search_space_id: int search_space_id: int
@ -395,4 +442,3 @@ async def process_youtube_video_with_new_session(
except Exception as e: except Exception as e:
import logging import logging
logging.error(f"Error processing YouTube video: {str(e)}") logging.error(f"Error processing YouTube video: {str(e)}")

View file

@ -53,7 +53,7 @@ export default function FileUploader() {
'text/html': ['.html'], 'text/html': ['.html'],
'image/jpeg': ['.jpeg', '.jpg'], 'image/jpeg': ['.jpeg', '.jpg'],
'image/png': ['.png'], 'image/png': ['.png'],
'text/markdown': ['.md'], 'text/markdown': ['.md', '.markdown'],
'application/vnd.ms-outlook': ['.msg'], 'application/vnd.ms-outlook': ['.msg'],
'application/vnd.oasis.opendocument.text': ['.odt'], 'application/vnd.oasis.opendocument.text': ['.odt'],
'text/x-org': ['.org'], 'text/x-org': ['.org'],
@ -69,6 +69,10 @@ export default function FileUploader() {
'application/vnd.ms-excel': ['.xls'], 'application/vnd.ms-excel': ['.xls'],
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'], 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
'application/xml': ['.xml'], 'application/xml': ['.xml'],
'audio/mpeg': ['.mp3', '.mpeg', '.mpga'],
'audio/mp4': ['.mp4', '.m4a'],
'audio/wav': ['.wav'],
'audio/webm': ['.webm'],
} }
const supportedExtensions = Array.from(new Set(Object.values(acceptedFileTypes).flat())).sort() const supportedExtensions = Array.from(new Set(Object.values(acceptedFileTypes).flat())).sort()

View file

@ -94,6 +94,7 @@ Before you begin, ensure you have:
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing | | UNSTRUCTURED_API_KEY | API key for Unstructured.io service for document parsing |
| FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling | | FIRECRAWL_API_KEY | API key for Firecrawl service for web crawling |
| TTS_SERVICE | Text-to-Speech API provider for Podcasts (e.g., `openai/tts-1`, `azure/neural`, `vertex_ai/`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers) | | TTS_SERVICE | Text-to-Speech API provider for Podcasts (e.g., `openai/tts-1`, `azure/neural`, `vertex_ai/`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers) |
| STT_SERVICE | Speech-to-Text API provider for Podcasts (e.g., `openai/whisper-1`). See [supported providers](https://docs.litellm.ai/docs/audio_transcription#supported-providers) |
Include API keys for the LLM providers you're using. For example: Include API keys for the LLM providers you're using. For example:
@ -114,6 +115,8 @@ Include API keys for the LLM providers you're using. For example:
| FAST_LLM_API_BASE | Custom API base URL for the fast LLM | | FAST_LLM_API_BASE | Custom API base URL for the fast LLM |
| STRATEGIC_LLM_API_BASE | Custom API base URL for the strategic LLM | | STRATEGIC_LLM_API_BASE | Custom API base URL for the strategic LLM |
| LONG_CONTEXT_LLM_API_BASE | Custom API base URL for the long context LLM | | LONG_CONTEXT_LLM_API_BASE | Custom API base URL for the long context LLM |
| TTS_SERVICE_API_BASE | Custom API base URL for the Text-to-Speech (TTS) service |
| STT_SERVICE_API_BASE | Custom API base URL for the Speech-to-Text (STT) service |
For other LLM providers, refer to the [LiteLLM documentation](https://docs.litellm.ai/docs/providers). For other LLM providers, refer to the [LiteLLM documentation](https://docs.litellm.ai/docs/providers).

View file

@ -65,6 +65,7 @@ Edit the `.env` file and set the following variables:
| UNSTRUCTURED_API_KEY | API key for Unstructured.io service | | UNSTRUCTURED_API_KEY | API key for Unstructured.io service |
| FIRECRAWL_API_KEY | API key for Firecrawl service (if using crawler) | | FIRECRAWL_API_KEY | API key for Firecrawl service (if using crawler) |
| TTS_SERVICE | Text-to-Speech API provider for Podcasts (e.g., `openai/tts-1`, `azure/neural`, `vertex_ai/`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers) | | TTS_SERVICE | Text-to-Speech API provider for Podcasts (e.g., `openai/tts-1`, `azure/neural`, `vertex_ai/`). See [supported providers](https://docs.litellm.ai/docs/text_to_speech#supported-providers) |
| STT_SERVICE | Speech-to-Text API provider for Podcasts (e.g., `openai/whisper-1`). See [supported providers](https://docs.litellm.ai/docs/audio_transcription#supported-providers) |
**Important**: Since LLM calls are routed through LiteLLM, include API keys for the LLM providers you're using: **Important**: Since LLM calls are routed through LiteLLM, include API keys for the LLM providers you're using:
@ -86,6 +87,8 @@ Edit the `.env` file and set the following variables:
| FAST_LLM_API_BASE | Custom API base URL for the fast LLM | | FAST_LLM_API_BASE | Custom API base URL for the fast LLM |
| STRATEGIC_LLM_API_BASE | Custom API base URL for the strategic LLM | | STRATEGIC_LLM_API_BASE | Custom API base URL for the strategic LLM |
| LONG_CONTEXT_LLM_API_BASE | Custom API base URL for the long context LLM | | LONG_CONTEXT_LLM_API_BASE | Custom API base URL for the long context LLM |
| TTS_SERVICE_API_BASE | Custom API base URL for the Text-to-Speech (TTS) service |
| STT_SERVICE_API_BASE | Custom API base URL for the Speech-to-Text (STT) service |
### 2. Install Dependencies ### 2. Install Dependencies