mirror of
https://github.com/lfnovo/open-notebook.git
synced 2026-04-29 03:50:04 +00:00
* feat: content-type aware chunking and unified embedding - Add chunking.py with HTML, Markdown, and plain text detection - Add embedding.py with mean pooling for large content - Create dedicated commands: embed_note, embed_insight, embed_source - Use fire-and-forget pattern for embedding via submit_command() - Refactor rebuild_embeddings_command to delegate to individual commands - Remove legacy commands and needs_embedding() methods - Reduce chunk size to 1500 chars for Ollama compatibility - Update CLAUDE.md documentation for new architecture Fixes #350, #142 * fix: address code review issues - Note.save() now returns command_id for tracking embedding jobs - Add length check after generate_embeddings() to fail fast on mismatch - Add numpy as explicit dependency (was transitive) - Remove hardcoded chunk sizes from docstrings * docs: address code review comments - Rename "SYNC PATH" to "DOMAIN MODEL PATH" in embedding router - Add test_chunking.py and test_embedding.py to Testing Strategy - Clarify auto-embedding behavior for each domain model * fix: clean thinking tags from prompt graph output Adds clean_thinking_content() to prompt.py to handle extended thinking models that return <think>...</think> tags. This fixes empty titles when saving notes from chat. * chore: remove local docker-compose from git * fix(frontend): handle null parent_id in search results Add defensive check for null parent_id in search results to prevent "Cannot read properties of null (reading 'split')" error. This can happen with orphaned records in the database. * fix: cascade delete embeddings and insights when source is deleted When deleting a Source, now also deletes associated: - source_embedding records - source_insight records This prevents orphaned records that cause null parent_id errors in vector search results. * fix: add cleanup for orphan embedding/insight records in migration 10 Deletes source_embedding and source_insight records where the linked source no longer exists (source.id = NONE). * chore: bump esperanto to 2.16 Increases ctx_num for Ollama models to accommodate larger notebook context windows. See: https://github.com/lfnovo/esperanto/pull/69
242 lines
13 KiB
Markdown
242 lines
13 KiB
Markdown
# Open Notebook Core Backend
|
|
|
|
The `open_notebook` module is the heart of the system: a multi-layer backend orchestrating AI-powered research workflows. It bridges domain models, asynchronous database operations, LangGraph-based content processing, and multi-provider AI model management.
|
|
|
|
## Purpose
|
|
|
|
Encapsulates the entire backend architecture:
|
|
1. **Data layer**: SurrealDB persistence with async CRUD and migrations
|
|
2. **Domain layer**: Research models (Notebook, Source, Note, etc.) with embedded relationships
|
|
3. **Workflow layer**: LangGraph state machines for content ingestion, chat, and transformations
|
|
4. **AI provisioning**: Multi-provider model management with smart fallback logic
|
|
5. **Support services**: Context building, tokenization, and utility functions
|
|
|
|
All components communicate through async/await patterns and use Pydantic for validation.
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ API / Streamlit UI │
|
|
└──────────────────────┬──────────────────────────────────────┘
|
|
│
|
|
┌──────────────────┴──────────────────┐
|
|
│ │
|
|
┌───▼────────────────────┐ ┌──────────▼────────────────┐
|
|
│ Graphs (LangGraph) │ │ Domain Models (Data) │
|
|
│ - source.py (ingestion) │ │ - Notebook, Source, Note │
|
|
│ - chat.py │ │ - ChatSession, Asset │
|
|
│ - ask.py (search) │ │ - SourceInsight, Embedding│
|
|
│ - transformation.py │ │ - Transformation, Settings│
|
|
└───┬────────────────────┘ │ - EpisodeProfile, Podcast │
|
|
│ └──────────┬─────────────────┘
|
|
│ │
|
|
└───────────────────┬───────────────┘
|
|
│
|
|
┌───────────────────┴────────────────────┐
|
|
│ │
|
|
┌───▼─────────────────┐ ┌──────────────▼──────┐
|
|
│ AI Module (Models) │ │ Utils (Helpers) │
|
|
│ - ModelManager │ │ - ContextBuilder │
|
|
│ - DefaultModels │ │ - TokenUtils │
|
|
│ - provision_langchain│ │ - TextUtils │
|
|
│ - Multi-provider AI │ │ - VersionUtils │
|
|
└───┬─────────────────┘ └──────────┬──────────┘
|
|
│ │
|
|
└───────────────────┬───────────────┘
|
|
│
|
|
┌──────────────▼────────────────┐
|
|
│ Database (SurrealDB) │
|
|
│ - repository.py (CRUD ops) │
|
|
│ - async_migrate.py (schema) │
|
|
│ - Configuration │
|
|
└────────────────────────────────┘
|
|
```
|
|
|
|
## Component Catalog
|
|
|
|
### Core Layers
|
|
|
|
**See dedicated CLAUDE.md files for detailed patterns and usage:**
|
|
|
|
- **`database/`**: Async repository pattern (repo_query, repo_create, repo_upsert), connection pooling, and automatic schema migrations on API startup. See `database/CLAUDE.md`.
|
|
|
|
- **`domain/`**: Core data models using Pydantic with SurrealDB persistence. Two base classes: `ObjectModel` (mutable records with auto-increment IDs and embedding) and `RecordModel` (singleton configuration). Includes search functions (text_search, vector_search). See `domain/CLAUDE.md`.
|
|
|
|
- **`graphs/`**: LangGraph state machines for async workflows. Content ingestion (source.py), conversational agents (chat.py), search synthesis (ask.py), and transformations. Uses provision_langchain_model() for smart model selection with token-aware fallback. See `graphs/CLAUDE.md`.
|
|
|
|
- **`ai/`**: Centralized AI model lifecycle via Esperanto library. ModelManager factory with intelligent fallback (large context detection, type-specific defaults, config override). Supports 8+ providers (OpenAI, Anthropic, Google, Groq, Ollama, Mistral, DeepSeek, xAI). See `ai/CLAUDE.md`.
|
|
|
|
- **`utils/`**: Cross-cutting utilities: ContextBuilder (flexible context assembly from sources/notes/insights with token budgeting), TextUtils (truncation, cleaning), TokenUtils (GPT token counting), VersionUtils (schema compatibility). See `utils/CLAUDE.md`.
|
|
|
|
- **`podcasts/`**: Podcast generation models: SpeakerProfile (TTS voice config), EpisodeProfile (generation settings), PodcastEpisode (job tracking via surreal-commands). See `podcasts/CLAUDE.md`.
|
|
|
|
### Configuration & Exceptions
|
|
|
|
- **`config.py`**: Paths for data folder, uploads, LangGraph checkpoints, and tiktoken cache. Auto-creates directories.
|
|
- **`exceptions.py`**: Hierarchy of OpenNotebookError subclasses for database, file, network, authentication, and rate-limit failures.
|
|
|
|
## Data Flow: Content Ingestion
|
|
|
|
```
|
|
User uploads file/URL
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ source.py (LangGraph state machine) │
|
|
├─────────────────────────────────────┤
|
|
│ 1. content_process() │
|
|
│ - extract_content() from file/URL│
|
|
│ - Use ContentSettings defaults │
|
|
│ - speech_to_text model from DB │
|
|
│ │
|
|
│ 2. save_source() │
|
|
│ - Update Source with full_text │
|
|
│ - Preserve title if empty │
|
|
│ │
|
|
│ 3. trigger_transformations() │
|
|
│ - Parallel fan-out to each TXN │
|
|
└────────────────┬────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────┐
|
|
│ transformation.py (parallel)
|
|
│ - Apply prompt to source text
|
|
│ - Generate insights
|
|
│ - Auto-embed results
|
|
└──────────────┘
|
|
│
|
|
▼
|
|
┌────────────────────┐
|
|
│ Database Storage │
|
|
│ - Source.full_text │
|
|
│ - SourceInsight │
|
|
│ - Embeddings │
|
|
│ - (async job) │
|
|
└────────────────────┘
|
|
```
|
|
|
|
**Fire-and-forget embeddings**: Source.vectorize() returns command_id without awaiting; embedding happens asynchronously via surreal-commands job system.
|
|
|
|
## Data Flow: Chat & Search
|
|
|
|
```
|
|
User message in chat
|
|
│
|
|
▼
|
|
┌──────────────────────────┐
|
|
│ ContextBuilder │
|
|
│ - Select sources/notes │
|
|
│ - Token budget limiting │
|
|
│ - Priority weighting │
|
|
└──────────┬───────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ chat.py or ask.py (LangGraph) │
|
|
│ - Load context from above │
|
|
│ - provision_langchain_model() │
|
|
│ * Auto-upgrade for large text │
|
|
│ * Apply model_id override │
|
|
│ - Call LLM with context │
|
|
│ - Store message in SqliteSaver │
|
|
└──────────┬───────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────┐
|
|
│ LLM Response │
|
|
│ (persisted) │
|
|
└──────────────┘
|
|
```
|
|
|
|
## Key Patterns Across Layers
|
|
|
|
### Async/Await Everywhere
|
|
All database operations, model provisioning, and graph execution are async. Mix with sync code only via `asyncio.run()` or LangGraph's async bridges (see graphs/CLAUDE.md for workarounds).
|
|
|
|
### Type-Driven Dispatch
|
|
Model types (language, embedding, speech_to_text, text_to_speech) drive factory logic in ModelManager. Domain model IDs encode their type: `notebook:uuid`, `source:uuid`, `note:uuid`.
|
|
|
|
### Smart Fallback Logic
|
|
`provision_langchain_model()` auto-detects large contexts (105K+ tokens) and upgrades to dedicated large_context_model. Falls back to default_chat_model if specific type not found.
|
|
|
|
### Fire-and-Forget Jobs
|
|
Time-consuming operations (embedding, podcast generation) return command_id immediately. Caller polls surreal-commands for status; no blocking.
|
|
|
|
### Fire-and-Forget Embedding
|
|
Domain models submit embedding commands after save via `submit_command()` (non-blocking). Note.save() submits `embed_note`, Source.add_insight() submits `embed_insight`, Source.vectorize() submits `embed_source`. Search functions (text_search, vector_search) use embeddings for semantic matching.
|
|
|
|
### Relationship Management
|
|
SurrealDB graph edges link entities: Notebook→Source (has), Source→Note (artifact), Note→Source (refers_to). See `relate()` in domain/base.py.
|
|
|
|
## Integration Points
|
|
|
|
**API startup** (`api/main.py`):
|
|
- AsyncMigrationManager.run_migration_up() on lifespan startup
|
|
- Ensures schema is current before handling requests
|
|
|
|
**Streamlit UI** (`pages/stream_app/`):
|
|
- Calls domain models directly to fetch/create notebooks, sources, notes
|
|
- Invokes graphs (chat, source, ask) via async wrapper
|
|
- Relies on API for migrations (deprecated check in UI)
|
|
|
|
**Background Jobs** (`surreal_commands`):
|
|
- Source.vectorize() submits async embedding job
|
|
- PodcastEpisode.get_job_status() polls job queue
|
|
- Decouples long-running operations from request flow
|
|
|
|
## Important Quirks & Gotchas
|
|
|
|
1. **Token counting rough estimate**: Uses cl100k_base encoding; may differ 5-10% from actual model
|
|
2. **Large context threshold hard-coded**: 105,000 token limit for large_context_model upgrade (not configurable)
|
|
3. **Async loop gymnastics in graphs**: ThreadPoolExecutor workaround for LangGraph sync nodes calling async functions (fragile)
|
|
4. **DefaultModels always fresh**: get_instance() bypasses singleton cache to pick up live config changes
|
|
5. **Polymorphic model.get()**: Resolves subclass from ID prefix; fails silently if subclass not imported
|
|
6. **RecordID string inconsistency**: repo_update() accepts both "table:id" format and full RecordID
|
|
7. **Snapshot profiles**: podcast profiles stored as dicts, so config updates don't affect past episodes
|
|
8. **No connection pooling**: Each repo_* creates new connection (adequate for HTTP but inefficient for bulk)
|
|
9. **Circular import guard**: utils imports domain; domain must not import utils (breaks on import)
|
|
10. **SqliteSaver shared location**: LangGraph checkpoints from LANGGRAPH_CHECKPOINT_FILE env var; all graphs use same file
|
|
|
|
## How to Add New Feature
|
|
|
|
**New data model**:
|
|
1. Create class inheriting from `ObjectModel` with `table_name` ClassVar
|
|
2. Define Pydantic fields and validators
|
|
3. Override `save()` to submit embedding command if searchable (use `submit_command("embed_*", id)`)
|
|
4. Add custom methods for domain logic (get_X, add_to_Y)
|
|
5. Register in domain/__init__.py exports
|
|
|
|
**New workflow**:
|
|
1. Create state machine in graphs/WORKFLOW.py using StateGraph
|
|
2. Import domain models and provision_langchain_model()
|
|
3. Define nodes as async functions taking State, returning dict
|
|
4. Compile with graph.compile()
|
|
5. Invoke from API endpoint or Streamlit page
|
|
|
|
**New AI model type**:
|
|
1. Add type string to Model class
|
|
2. Add AIFactory.create_* method in Esperanto
|
|
3. Handle in ModelManager.get_model()
|
|
4. Add DefaultModels field + getter
|
|
|
|
## Key Dependencies
|
|
|
|
- **surrealdb**: AsyncSurreal client, RecordID type
|
|
- **pydantic**: Validation, field_validator
|
|
- **langgraph**: StateGraph, Send, SqliteSaver, async/sync bridging
|
|
- **langchain_core**: Messages, OutputParser, RunnableConfig
|
|
- **esperanto**: Multi-provider AI model abstraction (OpenAI, Anthropic, Google, Groq, Ollama, etc.)
|
|
- **content-core**: File/URL content extraction
|
|
- **ai_prompter**: Jinja2 template rendering for prompts
|
|
- **surreal_commands**: Async job queue for embeddings, podcast generation
|
|
- **loguru**: Structured logging throughout
|
|
- **tiktoken**: GPT token encoding for context window estimation
|
|
|
|
## Codebase Statistics
|
|
|
|
- **Modules**: 6 core layers + support services
|
|
- **Async operations**: Database, AI provisioning, graph execution, embedding, job tracking
|
|
- **Supported AI providers**: 8+ (OpenAI, Anthropic, Google, Groq, Ollama, Mistral, DeepSeek, xAI, OpenRouter)
|
|
- **Domain models**: Notebook, Source, Note, SourceInsight, SourceEmbedding, ChatSession, Asset, Transformation, ContentSettings, EpisodeProfile, SpeakerProfile, PodcastEpisode
|
|
- **Graph workflows**: 6 (source, chat, source_chat, ask, transformation, prompt)
|