mirror of https://github.com/lfnovo/open-notebook.git synced 2026-04-28 11:30:00 +00:00

History

unendless314 6aabacfca6 feat: use token-based sizing for embedding chunking (#749 ) * feat: make chunk sizing token-based with 512-token default * fix: defer embedding debug token metrics * chore: lower default chunk size to 400 tokens and document rationale The previous 512-token default matched exactly the context window of BERT-family embedders like mxbai-embed-large, leaving no margin for: - tokenizer mismatch between our o200k_base measurement and the embedder's own WordPiece tokenizer - occasional splitter overshoot (RecursiveCharacterTextSplitter can emit chunks slightly above chunk_size when separators are sparse) - special tokens ([CLS], [SEP]) that consume context-window budget 400 tokens keeps ~20% headroom below 512 while still being a large improvement over the old character-based default for most content. Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE via env var. Also adds a CHANGELOG entry for the full PR behavior change. * chore: move chunking changelog entry under 1.8.5 Target release is 1.8.5 — moving the Changed section out of Unreleased. --------- Co-authored-by: Luis Novo <lfnovo@gmail.com>		2026-04-19 13:49:09 -03:00
..
__init__.py	feat: credential-based API key management (#477 ) (#540 )	2026-02-10 08:30:22 -03:00
chunking.py	feat: use token-based sizing for embedding chunking (#749 )	2026-04-19 13:49:09 -03:00
CLAUDE.md	feat: use token-based sizing for embedding chunking (#749 )	2026-04-19 13:49:09 -03:00
context_builder.py	feat: content-type aware chunking and unified embedding (#444 )	2026-01-21 23:49:08 -03:00
embedding.py	feat: use token-based sizing for embedding chunking (#749 )	2026-04-19 13:49:09 -03:00
encryption.py	feat: credential-based API key management (#477 ) (#540 )	2026-02-10 08:30:22 -03:00
error_classifier.py	fix: embedding batch sizing and 413 error classification (1.7.4)	2026-02-18 11:39:47 -03:00
graph_utils.py	fix: use sync get_state() for SqliteSaver compatibility (#519 )	2026-01-31 19:25:11 -03:00
README.md	Version 1 (#160 )	2025-10-18 12:46:22 -03:00
text_utils.py	fix: handle structured content format in LLM response parsing	2026-02-08 22:29:45 +01:00
token_utils.py	fix: narrow exception to (ImportError, OSError) and include error in log	2026-03-10 19:45:14 -05:00
version_utils.py	feat: use standard HTTP_PROXY/HTTPS_PROXY environment variables (#499 )	2026-01-29 23:31:02 -03:00

README.md

ContextBuilder

A flexible and generic ContextBuilder class for the Open Notebook project that can handle any parameters and build context from sources, notebooks, insights, and notes.

Features

Flexible Parameters: Accepts any parameters via **kwargs for future extensibility
Priority-based Management: Automatic prioritization and sorting of context items
Token Counting: Built-in token counting and truncation to fit limits
Deduplication: Automatic removal of duplicate items based on ID
Type-based Grouping: Separates sources, notes, and insights in output
Async Support: Fully async for database operations

Basic Usage

from open_notebook.utils.context_builder import ContextBuilder, ContextConfig

# Simple notebook context
builder = ContextBuilder(notebook_id="notebook:123")
context = await builder.build()

# Single source with insights
builder = ContextBuilder(
    source_id="source:456",
    include_insights=True,
    max_tokens=2000
)
context = await builder.build()

Convenience Functions

from open_notebook.utils.context_builder import (
    build_notebook_context,
    build_source_context,
    build_mixed_context
)

# Build notebook context
context = await build_notebook_context(
    notebook_id="notebook:123",
    max_tokens=5000
)

# Build single source context
context = await build_source_context(
    source_id="source:456",
    include_insights=True
)

# Build mixed context
context = await build_mixed_context(
    source_ids=["source:1", "source:2"],
    note_ids=["note:1", "note:2"],
    max_tokens=3000
)

Advanced Configuration

from open_notebook.utils.context_builder import ContextConfig

# Custom configuration
config = ContextConfig(
    sources={
        "source:doc1": "insights",
        "source:doc2": "full content", 
        "source:doc3": "not in"  # Exclude
    },
    notes={
        "note:summary": "full content",
        "note:draft": "not in"  # Exclude
    },
    include_insights=True,
    max_tokens=3000,
    priority_weights={
        "source": 120,  # Higher priority
        "note": 80,     # Medium priority  
        "insight": 100  # High priority
    }
)

builder = ContextBuilder(
    notebook_id="notebook:project",
    context_config=config
)
context = await builder.build()

Programmatic Item Management

from open_notebook.utils.context_builder import ContextItem

builder = ContextBuilder()

# Add custom items
item = ContextItem(
    id="source:important",
    type="source",
    content={"title": "Key Document", "summary": "..."},
    priority=150  # Very high priority
)
builder.add_item(item)

# Apply management operations
builder.remove_duplicates()
builder.prioritize()
builder.truncate_to_fit(1000)

context = builder._format_response()

Flexible Parameters

The ContextBuilder accepts any parameters via **kwargs, making it extensible for future features:

builder = ContextBuilder(
    notebook_id="notebook:123",
    include_insights=True,
    max_tokens=2000,
    
    # Custom parameters for future extensions
    user_id="user:456",
    custom_filter="advanced",
    experimental_feature=True
)

# Access custom parameters
user_id = builder.params.get('user_id')

Output Format

The ContextBuilder returns a structured response:

{
    "sources": [...],           # List of source contexts
    "notes": [...],             # List of note contexts  
    "insights": [...],          # List of insight contexts
    "total_tokens": 1234,       # Total token count
    "total_items": 10,          # Total number of items
    "notebook_id": "notebook:123",  # If provided
    "metadata": {
        "source_count": 5,
        "note_count": 3,
        "insight_count": 2,
        "config": {
            "include_insights": true,
            "include_notes": true,
            "max_tokens": 2000
        }
    }
}

Architecture

The ContextBuilder follows these design principles:

Separation of Concerns: Context building, item management, and formatting are separate
Extensibility: Uses **kwargs and flexible configuration for future features
Performance: Token-aware truncation and efficient deduplication
Type Safety: Proper type hints and data classes for structure
Error Handling: Graceful handling of missing items and database errors

Integration

The ContextBuilder integrates seamlessly with the existing Open Notebook architecture:

Uses existing domain models (Source, Notebook, Note)
Leverages the repository pattern for database access
Follows the same async patterns as other services
Integrates with the token counting utilities

Error Handling

The ContextBuilder handles errors gracefully:

Missing notebooks/sources/notes are logged but don't stop execution
Database errors are wrapped in DatabaseOperationError
Invalid parameters raise InvalidInputError
All errors include detailed context information