open-notebook/open_notebook/utils
unendless314 6aabacfca6
feat: use token-based sizing for embedding chunking (#749)
* feat: make chunk sizing token-based with 512-token default

* fix: defer embedding debug token metrics

* chore: lower default chunk size to 400 tokens and document rationale

The previous 512-token default matched exactly the context window of
BERT-family embedders like mxbai-embed-large, leaving no margin for:
- tokenizer mismatch between our o200k_base measurement and the
  embedder's own WordPiece tokenizer
- occasional splitter overshoot (RecursiveCharacterTextSplitter can
  emit chunks slightly above chunk_size when separators are sparse)
- special tokens ([CLS], [SEP]) that consume context-window budget

400 tokens keeps ~20% headroom below 512 while still being a large
improvement over the old character-based default for most content.
Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE
via env var. Also adds a CHANGELOG entry for the full PR behavior
change.

* chore: move chunking changelog entry under 1.8.5

Target release is 1.8.5 — moving the Changed section out of Unreleased.

---------

Co-authored-by: Luis Novo <lfnovo@gmail.com>
2026-04-19 13:49:09 -03:00
..
__init__.py feat: credential-based API key management (#477) (#540) 2026-02-10 08:30:22 -03:00
chunking.py feat: use token-based sizing for embedding chunking (#749) 2026-04-19 13:49:09 -03:00
CLAUDE.md feat: use token-based sizing for embedding chunking (#749) 2026-04-19 13:49:09 -03:00
context_builder.py feat: content-type aware chunking and unified embedding (#444) 2026-01-21 23:49:08 -03:00
embedding.py feat: use token-based sizing for embedding chunking (#749) 2026-04-19 13:49:09 -03:00
encryption.py feat: credential-based API key management (#477) (#540) 2026-02-10 08:30:22 -03:00
error_classifier.py fix: embedding batch sizing and 413 error classification (1.7.4) 2026-02-18 11:39:47 -03:00
graph_utils.py fix: use sync get_state() for SqliteSaver compatibility (#519) 2026-01-31 19:25:11 -03:00
README.md Version 1 (#160) 2025-10-18 12:46:22 -03:00
text_utils.py fix: handle structured content format in LLM response parsing 2026-02-08 22:29:45 +01:00
token_utils.py fix: narrow exception to (ImportError, OSError) and include error in log 2026-03-10 19:45:14 -05:00
version_utils.py feat: use standard HTTP_PROXY/HTTPS_PROXY environment variables (#499) 2026-01-29 23:31:02 -03:00

ContextBuilder

A flexible and generic ContextBuilder class for the Open Notebook project that can handle any parameters and build context from sources, notebooks, insights, and notes.

Features

  • Flexible Parameters: Accepts any parameters via **kwargs for future extensibility
  • Priority-based Management: Automatic prioritization and sorting of context items
  • Token Counting: Built-in token counting and truncation to fit limits
  • Deduplication: Automatic removal of duplicate items based on ID
  • Type-based Grouping: Separates sources, notes, and insights in output
  • Async Support: Fully async for database operations

Basic Usage

from open_notebook.utils.context_builder import ContextBuilder, ContextConfig

# Simple notebook context
builder = ContextBuilder(notebook_id="notebook:123")
context = await builder.build()

# Single source with insights
builder = ContextBuilder(
    source_id="source:456",
    include_insights=True,
    max_tokens=2000
)
context = await builder.build()

Convenience Functions

from open_notebook.utils.context_builder import (
    build_notebook_context,
    build_source_context,
    build_mixed_context
)

# Build notebook context
context = await build_notebook_context(
    notebook_id="notebook:123",
    max_tokens=5000
)

# Build single source context
context = await build_source_context(
    source_id="source:456",
    include_insights=True
)

# Build mixed context
context = await build_mixed_context(
    source_ids=["source:1", "source:2"],
    note_ids=["note:1", "note:2"],
    max_tokens=3000
)

Advanced Configuration

from open_notebook.utils.context_builder import ContextConfig

# Custom configuration
config = ContextConfig(
    sources={
        "source:doc1": "insights",
        "source:doc2": "full content", 
        "source:doc3": "not in"  # Exclude
    },
    notes={
        "note:summary": "full content",
        "note:draft": "not in"  # Exclude
    },
    include_insights=True,
    max_tokens=3000,
    priority_weights={
        "source": 120,  # Higher priority
        "note": 80,     # Medium priority  
        "insight": 100  # High priority
    }
)

builder = ContextBuilder(
    notebook_id="notebook:project",
    context_config=config
)
context = await builder.build()

Programmatic Item Management

from open_notebook.utils.context_builder import ContextItem

builder = ContextBuilder()

# Add custom items
item = ContextItem(
    id="source:important",
    type="source",
    content={"title": "Key Document", "summary": "..."},
    priority=150  # Very high priority
)
builder.add_item(item)

# Apply management operations
builder.remove_duplicates()
builder.prioritize()
builder.truncate_to_fit(1000)

context = builder._format_response()

Flexible Parameters

The ContextBuilder accepts any parameters via **kwargs, making it extensible for future features:

builder = ContextBuilder(
    notebook_id="notebook:123",
    include_insights=True,
    max_tokens=2000,
    
    # Custom parameters for future extensions
    user_id="user:456",
    custom_filter="advanced",
    experimental_feature=True
)

# Access custom parameters
user_id = builder.params.get('user_id')

Output Format

The ContextBuilder returns a structured response:

{
    "sources": [...],           # List of source contexts
    "notes": [...],             # List of note contexts  
    "insights": [...],          # List of insight contexts
    "total_tokens": 1234,       # Total token count
    "total_items": 10,          # Total number of items
    "notebook_id": "notebook:123",  # If provided
    "metadata": {
        "source_count": 5,
        "note_count": 3,
        "insight_count": 2,
        "config": {
            "include_insights": true,
            "include_notes": true,
            "max_tokens": 2000
        }
    }
}

Architecture

The ContextBuilder follows these design principles:

  1. Separation of Concerns: Context building, item management, and formatting are separate
  2. Extensibility: Uses **kwargs and flexible configuration for future features
  3. Performance: Token-aware truncation and efficient deduplication
  4. Type Safety: Proper type hints and data classes for structure
  5. Error Handling: Graceful handling of missing items and database errors

Integration

The ContextBuilder integrates seamlessly with the existing Open Notebook architecture:

  • Uses existing domain models (Source, Notebook, Note)
  • Leverages the repository pattern for database access
  • Follows the same async patterns as other services
  • Integrates with the token counting utilities

Error Handling

The ContextBuilder handles errors gracefully:

  • Missing notebooks/sources/notes are logged but don't stop execution
  • Database errors are wrapped in DatabaseOperationError
  • Invalid parameters raise InvalidInputError
  • All errors include detailed context information