# Utils Module

Utility functions and helpers for context building, text processing, tokenization, and versioning.

## Purpose

Provides cross-cutting concerns: building LLM context from sources/insights, text utilities (truncation, cleaning), token counting, and version management.

## Architecture Overview

**Four core utilities**:
1. **context_builder.py**: Flexible context assembly from sources, notes, insights with token budgeting
2. **text_utils.py**: Text truncation, whitespace cleaning, formatting helpers
3. **token_utils.py**: Token counting for LLM context windows (wrapper around encoding library)
4. **version_utils.py**: Version parsing, comparison, and schema compatibility checks

Each utility is stateless and can be imported independently.

## Component Catalog

### context_builder.py
- **ContextItem**: Dataclass for individual context piece (id, type, content, priority, token_count)
- **ContextConfig**: Configuration for context building (sources/notes/insights selection, max tokens, priority weights)
- **ContextBuilder**: Main class assembling context
  - `add_source()`: Include source by ID with inclusion level
  - `add_note()`: Include note by ID
  - `add_insight()`: Include insight by ID
  - `build()`: Assemble context respecting token budget and priorities
  - Uses vector_search to fetch source/insight content from SurrealDB
  - Returns list of ContextItem objects sorted by priority

**Key behavior**:
- Token counting is automatic (calculated in ContextItem.__post_init__)
- Max token enforcement via priority weighting (higher priority items included first)
- Type-specific fetching: sources → Source.full_text, notes → Note.content, insights → SourceInsight.content
- Raises DatabaseOperationError if source/note fetch fails

### text_utils.py
- **truncate_text(text, max_chars, suffix="...")**: Truncates string, adds ellipsis
- **clean_text(text)**: Removes extra whitespace, normalizes newlines
- **extract_sentences(text, max_count)**: Splits text into sentences up to limit
- **normalize_whitespace(text)**: Collapse multiple spaces/newlines into single
- **format_for_llm(text)**: Combines cleaning + normalization for LLM consumption

**Key behavior**: All functions are pure (no side effects); safe for high-volume processing

### token_utils.py
- **token_count(text)**: Returns estimated token count for string (via encoding library)
- **remaining_tokens(max_tokens, used)**: Returns remaining tokens in budget
- **fits_in_context(text, max_tokens)**: Boolean check if text fits token budget

**Key behavior**: Uses fixed encoding (cl100k_base for GPT models); may differ slightly from actual model tokenization

### version_utils.py
- **parse_version(version_string)**: Parses "1.2.3" format; returns Version namedtuple
- **compare_versions(v1, v2)**: Returns -1 (v1 < v2), 0 (equal), 1 (v1 > v2)
- **is_compatible(current, required)**: Checks if current version meets requirement (e.g., current >= required)
- **schema_version_check()**: Validates database schema version on startup

**Key behavior**: Assumes semantic versioning (MAJOR.MINOR.PATCH); non-standard formats raise ValueError

## Common Patterns

- **Dataclass-driven config**: ContextConfig used by ContextBuilder (immutable after init)
- **Token budgeting**: ContextBuilder respects max_tokens constraint; prioritizes high-priority items
- **Error handling resilience**: token_count() returns estimate; context_builder catches DB errors gracefully
- **Pure text functions**: text_utils functions are stateless utilities (no class needed)
- **Lazy evaluation**: ContextBuilder doesn't fetch items until build() called
- **Type hints throughout**: All functions use Optional, List, Dict for clarity

## Key Dependencies

- `open_notebook.domain.notebook`: Source, Note, SourceInsight models; vector_search function
- `open_notebook.exceptions`: DatabaseOperationError, NotFoundError
- `tiktoken` (via token_utils.py): Token encoding for GPT models
- `loguru`: Logging in context_builder (debug-level)

## Important Quirks & Gotchas

- **Token count estimation**: Uses cl100k_base encoding; may differ 5-10% from actual model tokens
- **Priority weights default**: If not specified, ContextConfig uses default weights (source=1, note=0.8, insight=1.2)
- **Vector search required**: ContextBuilder assumes vector_search is available on Notebook model; fails if not
- **Source.full_text vs content**: Uses full_text field (may include extracted text + metadata)
- **Type-specific fetch logic**: ContextItem.content stores raw dict; caller must parse (e.g., dict["content"])
- **Circular import risk**: context_builder imports from domain.notebook; avoid domain importing utils
- **Max tokens hard limit**: ContextBuilder stops adding items once max_tokens exceeded (not prorated)
- **No caching**: Every build() call re-fetches from database (use cache layer if needed)
- **Whitespace normalization lossy**: clean_text() may change intended formatting (code blocks, poetry, etc.)

## How to Extend

1. **Add new context source type**: Create fetch method in ContextBuilder; update ContextConfig.sources dict
2. **Add text preprocessing**: Add new function to text_utils (e.g., remove_urls, extract_keywords)
3. **Change tokenization**: Replace tiktoken with alternative library in token_utils; update all calls
4. **Add context filtering**: Extend ContextConfig with filter_by_date, filter_by_topic fields
5. **Implement caching**: Wrap ContextBuilder.build() with functools.lru_cache (be aware of mutability)

## Usage Example

```python
from open_notebook.utils.context_builder import ContextBuilder, ContextConfig

config = ContextConfig(
    sources={"source:123": "full", "source:456": "summary"},
    max_tokens=2000,
)
builder = ContextBuilder(notebook, config)
context_items = await builder.build()

# context_items is List[ContextItem] sorted by priority
for item in context_items:
    print(f"{item.type}:{item.id} ({item.token_count} tokens)")
```