open-notebook/open_notebook/utils/CLAUDE.md
LUIS NOVO 71b8d13b24 docs: generate comprehensive CLAUDE.md reference documentation across codebase
Create a hierarchical CLAUDE.md documentation system for the entire Open Notebook
codebase with focus on concise, pattern-driven reference cards rather than
comprehensive tutorials.

## Changes

### Core Documentation System
- Updated `.claude/commands/build-claude-md.md` to distinguish between leaf and
  parent modules, with special handling for prompt/template modules
- Established clear patterns:
  * Leaf modules (40-70 lines): Components, hooks, API clients
  * Parent modules (50-150 lines): Architecture, cross-layer patterns, data flows
  * Template modules: Pattern focus, not catalog listings

### Generated Documentation
Created 15 CLAUDE.md reference files across the project:

**Frontend (React/Next.js)**
- frontend/src/CLAUDE.md: Architecture overview, data flow, three-tier design
- frontend/src/lib/hooks/CLAUDE.md: React Query patterns, state management
- frontend/src/lib/api/CLAUDE.md: Axios client, FormData handling, interceptors
- frontend/src/lib/stores/CLAUDE.md: Zustand state persistence, auth patterns
- frontend/src/components/ui/CLAUDE.md: Radix UI primitives, CVA styling

**Backend (Python/FastAPI)**
- open_notebook/CLAUDE.md: System architecture, layer interactions
- open_notebook/ai/CLAUDE.md: Model provisioning, Esperanto integration
- open_notebook/domain/CLAUDE.md: Data models, ObjectModel/RecordModel patterns
- open_notebook/database/CLAUDE.md: Repository pattern, async migrations
- open_notebook/graphs/CLAUDE.md: LangGraph workflows, async orchestration
- open_notebook/utils/CLAUDE.md: Cross-cutting utilities, context building
- open_notebook/podcasts/CLAUDE.md: Episode/speaker profiles, job tracking

**API & Other**
- api/CLAUDE.md: REST layer, service architecture
- commands/CLAUDE.md: Async command handlers, job queue patterns
- prompts/CLAUDE.md: Jinja2 templates, prompt engineering patterns (refactored)

**Project Root**
- CLAUDE.md: Project overview, three-tier architecture, tech stack, getting started

### Key Features
- Zero duplication: Parent modules reference child CLAUDE.md files, don't repeat them
- Pattern-focused: Emphasizes how components work together, not component catalogs
- Scannable: Short bullets, code examples only when necessary (1-2 per file)
- Practical: "How to extend" guides, quirks/gotchas for each module
- Navigation: Root CLAUDE.md acts as hub pointing to specialized documentation

### Cleanup
- Removed unused `batch_fix_services.py`
- Removed deprecated `open_notebook/plugins/podcasts.py`
- Updated .gitignore for documentation consistency

## Impact
New contributors can now:
1. Read root CLAUDE.md for system architecture (5 min)
2. Jump to specific layer documentation (frontend, api, open_notebook)
3. Dive into module-specific patterns in child CLAUDE.md files (1 min per module)
All documentation is lean, reference-focused, and avoids duplication.
2026-01-03 16:27:52 -03:00

5.8 KiB

Utils Module

Utility functions and helpers for context building, text processing, tokenization, and versioning.

Purpose

Provides cross-cutting concerns: building LLM context from sources/insights, text utilities (truncation, cleaning), token counting, and version management.

Architecture Overview

Four core utilities:

  1. context_builder.py: Flexible context assembly from sources, notes, insights with token budgeting
  2. text_utils.py: Text truncation, whitespace cleaning, formatting helpers
  3. token_utils.py: Token counting for LLM context windows (wrapper around encoding library)
  4. version_utils.py: Version parsing, comparison, and schema compatibility checks

Each utility is stateless and can be imported independently.

Component Catalog

context_builder.py

  • ContextItem: Dataclass for individual context piece (id, type, content, priority, token_count)
  • ContextConfig: Configuration for context building (sources/notes/insights selection, max tokens, priority weights)
  • ContextBuilder: Main class assembling context
    • add_source(): Include source by ID with inclusion level
    • add_note(): Include note by ID
    • add_insight(): Include insight by ID
    • build(): Assemble context respecting token budget and priorities
    • Uses vector_search to fetch source/insight content from SurrealDB
    • Returns list of ContextItem objects sorted by priority

Key behavior:

  • Token counting is automatic (calculated in ContextItem.post_init)
  • Max token enforcement via priority weighting (higher priority items included first)
  • Type-specific fetching: sources → Source.full_text, notes → Note.content, insights → SourceInsight.content
  • Raises DatabaseOperationError if source/note fetch fails

text_utils.py

  • truncate_text(text, max_chars, suffix="..."): Truncates string, adds ellipsis
  • clean_text(text): Removes extra whitespace, normalizes newlines
  • extract_sentences(text, max_count): Splits text into sentences up to limit
  • normalize_whitespace(text): Collapse multiple spaces/newlines into single
  • format_for_llm(text): Combines cleaning + normalization for LLM consumption

Key behavior: All functions are pure (no side effects); safe for high-volume processing

token_utils.py

  • token_count(text): Returns estimated token count for string (via encoding library)
  • remaining_tokens(max_tokens, used): Returns remaining tokens in budget
  • fits_in_context(text, max_tokens): Boolean check if text fits token budget

Key behavior: Uses fixed encoding (cl100k_base for GPT models); may differ slightly from actual model tokenization

version_utils.py

  • parse_version(version_string): Parses "1.2.3" format; returns Version namedtuple
  • compare_versions(v1, v2): Returns -1 (v1 < v2), 0 (equal), 1 (v1 > v2)
  • is_compatible(current, required): Checks if current version meets requirement (e.g., current >= required)
  • schema_version_check(): Validates database schema version on startup

Key behavior: Assumes semantic versioning (MAJOR.MINOR.PATCH); non-standard formats raise ValueError

Common Patterns

  • Dataclass-driven config: ContextConfig used by ContextBuilder (immutable after init)
  • Token budgeting: ContextBuilder respects max_tokens constraint; prioritizes high-priority items
  • Error handling resilience: token_count() returns estimate; context_builder catches DB errors gracefully
  • Pure text functions: text_utils functions are stateless utilities (no class needed)
  • Lazy evaluation: ContextBuilder doesn't fetch items until build() called
  • Type hints throughout: All functions use Optional, List, Dict for clarity

Key Dependencies

  • open_notebook.domain.notebook: Source, Note, SourceInsight models; vector_search function
  • open_notebook.exceptions: DatabaseOperationError, NotFoundError
  • tiktoken (via token_utils.py): Token encoding for GPT models
  • loguru: Logging in context_builder (debug-level)

Important Quirks & Gotchas

  • Token count estimation: Uses cl100k_base encoding; may differ 5-10% from actual model tokens
  • Priority weights default: If not specified, ContextConfig uses default weights (source=1, note=0.8, insight=1.2)
  • Vector search required: ContextBuilder assumes vector_search is available on Notebook model; fails if not
  • Source.full_text vs content: Uses full_text field (may include extracted text + metadata)
  • Type-specific fetch logic: ContextItem.content stores raw dict; caller must parse (e.g., dict["content"])
  • Circular import risk: context_builder imports from domain.notebook; avoid domain importing utils
  • Max tokens hard limit: ContextBuilder stops adding items once max_tokens exceeded (not prorated)
  • No caching: Every build() call re-fetches from database (use cache layer if needed)
  • Whitespace normalization lossy: clean_text() may change intended formatting (code blocks, poetry, etc.)

How to Extend

  1. Add new context source type: Create fetch method in ContextBuilder; update ContextConfig.sources dict
  2. Add text preprocessing: Add new function to text_utils (e.g., remove_urls, extract_keywords)
  3. Change tokenization: Replace tiktoken with alternative library in token_utils; update all calls
  4. Add context filtering: Extend ContextConfig with filter_by_date, filter_by_topic fields
  5. Implement caching: Wrap ContextBuilder.build() with functools.lru_cache (be aware of mutability)

Usage Example

from open_notebook.utils.context_builder import ContextBuilder, ContextConfig

config = ContextConfig(
    sources={"source:123": "full", "source:456": "summary"},
    max_tokens=2000,
)
builder = ContextBuilder(notebook, config)
context_items = await builder.build()

# context_items is List[ContextItem] sorted by priority
for item in context_items:
    print(f"{item.type}:{item.id} ({item.token_count} tokens)")