open-notebook/open_notebook
unendless314 6aabacfca6
feat: use token-based sizing for embedding chunking (#749)
* feat: make chunk sizing token-based with 512-token default

* fix: defer embedding debug token metrics

* chore: lower default chunk size to 400 tokens and document rationale

The previous 512-token default matched exactly the context window of
BERT-family embedders like mxbai-embed-large, leaving no margin for:
- tokenizer mismatch between our o200k_base measurement and the
  embedder's own WordPiece tokenizer
- occasional splitter overshoot (RecursiveCharacterTextSplitter can
  emit chunks slightly above chunk_size when separators are sparse)
- special tokens ([CLS], [SEP]) that consume context-window budget

400 tokens keeps ~20% headroom below 512 while still being a large
improvement over the old character-based default for most content.
Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE
via env var. Also adds a CHANGELOG entry for the full PR behavior
change.

* chore: move chunking changelog entry under 1.8.5

Target release is 1.8.5 — moving the Changed section out of Unreleased.

---------

Co-authored-by: Luis Novo <lfnovo@gmail.com>
2026-04-19 13:49:09 -03:00
..
ai fix: map base_url to endpoint for Azure credentials (#741) 2026-04-09 13:22:00 -03:00
database fix: prevent SurrealDB injection via order_by and unparameterized queries 2026-04-07 07:58:54 -03:00
domain Merge pull request #753 from lfnovo/fix/graceful-credential-decryption-errors 2026-04-14 14:37:19 -03:00
graphs fix: persist source asset, preserve custom titles, cascade-delete credential models 2026-04-06 07:38:37 -03:00
podcasts feat(podcasts): model registry integration, credential passthrough & new features (#632) 2026-02-27 11:06:47 -03:00
utils feat: use token-based sizing for embedding chunking (#749) 2026-04-19 13:49:09 -03:00
__init__.py refactor: move environment variables loading to application entry point (#283) 2025-12-01 14:59:50 -03:00
CLAUDE.md chore: bump version to 1.8.1 2026-03-10 20:20:16 -05:00
config.py fix: handle tiktoken network errors in offline environments (issue #264) 2026-03-10 19:45:14 -05:00
exceptions.py refactor database module and migrations 2024-10-30 16:33:07 -03:00