open-notebook

mirror of https://github.com/lfnovo/open-notebook.git synced 2026-04-28 11:30:00 +00:00

History

unendless314 6aabacfca6 feat: use token-based sizing for embedding chunking (#749 ) * feat: make chunk sizing token-based with 512-token default * fix: defer embedding debug token metrics * chore: lower default chunk size to 400 tokens and document rationale The previous 512-token default matched exactly the context window of BERT-family embedders like mxbai-embed-large, leaving no margin for: - tokenizer mismatch between our o200k_base measurement and the embedder's own WordPiece tokenizer - occasional splitter overshoot (RecursiveCharacterTextSplitter can emit chunks slightly above chunk_size when separators are sparse) - special tokens ([CLS], [SEP]) that consume context-window budget 400 tokens keeps ~20% headroom below 512 while still being a large improvement over the old character-based default for most content. Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE via env var. Also adds a CHANGELOG entry for the full PR behavior change. * chore: move chunking changelog entry under 1.8.5 Target release is 1.8.5 — moving the Changed section out of Unreleased. --------- Co-authored-by: Luis Novo <lfnovo@gmail.com>		2026-04-19 13:49:09 -03:00
..
ai	fix: map base_url to endpoint for Azure credentials (#741 )	2026-04-09 13:22:00 -03:00
database	fix: prevent SurrealDB injection via order_by and unparameterized queries	2026-04-07 07:58:54 -03:00
domain	Merge pull request #753 from lfnovo/fix/graceful-credential-decryption-errors	2026-04-14 14:37:19 -03:00
graphs	fix: persist source asset, preserve custom titles, cascade-delete credential models	2026-04-06 07:38:37 -03:00
podcasts	feat(podcasts): model registry integration, credential passthrough & new features (#632 )	2026-02-27 11:06:47 -03:00
utils	feat: use token-based sizing for embedding chunking (#749 )	2026-04-19 13:49:09 -03:00
__init__.py	refactor: move environment variables loading to application entry point (#283 )	2025-12-01 14:59:50 -03:00
CLAUDE.md	chore: bump version to 1.8.1	2026-03-10 20:20:16 -05:00
config.py	fix: handle tiktoken network errors in offline environments (issue #264 )	2026-03-10 19:45:14 -05:00
exceptions.py	refactor database module and migrations	2024-10-30 16:33:07 -03:00