open-notebook/tests
unendless314 6aabacfca6
feat: use token-based sizing for embedding chunking (#749)
* feat: make chunk sizing token-based with 512-token default

* fix: defer embedding debug token metrics

* chore: lower default chunk size to 400 tokens and document rationale

The previous 512-token default matched exactly the context window of
BERT-family embedders like mxbai-embed-large, leaving no margin for:
- tokenizer mismatch between our o200k_base measurement and the
  embedder's own WordPiece tokenizer
- occasional splitter overshoot (RecursiveCharacterTextSplitter can
  emit chunks slightly above chunk_size when separators are sparse)
- special tokens ([CLS], [SEP]) that consume context-window budget

400 tokens keeps ~20% headroom below 512 while still being a large
improvement over the old character-based default for most content.
Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE
via env var. Also adds a CHANGELOG entry for the full PR behavior
change.

* chore: move chunking changelog entry under 1.8.5

Target release is 1.8.5 — moving the Changed section out of Unreleased.

---------

Co-authored-by: Luis Novo <lfnovo@gmail.com>
2026-04-19 13:49:09 -03:00
..
conftest.py feat: credential-based API key management (#477) (#540) 2026-02-10 08:30:22 -03:00
README.md Initial commit with all features 2024-10-21 14:56:10 -03:00
test_chunking.py feat: use token-based sizing for embedding chunking (#749) 2026-04-19 13:49:09 -03:00
test_credentials_api.py refactor: move tests from test_bug_fixes.py to proper test modules 2026-04-06 07:45:49 -03:00
test_domain.py fix: handle empty/whitespace source content without retry loop (#576) 2026-02-14 18:09:07 -03:00
test_embedding.py feat: use token-based sizing for embedding chunking (#749) 2026-04-19 13:49:09 -03:00
test_graphs.py refactor: move tests from test_bug_fixes.py to proper test modules 2026-04-06 07:45:49 -03:00
test_models_api.py Feat/localization tests docker (#371) 2026-01-15 13:51:05 -03:00
test_notes_api.py feat: expose embed command_id in note API responses (#545) 2026-02-14 18:11:23 -03:00
test_podcast_path.py fix: extract build_episode_output_dir helper and test production code 2026-03-11 17:05:42 -05:00
test_sources_api.py fix: prevent RCE via SSTI, path traversal file write, and LFI file read 2026-04-09 11:58:16 -03:00
test_url_validation.py feat: credential-based API key management (#477) (#540) 2026-02-10 08:30:22 -03:00
test_utils.py fix: handle tiktoken network errors in offline environments (issue #264) 2026-03-10 19:45:14 -05:00

Coming Soon