mirror of
https://github.com/lfnovo/open-notebook.git
synced 2026-04-28 03:19:59 +00:00
* feat: make chunk sizing token-based with 512-token default * fix: defer embedding debug token metrics * chore: lower default chunk size to 400 tokens and document rationale The previous 512-token default matched exactly the context window of BERT-family embedders like mxbai-embed-large, leaving no margin for: - tokenizer mismatch between our o200k_base measurement and the embedder's own WordPiece tokenizer - occasional splitter overshoot (RecursiveCharacterTextSplitter can emit chunks slightly above chunk_size when separators are sparse) - special tokens ([CLS], [SEP]) that consume context-window budget 400 tokens keeps ~20% headroom below 512 while still being a large improvement over the old character-based default for most content. Users with larger-context embedders can raise OPEN_NOTEBOOK_CHUNK_SIZE via env var. Also adds a CHANGELOG entry for the full PR behavior change. * chore: move chunking changelog entry under 1.8.5 Target release is 1.8.5 — moving the Changed section out of Unreleased. --------- Co-authored-by: Luis Novo <lfnovo@gmail.com> |
||
|---|---|---|
| .. | ||
| conftest.py | ||
| README.md | ||
| test_chunking.py | ||
| test_credentials_api.py | ||
| test_domain.py | ||
| test_embedding.py | ||
| test_graphs.py | ||
| test_models_api.py | ||
| test_notes_api.py | ||
| test_podcast_path.py | ||
| test_sources_api.py | ||
| test_url_validation.py | ||
| test_utils.py | ||
Coming Soon