mirror of
https://github.com/lfnovo/open-notebook.git
synced 2026-04-29 12:00:00 +00:00
* feat(i18n): complete 100% internationalization and fix Next.js 15 compatibility
* feat(i18n): complete 100% internationalization coverage
* chore(test): finalize component tests and project cleanup
* test(logic): add unit tests for useModalManager hook
* fix(test): resolve timeout in AppSidebar tests by mocking TooltipProvider
* feat(i18n): comprehensive i18n audit, fixes for hardcoded strings, and complete zh-TW support
* fix(i18n): resolve TypeScript warnings and improve translation hook stability
- Remove unused useTranslation import from ConnectionGuard
- Add ref-based checking state to prevent dependency cycles
- Fix useTranslation hook to return empty string for undefined translations
- Add comment for backward compatibility on ExtractedReference interface
- Ensure .replace() string methods work safely with nested translation keys
* feat(i18n): complete internationalization implementation with Docker deployment
- Add LanguageLoadingOverlay component for smooth language transitions
- Update all translation files (en-US, zh-CN, zh-TW) with improved terminology
- Optimize Docker configuration for better performance
- Update version check and config handling for i18n support
- Fix route handling for language-specific content
- Add comprehensive task documentation
* fix(i18n): resolve localization errors, duplicates, and type issues
* chore(i18n): finalize 100% internationalization coverage
* chore(test): supplement i18n test cases and cleanup redundant files
* fix(test): resolve lint type errors and finalize delivery documents
* feat(i18n): finalize full internationalization and zh-TW localization
* fix(frontend): add missing devDependency and fix build tsconfig
* feat(ui): enhance sidebar hover effects with better visual feedback
* fix(frontend): resolve accessibility, i18n, and lint issues
- fix: add missing id, name, autocomplete attributes to dialog inputs
- fix: add aria labels and DialogDescription for accessibility
- fix: resolve uncontrolled component warning in SettingsForm
- fix: correct duplicate 'Traditional Chinese' label in zh-TW locale
- feat: add i18n support for podcast template names
- chore: fix lint errors in Dialogs
* fix: address all 21 PR feedback items from cubic-dev-ai bot
Configuration:
- Remove ignoreDuringBuilds flags from next.config.ts
Testing:
- Fix AppSidebar.test.tsx regex pattern and add missing assertion
Logic:
- Fix ConnectionGuard.tsx re-entry prevention logic
Internationalization (I18n) - Translations:
- Add missing keys: notebooks.archived, common.note/insight, accessibility keys
- Add specific keys: sources.allSourcesDescShort, transformations.selectModel
- Add singular/plural keys: podcasts.usedByCount_one/other, common.note/notes
- Add common.created/updated with {time} placeholder
Internationalization (I18n) - Usage:
- SourcesPage: use allSourcesDescShort instead of string splitting
- TransformationPlayground: use navigation.transformation and selectModel
- CommandPalette: use dedicated keys instead of string concatenation
- GeneratePodcastDialog: fix zh-TW date locale handling
- NotebookHeader: correctly interpolate {time} placeholder
- TransformationCard: use common.description instead of undefined key
- ChatPanel/SpeakerProfilesPanel: implement proper pluralization
- SystemInfo: correctly interpolate {version} placeholder
- LanguageLoadingOverlay: use t.common.loading instead of hardcoded string
- MessageActions: use specific error key cannotSaveNoteNoNotebook
Other:
- Fix SessionManager.tsx exhaustive-deps warning
* fix: remove duplicate locale keys and add missing zh-CN translations
- en-US: remove duplicate loading key (line 59) and addNew key (sources)
- zh-CN: remove duplicate common keys (loading, note, insight, newSource, newNotebook, newPodcast)
- zh-CN: remove duplicate accessibility.searchNotebooks key
- zh-CN: remove duplicate sources.addNew key
- zh-CN: remove duplicate navigation.transformation key
- zh-CN: add missing usedByCount_one and usedByCount_other keys in podcasts
- zh-TW: remove duplicate common keys (loading, note, insight, newSource, newNotebook, newPodcast)
- zh-TW: remove duplicate accessibility.searchNotebooks key
- zh-TW: remove duplicate sources.addNew key
* docs: remove info.md
* fix: remove duplicate notebook keys and unused ts-expect-error
- zh-CN: remove duplicate notebooks keys (archived, archive, unarchive, deleteNotebook, deleteNotebookDesc)
- zh-TW: remove duplicate notebooks keys (archived, archive, unarchive, deleteNotebook, deleteNotebookDesc)
- GeneratePodcastDialog: remove unused @ts-expect-error directive
* fix(a11y): fix unassociated labels in search page
- Replace <Label> with role='group' + aria-labelledby for search type section
- Replace <Label> with role='group' + aria-labelledby for search in section
- Follows WAI-ARIA best practices for labeling form field groups
* fix(a11y): fix unassociated labels across multiple components
- search/page.tsx: use role='group' + aria-labelledby for search type and search in sections
- RebuildEmbeddings.tsx: use role='group' + aria-labelledby for include checkboxes
- TransformationPlayground.tsx: replace Label with span for non-form output label
* chore: revert to npm stack and ensure i18n compatibility
* chore: polish zh-TW translations for better idiomatic usage
* fix: resolve linter errors (ruff import sort, mypy config duplicate)
* style: apply ruff formatting
* fix: finalize upstream compliance (Dockerfile.single, i18n hooks, docker-compose)
* style: polish strings, fix timeout cleanup, and improve test mocks
* fix: use relative imports in test setup to resolve IDE path errors
* perf(docker): optimize build speed by removing apt-get upgrade and build tools
- Remove apt-get upgrade from both builder and runtime stages (saves 10-15 min each)
- Remove gcc/g++/make/git from builder (uv downloads pre-built wheels)
- Add --no-install-recommends to minimize package footprint
- Keep npm mirror (npmmirror.com) for faster frontend deps
- Add npm registry config for reliable China network access
Also includes:
- fix(a11y): add missing labels and aria attributes to form fields
- fix(i18n): add 2s safety timeout to LanguageLoadingOverlay
- fix(i18n): add robustness checks to use-translation proxy
Build time reduced from 2+ hours to ~34 minutes (~70% improvement)
* fix(a11y): resolve 16 form field accessibility warnings in notebook and podcast pages
* fix(a11y): resolve 4 button and 1 select field accessibility warnings in models page
* fix(a11y): resolve redundant attributes and residual warnings in transformations and podcast forms
* fix(i18n): deep fix for language switch hang using proxy protection and safer access
* fix(a11y): add name attributes to ModelSelector, TransformationPlayground, and SourceDetailContent
* fix: add missing Label import to SourceDetailContent
* fix(i18n): use native react-i18next in LanguageLoadingOverlay to prevent hang during language switch
* fix(i18n): rewrite use-translation Proxy with strict depth limit and expanded blocked props to prevent language switch hang
* fix: add type assertion to fix TypeScript comparison error
* fix(i18n): disable useSuspense to prevent thread hang during language resource loading
* fix(i18n): add infinite loop detection circuit breaker to useTranslation hook
* fix(i18n): update traditional chinese label to native script in en-US
* feat: add new localization strings for notebook and note management.
* fix: resolve config priority, docker build deps, and ui glitches
* refactor: improve ui details and test coverage based on feedback
* refactor: improve ui details (version check/lang toggle) and test coverage
* fix: polish language matching and test cleanup
* fix(test): update mocks to resolve timeouts and proxy errors
* fix(frontend): restore tsconfig.json structure and enable IDE support for tests
* fix: address PR review findings and resolve CI OIDC failure
* fix: merge exception headers in custom handler
* fix: comprehensive PR review remediations and async performance fixes
* refactor: address all PR #371 review feedback
- Docker: consolidate SURREAL_URL to docker.env, add single-container override
- Security: restore apt-get upgrade in Dockerfile and Dockerfile.single
- Create centralized getDateLocale helper (lib/utils/date-locale.ts)
- Refactor 7 files to use getDateLocale helper
- Revert config/route.ts to origin/main version
- Move test files to co-located pattern (3 files)
- Remove local useTranslation mock from ConfirmDialog.test.tsx
- Simplify use-version-check to single useEffect pattern
- Fix test import paths after moving to co-located pattern
* fix: add jest-dom types for test files
* fix: address remaining review issues
- Add apt-get upgrade -y to Dockerfile.single backend-builder stage
- Refactor ChatColumn.test.tsx: use 'as unknown as ReturnType<typeof hook>' instead of 'as any'
- Use toBeInTheDocument() assertions instead of toBeDefined()
156 lines
5.1 KiB
Python
156 lines
5.1 KiB
Python
"""
|
|
Text utilities for Open Notebook.
|
|
Extracted from main utils to avoid circular imports.
|
|
"""
|
|
|
|
import re
|
|
import unicodedata
|
|
from typing import Tuple
|
|
|
|
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
|
|
|
from .token_utils import token_count
|
|
|
|
# Patterns for matching thinking content in AI responses
|
|
# Standard pattern: <think>...</think>
|
|
THINK_PATTERN = re.compile(r"<think>(.*?)</think>", re.DOTALL)
|
|
# Pattern for malformed output: content</think> (missing opening tag)
|
|
THINK_PATTERN_NO_OPEN = re.compile(r"^(.*?)</think>", re.DOTALL)
|
|
|
|
|
|
def split_text(txt: str, chunk_size=500):
|
|
"""
|
|
Split the input text into chunks.
|
|
|
|
Args:
|
|
txt (str): The input text to be split.
|
|
chunk_size (int): The size of each chunk. Default is 500.
|
|
|
|
Returns:
|
|
list: A list of text chunks.
|
|
"""
|
|
overlap = int(chunk_size * 0.15)
|
|
text_splitter = RecursiveCharacterTextSplitter(
|
|
chunk_size=chunk_size,
|
|
chunk_overlap=overlap,
|
|
length_function=token_count,
|
|
separators=[
|
|
"\n\n",
|
|
"\n",
|
|
".",
|
|
",",
|
|
" ",
|
|
"\u200b", # Zero-width space
|
|
"\uff0c", # Fullwidth comma
|
|
"\u3001", # Ideographic comma
|
|
"\uff0e", # Fullwidth full stop
|
|
"\u3002", # Ideographic full stop
|
|
"",
|
|
],
|
|
)
|
|
return text_splitter.split_text(txt)
|
|
|
|
|
|
def remove_non_ascii(text: str) -> str:
|
|
"""Remove non-ASCII characters from text."""
|
|
return re.sub(r"[^\x00-\x7F]+", "", text)
|
|
|
|
|
|
def remove_non_printable(text: str) -> str:
|
|
"""Remove non-printable characters from text."""
|
|
# Replace any special Unicode whitespace characters with a regular space
|
|
text = re.sub(r"[\u2000-\u200B\u202F\u205F\u3000]", " ", text)
|
|
|
|
# Replace unusual line terminators with a single newline
|
|
text = re.sub(r"[\u2028\u2029\r]", "\n", text)
|
|
|
|
# Remove control characters, except newlines and tabs
|
|
text = "".join(
|
|
char for char in text if unicodedata.category(char)[0] != "C" or char in "\n\t"
|
|
)
|
|
|
|
# Replace non-breaking spaces with regular spaces
|
|
text = text.replace("\xa0", " ").strip()
|
|
|
|
# Keep letters (including accented ones), numbers, spaces, newlines, tabs, and basic punctuation
|
|
return re.sub(r"[^\w\s.,!?\-\n\t]", "", text, flags=re.UNICODE)
|
|
|
|
|
|
def parse_thinking_content(content: str) -> Tuple[str, str]:
|
|
"""
|
|
Parse message content to extract thinking content from <think> tags.
|
|
|
|
Handles both well-formed tags and malformed output where the opening
|
|
<think> tag is missing but </think> is present.
|
|
|
|
Args:
|
|
content (str): The original message content
|
|
|
|
Returns:
|
|
Tuple[str, str]: (thinking_content, cleaned_content)
|
|
- thinking_content: Content from within <think> tags
|
|
- cleaned_content: Original content with <think> blocks removed
|
|
|
|
Example:
|
|
>>> content = "<think>Let me analyze this</think>Here's my answer"
|
|
>>> thinking, cleaned = parse_thinking_content(content)
|
|
>>> print(thinking)
|
|
"Let me analyze this"
|
|
>>> print(cleaned)
|
|
"Here's my answer"
|
|
"""
|
|
# Input validation
|
|
if not isinstance(content, str):
|
|
return "", str(content) if content is not None else ""
|
|
|
|
# Limit processing for very large content (100KB limit)
|
|
if len(content) > 100000:
|
|
return "", content
|
|
|
|
# Find all well-formed thinking blocks
|
|
thinking_matches = THINK_PATTERN.findall(content)
|
|
|
|
if thinking_matches:
|
|
# Join all thinking content with double newlines
|
|
thinking_content = "\n\n".join(match.strip() for match in thinking_matches)
|
|
|
|
# Remove all <think>...</think> blocks from the original content
|
|
cleaned_content = THINK_PATTERN.sub("", content)
|
|
|
|
# Clean up extra whitespace
|
|
cleaned_content = re.sub(r"\n\s*\n\s*\n", "\n\n", cleaned_content).strip()
|
|
|
|
return thinking_content, cleaned_content
|
|
|
|
# Handle malformed output: content</think> (missing opening tag)
|
|
# Some models like Nemotron output thinking without the opening <think> tag
|
|
malformed_match = THINK_PATTERN_NO_OPEN.match(content)
|
|
if malformed_match:
|
|
thinking_content = malformed_match.group(1).strip()
|
|
# Remove the thinking content and </think> tag
|
|
cleaned_content = content[malformed_match.end() :].strip()
|
|
return thinking_content, cleaned_content
|
|
|
|
return "", content
|
|
|
|
|
|
def clean_thinking_content(content: str) -> str:
|
|
"""
|
|
Remove thinking content from AI responses, returning only the cleaned content.
|
|
|
|
This is a convenience function for cases where you only need the cleaned
|
|
content and don't need access to the thinking process.
|
|
|
|
Args:
|
|
content (str): The original message content with potential <think> tags
|
|
|
|
Returns:
|
|
str: Content with <think> blocks removed and whitespace cleaned
|
|
|
|
Example:
|
|
>>> content = "<think>Let me think...</think>Here's the answer"
|
|
>>> clean_thinking_content(content)
|
|
"Here's the answer"
|
|
"""
|
|
_, cleaned_content = parse_thinking_content(content)
|
|
return cleaned_content
|