zed/crates/fuzzy at extension-cli - vrr/zed

vrr/zed

mirror of https://github.com/zed-industries/zed.git synced 2026-05-23 21:05:08 +00:00

History

David Alecrim 90fcf8539e fuzzy: Fix crash with Unicode chars whose lowercase expands to multiple codepoints (#52989 ) Self-Review Checklist: - [x] I've reviewed my own diff for quality, security, and reliability - [x] Unsafe blocks (if any) have justifying comments - [x] The content is consistent with the [UI/UX checklist](https://github.com/zed-industries/zed/blob/main/CONTRIBUTING.md#uiux-checklist) - [x] Tests cover the new/changed behavior - [x] Performance impact has been considered and is acceptable Closes #52973 ## Problem The file picker crashes with `highlight index N is not a valid UTF-8 boundary` when file paths contain Unicode characters whose lowercase form expands to multiple codepoints. Turkish `İ` (U+0130) is the trigger here: Rust's `char::to_lowercase()` turns it into `i` + combining dot above (two codepoints). That expansion breaks the fuzzy matcher in two ways: 1. The `j_regular` index mapping mixes the expanded lowercase index space with the original character index space, so highlight positions land on invalid byte boundaries. 2. The scoring matrices are allocated with the expanded length but indexed with the original length as stride, so rows alias each other and corrupt stored values. Users with Turkish locale filenames were hitting this on v0.229.0 and v0.230.0 stable. ## Fix I went with simple 1:1 case mapping: a `simple_lowercase` helper in `char_bag.rs` that takes only the first codepoint from `to_lowercase()` and drops any trailing combining characters. For `İ` this gives `i`, which is what anyone would actually type in a search query. The same function is used in the matcher, the char bag pre-filter, and both query-lowercasing call sites (`paths.rs` and `strings.rs`). This gets rid of the `extra_lowercase_chars` BTreeMap, the `j_regular` adjustment, and the matrix sizing discrepancy. The matcher now works with a flat character array where `lowercase_candidate_chars.len() == candidate_chars.len()`, so there's no expanded-vs-original index space to get wrong. I also fixed `CharBag::insert`, which used `to_ascii_lowercase()` and silently ignored non-ASCII characters. A file like `aİbİcdef.txt` wouldn't show up when searching `ai` because `İ` was never registered as `i` in the bag. It now goes through `simple_lowercase` too. The alternative was keeping full case folding and fixing the index tracking with a `Vec<usize>` mapping expanded positions back to originals. That would work but keeps the dual-index-space complexity that caused these bugs, plus adds a per-candidate allocation for the mapping vector. ## Prior art fzf uses Go's `unicode.To(unicode.LowerCase, r)`, which is simple case mapping -- always one rune in, one rune out. `İ` maps to `i`, no expansion. VS Code's `String.toLowerCase()` does produce the expanded form, but the scorer compares UTF-16 code units independently and sidesteps the problem in practice. Neither tool maintains a mapping between expanded and original index spaces. ## Trade-off Searching for the combining dot above (U+0307) won't match `İ` in a path anymore. Nobody types combining characters in a file picker, and fzf doesn't support it either. ## Screenshot <img width="1282" height="458" alt="Screenshot 2026-04-02 at 09 56 34" src="https://github.com/user-attachments/assets/720d327a-4855-4d4d-989e-cbd1c0657f97" /> Release Notes: - Fixed a crash and improved matching and highlighting in the file picker for paths with non-ASCII characters (e.g., Turkish İ, ß, ﬁ). --------- Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>		2026-04-03 11:00:13 +00:00
..
src	fuzzy: Fix crash with Unicode chars whose lowercase expands to multiple codepoints (#52989 )	2026-04-03 11:00:13 +00:00
Cargo.toml	Remove workspace-hack (#40216 )	2025-10-17 18:58:14 +00:00
LICENSE-GPL