zed/crates/worktree
Lee ByeongJun f7ec53137d
worktree: Fix binary files misdetected as UTF-16 (#50890)
Closes #50785

Opening a .wav file (PCM 16-bit) caused Zed to freeze because the binary
detection heuristic in `analyze_byte_contentmisidentified` it as
UTF-16LE text. The heuristic determines UTF-16 encoding solely by
checking whether null bytes are skewed toward even or odd positions. PCM
16-bit audio with small sample values produces bytes like `[sample,
0x00]`, creating an alternating null pattern at odd positions that is
indistinguishable from BOM-less UTF-16LE by position alone.

### Why not just add more binary headers?

The initial approach
(32d8bd7009)
was to add audio format signatures (RIFF, OGG, FLAC, MP3) to known
binary header. While this solved the reported `.wav` case, any binary
format containing small 16-bit values (audio, images, or arbitrary data)
would still be misclassified. Adding headers is an endless game that
cannot cover unknown or uncommon formats.

### Changes

* Adds `is_plausible_utf16_text` as a secondary validation: when the
null byte skew suggests UTF-16, decode the bytes and count code units
that fall in C0/C1 control character ranges (U+0000–U+001F,
U+007F–U+009F, excluding common whitespace) or form unpaired surrogates.
Real UTF-16 text has near-zero such characters. I've set the threshold
at 2% — note that this is an empirically derived value, not based on any
formal standard.

**Before fix**

<img width="1147" height="807" alt="스크린샷 2026-03-06 오후 9 00 07"
src="https://github.com/user-attachments/assets/2e6e47f9-f5e7-4cab-9d41-cc3dd20f9142"
/>

**After fix**
<img width="1280" height="783" alt="스크린샷 2026-03-06 오전 1 17 43"
src="https://github.com/user-attachments/assets/3fecea75-f061-4757-9972-220a34380d67"
/>


Before you mark this PR as ready for review, make sure that you have:
- [X] Added a solid test coverage and/or screenshots from doing manual
testing
- [ ] Done a self-review taking into account security and performance
aspects
- [ ] Aligned any UI changes with the [UI
checklist](https://github.com/zed-industries/zed/blob/main/CONTRIBUTING.md#uiux-checklist)

Release Notes:

- Fixed binary files (e.g. WAV) being misdetected as UTF-16 text,
causing Zed to freeze.
2026-03-17 02:51:44 +00:00
..
src worktree: Fix binary files misdetected as UTF-16 (#50890) 2026-03-17 02:51:44 +00:00
tests/integration project: Always allocate WorktreeIDs on the remote client (#47936) 2026-01-29 15:31:13 +00:00
Cargo.toml Remove unreferenced dev dependencies (#51093) 2026-03-09 13:22:12 +01:00
LICENSE-GPL Rename 'project_core' crate to 'worktree', make it just about worktrees (#9189) 2024-03-11 11:35:27 -07:00