Commit graph

9194 commits

Author SHA1 Message Date
Jeff Bolz
3fbadb06dc
vulkan: fuse SSM_CONV + BIAS + SILU (#22653) 2026-05-17 10:25:50 +02:00
Rares Vernica
1a68ec9378
server : honor --embd-normalize CLI arg (#23125)
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
2026-05-17 09:39:04 +03:00
ddh0
a16cce81d3
ngram : reduce noisy logs (#23185)
* ngram : reduce noisy logs

* ngram : reduce noisy logs
2026-05-17 09:38:17 +03:00
Judd
4f13cb7424
webui: support video files as input (#22830) 2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen
b64739ea39
server: (router) alloc tmp buffer on heap (#23159) 2026-05-16 23:42:16 +02:00
Pascal
64b38b561b
server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137) 2026-05-16 21:21:06 +02:00
Winston Ma
6049906133
vulkan: removed duplicate #include <memory> in headers (#23144) 2026-05-16 19:57:35 +02:00
Aleksander Grygier
0253fb21f5
ui: Add request timeout for MCP tool calls (#23138)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* feat: Add request timeout for MCP tool calls in llama-ui

* feat: MCP Settings tab with max timeout setting
2026-05-16 15:20:27 +02:00
Georgi Gerganov
3a92bc99db sync : ggml 2026-05-16 16:11:29 +03:00
Georgi Gerganov
e6c37a1adc ggml : bump version to 0.12.0 (ggml/1494) 2026-05-16 16:11:29 +03:00
CrispStrobe
560445bf34 metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).

The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.

Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).

Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
2026-05-16 16:11:29 +03:00
Steve Lhomme
2eb3e6b242 ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480)
That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code
2026-05-16 16:11:29 +03:00
Holger Voormann
25b1bc9c2f
ui: Correct links in tools/ui/README.md [no ci] (#23139)
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.

See 59778f0196.
2026-05-16 14:42:38 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
18675b6bbc
vendor : update cpp-httplib to 0.45.0 (#23103) 2026-05-16 15:25:21 +03:00
Aman Gupta
255582687b
llama + spec: MTP Support (#22673)
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (#7)

* MTP: clean-up (#9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: 8c05923630

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (#11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (#13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-16 20:06:23 +08:00
kubawoo
b81c2cdd74
ui: Fix handling of MCP resource template parameters (#23117)
* Fix handling of MCP resource template parameters

* Fix formatting for uri-template.test.ts

---------

Co-authored-by: kuba <kuba@laptop.local.net>
2026-05-16 13:25:41 +02:00
viggy
1428004808
webui : [ChatFormActionAdd][a11y] fix accessibility issues in add menu trigger and items (#22736)
* fix tab order on attach button, and dont focus on disabled mennu item

* add a11y tests
2026-05-16 12:00:46 +02:00
Pascal
366c5e2a3b
ui: untrack settings sync in props effect to prevent reactive loop (#23127)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
2026-05-16 11:25:34 +02:00
Aleksander Grygier
1d9f99aa75
fix: Add build step using build workflow to publish workflow (#23134) 2026-05-16 11:22:59 +02:00
ynankani
42928bc14d
model : NvFP4 quantized LM head support (#23046)
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
2026-05-16 11:09:27 +02:00
Aleksander Grygier
59778f0196
ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming (#23064)
* webui: Move static build output from `tools/server/public` to `build/ui` directory

* refactor: Move to `tools/ui`

* refactor: rename CMake variables and preprocessor defines

- Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated)
- Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated)
- Backward compat: old vars auto-forward to new ones with DEPRECATION warning
- Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc.
- Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET
- Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines
- Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED

* refactor: rename CLI flags (--webui -> --ui) with backward compat

- Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases)
- Add --ui-config (old --webui-config kept as deprecated alias)
- Add --ui-config-file (old --webui-config-file kept as deprecated alias)
- Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated)
- Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY
- C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields
- Backward compat: old fields synced to new ones in g_params_to_internals

* refactor: update C++ server internals with backward compat

- Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta)
- Rename params.webui usage -> params.ui (both synced, old still works)
- JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys
- Server routes use params.ui_mcp_proxy || params.webui_mcp_proxy
- Preprocessor guards use #if defined(LLAMA_BUILD_UI) || defined(LLAMA_BUILD_WEBUI)

* refactor: rename CI/CD workflows, artifacts, and build script

- Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build
- Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT
- Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks
- Update server.yml: job/artifact refs webui-build -> ui-build
- Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT
- Update server-self-hosted.yml: webui-build -> ui-build
- Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION
- Rename webui-download.cmake -> ui-download.cmake (internal refs updated)
- Update labeler.yml: server/webui -> server/ui path label

* docs: update CODEOWNERS and server README docs

- Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/
- Update server README.md: CLI tables show --ui flags with deprecated --webui aliases
- Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/

* fix: Small fixes for UI build

* fix: CMake.txt syntax

* chore: Formatting

* fix: `.editorconfig` for llama-ui

* chore: Formatting

* refactor: Use `APP_NAME` in Error route

* refactor: Cleanup

* refactor: Single migration service

* make llama-ui a linkable target

* fix: UI Build output

* fix: Missing change

* fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI

* refactor: UI workflows cleanup

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-05-16 02:02:40 +02:00
Sigbjørn Skjæret
49d1701bd2
ci : fix release symlinks (#23119) 2026-05-16 01:09:28 +02:00
Omer Ozarslan
1348f67c58
webui: Use lowercase hash for HF checksum check (#23107) 2026-05-15 19:38:16 +02:00
Pascal
cfabeb1bad
tests: add BF16 non-contig coverage for MUL_MAT permutations (#22689)
The MUL_MAT test loop iterates over base_types[] to generate non-contig
permutation cases (3 standard permutations across n in {1, 8, 16}).
BF16 is absent from base_types[], so these 9 cases were never generated
for BF16 even though every other type covered by base_types[] tests them.

Add the missing 9 cases explicitly: BF16 x F32, m=16, k=256, bs=[2,3],
permutations {0,2,1,3}, {0,1,3,2}, {0,3,2,1}, with n in {1, 8, 16}.

Suggested-by: @jeffbolznv
2026-05-15 19:35:05 +02:00
Julien Chaumond
6831fe470c
docs: document usage object in server timings response (#23110)
* docs: document `usage` object in server timings response

Co-Authored-By: julien-agent <Agents+cyolo@huggingface.co>

* Apply suggestion from @julien-c

---------

Co-authored-by: julien-agent <Agents+cyolo@huggingface.co>
2026-05-15 19:33:12 +02:00
Xuan-Son Nguyen
72e60f500d
mtmd: add chunks and fix preproc for qwen3a (#23073)
* mtmd: add chunks and fix preproc for qwen3a

* add attn_mask

* limit mtmd_chunk size (avoid blow up memory)

* correct audio tokens

* re-order the set_input case

* remove attn_mask
2026-05-15 19:32:47 +02:00
Pascal
8be1786707
webui: fix theme from --webui-config-file not applied on first load (fresh localStorage) (#22902) 2026-05-15 19:25:38 +02:00
Sigbjørn Skjæret
18d1717d62
convert : fix Qwen3 ASR conversion (#23081)
* fix qwen3asr

* fix qwen3asr
2026-05-15 18:38:39 +02:00
Piotr Wilkin (ilintar)
cc7200bf12
Refactor: convert_hf_to_gguf.py (#17114)
* move conversion code to a dedicated conversion directory and split the files akin to the src/models architecture

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-15 15:18:12 +02:00
Sigbjørn Skjæret
769cc93a43
ci : fix transform of top . entry in release archive (#23080)
* fix transform of top . entry in release archive

* simplify
2026-05-15 13:13:16 +02:00
Georgi Gerganov
d5dc2e0a02
llama-eval : add AIME 2026 dataset support (#23058)
Add Aime2026Dataset class loading from MathArena/aime_2026 on
HuggingFace. 30 problems (two sets of 15), single config/split.

Usage: --dataset aime2026

Assisted-by: llama.cpp:local pi
2026-05-15 13:58:30 +03:00
Aman Gupta
ac33f032ac
reasoning-budget: clone should do a deep-copy (#23095) 2026-05-15 11:59:07 +02:00
Pascal
d528444580
webui: preserve partial response on streaming error (#23090) 2026-05-15 11:18:11 +02:00
Sid Shaytay
91e84fed64
Support for Codex CLI by skipping unsupported Responses tools (#23041)
Some checks are pending
Python Type-Check / python type-check (push) Waiting to run
* Support for Codex CLI by skipping unsupported Responses tools

* Warn on skipped Responses tools and preserve gpt-oss apply_patch rejection

* Revert gpt-oss apply_patch special handling
2026-05-15 09:03:24 +02:00
KITAITI Makoto
7155a49771
readme : update bindings (#23063) 2026-05-15 08:41:24 +03:00
Pranav Dhinakar
5c0e946837
ggml-hexagon: cpy: add contiguous fast-path in reshape copy (#23076) 2026-05-14 16:55:54 -07:00
Johannes Gäßler
3e037f313c
HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (#22880)
Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout.

I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.
2026-05-14 22:58:58 +02:00
Zack Li
d81e63dcfd
CI : support IOT device (IQ9) (#22987)
* update test scripts

* align CI behavior between linux and android

* remove automatically cancel in 15min

* enable cancel-in-progress

* fix ty check issue

* update and fix pylint issue

* update runner such that we are not restricted by the 15min limit rule

* fix flake8 lint issue

* update runner according to review feedback

* code update according to review feedback

* switch from llama-cli to llama-completion binary with -no-cnv flag
2026-05-14 13:58:34 -07:00
Reese Levine
834a243664
ggml-webgpu: Enable NVIDIA self-hosted CI (#22976)
* Enabel nvidia ci for webgpu

* Address precision issues

* fix placement

* Relax more set_rows and div

* Try relaxing all f16

* formatting and naming

* Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.
2026-05-14 09:41:32 -07:00
Zheyuan Chen
5ec717d125
ggml-webgpu: makes the flash attn vec path subgroup-aware (#23040)
* ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size.

* ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap
2026-05-14 09:31:36 -07:00
Aleksander Grygier
0c3e4fccca
fix: Propagate version tag to WebUI asset download in self-hosted CI (#23051)
* fix: Propagate version tag to WebUI asset download in self-hosted CI

* refactor: Apply suggestions from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix: Skip npm build when Node.js is not installed

Avoid 'no such file or directory' errors on CI runners that lack
Node.js. Check if npm is available via find_program before attempting
npm install + npm run build. Falls back to HF Bucket download.

* fix: Use + separator for ASSETS list to fix Windows build

Replace fragile \; escaping with a + separator when passing the
WebUI asset list via -DASSETS to the download script. On Windows,
the \; escaping was not reliably preserved through the CMake build
system, causing all asset filenames to be concatenated into one
(e.g., 'index.html;bundle.js;bundle.css;loading.html' as a single
file), which broke the HF Bucket download and subsequent xxd.cmake
step.

+ is safe because it is not special in cmd.exe (unlike | which is a
pipe operator), not special in CMake's -D argument parser, and not
a valid Windows filename character. CMakeLists.txt joins assets
with + and webui-download.cmake splits them back via regex.

* fix: Validate HF_WEBUI_VERSION environment variable with regex

Add input validation for the HF_WEBUI_VERSION env var to prevent
CMake list separator or path-traversal issues in stamp filenames
and download URLs. Rejects non-conforming characters early.

* fix: Remove 'latest' fallback for HF_WEBUI_VERSION

When needs.determine-tag.outputs.tag_name is empty, let CMake's
default resolution handle it (empty -> git-based version lookup)
instead of falling back to 'latest'. This ensures the sentinel
stamp file is consistent with CMake's resolution logic.

* fix: Demote checksum verification failure to warning instead of hard gate

* fix: End line character

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-14 17:57:20 +02:00
Aman Gupta
97b658cee8
contributing: new contributors should not submit trivial fixes (#23045) 2026-05-14 23:55:24 +08:00
Aleksander Grygier
253ba110bc
webui: Move static build output from repo code to HF Bucket (#22937)
* ci: add workflow to publish webui to Hugging Face bucket

* ci: add webui release job to release workflow

* ci: test webui release job

* chore: Return to default minification strategy for build output files

* ci: extract webui build into separate workflow and job

* chore: Ignore webui static output + clean up references

* chore: Delete legacy webui static output

* chore: Ignore webui build static output

* fix: Workflow

* fix: Versioning naming

* chore: Update package name

* test: Test CI fix

* refactor: Naming

* server: implement webui build strategy with HF Bucket support

* chore: Remove test workflow

* chore: Use WebUI build workflow call in other workflows

* server: HF Buckets fallback for WebUI build

* refactor: App name variable

* refactor: Naming

* fix: Retrieve loading.html

* fix: workflow syntax

* fix: Rewrite malformed release.yml

* fix: Req param

* test: Re-add missing Playwright installation for CI tests

* refactor: Logic & security improvements

* refactor: Retrieve publishing jobs and DRY the workflows

* fix: Test workflow syntax

* fix: Upstream Release Tag for test workflow

* chore: Remove test workflow

* ci: Run WebUI jobs on `ubuntu-24.04-arm`

* refactor: Post-CR cleanup

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: CI cleanup

* refactor: Cleanup

* test: Test workflow

* refactor: use LLAMA_BUILD_NUMBER instead of LLAMA_BUILD_TAG for HF Bucket webui downloads

* server: add fallback mechanism for HF Bucket webui downloads from latest directory

* fix: Incorrect argument order in file(SHA256) calls for checksum verification

* refactor: Use cmake script for handling the HF Bucket download on build time

* feat: support local npm build for WebUI assets

* refactor: add `HF_ENABLED` flag to control WebUI build/download provisioning

* refactor: Cleanup

* chore: Remove test workflow

* fix: remove s390x from release workflow

* fix: add webui-build dependency to ubuntu-22-rocm and windows-hip

* Revert "fix: remove s390x from release workflow"

This reverts commit debcfffa9bc1e3112eae41f2d29741b682e4eb19.

* fix: Release workflow file

* fix: Proper release tag used for HF Bucket upload

* fix: Remove duplicate steps in release workflow

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-14 13:21:41 +02:00
Georgi Gerganov
67b2b7f2f2
logs : reduce (#23021)
Some checks failed
Python Type-Check / python type-check (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* logs : reduce

* args : fix envs

* server : fix build

* common : print verbosity level at start

* server : clean-up logs

* server : print prompt processing timings + sampling params

* minor : whitespaces
2026-05-14 13:05:52 +03:00
alex-spacemit
81b0d882ae
ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (#22863) 2026-05-14 17:39:30 +08:00
Neo Zhang
0f45f1a35c
docker : revert stable version of intel compute-runtime (#22968) 2026-05-14 11:30:40 +02:00
Kabir Potdar
42532afff4
unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110)
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests

- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks

* cont : remove trailing whitespace

---------

Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-05-14 11:03:40 +02:00
Ruben Ortlam
dbe7901ca6
vulkan: fix matmul integer pipeline selection (#23005)
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
2026-05-14 10:36:54 +02:00
Aleksander Grygier
320a6a44a5
fix: Autoscroll detection (#23026) 2026-05-14 08:09:29 +02:00
Katostrofik
9ed6e19b9d
SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597)
* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2026-05-14 13:39:14 +08:00