Commit graph

9224 commits

Author SHA1 Message Date
Intel AI Get-to Market Customer Success and Solutions
439f1b193d
sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (#22153)
* sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Use async mem ops for correctness when SYCL graphs are explicitly on.

Signed-off-by: Tao, Chun <chun.tao@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Tao, Chun <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-19 09:44:02 +03:00
Radoslav Gerganov
c3e9ade6dd
rpc : keep last_graph_uid in the device context (#23273)
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.

closes: #23242
2026-05-19 09:42:36 +03:00
Pranav Dhinakar
9a532ae4ba
hexagon: add support for TRI op (#22822)
* Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context

* addressed PR review comments for TRI op

* hexagon: clang format

* hex-unary: remove merge conflict markers

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-ggml: fix editor config errors

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-18 14:04:57 -07:00
Pranav Dhinakar
b7340443d4
ggml-hexagon: add PAD op HVX kernel (#23078)
* ggml-hexagon: add PAD op HVX kernel

Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-pad: fix editorconfig checks and macro alignment

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-18 13:39:36 -07:00
SamareshSingh
5cbaa5e69e
docker : add OCI image labels for version and build date (#21653)
* docker: add OCI image labels to all published images

* docker: propagate OCI labels as manifest and index annotations

* docker: drop hardcoded org URL and revert accidental intel version bump

The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.

* docker: add skip_s390x workflow_dispatch input for fast test runs

Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>

---------

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
2026-05-18 22:14:45 +02:00
Adrien Gallouët
45b455e66f
common : remove hf cache migration (#23266)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-18 17:11:47 +02:00
Aleksander Grygier
3a9c1b854d
ui: Update KaTeX package and clean up logs from sass warnings (#23275)
* ui: migrate katex imports to @use to resolve SCSS deprecation warnings

* ci: Use `ubuntu-slim` for CI (UI) workflow
2026-05-18 16:26:01 +02:00
Aleksander Grygier
b9a2170fce
feat: add scroll-to-bottom button to chat + prevent forced scroll down (#23270) 2026-05-18 16:17:21 +02:00
Aleksander Grygier
1ff0fc1384
ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236)
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars

* refactor: skip MCP proxy probe when no server requires it

* refactor: suppress expected disconnect errors during MCP client shutdown

* refactor: Deduplicate requests

* refactor: deduplicate model fetching across ROUTER and MODEL modes

* refactor: Clean up models logic

* chore: Add `.env.example` file

* refactor: replace client-side CORS proxy probe with server status flag

* refactor: Post-review fixes

* test: add vitest client setup with API fetch mocks
2026-05-18 16:09:40 +02:00
Aleksander Grygier
a135ec0baa
ui: Centralize monospace font styles in app.css (#23272)
Some checks failed
Python Type-Check / python type-check (push) Has been cancelled
2026-05-18 15:10:14 +02:00
Martin Andersson
232f466583
webui: fix Tailwind v4 utility classes missing when built via cmake (#23253) 2026-05-18 14:08:02 +02:00
Andrei
49c21f97cd
llama: initialize pre-norm embedding mask flag (#23256) 2026-05-18 14:20:49 +03:00
Sigbjørn Skjæret
77e38d68f2
add myself to conversion (#23261) 2026-05-18 12:42:56 +02:00
Martin Klacer
053e01dff6
ci : added kleidiai-server to server-self-hosted workflow (#22435)
* kleidiai: added kleidiai-server to server-self-hosted workflow

 * Added KleidiAI-enabled Arm64 Linux llama-server CI/integration test
   workflow into the server-self-hosted.yml configuration file

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I032e33c525b7e26bc5d53719f638bee610cec1ee

* Added self-hosted executor for KleidiAI server workflow

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* Update .github/workflows/server-self-hosted.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 11:14:57 +02:00
Georgi Gerganov
c3f95c1f06
scripts : allow wc2wt with an existing branch (#23189) 2026-05-18 08:57:28 +03:00
Intel AI Get-to Market Customer Success and Solutions
0caf2a1d48
sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-18 08:12:21 +03:00
Intel AI Get-to Market Customer Success and Solutions
5511965b19
sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-18 08:11:51 +03:00
Neo Zhang
e98bcfec28
sycl : fix error when use -mg 1 error (#23140) 2026-05-18 08:11:19 +03:00
Incarnas
1867a0c692
update bid to match each layers MTP source (#23237)
* update bid to match each layers MTP source

* Update conversion/qwen.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 12:37:12 +08:00
Sigbjørn Skjæret
dd7cad7197
cmake : do not check for bin install dir (#23234) 2026-05-18 02:33:14 +02:00
Gabe Goodhart
726704a160
feat: Support d_conv=15 for ssm-conv.cu (#23017)
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2026-05-17 23:05:11 +02:00
Aldehir Rojas
87589042ca
cmake : fix LLAMA_BUILD_UI logic (#23190) 2026-05-17 14:42:26 -04:00
Sigbjørn Skjæret
e0de4c2419
cmake : do not install conversion script (#23204) 2026-05-17 18:07:21 +02:00
Oliver Simons
84c678242a
CUDA: Continue directly including cuda/iterator (#23102)
Cont of #22936, forgot to update one site
2026-05-17 18:00:10 +02:00
Aman Gupta
3e12fbdea5
llama: avoid copying logits during prompt decode in MTP (#23198)
* llama: avoid copying logits during prompt decode in MTP

* review: update comment

* llama-graph: call set_output for t_h_pre_norm
2026-05-17 23:30:25 +08:00
Aldehir Rojas
39cf5d6191
common : delegate assistant continuation to underlying template handlers (#23089)
* common : delegate assistant continuation to template handler

* server : implement echo parameter to exclude assistant prefill in the response

* server : fix tests for prefill

* server : use existing llama template

* cont : clean up
2026-05-17 13:36:05 +02:00
Jan Ekström
a6d6183dbc
ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009)
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI

For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).

This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.

* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers

This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
2026-05-17 13:12:11 +02:00
Pascal
fcae601e44
vulkan: add cpy bf16 -> f32 pipelines (#22677) 2026-05-17 11:31:20 +02:00
Jeff Bolz
7ba22c6a09
vulkan: Support unaligned tensors for ROPE (#22637) 2026-05-17 11:30:16 +02:00
Aldehir Rojas
f4cc787b9f
common : enable streaming JSON argument values (#23173)
* common : remove atomic from json arguments

* common : remove parsing logic on JSON arguments
2026-05-17 03:44:34 -05:00
Jeff Bolz
3fbadb06dc
vulkan: fuse SSM_CONV + BIAS + SILU (#22653) 2026-05-17 10:25:50 +02:00
Rares Vernica
1a68ec9378
server : honor --embd-normalize CLI arg (#23125)
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
2026-05-17 09:39:04 +03:00
ddh0
a16cce81d3
ngram : reduce noisy logs (#23185)
* ngram : reduce noisy logs

* ngram : reduce noisy logs
2026-05-17 09:38:17 +03:00
Judd
4f13cb7424
webui: support video files as input (#22830) 2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen
b64739ea39
server: (router) alloc tmp buffer on heap (#23159) 2026-05-16 23:42:16 +02:00
Pascal
64b38b561b
server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137) 2026-05-16 21:21:06 +02:00
Winston Ma
6049906133
vulkan: removed duplicate #include <memory> in headers (#23144) 2026-05-16 19:57:35 +02:00
Aleksander Grygier
0253fb21f5
ui: Add request timeout for MCP tool calls (#23138)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* feat: Add request timeout for MCP tool calls in llama-ui

* feat: MCP Settings tab with max timeout setting
2026-05-16 15:20:27 +02:00
Georgi Gerganov
3a92bc99db sync : ggml 2026-05-16 16:11:29 +03:00
Georgi Gerganov
e6c37a1adc ggml : bump version to 0.12.0 (ggml/1494) 2026-05-16 16:11:29 +03:00
CrispStrobe
560445bf34 metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).

The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.

Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).

Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
2026-05-16 16:11:29 +03:00
Steve Lhomme
2eb3e6b242 ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480)
That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code
2026-05-16 16:11:29 +03:00
Holger Voormann
25b1bc9c2f
ui: Correct links in tools/ui/README.md [no ci] (#23139)
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.

See 59778f0196.
2026-05-16 14:42:38 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
18675b6bbc
vendor : update cpp-httplib to 0.45.0 (#23103) 2026-05-16 15:25:21 +03:00
Aman Gupta
255582687b
llama + spec: MTP Support (#22673)
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (#7)

* MTP: clean-up (#9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: 8c05923630

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (#11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (#13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-16 20:06:23 +08:00
kubawoo
b81c2cdd74
ui: Fix handling of MCP resource template parameters (#23117)
* Fix handling of MCP resource template parameters

* Fix formatting for uri-template.test.ts

---------

Co-authored-by: kuba <kuba@laptop.local.net>
2026-05-16 13:25:41 +02:00
viggy
1428004808
webui : [ChatFormActionAdd][a11y] fix accessibility issues in add menu trigger and items (#22736)
* fix tab order on attach button, and dont focus on disabled mennu item

* add a11y tests
2026-05-16 12:00:46 +02:00
Pascal
366c5e2a3b
ui: untrack settings sync in props effect to prevent reactive loop (#23127)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
2026-05-16 11:25:34 +02:00
Aleksander Grygier
1d9f99aa75
fix: Add build step using build workflow to publish workflow (#23134) 2026-05-16 11:22:59 +02:00
ynankani
42928bc14d
model : NvFP4 quantized LM head support (#23046)
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
2026-05-16 11:09:27 +02:00