Commit graph

9230 commits

Author SHA1 Message Date
Georgi Gerganov
3c81c8deea
server : print graphs reused in slot timings (#23279)
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-19 09:46:58 +03:00
Georgi Gerganov
cd963fee6a
save-load-state : refactor tests and improve readability (#23196)
* save-load-state : refactor into separate phase functions

- Split monolithic main() into 4 self-contained phase functions, each
  managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)

Assisted-by: llama.cpp:local pi

* save-load-state : use params.out_file instead of separate state_file

- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt

Assisted-by: llama.cpp:local pi

* save-load-state : use smart pointers for ctx and smpl

- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)

Assisted-by: llama.cpp:local pi

* save-load-state : add local llama_batch_ptr RAII wrapper

- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls

Assisted-by: llama.cpp:local pi

* save-load-state : replace printf/fprintf with logging macros

- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG

Assisted-by: llama.cpp:local pi

* save-load-state : refactor tests to check results inline

Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().

Assisted-by: llama.cpp:local pi

* save-load-state : improve test output readability

Add phase labels, remove redundant run prefixes, and show
PASS after each test.

Assisted-by: llama.cpp:local pi

* pi : add rule about git signing

* save-load-state : simplify llama_batch_ptr

Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.

Assisted-by: llama.cpp:local pi

* save-load-state : extract generate_tokens helper

Factor out the repeated token generation loop into a shared
helper function used by all phases.

Assisted-by: llama.cpp:local pi

* save-load-state : update comments to use test terminology

Replace "Phase" with "Test" and list each test's steps
as bullet points.

Assisted-by: llama.cpp:local pi

* save-load-state : rename test functions

Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.

Assisted-by: llama.cpp:local pi

* pi : add rule to never git push without confirmation

Assisted-by: llama.cpp:local pi

* common : add model_only option to common_init_from_params

Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.

Use in save-load-state to initialize only the model,
with each test creating its own context.

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-19 09:46:34 +03:00
Georgi Gerganov
d2e179a477
llama-eval : add per-task summary stats (#23151)
* llama-eval : add per-problem summary table to HTML reports

- Add chunk_idx and problem_idx to TaskState and saved case dicts
- Group completed cases by problem_idx in dump_html()
- Render per-problem summary table before individual task table
  - Columns: Problem (zero-padded), Runs, Correct (n/r),
    Tokens (min/avg/max), T/s (min/avg/max), Gen s (min/avg/max)
  - Sorted by problem index, monospace font, right-aligned numbers
  - Colspan headers for grouped stats, auto width
- Simulator: add /v1/models endpoint, timings in response,
  template-aware question matching, --dataset arg (aime/aime2025)

Assisted-by: llama.cpp:local pi

* llama-eval : add tabs for Detailed and Summary tables, apply monospace font globally

- Wrap Detailed and Summary tables in switchable tabs (Detailed active by default)
- Remove summary-section wrapper, use tab labels instead
- Apply monospace font to all tables and the top bar

Assisted-by: llama.cpp:local pi

* llama-eval : redesign top bar as CSS grid label/value pairs

- Replace flat span list with 4-column grid layout (2 pairs per row)
- Labels in muted color (#888), values in dark (#222)
- Bold dataset name and model name
- Removed media query, always uses 4 columns

Assisted-by: llama.cpp:local pi

* llama-eval : use realistic token counts and throughput in simulator

- comp_tokens: [30, 80] → [10000, 60000]
- tps_gen: derived → uniform [90.0, 110.0]
- t_gen_ms: now computed from tokens/tps

Assisted-by: llama.cpp:local pi

* llama-eval : color Answer column green/red based on correctness

Use the same .correct/.incorrect CSS classes on the Answer column
to make correct answers green and incorrect answers red.

Assisted-by: llama.cpp:local pi

* llama-eval : fix pyright errors from max(..., key=len) type inference

Use key=lambda x: len(x) instead of key=len so the type checker
infers the return type as str instead of Sized, fixing:
  - unresolved-attribute: Object of type Sized has no attribute lower
  - not-subscriptable: Cannot subscript object of type Sized

Assisted-by: llama.cpp:local pi
2026-05-19 09:46:05 +03:00
Reese Levine
c85a242ed0
ggml-webgpu : extend GDN for K>1 (#23299) 2026-05-19 09:45:41 +03:00
Neo Zhang
aabee047d8
[SCYL] add chapter for performance reference in SYCL.md (#23315)
* add chapter for performance reference

* rm unsupported GPU
2026-05-19 09:44:51 +03:00
Sigbjørn Skjæret
f1c1c5c057
convert : filter lora tensor names (#23077) 2026-05-19 09:44:25 +03:00
Intel AI Get-to Market Customer Success and Solutions
439f1b193d
sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (#22153)
* sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Use async mem ops for correctness when SYCL graphs are explicitly on.

Signed-off-by: Tao, Chun <chun.tao@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Tao, Chun <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-19 09:44:02 +03:00
Radoslav Gerganov
c3e9ade6dd
rpc : keep last_graph_uid in the device context (#23273)
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.

closes: #23242
2026-05-19 09:42:36 +03:00
Pranav Dhinakar
9a532ae4ba
hexagon: add support for TRI op (#22822)
* Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context

* addressed PR review comments for TRI op

* hexagon: clang format

* hex-unary: remove merge conflict markers

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-ggml: fix editor config errors

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-18 14:04:57 -07:00
Pranav Dhinakar
b7340443d4
ggml-hexagon: add PAD op HVX kernel (#23078)
* ggml-hexagon: add PAD op HVX kernel

Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-pad: fix editorconfig checks and macro alignment

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-18 13:39:36 -07:00
SamareshSingh
5cbaa5e69e
docker : add OCI image labels for version and build date (#21653)
* docker: add OCI image labels to all published images

* docker: propagate OCI labels as manifest and index annotations

* docker: drop hardcoded org URL and revert accidental intel version bump

The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.

* docker: add skip_s390x workflow_dispatch input for fast test runs

Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>

---------

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
2026-05-18 22:14:45 +02:00
Adrien Gallouët
45b455e66f
common : remove hf cache migration (#23266)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-18 17:11:47 +02:00
Aleksander Grygier
3a9c1b854d
ui: Update KaTeX package and clean up logs from sass warnings (#23275)
* ui: migrate katex imports to @use to resolve SCSS deprecation warnings

* ci: Use `ubuntu-slim` for CI (UI) workflow
2026-05-18 16:26:01 +02:00
Aleksander Grygier
b9a2170fce
feat: add scroll-to-bottom button to chat + prevent forced scroll down (#23270) 2026-05-18 16:17:21 +02:00
Aleksander Grygier
1ff0fc1384
ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236)
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars

* refactor: skip MCP proxy probe when no server requires it

* refactor: suppress expected disconnect errors during MCP client shutdown

* refactor: Deduplicate requests

* refactor: deduplicate model fetching across ROUTER and MODEL modes

* refactor: Clean up models logic

* chore: Add `.env.example` file

* refactor: replace client-side CORS proxy probe with server status flag

* refactor: Post-review fixes

* test: add vitest client setup with API fetch mocks
2026-05-18 16:09:40 +02:00
Aleksander Grygier
a135ec0baa
ui: Centralize monospace font styles in app.css (#23272)
Some checks failed
Python Type-Check / python type-check (push) Has been cancelled
2026-05-18 15:10:14 +02:00
Martin Andersson
232f466583
webui: fix Tailwind v4 utility classes missing when built via cmake (#23253) 2026-05-18 14:08:02 +02:00
Andrei
49c21f97cd
llama: initialize pre-norm embedding mask flag (#23256) 2026-05-18 14:20:49 +03:00
Sigbjørn Skjæret
77e38d68f2
add myself to conversion (#23261) 2026-05-18 12:42:56 +02:00
Martin Klacer
053e01dff6
ci : added kleidiai-server to server-self-hosted workflow (#22435)
* kleidiai: added kleidiai-server to server-self-hosted workflow

 * Added KleidiAI-enabled Arm64 Linux llama-server CI/integration test
   workflow into the server-self-hosted.yml configuration file

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I032e33c525b7e26bc5d53719f638bee610cec1ee

* Added self-hosted executor for KleidiAI server workflow

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* Update .github/workflows/server-self-hosted.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 11:14:57 +02:00
Georgi Gerganov
c3f95c1f06
scripts : allow wc2wt with an existing branch (#23189) 2026-05-18 08:57:28 +03:00
Intel AI Get-to Market Customer Success and Solutions
0caf2a1d48
sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-18 08:12:21 +03:00
Intel AI Get-to Market Customer Success and Solutions
5511965b19
sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-18 08:11:51 +03:00
Neo Zhang
e98bcfec28
sycl : fix error when use -mg 1 error (#23140) 2026-05-18 08:11:19 +03:00
Incarnas
1867a0c692
update bid to match each layers MTP source (#23237)
* update bid to match each layers MTP source

* Update conversion/qwen.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 12:37:12 +08:00
Sigbjørn Skjæret
dd7cad7197
cmake : do not check for bin install dir (#23234) 2026-05-18 02:33:14 +02:00
Gabe Goodhart
726704a160
feat: Support d_conv=15 for ssm-conv.cu (#23017)
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2026-05-17 23:05:11 +02:00
Aldehir Rojas
87589042ca
cmake : fix LLAMA_BUILD_UI logic (#23190) 2026-05-17 14:42:26 -04:00
Sigbjørn Skjæret
e0de4c2419
cmake : do not install conversion script (#23204) 2026-05-17 18:07:21 +02:00
Oliver Simons
84c678242a
CUDA: Continue directly including cuda/iterator (#23102)
Cont of #22936, forgot to update one site
2026-05-17 18:00:10 +02:00
Aman Gupta
3e12fbdea5
llama: avoid copying logits during prompt decode in MTP (#23198)
* llama: avoid copying logits during prompt decode in MTP

* review: update comment

* llama-graph: call set_output for t_h_pre_norm
2026-05-17 23:30:25 +08:00
Aldehir Rojas
39cf5d6191
common : delegate assistant continuation to underlying template handlers (#23089)
* common : delegate assistant continuation to template handler

* server : implement echo parameter to exclude assistant prefill in the response

* server : fix tests for prefill

* server : use existing llama template

* cont : clean up
2026-05-17 13:36:05 +02:00
Jan Ekström
a6d6183dbc
ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009)
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI

For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).

This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.

* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers

This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
2026-05-17 13:12:11 +02:00
Pascal
fcae601e44
vulkan: add cpy bf16 -> f32 pipelines (#22677) 2026-05-17 11:31:20 +02:00
Jeff Bolz
7ba22c6a09
vulkan: Support unaligned tensors for ROPE (#22637) 2026-05-17 11:30:16 +02:00
Aldehir Rojas
f4cc787b9f
common : enable streaming JSON argument values (#23173)
* common : remove atomic from json arguments

* common : remove parsing logic on JSON arguments
2026-05-17 03:44:34 -05:00
Jeff Bolz
3fbadb06dc
vulkan: fuse SSM_CONV + BIAS + SILU (#22653) 2026-05-17 10:25:50 +02:00
Rares Vernica
1a68ec9378
server : honor --embd-normalize CLI arg (#23125)
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
2026-05-17 09:39:04 +03:00
ddh0
a16cce81d3
ngram : reduce noisy logs (#23185)
* ngram : reduce noisy logs

* ngram : reduce noisy logs
2026-05-17 09:38:17 +03:00
Judd
4f13cb7424
webui: support video files as input (#22830) 2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen
b64739ea39
server: (router) alloc tmp buffer on heap (#23159) 2026-05-16 23:42:16 +02:00
Pascal
64b38b561b
server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137) 2026-05-16 21:21:06 +02:00
Winston Ma
6049906133
vulkan: removed duplicate #include <memory> in headers (#23144) 2026-05-16 19:57:35 +02:00
Aleksander Grygier
0253fb21f5
ui: Add request timeout for MCP tool calls (#23138)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* feat: Add request timeout for MCP tool calls in llama-ui

* feat: MCP Settings tab with max timeout setting
2026-05-16 15:20:27 +02:00
Georgi Gerganov
3a92bc99db sync : ggml 2026-05-16 16:11:29 +03:00
Georgi Gerganov
e6c37a1adc ggml : bump version to 0.12.0 (ggml/1494) 2026-05-16 16:11:29 +03:00
CrispStrobe
560445bf34 metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).

The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.

Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).

Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
2026-05-16 16:11:29 +03:00
Steve Lhomme
2eb3e6b242 ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480)
That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code
2026-05-16 16:11:29 +03:00
Holger Voormann
25b1bc9c2f
ui: Correct links in tools/ui/README.md [no ci] (#23139)
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.

See 59778f0196.
2026-05-16 14:42:38 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
18675b6bbc
vendor : update cpp-httplib to 0.45.0 (#23103) 2026-05-16 15:25:21 +03:00