Commit graph

1296 commits

Author SHA1 Message Date
Concedo
9203b6a051 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/release.yml
#	.github/workflows/server-sanitize.yml
#	.github/workflows/server-self-hosted.yml
#	.github/workflows/server.yml
#	.github/workflows/ui-build.yml
#	.github/workflows/ui-ci.yml
#	.github/workflows/ui-publish.yml
#	.gitignore
#	CMakeLists.txt
#	CODEOWNERS
#	scripts/ui-download.cmake
#	scripts/xxd.cmake
#	tests/test-backend-ops.cpp
#	tests/test-reasoning-budget.cpp
#	tools/CMakeLists.txt
#	tools/server/CMakeLists.txt
#	tools/server/README.md
2026-05-16 22:56:33 +08:00
Aleksander Grygier
59778f0196
ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming (#23064)
* webui: Move static build output from `tools/server/public` to `build/ui` directory

* refactor: Move to `tools/ui`

* refactor: rename CMake variables and preprocessor defines

- Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated)
- Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated)
- Backward compat: old vars auto-forward to new ones with DEPRECATION warning
- Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc.
- Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET
- Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines
- Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED

* refactor: rename CLI flags (--webui -> --ui) with backward compat

- Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases)
- Add --ui-config (old --webui-config kept as deprecated alias)
- Add --ui-config-file (old --webui-config-file kept as deprecated alias)
- Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated)
- Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY
- C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields
- Backward compat: old fields synced to new ones in g_params_to_internals

* refactor: update C++ server internals with backward compat

- Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta)
- Rename params.webui usage -> params.ui (both synced, old still works)
- JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys
- Server routes use params.ui_mcp_proxy || params.webui_mcp_proxy
- Preprocessor guards use #if defined(LLAMA_BUILD_UI) || defined(LLAMA_BUILD_WEBUI)

* refactor: rename CI/CD workflows, artifacts, and build script

- Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build
- Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT
- Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks
- Update server.yml: job/artifact refs webui-build -> ui-build
- Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT
- Update server-self-hosted.yml: webui-build -> ui-build
- Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION
- Rename webui-download.cmake -> ui-download.cmake (internal refs updated)
- Update labeler.yml: server/webui -> server/ui path label

* docs: update CODEOWNERS and server README docs

- Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/
- Update server README.md: CLI tables show --ui flags with deprecated --webui aliases
- Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/

* fix: Small fixes for UI build

* fix: CMake.txt syntax

* chore: Formatting

* fix: `.editorconfig` for llama-ui

* chore: Formatting

* refactor: Use `APP_NAME` in Error route

* refactor: Cleanup

* refactor: Single migration service

* make llama-ui a linkable target

* fix: UI Build output

* fix: Missing change

* fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI

* refactor: UI workflows cleanup

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-05-16 02:02:40 +02:00
Aman Gupta
ac33f032ac
reasoning-budget: clone should do a deep-copy (#23095) 2026-05-15 11:59:07 +02:00
Concedo
da2cc90723 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/build-and-test-snapdragon.yml
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/release.yml
#	.github/workflows/server-self-hosted.yml
#	.github/workflows/server-webui.yml
#	.github/workflows/server.yml
#	.gitignore
#	CMakeLists.txt
#	CONTRIBUTING.md
#	README.md
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-hexagon/htp/cpy-ops.c
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	grammars/README.md
#	scripts/snapdragon/qdc/run_qdc_jobs.py
#	scripts/snapdragon/qdc/tests/run_backend_ops_posix.py
#	scripts/snapdragon/qdc/tests/run_bench_tests_posix.py
#	scripts/snapdragon/qdc/tests/utils.py
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/server/CMakeLists.txt
#	tools/server/README.md
#	tools/server/webui/src/lib/components/app/server/ServerLoadingSplash.svelte
#	tools/server/webui/src/routes/(chat)/chat/[id]/+page.svelte
#	ty.toml
2026-05-15 17:09:48 +08:00
Aleksander Grygier
253ba110bc
webui: Move static build output from repo code to HF Bucket (#22937)
* ci: add workflow to publish webui to Hugging Face bucket

* ci: add webui release job to release workflow

* ci: test webui release job

* chore: Return to default minification strategy for build output files

* ci: extract webui build into separate workflow and job

* chore: Ignore webui static output + clean up references

* chore: Delete legacy webui static output

* chore: Ignore webui build static output

* fix: Workflow

* fix: Versioning naming

* chore: Update package name

* test: Test CI fix

* refactor: Naming

* server: implement webui build strategy with HF Bucket support

* chore: Remove test workflow

* chore: Use WebUI build workflow call in other workflows

* server: HF Buckets fallback for WebUI build

* refactor: App name variable

* refactor: Naming

* fix: Retrieve loading.html

* fix: workflow syntax

* fix: Rewrite malformed release.yml

* fix: Req param

* test: Re-add missing Playwright installation for CI tests

* refactor: Logic & security improvements

* refactor: Retrieve publishing jobs and DRY the workflows

* fix: Test workflow syntax

* fix: Upstream Release Tag for test workflow

* chore: Remove test workflow

* ci: Run WebUI jobs on `ubuntu-24.04-arm`

* refactor: Post-CR cleanup

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: CI cleanup

* refactor: Cleanup

* test: Test workflow

* refactor: use LLAMA_BUILD_NUMBER instead of LLAMA_BUILD_TAG for HF Bucket webui downloads

* server: add fallback mechanism for HF Bucket webui downloads from latest directory

* fix: Incorrect argument order in file(SHA256) calls for checksum verification

* refactor: Use cmake script for handling the HF Bucket download on build time

* feat: support local npm build for WebUI assets

* refactor: add `HF_ENABLED` flag to control WebUI build/download provisioning

* refactor: Cleanup

* chore: Remove test workflow

* fix: remove s390x from release workflow

* fix: add webui-build dependency to ubuntu-22-rocm and windows-hip

* Revert "fix: remove s390x from release workflow"

This reverts commit debcfffa9bc1e3112eae41f2d29741b682e4eb19.

* fix: Release workflow file

* fix: Proper release tag used for HF Bucket upload

* fix: Remove duplicate steps in release workflow

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-14 13:21:41 +02:00
Concedo
cc82c3164e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	.github/workflows/build-cross.yml
#	.github/workflows/build-sycl.yml
#	.github/workflows/build.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/release.yml
#	cmake/riscv64-spacemit-linux-gnu-gcc.cmake
#	docs/backend/OPENVINO.md
#	docs/backend/SYCL.md
#	docs/build-riscv64-spacemit.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	embd_res/ggml-vocab-qwen35.gguf
#	embd_res/ggml-vocab-qwen35.gguf.inp
#	embd_res/ggml-vocab-qwen35.gguf.out
#	examples/model-conversion/Makefile
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-utils.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/hvx-utils.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/common.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_reduce.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec_acc.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
#	ggml/src/ggml-zendnn/CMakeLists.txt
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	scripts/snapdragon/adb/run-completion.sh
#	tests/CMakeLists.txt
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/mtmd/clip-impl.h
#	tools/mtmd/clip.cpp
#	tools/mtmd/clip.h
#	tools/server/README.md
2026-05-14 19:04:04 +08:00
Georgi Gerganov
67b2b7f2f2
logs : reduce (#23021)
Some checks failed
Python Type-Check / python type-check (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* logs : reduce

* args : fix envs

* server : fix build

* common : print verbosity level at start

* server : clean-up logs

* server : print prompt processing timings + sampling params

* minor : whitespaces
2026-05-14 13:05:52 +03:00
Xuan-Son Nguyen
e75cd5efb5
download: do not exit() on error (#23008) 2026-05-13 15:14:58 +02:00
Georgi Gerganov
634275fbbb
spec : update CLI arguments for better consistency (#22964)
* spec : update CLI arguments for better consistency

* cont : fix CLI arg message
2026-05-13 09:15:39 +03:00
Xuan-Son Nguyen
7bfe120c21
mtmd, server, common: expose modalities to /v1/models (#22952)
* mtmd, server, common: expose modalities to /v1/models

* fix build

* rename to mtmd_caps
2026-05-12 19:08:07 +02:00
Concedo
f7923b261f need to fix cuda compile. Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/python-type-check.yml
#	examples/speculative-simple/README.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/src/ggml-cuda/im2col.cu
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	tests/test-backend-ops.cpp
#	tools/cli/README.md
#	tools/mtmd/CMakeLists.txt
#	tools/server/README.md
2026-05-12 20:47:07 +08:00
Georgi Gerganov
68e7ea3eab
spec : parallel drafting support (#22838)
* spec : refactor

* spec : drop support for incompatible vocabs

* spec : update common_speculative_init()

* cont : pass seq_id

* cont : dedup ctx_seq_rm_type

* server : sketch the ctx_dft decode loop

* server : draft prompt cache and checkpoints

* server : improve ctx names

* server, spec : transition to unified spec context

* cont : sync main and drft contexts

* cont : async drft eval when possible

* cont : handle non-ckpt models

* cont : pass correct n_past for drafting

* cont : process images throught the draft context

* spec : handle draft running out of context

* server : fix mtmd draft processing

* server : fix URL for draft model

* server : add comment

* server : clean-up + dry

* speculative-simple : update

* spec : fix n_past type

* server : fix slot ctx_drft ptr

* tools : update readme

* naming : improve consistency

* spec : refactor for multi-sequence speculative context

* cont : prepare params

* cont : prepare params

* spec : support parallel drafts

* server : support parallel drafting

* llama : reuse device buffers when possible

* server, spec : clean-up

* cont : clean-up

* cont : minor

* spec : reset `drafting` flag at the end

* spec : introduce `common_speculative_process()`

* spec : allow for multiple spec types (chain of speculators)

* replace old type field of type common_speculative_type in the
  common_params_speculative struct with a vector to allow multiple
  types to be specified

* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
  to figure out which implementations the user has enabled

* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
  to parse the already user provided spec types

* all speculators run sequentially, best one wins (we verify its drafted tokens)

* maximize expected accepted tokens for current round by calculating the
  product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
  and the draft's length

---------

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
2026-05-11 19:09:43 +03:00
Concedo
2771e16fbc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	.devops/nix/package.nix
#	.gitignore
#	docs/backend/SYCL.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-cuda/ggml-cuda.cu
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/fattn-common.hpp
#	ggml/src/ggml-sycl/getrows.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/im2col.cpp
#	ggml/src/ggml-sycl/im2col.hpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/quants.hpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	ggml/src/ggml-virtgpu/ggml-backend-device.cpp
#	scripts/sync-ggml.last
#	scripts/sync_vendor.py
#	tests/test-backend-ops.cpp
2026-05-11 16:18:28 +08:00
Concedo
9b0b36b5ef Merge commit '66001722aa' into concedo_experimental
# Conflicts:
#	README.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	examples/sycl/start-svr.sh
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-sycl/gated_delta_net.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/pad.cpp
#	ggml/src/ggml-sycl/ssm_conv.cpp
#	tests/test-backend-ops.cpp
#	tests/test-reasoning-budget.cpp
#	tools/server/README.md
#	tools/server/webui/src/lib/constants/settings-config.ts
2026-05-11 15:40:10 +08:00
Tim Neumann
2e97c5f96f
backend sampling: support returning post-sampling probs (#22622)
* server: Never return 0.0 post-sampling probabilities

* backend sampling: support returning post-sampling probs
2026-05-10 19:12:02 +02:00
Aldehir Rojas
49956041ee
common : do not wrap raw strings in schema parser for tagged parsers (#22827) 2026-05-08 15:33:17 -05:00
Aldehir Rojas
f9cd456ea5
common : revert reasoning budget +inf logit bias (#22740) 2026-05-08 17:46:43 +02:00
Concedo
eb30b29d69 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/gguf-publish.yml
#	CODEOWNERS
#	examples/sycl/test.sh
#	pyproject.toml
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/README.md
2026-05-08 14:48:57 +08:00
Aldehir Rojas
093be624cc
common/chat : preserve media markers for typed-content templates (#22634) 2026-05-07 12:50:56 -05:00
fl0rianr
a0101225bc
common: do not fit to unknown device memory (#22614)
* common: do not fit to unknown device memory

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: preserve host fallback for non-GPU fit devices

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: keep unknown GPU fit memory at zero

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
2026-05-06 17:03:45 +02:00
Concedo
9e9497f0cc Merge remote-tracking branch 'origin/upstream' into concedo_experimental
# Conflicts:
#	examples/save-load-state/save-load-state.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/gemm_noshuffle_q4_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemm_noshuffle_q8_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32_spec.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q8_0_f32.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/sync-ggml.last
#	scripts/sync_vendor.py
#	src/llama-graph.cpp
#	tests/test-backend-ops.cpp
#	tests/test-state-restore-fragmented.cpp
2026-05-06 21:20:06 +08:00
Concedo
7240da764a Merge commit '935a340292' into concedo_experimental
# Conflicts:
#	examples/diffusion/CMakeLists.txt
#	scripts/server-test-function-call.py
#	src/llama-model.cpp
#	src/models/gemma4.cpp
#	tests/test-chat.cpp
#	tests/test-reasoning-budget.cpp
#	tools/server/README.md
2026-05-06 21:02:25 +08:00
Adrien Gallouët
2635ac76e8
common : fix missing-noreturn warnings when compiling with clang 21 (#22702)
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3719 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3726 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3733 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3740 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3747 |         [](common_params & /*params*/, int /*value*/) {
          |         ^

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 13:16:25 +03:00
Adrien Gallouët
bf76ac77be
common : only load backends when required (#22290)
* common : only load backends when required

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama : call ggml_backend_load_all() directly from llama_backend_init()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add ggml_backend_load_all() where llama_backend_init() is not used

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 09:23:50 +02:00
Georgi Gerganov
d6e7b033a4
llama : add option to save memory in device buffers (#22679)
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Shakhnazar Sailaukan
d8794eecd5
examples: refactor diffusion generation (#22590)
* examples: refactor diffusion generation

* renamed enum values
2026-05-04 20:19:30 +08:00
Piotr Wilkin (ilintar)
a4701c98f7
common/autoparser: fixes for newline handling / forced tool calls (#22654)
* chat/autoparser: the fixes

* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.

* Trim whitespace on apply instead
2026-05-04 13:18:11 +02:00
Evan Huus
c84e6d6db5
server: Add a simple get_datetime server tool (#22649) 2026-05-04 12:19:41 +02:00
Concedo
2905c6254f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.pi/gg/SYSTEM.md
#	docs/speculative.md
#	ggml/src/ggml-virtgpu/virtgpu-shm.cpp
#	ggml/src/ggml-virtgpu/virtgpu.cpp
#	ggml/src/ggml-virtgpu/virtgpu.h
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-05-04 15:36:13 +08:00
Georgi Gerganov
846262d787
docs : update speculative decoding parameters after refactor (#22397) (#22539)
* docs : update speculative decoding parameters after refactor (#22397)

Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397:

- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands

Assisted-by: llama.cpp:local pi

* pi : add rule to use gh CLI for GitHub resources

Assisted-by: llama.cpp:local pi

* docs : run llama-gen-docs

* arg : fix typo
2026-05-04 08:52:07 +03:00
Aldehir Rojas
e48034dfc9
common : determine generation prompt using longest common prefix (#22657) 2026-05-04 00:18:23 +02:00
Concedo
7c70187e26 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/010-bug-compilation.yml
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	.github/ISSUE_TEMPLATE/019-bug-misc.yml
#	.github/ISSUE_TEMPLATE/020-enhancement.yml
#	.github/ISSUE_TEMPLATE/030-research.yml
#	.github/ISSUE_TEMPLATE/040-refactor.yml
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-ops.h
#	ggml/src/ggml-hexagon/htp/hmx-utils.h
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/hvx-copy.h
#	ggml/src/ggml-hexagon/htp/hvx-exp.h
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-virtgpu/ggml-backend.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2026-05-02 18:07:50 +08:00
Adrien Gallouët
beb42fffa4
common : check for null getpwuid in hf-cache (#22550)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-30 21:32:41 +02:00
Concedo
61478cbf4a Merge commit 'c20c44514a' into concedo_experimental
# Conflicts:
#	.github/workflows/python-type-check.yml
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/htp_iface.idl
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	scripts/jinja/jinja-tester.py
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-completion.sh
#	scripts/sync_vendor.py
#	tests/test-backend-ops.cpp
2026-05-01 00:07:46 +08:00
Ben Guidarelli
c20c44514a
spec: fix argument typo (#22552) 2026-04-30 17:32:32 +03:00
Concedo
37073bc13d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/mmq.cuh
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tests/test-log.cpp
2026-04-30 17:37:52 +08:00
Georgi Gerganov
80afa33aad
spec : fix draft model checkpoints (#22521)
* spec : fix draft model checkpoints

* cont : clean-up

* cont : gate the ngram-mod reset warning behind verbose flag
2026-04-30 08:32:18 +03:00
Aldehir Rojas
d77599234e
common : do not pass prompt tokens to reasoning budget sampler (#22488) 2026-04-29 14:10:58 -05:00
Concedo
45f8ff49bb Merge commit '52e5f0a5c1' into concedo_experimental
# Conflicts:
#	examples/gen-docs/gen-docs.cpp
#	examples/lookup/lookup-create.cpp
#	examples/lookup/lookup-stats.cpp
#	examples/lookup/lookup.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/binary.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/rms_norm_mul.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/ssm_scan.wgsl
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tests/test-reasoning-budget.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/rpc/rpc-server.cpp
#	tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
#	tools/server/webui/src/lib/components/app/chat/ChatSidebar/ChatSidebar.svelte
#	tools/server/webui/src/routes/(chat)/+page.svelte
2026-04-29 22:27:36 +08:00
Georgi Gerganov
683c5acb90
spec : disacard last drafted token with low prob (#22506)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
2026-04-29 17:00:00 +03:00
Masato Nakasaka
7b95ea5d11
common: Intentionally leak logger instance to fix hanging on Windows (#22273)
* Changed to leak logger singleton to prevent hanging on Windows

* Fix comment

* Stopped using static vector

Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash

* Change so all logs are output before exit

* Added debug logging

* added more logging

* Added logging

* Explicitly free logger to avoid hanging on Win

* Reverted to leak logger instance again

* Removed debug log and fixed comment

* Fixed comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-29 10:58:43 +03:00
Jillis ter Hove
52e5f0a5c1
common : re-arm reasoning budget after DONE on new <think> (#22323)
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.

Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
2026-04-28 19:15:36 +02:00
Concedo
70be589894 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	examples/debug/debug.cpp
#	examples/eval-callback/eval-callback.cpp
#	ggml/src/ggml-cpu/amx/mmq.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	scripts/pr2wt.sh
2026-04-28 21:13:40 +08:00
Georgi Gerganov
14e733e36f
spec : refactor params (#22397)
* spec : refactor params

* cont : fix

* cont : rename "sparam" to "sampling"

* cont : add spec params category

* cont : add info about removed arguments

* cont : skip param length check for spec params

* cont : adapt server tests
2026-04-28 09:07:33 +03:00
rankaiyx
42401c72b8
Fix type casting for unaccounted memory calculation (#22424)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-04-27 14:31:13 +02:00
Georgi Gerganov
e940b3d468
download : prefer q8_0 when q4_k not available (#22428) 2026-04-27 14:30:29 +02:00
Max Krasnyansky
5594d13224
common: fix missing exports in llama-common (#22340)
* common: refactor common/debug to move abort_on_nan into base_callback_data

Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO.
It should just be a member of the base_callback_data instead.

* cont : cleanup

* common : use pimpl in debug.h to reduce header dependencies

Move common_debug_cb_user_data's data members (std::regex,
std::vector<uint8_t>) into a private impl struct in debug.cpp.

This removes the includes of common.h and <regex> from debug.h,
reducing transitive dependencies for any translation unit that
includes the header.

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-27 08:06:39 +03:00
Concedo
095cfd6354 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	tests/test-chat-auto-parser.cpp
#	tests/test-chat.cpp
2026-04-26 15:57:35 +08:00
Piotr Wilkin (ilintar)
dcad77cc3b
chat: fix handling of space in reasoning markers (#22353)
* chat: fix handling of space in reasoning markers

* fix tests

* whitespace
2026-04-25 21:24:13 +02:00
Georgi Gerganov
98dc1418ea
spec : fix vocab compat checks (#22358) 2026-04-25 20:11:35 +03:00