Concedo
9203b6a051
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# .github/workflows/server-sanitize.yml
# .github/workflows/server-self-hosted.yml
# .github/workflows/server.yml
# .github/workflows/ui-build.yml
# .github/workflows/ui-ci.yml
# .github/workflows/ui-publish.yml
# .gitignore
# CMakeLists.txt
# CODEOWNERS
# scripts/ui-download.cmake
# scripts/xxd.cmake
# tests/test-backend-ops.cpp
# tests/test-reasoning-budget.cpp
# tools/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/server/README.md
2026-05-16 22:56:33 +08:00
Aleksander Grygier
59778f0196
ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming ( #23064 )
...
* webui: Move static build output from `tools/server/public` to `build/ui` directory
* refactor: Move to `tools/ui`
* refactor: rename CMake variables and preprocessor defines
- Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated)
- Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated)
- Backward compat: old vars auto-forward to new ones with DEPRECATION warning
- Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc.
- Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET
- Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines
- Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED
* refactor: rename CLI flags (--webui -> --ui) with backward compat
- Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases)
- Add --ui-config (old --webui-config kept as deprecated alias)
- Add --ui-config-file (old --webui-config-file kept as deprecated alias)
- Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated)
- Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY
- C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields
- Backward compat: old fields synced to new ones in g_params_to_internals
* refactor: update C++ server internals with backward compat
- Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta)
- Rename params.webui usage -> params.ui (both synced, old still works)
- JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys
- Server routes use params.ui_mcp_proxy || params.webui_mcp_proxy
- Preprocessor guards use #if defined(LLAMA_BUILD_UI) || defined(LLAMA_BUILD_WEBUI)
* refactor: rename CI/CD workflows, artifacts, and build script
- Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build
- Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT
- Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks
- Update server.yml: job/artifact refs webui-build -> ui-build
- Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT
- Update server-self-hosted.yml: webui-build -> ui-build
- Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION
- Rename webui-download.cmake -> ui-download.cmake (internal refs updated)
- Update labeler.yml: server/webui -> server/ui path label
* docs: update CODEOWNERS and server README docs
- Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/
- Update server README.md: CLI tables show --ui flags with deprecated --webui aliases
- Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/
* fix: Small fixes for UI build
* fix: CMake.txt syntax
* chore: Formatting
* fix: `.editorconfig` for llama-ui
* chore: Formatting
* refactor: Use `APP_NAME` in Error route
* refactor: Cleanup
* refactor: Single migration service
* make llama-ui a linkable target
* fix: UI Build output
* fix: Missing change
* fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI
* refactor: UI workflows cleanup
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-05-16 02:02:40 +02:00
Aman Gupta
ac33f032ac
reasoning-budget: clone should do a deep-copy ( #23095 )
2026-05-15 11:59:07 +02:00
Concedo
da2cc90723
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-and-test-snapdragon.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# .github/workflows/server-self-hosted.yml
# .github/workflows/server-webui.yml
# .github/workflows/server.yml
# .gitignore
# CMakeLists.txt
# CONTRIBUTING.md
# README.md
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# grammars/README.md
# scripts/snapdragon/qdc/run_qdc_jobs.py
# scripts/snapdragon/qdc/tests/run_backend_ops_posix.py
# scripts/snapdragon/qdc/tests/run_bench_tests_posix.py
# scripts/snapdragon/qdc/tests/utils.py
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
# tools/server/webui/src/lib/components/app/server/ServerLoadingSplash.svelte
# tools/server/webui/src/routes/(chat)/chat/[id]/+page.svelte
# ty.toml
2026-05-15 17:09:48 +08:00
Aleksander Grygier
253ba110bc
webui: Move static build output from repo code to HF Bucket ( #22937 )
...
* ci: add workflow to publish webui to Hugging Face bucket
* ci: add webui release job to release workflow
* ci: test webui release job
* chore: Return to default minification strategy for build output files
* ci: extract webui build into separate workflow and job
* chore: Ignore webui static output + clean up references
* chore: Delete legacy webui static output
* chore: Ignore webui build static output
* fix: Workflow
* fix: Versioning naming
* chore: Update package name
* test: Test CI fix
* refactor: Naming
* server: implement webui build strategy with HF Bucket support
* chore: Remove test workflow
* chore: Use WebUI build workflow call in other workflows
* server: HF Buckets fallback for WebUI build
* refactor: App name variable
* refactor: Naming
* fix: Retrieve loading.html
* fix: workflow syntax
* fix: Rewrite malformed release.yml
* fix: Req param
* test: Re-add missing Playwright installation for CI tests
* refactor: Logic & security improvements
* refactor: Retrieve publishing jobs and DRY the workflows
* fix: Test workflow syntax
* fix: Upstream Release Tag for test workflow
* chore: Remove test workflow
* ci: Run WebUI jobs on `ubuntu-24.04-arm`
* refactor: Post-CR cleanup
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: CI cleanup
* refactor: Cleanup
* test: Test workflow
* refactor: use LLAMA_BUILD_NUMBER instead of LLAMA_BUILD_TAG for HF Bucket webui downloads
* server: add fallback mechanism for HF Bucket webui downloads from latest directory
* fix: Incorrect argument order in file(SHA256) calls for checksum verification
* refactor: Use cmake script for handling the HF Bucket download on build time
* feat: support local npm build for WebUI assets
* refactor: add `HF_ENABLED` flag to control WebUI build/download provisioning
* refactor: Cleanup
* chore: Remove test workflow
* fix: remove s390x from release workflow
* fix: add webui-build dependency to ubuntu-22-rocm and windows-hip
* Revert "fix: remove s390x from release workflow"
This reverts commit debcfffa9bc1e3112eae41f2d29741b682e4eb19.
* fix: Release workflow file
* fix: Proper release tag used for HF Bucket upload
* fix: Remove duplicate steps in release workflow
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-14 13:21:41 +02:00
Concedo
cc82c3164e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .github/workflows/build-cross.yml
# .github/workflows/build-sycl.yml
# .github/workflows/build.yml
# .github/workflows/editorconfig.yml
# .github/workflows/release.yml
# cmake/riscv64-spacemit-linux-gnu-gcc.cmake
# docs/backend/OPENVINO.md
# docs/backend/SYCL.md
# docs/build-riscv64-spacemit.md
# docs/ops.md
# docs/ops/WebGPU.csv
# embd_res/ggml-vocab-qwen35.gguf
# embd_res/ggml-vocab-qwen35.gguf.inp
# embd_res/ggml-vocab-qwen35.gguf.out
# examples/model-conversion/Makefile
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/common.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_reduce.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec_acc.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# ggml/src/ggml-zendnn/CMakeLists.txt
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# scripts/snapdragon/adb/run-completion.sh
# tests/CMakeLists.txt
# tools/cli/README.md
# tools/completion/README.md
# tools/mtmd/clip-impl.h
# tools/mtmd/clip.cpp
# tools/mtmd/clip.h
# tools/server/README.md
2026-05-14 19:04:04 +08:00
Georgi Gerganov
67b2b7f2f2
logs : reduce ( #23021 )
...
Python Type-Check / python type-check (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* logs : reduce
* args : fix envs
* server : fix build
* common : print verbosity level at start
* server : clean-up logs
* server : print prompt processing timings + sampling params
* minor : whitespaces
2026-05-14 13:05:52 +03:00
Xuan-Son Nguyen
e75cd5efb5
download: do not exit() on error ( #23008 )
2026-05-13 15:14:58 +02:00
Georgi Gerganov
634275fbbb
spec : update CLI arguments for better consistency ( #22964 )
...
* spec : update CLI arguments for better consistency
* cont : fix CLI arg message
2026-05-13 09:15:39 +03:00
Xuan-Son Nguyen
7bfe120c21
mtmd, server, common: expose modalities to /v1/models ( #22952 )
...
* mtmd, server, common: expose modalities to /v1/models
* fix build
* rename to mtmd_caps
2026-05-12 19:08:07 +02:00
Concedo
f7923b261f
need to fix cuda compile. Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/python-type-check.yml
# examples/speculative-simple/README.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-cuda/im2col.cu
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# tests/test-backend-ops.cpp
# tools/cli/README.md
# tools/mtmd/CMakeLists.txt
# tools/server/README.md
2026-05-12 20:47:07 +08:00
Georgi Gerganov
68e7ea3eab
spec : parallel drafting support ( #22838 )
...
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
2026-05-11 19:09:43 +03:00
Concedo
2771e16fbc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .devops/nix/package.nix
# .gitignore
# docs/backend/SYCL.md
# docs/ops.md
# docs/ops/SYCL.csv
# ggml/CMakeLists.txt
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-cuda/ggml-cuda.cu
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/fattn-common.hpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/im2col.cpp
# ggml/src/ggml-sycl/im2col.hpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/quants.hpp
# ggml/src/ggml-sycl/vecdotq.hpp
# ggml/src/ggml-virtgpu/ggml-backend-device.cpp
# scripts/sync-ggml.last
# scripts/sync_vendor.py
# tests/test-backend-ops.cpp
2026-05-11 16:18:28 +08:00
Concedo
9b0b36b5ef
Merge commit ' 66001722aa' into concedo_experimental
...
# Conflicts:
# README.md
# docs/ops.md
# docs/ops/SYCL.csv
# examples/sycl/start-svr.sh
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-sycl/gated_delta_net.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/pad.cpp
# ggml/src/ggml-sycl/ssm_conv.cpp
# tests/test-backend-ops.cpp
# tests/test-reasoning-budget.cpp
# tools/server/README.md
# tools/server/webui/src/lib/constants/settings-config.ts
2026-05-11 15:40:10 +08:00
Tim Neumann
2e97c5f96f
backend sampling: support returning post-sampling probs ( #22622 )
...
* server: Never return 0.0 post-sampling probabilities
* backend sampling: support returning post-sampling probs
2026-05-10 19:12:02 +02:00
Aldehir Rojas
49956041ee
common : do not wrap raw strings in schema parser for tagged parsers ( #22827 )
2026-05-08 15:33:17 -05:00
Aldehir Rojas
f9cd456ea5
common : revert reasoning budget +inf logit bias ( #22740 )
2026-05-08 17:46:43 +02:00
Concedo
eb30b29d69
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/gguf-publish.yml
# CODEOWNERS
# examples/sycl/test.sh
# pyproject.toml
# tools/mtmd/CMakeLists.txt
# tools/mtmd/README.md
2026-05-08 14:48:57 +08:00
Aldehir Rojas
093be624cc
common/chat : preserve media markers for typed-content templates ( #22634 )
2026-05-07 12:50:56 -05:00
fl0rianr
a0101225bc
common: do not fit to unknown device memory ( #22614 )
...
* common: do not fit to unknown device memory
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: preserve host fallback for non-GPU fit devices
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: keep unknown GPU fit memory at zero
Signed-off-by: Florian Reinle <f.reinle@otec.de>
---------
Signed-off-by: Florian Reinle <f.reinle@otec.de>
2026-05-06 17:03:45 +02:00
Concedo
9e9497f0cc
Merge remote-tracking branch 'origin/upstream' into concedo_experimental
...
# Conflicts:
# examples/save-load-state/save-load-state.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/gemm_noshuffle_q4_0_f32.cl
# ggml/src/ggml-opencl/kernels/gemm_noshuffle_q8_0_f32.cl
# ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32.cl
# ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32_spec.cl
# ggml/src/ggml-opencl/kernels/gemv_noshuffle_q8_0_f32.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# scripts/sync-ggml.last
# scripts/sync_vendor.py
# src/llama-graph.cpp
# tests/test-backend-ops.cpp
# tests/test-state-restore-fragmented.cpp
2026-05-06 21:20:06 +08:00
Concedo
7240da764a
Merge commit ' 935a340292' into concedo_experimental
...
# Conflicts:
# examples/diffusion/CMakeLists.txt
# scripts/server-test-function-call.py
# src/llama-model.cpp
# src/models/gemma4.cpp
# tests/test-chat.cpp
# tests/test-reasoning-budget.cpp
# tools/server/README.md
2026-05-06 21:02:25 +08:00
Adrien Gallouët
2635ac76e8
common : fix missing-noreturn warnings when compiling with clang 21 ( #22702 )
...
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3719 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3726 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3733 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3740 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3747 | [](common_params & /*params*/, int /*value*/) {
| ^
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 13:16:25 +03:00
Adrien Gallouët
bf76ac77be
common : only load backends when required ( #22290 )
...
* common : only load backends when required
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* llama : call ggml_backend_load_all() directly from llama_backend_init()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add ggml_backend_load_all() where llama_backend_init() is not used
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 09:23:50 +02:00
Georgi Gerganov
d6e7b033a4
llama : add option to save memory in device buffers ( #22679 )
...
* llama : add option to save memory in device buffers
* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Shakhnazar Sailaukan
d8794eecd5
examples: refactor diffusion generation ( #22590 )
...
* examples: refactor diffusion generation
* renamed enum values
2026-05-04 20:19:30 +08:00
Piotr Wilkin (ilintar)
a4701c98f7
common/autoparser: fixes for newline handling / forced tool calls ( #22654 )
...
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
2026-05-04 13:18:11 +02:00
Evan Huus
c84e6d6db5
server: Add a simple get_datetime server tool ( #22649 )
2026-05-04 12:19:41 +02:00
Concedo
2905c6254f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .pi/gg/SYSTEM.md
# docs/speculative.md
# ggml/src/ggml-virtgpu/virtgpu-shm.cpp
# ggml/src/ggml-virtgpu/virtgpu.cpp
# ggml/src/ggml-virtgpu/virtgpu.h
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-05-04 15:36:13 +08:00
Georgi Gerganov
846262d787
docs : update speculative decoding parameters after refactor ( #22397 ) ( #22539 )
...
* docs : update speculative decoding parameters after refactor (#22397 )
Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397 :
- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands
Assisted-by: llama.cpp:local pi
* pi : add rule to use gh CLI for GitHub resources
Assisted-by: llama.cpp:local pi
* docs : run llama-gen-docs
* arg : fix typo
2026-05-04 08:52:07 +03:00
Aldehir Rojas
e48034dfc9
common : determine generation prompt using longest common prefix ( #22657 )
2026-05-04 00:18:23 +02:00
Concedo
7c70187e26
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-copy.h
# ggml/src/ggml-hexagon/htp/hvx-exp.h
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2026-05-02 18:07:50 +08:00
Adrien Gallouët
beb42fffa4
common : check for null getpwuid in hf-cache ( #22550 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-30 21:32:41 +02:00
Concedo
61478cbf4a
Merge commit ' c20c44514a' into concedo_experimental
...
# Conflicts:
# .github/workflows/python-type-check.yml
# examples/speculative/speculative.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# scripts/jinja/jinja-tester.py
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/sync_vendor.py
# tests/test-backend-ops.cpp
2026-05-01 00:07:46 +08:00
Ben Guidarelli
c20c44514a
spec: fix argument typo ( #22552 )
2026-04-30 17:32:32 +03:00
Concedo
37073bc13d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/mmq.cuh
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tests/test-log.cpp
2026-04-30 17:37:52 +08:00
Georgi Gerganov
80afa33aad
spec : fix draft model checkpoints ( #22521 )
...
* spec : fix draft model checkpoints
* cont : clean-up
* cont : gate the ngram-mod reset warning behind verbose flag
2026-04-30 08:32:18 +03:00
Aldehir Rojas
d77599234e
common : do not pass prompt tokens to reasoning budget sampler ( #22488 )
2026-04-29 14:10:58 -05:00
Concedo
45f8ff49bb
Merge commit ' 52e5f0a5c1' into concedo_experimental
...
# Conflicts:
# examples/gen-docs/gen-docs.cpp
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-vulkan/ggml-vulkan.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/binary.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/rms_norm_mul.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/ssm_scan.wgsl
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-reasoning-budget.cpp
# tools/llama-bench/llama-bench.cpp
# tools/rpc/rpc-server.cpp
# tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
# tools/server/webui/src/lib/components/app/chat/ChatSidebar/ChatSidebar.svelte
# tools/server/webui/src/routes/(chat)/+page.svelte
2026-04-29 22:27:36 +08:00
Georgi Gerganov
683c5acb90
spec : disacard last drafted token with low prob ( #22506 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
2026-04-29 17:00:00 +03:00
Masato Nakasaka
7b95ea5d11
common: Intentionally leak logger instance to fix hanging on Windows ( #22273 )
...
* Changed to leak logger singleton to prevent hanging on Windows
* Fix comment
* Stopped using static vector
Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash
* Change so all logs are output before exit
* Added debug logging
* added more logging
* Added logging
* Explicitly free logger to avoid hanging on Win
* Reverted to leak logger instance again
* Removed debug log and fixed comment
* Fixed comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-29 10:58:43 +03:00
Jillis ter Hove
52e5f0a5c1
common : re-arm reasoning budget after DONE on new <think> ( #22323 )
...
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.
Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
2026-04-28 19:15:36 +02:00
Concedo
70be589894
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# examples/debug/debug.cpp
# examples/eval-callback/eval-callback.cpp
# ggml/src/ggml-cpu/amx/mmq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# scripts/pr2wt.sh
2026-04-28 21:13:40 +08:00
Georgi Gerganov
14e733e36f
spec : refactor params ( #22397 )
...
* spec : refactor params
* cont : fix
* cont : rename "sparam" to "sampling"
* cont : add spec params category
* cont : add info about removed arguments
* cont : skip param length check for spec params
* cont : adapt server tests
2026-04-28 09:07:33 +03:00
rankaiyx
42401c72b8
Fix type casting for unaccounted memory calculation ( #22424 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-04-27 14:31:13 +02:00
Georgi Gerganov
e940b3d468
download : prefer q8_0 when q4_k not available ( #22428 )
2026-04-27 14:30:29 +02:00
Max Krasnyansky
5594d13224
common: fix missing exports in llama-common ( #22340 )
...
* common: refactor common/debug to move abort_on_nan into base_callback_data
Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO.
It should just be a member of the base_callback_data instead.
* cont : cleanup
* common : use pimpl in debug.h to reduce header dependencies
Move common_debug_cb_user_data's data members (std::regex,
std::vector<uint8_t>) into a private impl struct in debug.cpp.
This removes the includes of common.h and <regex> from debug.h,
reducing transitive dependencies for any translation unit that
includes the header.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-27 08:06:39 +03:00
Concedo
095cfd6354
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# tests/test-chat-auto-parser.cpp
# tests/test-chat.cpp
2026-04-26 15:57:35 +08:00
Piotr Wilkin (ilintar)
dcad77cc3b
chat: fix handling of space in reasoning markers ( #22353 )
...
* chat: fix handling of space in reasoning markers
* fix tests
* whitespace
2026-04-25 21:24:13 +02:00
Georgi Gerganov
98dc1418ea
spec : fix vocab compat checks ( #22358 )
2026-04-25 20:11:35 +03:00