Concedo
0755f27372
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/openvino.Dockerfile
# .github/workflows/build-self-hosted.yml
# .github/workflows/build.yml
# common/chat.cpp
# docs/backend/OPENVINO.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-openvino/ggml-decoder.cpp
# ggml/src/ggml-openvino/ggml-openvino-extra.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-openvino/ggml-quants.cpp
# ggml/src/ggml-openvino/openvino/op/rope.cpp
# ggml/src/ggml-openvino/openvino/op_table.cpp
# ggml/src/ggml-openvino/openvino/op_table.h
# ggml/src/ggml-openvino/openvino/translate_session.cpp
# ggml/src/ggml-openvino/openvino/utils.cpp
# ggml/src/ggml-openvino/openvino/utils.h
# ggml/src/ggml-openvino/utils.cpp
# ggml/src/ggml-openvino/utils.h
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/mtmd/CMakeLists.txt
# tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
Ethan Turner
750579ff14
common: Refactoring sampler parameters ( #20429 ) ( #22233 )
...
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.
Issue: https://github.com/ggml-org/llama.cpp/issues/20429
Original PR: https://github.com/ggml-org/llama.cpp/pull/20297
2026-04-22 10:40:19 +02:00
Georgi Gerganov
84652b80cf
arg : add --spec-default ( #22223 )
2026-04-21 19:52:02 +03:00
Concedo
19a12bb080
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# common/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/sync-ggml.last
# tools/cli/cli.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device ( #22171 )
...
* fit-params : add option to output estimated memory per device
* cont : minor
* cont : refactor
* cont : move fit params implementation to libcommon
* cont : header
* cont : headers
* cont : codeowners
2026-04-21 09:54:36 +03:00
Concedo
cd6788007e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cross.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# examples/llama.android/lib/src/main/cpp/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/test-chat.cpp
# tests/test-mtmd-c-api.c
# tools/server/README.md
2026-04-20 20:19:11 +08:00
Georgi Gerganov
de71b5f81c
server : refactor "use checkpoint" logic ( #22114 )
2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own
9d49acb2a7
server: rename --clear-idle to --cache-idle-slots ( #21741 )
2026-04-20 08:30:24 +03:00
Concedo
79882d669a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CODEOWNERS
# common/CMakeLists.txt
# common/common.h
# docs/ops.md
# docs/ops/Metal.csv
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/diffusion/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/gen-docs/CMakeLists.txt
# examples/idle/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/sycl/CMakeLists.txt
# examples/training/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-quantize-stats.cpp
# tools/batched-bench/CMakeLists.txt
# tools/cli/CMakeLists.txt
# tools/cli/cli.cpp
# tools/completion/CMakeLists.txt
# tools/cvector-generator/CMakeLists.txt
# tools/cvector-generator/cvector-generator.cpp
# tools/export-lora/CMakeLists.txt
# tools/gguf-split/CMakeLists.txt
# tools/gguf-split/gguf-split.cpp
# tools/imatrix/CMakeLists.txt
# tools/llama-bench/CMakeLists.txt
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/perplexity/CMakeLists.txt
# tools/quantize/CMakeLists.txt
# tools/quantize/quantize.cpp
# tools/results/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/tokenize/CMakeLists.txt
# tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Georgi Gerganov
6990e2f1f7
libs : rename libcommon -> libllama-common ( #21936 )
...
* cmake : allow libcommon to be shared
* cmake : rename libcommon to libllama-common
* cont : set -fPIC for httplib
* cont : export all symbols
* cont : fix build_info exports
* libs : add libllama-common-base
* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Concedo
4c860ae4ae
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/download.cpp
# docs/backend/OPENVINO.md
# docs/backend/snapdragon/CMakeUserPresets.json
# docs/backend/snapdragon/README.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/argsort-ops.c
# ggml/src/ggml-hexagon/htp/binary-ops.c
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-hexagon/htp/cumsum-ops.c
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/repeat-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/ssm-conv.c
# ggml/src/ggml-hexagon/htp/sum-rows-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# models/templates/google-gemma-4-31B-it-interleaved.jinja
# models/templates/google-gemma-4-31B-it.jinja
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-tool.sh
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-mtmd.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
2026-04-11 11:19:32 +08:00
Concedo
a165a73120
Merge commit ' d6f3030047' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/utils/semantic_check.py
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/amx/amx.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend-buffer.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# pyproject.toml
# requirements/requirements-convert_legacy_llama.txt
# requirements/requirements-tool_bench.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-llama-archs.cpp
# tests/test-tokenizer-0.py
# tests/test-tokenizer-random.py
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-11 11:10:55 +08:00
Adrien Gallouët
05b3caaa48
common : add callback interface for download progress ( #21735 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-10 22:17:00 +02:00
Johannes Gäßler
0893f50f2d
common: mark --split-mode tensor as experimental ( #21684 )
2026-04-10 12:27:27 +02:00
Johannes Gäßler
d6f3030047
ggml: backend-agnostic tensor parallelism (experimental) ( #19378 )
...
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8 )
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9 )
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7 )
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11 )
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12 )
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16 )
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17 )
* meta : formatting, naming, indentation (#18 )
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-09 16:42:19 +02:00
Concedo
2e4f94822e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-self-hosted.yml
# .github/workflows/docker.yml
# ci/run.sh
# docs/build.md
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# src/llama-vocab.cpp
# tests/test-chat.cpp
# tests/test-jinja.cpp
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-04-04 14:27:23 +08:00
Yes You Can Have Your Own
50e0ad08fb
server: save and clear idle slots on new task (--clear-idle) ( #20993 )
...
* server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE)
* server: move idle slot KV clearing to slot release
The save "cost" is now paid by the finishing request.
* server: add --kv-clear-idle flag, enable by default
* server: skip clearing last idle slot, clear on launch
* server: test --no-kv-clear-idle flag
* server: simplify on-release clearing loop
* server: remove on-release KV clearing, keep launch-only
* cont : clean-up
* tests: update log strings after --clear-idle rename
* tests: use debug tags instead of log message matching
* test: fix Windows CI by dropping temp log file unlink
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-03 19:02:27 +02:00
Concedo
34ad53e950
merged support for gemma4. the e2b, e4b and 26b work, the 31b does not
2026-04-03 11:07:46 +08:00
Ruben Ortlam
5803c8d115
tests: allow exporting graph ops from HF file without downloading weights ( #21182 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
* tests: allow exporting graph ops from HF file without downloading weights
* use unique_ptr for llama_context in HF metadata case
* fix missing non-required tensors falling back to type f32
* use unique pointers where possible
* use no_alloc instead of fixing f32 fallback
* fix missing space
2026-04-02 18:19:20 +02:00
Concedo
42ad89cd86
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/cann.Dockerfile
# .devops/cpu.Dockerfile
# .devops/llama-cli-cann.Dockerfile
# .devops/nix/package.nix
# .github/workflows/build-android.yml
# .github/workflows/build-cann.yml
# .github/workflows/build-msys.yml
# .github/workflows/docker.yml
# .github/workflows/editorconfig.yml
# .github/workflows/gguf-publish.yml
# .github/workflows/python-lint.yml
# .github/workflows/release.yml
# CMakeLists.txt
# docs/backend/CANN.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-rpc/ggml-rpc.cpp
# scripts/sync_vendor.py
# tests/test-chat-auto-parser.cpp
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-reasoning-budget.cpp
# tools/cli/cli.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2026-03-30 20:45:38 +08:00
Concedo
aac220f7e3
Merge commit ' 0fac87b157' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/hip-quality-check.yml
# docs/multimodal.md
# scripts/hip/gcn-cdna-vgpr-check.py
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tests/test-llama-archs.cpp
# tools/imatrix/imatrix.cpp
# tools/mtmd/CMakeLists.txt
2026-03-29 01:14:33 +08:00
Adrien Gallouët
5c1a7b8355
server : add custom socket options to disable SO_REUSEPORT ( #21056 )
...
* server : add custom socket options to disable SO_REUSEPORT
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --reuse-port
$ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
$ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update tools/server/README.md (llama-gen-docs)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix windows
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 01:12:43 +01:00
AN Long
48cda24c11
server: remove the verbose_prompt parameter ( #21059 )
...
* server: respect the verbose_prompt parameter
* Revert "server: respect the verbose_prompt parameter"
This reverts commit 8ed885cf375b2c8ba641c661f3667df70b9797f4.
* Remove --verbose-prompt parameter from llama-server
* Using set_examples instead of set_excludes
2026-03-27 13:36:13 +02:00
Xuan-Son Nguyen
20197b6fe3
server: add built-in tools backend support ( #20898 )
...
* wip: server_tools
* refactor
* displayName -> display_name
* snake_case everywhere
* rm redundant field
* change arg to --tools all
* add readme mention
* llama-gen-docs
2026-03-27 10:07:11 +01:00
Concedo
633222d2e3
fix tool builds
2026-03-26 15:15:58 +08:00
Adrien Gallouët
056b50c319
common : fix verbosity setup ( #20989 )
...
The verbosity threshold was set at the end of common_params_parse_ex(),
after doing many things (like downloading files..)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-25 19:41:01 +01:00
Concedo
c00fe0af5a
Merge commit ' 9f102a1407' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/pull_request_template.md
# CODEOWNERS
# README.md
# common/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/binary-ops.c
# ggml/src/ggml-hexagon/htp/hex-dma.c
# ggml/src/ggml-hexagon/htp/hex-dma.h
# ggml/src/ggml-hexagon/htp/hex-dump.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/ssm-conv.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# scripts/snapdragon/adb/run-bench.sh
# scripts/sync_vendor.py
# tests/test-backend-ops.cpp
# tools/llama-bench/llama-bench.cpp
2026-03-25 23:45:41 +08:00
Adrien Gallouët
8c7957ca33
common : add standard Hugging Face cache support ( #20775 )
...
* common : add standard Hugging Face cache support
- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check with the quant tag
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve error handling and report API errors
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Restore common_cached_model_info and align mmproj filtering
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Prefer main when getting cached ref
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use cached files when HF API fails
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use final_path..
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check all inputs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 07:30:33 +01:00
Concedo
ef854f002e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/python-type-check.yml
# AGENTS.md
# CONTRIBUTING.md
# examples/model-conversion/scripts/embedding/run-original-model.py
# examples/model-conversion/scripts/utils/compare_tokens.py
# examples/pydantic_models_to_grammar.py
# ggml/src/ggml-rpc/ggml-rpc.cpp
# pyrightconfig.json
# scripts/compare-llama-bench.py
# scripts/jinja/jinja-tester.py
# scripts/server-bench.py
# tests/test-grammar-integration.cpp
# tests/test-grammar-parser.cpp
# tests/test-llama-grammar.cpp
# tests/test-tokenizer-random.py
# tools/cli/README.md
# tools/completion/README.md
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2026-03-22 23:39:13 +08:00
ddh0
3306dbaef7
misc : prefer ggml-org models in docs and examples ( #20827 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* misc : prefer ggml-org models in docs and examples
Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.
* remove accidentally committed file
2026-03-21 22:00:26 +01:00
Concedo
6054bacadd
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/ai-issues.yml
# CONTRIBUTING.md
# docs/autoparser.md
# docs/ops.md
# docs/ops/Metal.csv
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hex-dma.h
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hip/CMakeLists.txt
# models/templates/Apriel-1.6-15b-Thinker-fixed.jinja
# models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja
# models/templates/deepseek-ai-DeepSeek-V3.1.jinja
# models/templates/llama-cpp-deepseek-r1.jinja
# models/templates/meetkai-functionary-medium-v3.1.jinja
# scripts/fetch_server_test_models.py
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-mtmd.sh
# scripts/snapdragon/adb/run-tool.sh
# tests/test-chat-auto-parser.cpp
# tests/test-chat-peg-parser.cpp
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/server/README.md
2026-03-21 12:06:01 +08:00
Concedo
98f099aecc
Merge commit ' c1258830b2' into concedo_experimental
...
# Conflicts:
# docs/docker.md
# docs/ops.md
# docs/ops/WebGPU.csv
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
2026-03-21 12:00:52 +08:00
Piotr Wilkin (ilintar)
5e54d51b19
common/parser: add proper reasoning tag prefill reading ( #20424 )
...
* Implement proper prefill extraction
* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp
* Update tools/server/server-task.cpp
* refactor: move grammars to variant, remove grammar_external, handle exception internally
* Make code less C++y
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-19 16:58:21 +01:00
ddh0
922b90e567
common : add LLAMA_ARG_SPEC_TYPE ( #20744 )
2026-03-19 16:16:55 +01:00
Concedo
48f914e374
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/run.sh
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/arch/riscv/repack.cpp
# ggml/src/ggml-cpu/arch/x86/repack.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-exp.h
# ggml/src/ggml-hexagon/htp/hvx-sigmoid.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync-ggml.last
# tests/test-backend-sampler.cpp
# tests/test-chat.cpp
# tests/test-jinja.cpp
# tools/cli/cli.cpp
2026-03-19 02:23:06 +08:00
Piotr Wilkin (ilintar)
d2ecd2d1cf
common/parser: add --skip-chat-parsing to force a pure content parser. ( #20289 )
...
* Add `--force-pure-content` to force a pure content parser.
* Update common/arg.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Change parameter name [no ci]
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Concedo
b1c500ae2b
Merge commit ' 2948e6049a' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CONTRIBUTING.md
# docs/backend/VirtGPU/development.md
# docs/ops.md
# docs/ops/WebGPU.csv
# embd_res/templates/GigaChat3-10B-A1.8B.jinja
# embd_res/templates/GigaChat3.1-10B-A1.8B.jinja
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-grammar-integration.cpp
# tests/test-quantize-fns.cpp
2026-03-15 11:21:24 +08:00
Concedo
67c9798d0b
Merge commit ' 3ca19b0e9f' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# common/CMakeLists.txt
# common/chat-peg-parser.cpp
# docs/backend/SYCL.md
# docs/ops.md
# docs/ops/SYCL.csv
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/rope.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/compare-llama-bench.py
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tools/cli/cli.cpp
2026-03-15 11:11:31 +08:00
Concedo
1802b09e6f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/build.md
# docs/ops.md
# docs/ops/CPU.csv
# ggml/src/ggml-cpu/kleidiai/kernels.cpp
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-cpu/repack.h
# src/llama-quant.cpp
# tests/test-json-schema-to-grammar.cpp
2026-03-14 17:56:16 +08:00
Concedo
04915d99ee
Merge commit ' 451ef08432' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# docs/ops.md
# docs/ops/Vulkan.csv
# src/llama-model-loader.cpp
# src/llama-model.cpp
# src/llama.cpp
# tests/CMakeLists.txt
# tests/peg-parser/test-basic.cpp
# tests/peg-parser/test-json-parser.cpp
# tests/peg-parser/test-python-dict-parser.cpp
# tests/peg-parser/test-unicode.cpp
# tests/test-chat-auto-parser.cpp
# tests/test-chat-peg-parser.cpp
# tests/test-chat.cpp
# tools/CMakeLists.txt
2026-03-13 23:33:37 +08:00
Ruben Ortlam
128142fe7d
test-backend-ops: allow loading tests from file and parsing model operators into file ( #19896 )
...
* tests: allow loading test-backend-ops tests from json
* add error threshold based on op
* add error when file cannot be read
* add graph operator json extraction tool
* add nb parameter for non-contiguous input tensors
* fix view check
* only use view if non-contiguous/permuted, use C++ random instead of rand()
* replace internal API calls with public llama_graph_reserve call
* reduce test description length
* fix nb[0] not getting set for view
* add name to tests
* fix inplace error
* use text file instead of json
* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/
* fix missing declaration
* use pragma once
* fix indent
* fix Windows build
2026-03-12 13:26:00 +01:00
Daniel Bevenius
6de1bc631d
common : update completion executables list [no ci] ( #19934 )
...
This commit updates the bash completion executables list, adding missing
executables and removing some that non longer exist.
2026-03-12 12:12:01 +01:00
Piotr Wilkin (ilintar)
acb7c79069
common/parser: handle reasoning budget ( #20297 )
...
* v1
* Finished!
* Handlie cli
* Reasoning sampler
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Less explosive terminology :)
* Add utf-8 case and tests
* common : migrate reasoning budget sampler to common
* cont : clean up
* cont : expose state and allow passing as initial state
* cont : remove unused imports
* cont : update state machine doc string
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Concedo
6adcd0b5db
Merge commit ' 34df42f7be' into concedo_experimental
...
# Conflicts:
# README.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/binary-ops.c
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-arith.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-inverse.h
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# tests/test-backend-ops.cpp
# tools/cli/cli.cpp
# tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
2026-03-10 22:20:04 +08:00
Concedo
746664fde6
Merge commit ' 2cd20b72ed' into concedo_experimental
...
# Conflicts:
# CONTRIBUTING.md
# docs/backend/CANN.md
# docs/backend/SYCL.md
# docs/backend/snapdragon/README.md
# docs/backend/snapdragon/windows.md
# docs/build.md
# docs/multimodal/MobileVLM.md
# docs/ops.md
# docs/ops/WebGPU.csv
# examples/debug/README.md
# examples/llama.vim
# examples/model-conversion/README.md
# examples/sycl/README.md
# ggml/src/ggml-cpu/amx/mmq.cpp
# ggml/src/ggml-cpu/arch/x86/repack.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp-drv.cpp
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-copy.h
# ggml/src/ggml-hexagon/htp/hvx-inverse.h
# ggml/src/ggml-hexagon/htp/hvx-reduce.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/worker-pool.c
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cpy.cl
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/quants.hpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/pr2wt.sh
# scripts/server-bench.py
# scripts/snapdragon/windows/run-cli.ps1
# tests/test-alloc.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/completion/README.md
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/README.md
# tools/perplexity/README.md
# tools/server/public_simplechat/readme.md
# tools/server/tests/README.md
2026-03-10 22:11:08 +08:00
Sigbjørn Skjæret
ec947d2b16
common : fix incorrect uses of stoul ( #20313 )
2026-03-10 11:40:26 +01:00
Johannes Gäßler
a976ff081b
llama: end-to-end tests ( #19802 )
...
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
2026-03-08 12:30:21 +01:00
Piotr Wilkin (ilintar)
f5ddcd1696
Checkpoint every n tokens: squash ( #20087 )
2026-03-06 11:39:26 +01:00
Aleksander Grygier
f6235a41ef
webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts ( #18655 )
2026-03-06 10:00:39 +01:00
Marcel Petrick
92f7da00b4
chore : correct typos [no ci] ( #20041 )
...
* fix(docs): correct typos found during code review
Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.
Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
* Update docs/backend/CANN.md
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"
This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.
* Update tests/test-backend-ops.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-backend-ops.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-05 08:50:21 +01:00