Commit graph

1029 commits

Author SHA1 Message Date
Concedo
8d173f50c2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	docs/backend/SYCL.md
#	docs/backend/snapdragon/CMakeUserPresets.json
#	docs/backend/snapdragon/README.md
#	docs/backend/snapdragon/developer.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	embd_res/templates/upstage-Solar-Open-100B.jinja
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
#	tests/test-chat.cpp
2026-01-30 15:32:59 +08:00
Aldehir Rojas
7b7ae857f6
chat : add parsing for solar-open-100b (#18540)
* chat : add parsing for solar-open-100b

* add comments to rules

* cont : make assistant start optional

* cont : remove assistant start prefix altogether

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-01-29 16:06:15 +01:00
Concedo
7e755014b2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/winget.yml
#	CODEOWNERS
#	common/CMakeLists.txt
#	common/arg.cpp
#	docs/ops/SYCL.csv
#	examples/lookup/lookup-create.cpp
#	examples/lookup/lookup-stats.cpp
#	examples/lookup/lookup.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	tests/test-chat-template.cpp
2026-01-29 23:05:05 +08:00
Concedo
46cd17c17e Merge commit '88d23ad515' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	docs/build.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zendnn/CMakeLists.txt
#	tests/test-chat-template.cpp
2026-01-29 22:25:56 +08:00
Sigbjørn Skjæret
b45ef2702c
jinja : do not pass empty tools and add some none filters (#19176)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Waiting to run
2026-01-29 14:06:54 +01:00
Georgi Gerganov
eed25bc6b0
arg : add -kvu to llama-batched-bench (#19172) 2026-01-29 08:50:47 +02:00
Sascha Rogmann
72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
* server: introduce self-speculative decoding

* server: moved self-call into speculative.cpp

* can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: can_speculate() tests self-spec

* server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: can_speculate() requires a task instance

* common: ngram map, config self-speculative decoding

* common: add enum common_speculative_type

* common: add vector of speculative states

* common: add option --spec-draftless

* server: cleanup (remove slot.batch_spec, rename)

* common: moved self-spec impl to ngram-map

* common: cleanup (use common_speculative_state_draft)

* spec : refactor

* cont : naming

* spec: remove --spec-config

* doc: (draftless) speculative decoding

* common: print performance in spec decoding

* minor : cleanup

* common : better names

* minor : cleanup + fix build

* minor: comments

* CODEOWNERS: add common/ngram-map.* (#18471)

* common : rename speculative.draftless_type -> speculative.type

* ngram-map : fix uninitialized values

* ngram-map : take into account the input can become shorter

* ngram-map : revert len check for now

* arg : change `--spec-draftless` -> `--spec-type`

* spec : add common_speculative_state::accept()

* spec : refactor + add common_speculative_begin()

* spec : fix begin() call with mtmd

* spec : additional refactor + remove common_speculative_params

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
Sigbjørn Skjæret
60368e1d73
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
* undefined is treated as iterable (string/array) by filters

`tojson` is not a supported `undefined` filter

* add tests

* add sequence and iterable tests

keep it DRY and fix some types
2026-01-28 14:40:29 +01:00
Georgi Gerganov
631cbfcc7a
cuda : fix "V is K view" check for non-unified KV cache (#19145) 2026-01-28 09:15:27 +02:00
Georgi Gerganov
c5c64f72ac
llama : disable Direct IO by default (#19109)
* llama : disable Direct IO by default

* cont : override mmap if supported
2026-01-28 09:11:13 +02:00
Sigbjørn Skjæret
2b4cbd2834
jinja : implement mixed type object keys (#18955)
* implement mixed type object keys

* add tests

* refactor

* minor fixes

* massive refactor

* add more tests

* forgotten tuples

* fix array/object is_hashable

* correct (albeit broken) jinja responses

verified with transformers

* improved hashing and equality

* refactor hash function

* more exhausive test case

* clean up

* cont

* cont (2)

* missing cstring

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-01-27 19:50:42 +01:00
Concedo
f6ece6fd37 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/check-vendor.yml
#	.github/workflows/close-issue.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/gguf-publish.yml
#	.github/workflows/labeler.yml
#	.github/workflows/pre-tokenizer-hashes.yml
#	.github/workflows/python-check-requirements.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/python-type-check.yml
#	.github/workflows/server.yml
#	.github/workflows/update-ops-docs.yml
#	README.md
#	docs/build.md
#	examples/model-conversion/scripts/utils/perplexity-gen.sh
#	examples/model-conversion/scripts/utils/perplexity-run-simple.sh
#	examples/model-conversion/scripts/utils/perplexity-run.sh
#	examples/model-conversion/scripts/utils/quantize.sh
#	examples/model-conversion/scripts/utils/run-embedding-server.sh
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-opencl/kernels/mul_mv_q6_k_f32.cl
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/compare-llama-bench.py
#	tests/test-backend-ops.cpp
#	tests/test-gguf.cpp
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-01-27 23:06:13 +08:00
Daniel Bevenius
fc3cdf32ce
common : clarify HTTPS build options in error message (#19103)
* common : clarify HTTPS build options in error message

This commit updates the https error message to provide clearer
instructions for users who encounter the "HTTPS is not supported" error.

The motivation for this is that it might not be clear to users that only
one of these options are needed to enable HTTPS support.
The LLAMA_OPENSSL option is also added to the message to cover all
possible build configurations.

* clarify that OpenSSL is the default for HTTPS support
2026-01-27 06:16:00 +01:00
Daniel Bevenius
16639ba217
common : use two decimal places for float arg help messages (#19048)
* common : use two decimal places for float arg help messages

This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.

The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.

For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.

* docs : run llama-gen-docs to update docs
2026-01-25 07:31:42 +01:00
Johannes Gäßler
e9fd8dcab4
llama-fit-params: keep explicit --ctx-size 0 (#19070) 2026-01-24 22:13:08 +01:00
Xuan-Son Nguyen
bb02f74c61
chat: fix language input for translategemma (#19052)
* chat: fix language input for translategemma

* Update common/chat.cpp

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-01-24 17:58:45 +01:00
Xuan-Son Nguyen
b5b8fa1c8b
chat : fix translategemma crash on common_chat_format_example (#19019) 2026-01-23 12:03:42 +01:00
Concedo
e8e7c357c9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-cache.yml
#	.github/workflows/build-cmake-pkg.yml
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	.github/workflows/check-vendor.yml
#	.github/workflows/close-issue.yml
#	.github/workflows/copilot-setup-steps.yml
#	.github/workflows/docker.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/gguf-publish.yml
#	.github/workflows/labeler.yml
#	.github/workflows/pre-tokenizer-hashes.yml
#	.github/workflows/python-check-requirements.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/python-type-check.yml
#	.github/workflows/release.yml
#	.github/workflows/server-webui.yml
#	.github/workflows/server.yml
#	.github/workflows/update-ops-docs.yml
#	.github/workflows/winget.yml
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	requirements/requirements-tool_bench.txt
#	src/CMakeLists.txt
#	src/llama-quant.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
#	tools/server/README.md
2026-01-23 14:27:04 +08:00
Xuan-Son Nguyen
51fa458a92
server : support preserving reasoning_content in assistant message (#18994)
* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
2026-01-22 21:30:06 +01:00
Piotr Wilkin (ilintar)
c301172f66
jinja: support none|string (#18995)
* jinja: support none|string

* Update common/jinja/value.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add as_string()

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-21 19:24:37 +01:00
Adrien Gallouët
14be5a39b1
common : improve error message when HTTPS is missing but required (#18987)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-21 17:58:38 +01:00
Concedo
4984c9bc16 Merge commit '12a4a47e6a' into concedo_experimental
# Conflicts:
#	ci/run.sh
#	examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
#	examples/model-conversion/scripts/causal/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	tests/CMakeLists.txt
#	tests/test-chat-parser.cpp
#	tests/test-chat-peg-parser.cpp
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
2026-01-21 21:00:44 +08:00
Adrien Gallouët
1c7cf94b22
common, server : use the same User-Agent by default (#18957)
This commit also ensures that if a custom User-Agent is used, it will be
the only one sent.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-20 18:28:43 +01:00
Xuan-Son Nguyen
2c1f199653
cli : fix reasoning responses in CLI (#18961)
* cli : fix reasoning responses in CLI

* fix build

* fix build (2)
2026-01-20 18:23:25 +01:00
Xuan-Son Nguyen
6df686bee6
server : refactor oai_parser_opt, move it to server_chat_params (#18937)
* server_chat_params

* move chat format into CLI

* use meta whenever possible

* clean up, no more chatml fallback
2026-01-19 23:28:01 +01:00
Sigbjørn Skjæret
959ecf7f23
jinja : fix undefined keys and attributes and int/float as bool (#18924)
* fix undefined keys and attributes

* add falsy tests

* as_bool for integers and floats

* more falsy/truthy tests

* --typo
2026-01-19 20:29:43 +01:00
Concedo
7f618454ff Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	CODEOWNERS
#	docs/backend/OPENCL.md
#	docs/ops.md
#	docs/ops/CANN.csv
#	docs/ops/WebGPU.csv
#	ggml/src/ggml-blas/CMakeLists.txt
#	ggml/src/ggml-opencl/kernels/mul_mv_q6_k.cl
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/cpy.tmpl.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/set_rows.wgsl
#	tests/test-backend-ops.cpp
2026-01-18 23:24:29 +08:00
Sigbjørn Skjæret
bbcdac0189
jinja : fix object item order (and properly implement dictsort) (#18904)
* fix object item order

* as_ordered_object

* copy whole object
2026-01-18 03:40:06 +01:00
Sigbjørn Skjæret
d03c45c9c5
jinja : attribute support for join, map and sort (#18883)
* support negative array index and default value

* attribute support (int and str) for join, map and sort

* add tests

* update CODEOWNERS

* improve fixme sorting comment
2026-01-18 02:53:01 +01:00
Sigbjørn Skjæret
10c98cbdf6
jinja : add missing tojson filter for bool (#18900)
* add missing tojson for bool

* add more literal tests
2026-01-18 01:05:09 +01:00
Sigbjørn Skjæret
420960ab92
jinja : fix lexing of float literals with sign (#18901)
* fix lexing of float literals with sign

* add test

* consume_numeric
2026-01-18 00:57:51 +01:00
Xuan-Son Nguyen
f55b033ae6
jinja: correct member access rule (#18905) 2026-01-18 00:48:55 +01:00
Concedo
8855a7f52b Merge commit 'c945aaaef2' into concedo_experimental
# Conflicts:
#	.devops/cann.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	README.md
#	common/CMakeLists.txt
#	common/chat.cpp
#	docs/function-calling.md
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	models/templates/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.jinja
#	scripts/sync_vendor.py
#	tests/CMakeLists.txt
#	tests/peg-parser/tests.h
#	tests/test-chat-peg-parser.cpp
#	tests/test-chat-template.cpp
#	tests/test-chat.cpp
#	tests/testing.h
#	tools/llama-bench/llama-bench.cpp
2026-01-17 10:24:03 +08:00
Concedo
d15bd212c5 cleanup 2026-01-17 00:57:33 +08:00
Concedo
0d43bdc46d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/batched/batched.cpp
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	src/llama-context.cpp
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-01-17 00:41:28 +08:00
Concedo
22af5f1250 Merge commit '2a13180100' into concedo_experimental
# Conflicts:
#	.devops/cann.Dockerfile
#	.devops/cpu.Dockerfile
#	.devops/cuda-new.Dockerfile
#	.devops/cuda.Dockerfile
#	.devops/intel.Dockerfile
#	.devops/llama-cli-cann.Dockerfile
#	.devops/musa.Dockerfile
#	.devops/nix/package.nix
#	.devops/rocm.Dockerfile
#	.devops/s390x.Dockerfile
#	.devops/vulkan.Dockerfile
#	.github/workflows/build-cmake-pkg.yml
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	.github/workflows/copilot-setup-steps.yml
#	.github/workflows/release.yml
#	.github/workflows/server-webui.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	README.md
#	build-xcframework.sh
#	ci/run.sh
#	cmake/common.cmake
#	common/CMakeLists.txt
#	docs/backend/hexagon/CMakeUserPresets.json
#	docs/backend/hexagon/README.md
#	docs/build-riscv64-spacemit.md
#	docs/build.md
#	examples/debug/debug.cpp
#	examples/eval-callback/CMakeLists.txt
#	examples/eval-callback/eval-callback.cpp
#	examples/llama.android/lib/build.gradle.kts
#	examples/sycl/build.sh
#	examples/sycl/win-build-sycl.bat
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/binary-ops.c
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/get-rows-ops.c
#	ggml/src/ggml-hexagon/htp/hex-dma.c
#	ggml/src/ggml-hexagon/htp/hex-dma.h
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-msg.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/hvx-utils.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/set-rows-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-hexagon/htp/worker-pool.c
#	scripts/debug-test.sh
#	scripts/serve-static.js
#	scripts/snapdragon/adb/run-bench.sh
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-mtmd.sh
#	scripts/snapdragon/adb/run-tool.sh
#	scripts/tool_bench.py
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/mtmd/clip.cpp
2026-01-16 21:52:01 +08:00
Xuan-Son Nguyen
c15395f73c
common : implement new jinja template engine (#18462)
* jinja vm

* lexer

* add vm types

* demo

* clean up

* parser ok

* binary_expression::execute

* shadow naming

* bin ops works!

* fix map object

* add string builtins

* add more builtins

* wip

* use mk_val

* eval with is_user_input

* render gemma tmpl ok

* track input string even after transformations

* support binded functions

* keyword arguments and slicing array

* use shared_ptr for values

* add mk_stmt

* allow print source on exception

* fix negate test

* testing more templates

* mostly works

* add filter_statement

* allow func to access ctx

* add jinja-value.cpp

* impl global_from_json

* a lot of fixes

* more tests

* more fix, more tests

* more fixes

* rm workarounds

* demo: type inferrence

* add placeholder for tojson

* improve function args handling

* rm type inference

* no more std::regex

* trailing spaces

* make testing more flexible

* make output a bit cleaner

* (wip) redirect minja calls

* test: add --output

* fix crash on macro kwargs

* add minimal caps system

* add some workarounds

* rm caps_apply_workarounds

* get rid of preprocessing

* more fixes

* fix test-chat-template

* move test-chat-jinja into test-chat-template

* rm test-chat-jinja from cmake

* test-chat-template: use common

* fix build

* fix build (2)

* rename vm --> interpreter

* improve error reporting

* correct lstrip behavior

* add tojson

* more fixes

* disable tests for COMMON_CHAT_FORMAT_GENERIC

* make sure tojson output correct order

* add object.length

* fully functional selectattr / rejectattr

* improve error reporting

* more builtins added, more fixes

* create jinja rendering tests

* fix testing.h path

* adjust whitespace rules

* more fixes

* temporary disable test for ibm-granite

* r/lstrip behavior matched with hf.js

* minimax, glm4.5 ok

* add append and pop

* kimi-k2 ok

* test-chat passed

* fix lstrip_block

* add more jinja tests

* cast to unsigned char

* allow dict key to be numeric

* nemotron: rm windows newline

* tests ok

* fix test

* rename interpreter --> runtime

* fix build

* add more checks

* bring back generic format support

* fix Apertus

* [json.exception.out_of_range.403] key 'content' not found

* rm generic test

* refactor input marking

* add docs

* fix windows build

* clarify error message

* improved tests

* split/rsplit with maxsplit

* non-inverse maxsplit

forgot to change after simplifying

* implement separators for tojson and fix indent

* i like to move it move it

* rename null -- > none

* token::eof

* some nits + comments

* add exception classes for lexer and parser

* null -> none

* rename global -> env

* rm minja

* update docs

* docs: add input marking caveats

* imlement missing jinja-tests functions

* oops

* support trim filter with args, remove bogus to_json reference

* numerous argument fixes

* updated tests

* implement optional strip chars parameter

* use new chars parameter

* float filter also has default

* always leave at least one decimal in float string

* jinja : static analysis + header cleanup + minor fixes

* add fuzz test

* add string.cpp

* fix chat_template_kwargs

* nits

* fix build

* revert

* unrevert

sorry :)

* add fuzz func_args, refactor to be safer

* fix array.map()

* loosen ensure_vals max count condition, add not impl for map(int)

* hopefully fix windows

* check if empty first

* normalize newlines

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Concedo
af7811dbe1 Merge commit '3e4bb29666' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ci/run.sh
#	cmake/common.cmake
#	examples/eval-callback/CMakeLists.txt
#	examples/model-conversion/scripts/causal/modelcard.template
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-metal/CMakeLists.txt
#	src/CMakeLists.txt
#	tests/CMakeLists.txt
#	tests/test-arg-parser.cpp
2026-01-16 17:55:22 +08:00
ddh0
13f1e4a9ca
llama : add adaptive-p sampler (#17927)
* initial commit for branch

* simplify constants

* add params to `struct common_params_sampling`, add reference to PR

* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`

* add args, rename `queue_size` -> `window_size`

* improved comments

* minor

* remove old unused code from algorithm

* minor

* add power law case to `common_sampler_init`, add sampler name mappings

* clarify behaviour when `window_size = 0`

* add missing enums

* remove `target_range` param, make `target == 1` no-op, cleanup code

* oops, straggler

* add missing parameters in `server-task.cpp`

* copy from author

ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069

* remove old debug log, style nit

* fix compiler warning, add commented-out logging per token

* re-write + change parameters + simplify

* oops forgot args.cpp

* fix leftover `window_size`

* add missing values to `common_params_sampling::print()`

* with logging

* does this fix it?

* no, but does this?

* update default decay

* optimize

* fix bad merge

my git skills are lacking

* silence `missing initializer for member`

* update default decay to 0.9

* fix logging

* format (double)

* add power law to the new `samplers` vector

* log sampler init values

* improve logging messages in llama_sampler_power_law

* remove extraneous logging

* simplify target computation

last commit with debug logging!

* remove debug logging, explicitly clamp params at init

* add `use_power_law` flag + logic, minor cleanup

* update `power-law` -> `adaptive-p`

* fix cold start EMA

- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`

this fixes a "cold start" problem with the moving average

* update `SHARPNESS` constant to `10.0f`

* minor style fixes

no functional changes

* minor style fixes cont.

* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)

* separate into `apply` + `accept` functions

* `pending_token_idx`: switch from `llama_token` to `int32`

functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now

* don't transform logits <= -1e9f

* fix masking in backend top-p, min-p

* address review comments

* typo in comments `RND` -> `RNG`

* add docs

* add recommended values in completion docs

* address PR feedback

* remove trailing whitespace (for CI `editorconfig`)

* add to adaptive-p to `common_sampler_types_from_chars`
2026-01-15 19:16:29 +02:00
Georgi Gerganov
39173bcacb
context : reserve new scheduler when graph topology changes (#18547)
Some checks failed
Python Type-Check / pyright type-check (push) Waiting to run
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
* context : reserve new scheduler when graph topology changes

* cont : fix

* cont : fix reserve

* cont : reserve only when changes occur + timing

* context : add comments

* llama : reserve on sampler changes

* common : allow null common_sampler

* server : task declares needs (embd, logits, sampling)

* server : do not init sampler if not needed

* llama : fix need_reserve when unsetting a sampler

* server : consolidate slot reset/clear logic
2026-01-15 16:39:17 +02:00
Piotr Wilkin (ilintar)
d98b548120
Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)
* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality

* Move to common

* Remove unneeded header

* Unlink from common

* chore: update webui build output

* Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code.

* Revert change to webapp

* Post-merge adjust

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Apply code review changes

* Remove changes to server-context

* Remove mtmd.h include

* Remove utility functions from header

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Rename functions

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-01-14 20:29:35 +01:00
Adrien Gallouët
516a4ca9b5
refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
Junwon Hwang
60591f01d4
model : add EXAONE MoE (#18543)
* Add EXAONE MoE implementations

Co-authored-by: Junwon Hwang <nuclear1221@gmail.com>

* Address PR feedback

* Address PR feedback

* [WIP] Add MTP for EXAONE-MoE

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

---------

Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>
2026-01-13 23:28:38 +01:00
Concedo
7d2c1c4f46 note: clip_is_mrope was moved to mtmd_decode_use_mrope upstream and no longer syncs since https://github.com/ggml-org/llama.cpp/pull/18793
Merge commit 'c1e79e610f' into concedo_experimental

# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	CONTRIBUTING.md
#	MIT_LICENSE_GGML_SDCPP_LLAMACPP_ONLY.md
#	README.md
#	SECURITY.md
#	ci/run.sh
#	common/CMakeLists.txt
#	common/arg.cpp
#	docs/ops.md
#	docs/ops/BLAS.csv
#	docs/ops/zDNN.csv
#	docs/preset.md
#	examples/batched/batched.cpp
#	examples/debug/debug.cpp
#	ggml/src/ggml-blas/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	licenses/LICENSE-curl
#	licenses/LICENSE-httplib
#	scripts/pr2wt.sh
#	scripts/sync_vendor.py
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
#	vendor/cpp-httplib/LICENSE
2026-01-13 23:31:14 +08:00
Concedo
0dc18c668c Merge commit 'a61c8bc3bf' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/pr2wt.sh
#	src/llama-model.cpp
#	tools/CMakeLists.txt
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/clip.cpp
#	tools/mtmd/clip.h
2026-01-13 23:06:50 +08:00
Radoslav Gerganov
bcf7546160
server : add arg for disabling prompt caching (#18776)
* server : add arg for disabling prompt caching

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.

* address review comments

* address review comments
2026-01-12 19:21:34 +02:00
Daniel Bevenius
4150da9a95
examples : add --kv-unified to batched example (#18774)
This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.

The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
2026-01-12 13:47:58 +01:00
Xuan-Son Nguyen
23f82f2420
preset: allow named remote preset (#18728)
* preset: allow named remote preset

* nits: fix docs

* cont docs
2026-01-10 15:12:29 +01:00
Adrien Gallouët
ea23c15990
common : add --license to display embedded licenses (#18696)
This commit introduces a mechanism to embed all licenses directly
into the compiled binaries.

This eliminates the need to distribute separate LICENSE files alongside
the executable, making the binaries self-contained and simplifying
deployment.
2026-01-10 09:46:24 +01:00
Xuan-Son Nguyen
8ece3836b4
common: support remote preset (#18520)
* arg: support remote preset

* proof reading

* allow one HF repo to point to multiple HF repos

* docs: mention about multiple GGUF use case

* correct clean_file_name

* download: also return HTTP status code

* fix case with cache file used

* fix --offline option
2026-01-08 22:35:40 +01:00