Commit graph

744 commits

Author SHA1 Message Date
Concedo
103d60ed2c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/common.cpp
#	examples/batched-bench/batched-bench.cpp
#	examples/batched/batched.cpp
#	examples/export-lora/export-lora.cpp
#	examples/gritlm/gritlm.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
2025-04-03 18:57:49 +08:00
Xuan-Son Nguyen
42eb248f46
common : remove json.hpp from common.cpp (#12697)
* common : remove json.hpp from common.cpp

* fix comment
2025-04-02 09:58:34 +02:00
Xuan-Son Nguyen
267c1399f1
common : refactor downloading system, handle mmproj with -hf option (#12694)
* (wip) refactor downloading system [no ci]

* fix all examples

* fix mmproj with -hf

* gemma3: update readme

* only handle mmproj in llava example

* fix multi-shard download

* windows: fix problem with std::min and std::max

* fix 2
2025-04-01 23:44:05 +02:00
Concedo
396875e1c4 update api docs and lite 2025-03-29 15:39:25 +08:00
Benson Wong
5d01670266
server : include speculative decoding stats when timings_per_token is enabled (#12603)
* Include speculative decoding stats when timings_per_token is true

New fields added to the `timings` object:

  - draft_n           : number of draft tokens generated
  - draft_accepted_n  : number of draft tokens accepted
  - draft_accept_ratio: ratio of accepted/generated

* Remove redundant draft_accept_ratio var

* add draft acceptance rate to server console output
2025-03-28 10:05:44 +02:00
Piotr
2099a9d5db
server : Support listening on a unix socket (#12613)
* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

---------

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-03-27 23:41:04 +01:00
Concedo
ea358369cc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ci/README.md
#	ci/run.sh
#	docs/backend/CUDA-FEDORA.md
#	docs/build.md
#	docs/install.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/common.cuh
#	tests/test-backend-ops.cpp
2025-03-26 00:18:01 +08:00
Marius Gerdes
77f9c6bbe5
server : Add verbose output to OAI compatible chat endpoint. (#12246)
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
2025-03-23 19:30:26 +01:00
Concedo
7030ebf401 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/SYCL.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
#	ggml/src/ggml-sycl/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-22 00:32:42 +08:00
Woof Dog
e04643063b
webui : Prevent rerendering on textarea input (#12299)
* webui: Make textarea uncontrolled to eliminate devastating lag

* Update index.html.gz

* use signal-style implementation

* rm console log

* no duplicated savedInitValue set

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-20 15:57:43 +01:00
Concedo
0c90d2ebcf Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	cmake/common.cmake
#	docs/backend/SYCL.md
#	examples/main/README.md
#	examples/speculative/speculative.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-19 19:27:11 +08:00
Georgi Gerganov
810e0af3f5
server : fix warmup draft cache type (#12446)
ggml-ci
2025-03-18 12:05:42 +02:00
Concedo
67851e5415 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/run/run.cpp
#	ggml/src/ggml-cann/aclnn_ops.cpp
2025-03-15 19:54:19 +08:00
Victor
add2a3aa5a
server: fix "--grammar-file" parameter (#12285) 2025-03-14 11:21:17 +01:00
Concedo
0db4ae6237 traded my ink for a pen 2025-03-14 11:58:15 +08:00
Georgi Gerganov
e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181)
* llama : refactor llama_context, llama_kv_cache, llm_build_context

ggml-ci

* graph : don't mutate the KV cache during defrag

ggml-ci

* context : reduce virtuals + remove test function

ggml-ci

* context : move interface implementation to source file + factory

ggml-ci

* graph : move KV cache build functions to llama_context impl

ggml-ci

* graph : remove model reference from build_pooling

ggml-ci

* graph : remove llama_model reference

ggml-ci

* kv_cache : provide rope factors

ggml-ci

* graph : rework inputs to use only unique_ptr, remove attn input abstraction

ggml-ci

* context : remove llama_context_i abstraction

ggml-ci

* context : clean-up

ggml-ci

* graph : clean-up

ggml-ci

* llama : remove redundant keywords (struct, enum)

ggml-ci

* model : adapt gemma3

ggml-ci

* graph : restore same attention ops as on master

ggml-ci

* llama : remove TODO + fix indent

ggml-ci
2025-03-13 12:35:44 +02:00
Ishaan Gandhi
2048b5913d
server : fix crash when using verbose output with input tokens that are not in printable range (#12178) (#12338)
* Fix DOS index bug

* Remove new APIs

* remove extra line

* Remove from API

* Add extra newline

* Update examples/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-13 11:10:05 +01:00
Concedo
77debb1b1b gemma3 vision works, but is using more tokens than expected - may need resizing 2025-03-13 00:31:16 +08:00
Olivier Chafik
be421fc429
tool-call: ensure there's always a non-empty tool call id (#12292) 2025-03-10 09:45:29 +00:00
Olivier Chafik
2b3a25c212
sampler: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291)
* Fix typo in lazy grammar handling (fixes trigger tokens)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-10 09:44:42 +00:00
Concedo
6b7c3ae1d3 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	AUTHORS
#	README.md
#	ci/run.sh
#	docs/build.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-metal/CMakeLists.txt
#	scripts/sync-ggml.last
2025-03-10 10:32:41 +08:00
Georgi Gerganov
7ab364390f
server : infill gen ends on new line (#12254) 2025-03-07 20:54:30 +02:00
Sigbjørn Skjæret
8fad3c7a7c
server : Log original chat template parsing error (#12233) 2025-03-07 11:15:33 +01:00
Concedo
ec43d2b147 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	common/common.cpp
#	examples/embedding/embedding.cpp
#	examples/json_schema_to_grammar.py
#	examples/llama.android/llama/src/main/cpp/llama-android.cpp
#	examples/llama.swiftui/README.md
#	examples/llama.swiftui/llama.swiftui.xcodeproj/project.pbxproj
#	examples/lookahead/lookahead.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	requirements.txt
#	requirements/requirements-all.txt
#	scripts/fetch_server_test_models.py
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
2025-03-06 18:54:58 +08:00
Olivier Chafik
669912d9a5
tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034)
* sampler: turn lazy grammar trigger words to regexes

* add scripts/tool_bench.sh & .py

* constrain llama json output regardless of function name if matches at beginning

* update relaxed newline space rule in grammar tests

* support add_generation_prompt query parameter (useful for /apply_template)

* Update src/llama-grammar.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-05 13:05:13 +00:00
Clauszy
06a92a193a
server : fix cache reuse logic (#12161)
The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
2025-03-05 09:25:45 +02:00
Concedo
0cddbe1f0b Merge branch 'upstream' into concedo_experimental 2025-03-05 00:22:06 +08:00
Concedo
6b7d2349a7 Rewrite history to fix bad vulkan shader commits without increasing repo size
added dpe colab (+8 squashed commit)

Squashed commit:

[b8362da4] updated lite

[ed6c037d] move nsigma into the regular sampler stack

[ac5f61c6] relative filepath fixed

[05fe96ab] export template

[ed0a5a3e] nix_example.md: refactor (#1401)

* nix_example.md: add override example

* nix_example.md: drop graphics example, already basic nixos knowledge

* nix_example.md: format

* nix_example.md: Vulkan is disabled on macOS

Disabled in: 1ccd253acc

* nix_examples.md: nixpkgs.config.cuda{Arches -> Capabilities}

Fixes: https://github.com/LostRuins/koboldcpp/issues/1367

[675c62f7] AutoGuess: Phi 4 (mini) (#1402)

[4bf56982] phrasing

[b8c0df04] Add Rep Pen to Top N Sigma sampler chain (#1397)

- place after nsigma and before xtc (+3 squashed commit)

Squashed commit:

[87c52b97] disable VMM from HIP

[ee8906f3] edit description

[e85c0e69] Remove Unnecessary Rep Counting (#1394)

* stop counting reps

* fix range-based initializer

* strike that - reverse it
2025-03-05 00:02:20 +08:00
Olivier Chafik
1a24c4621f
server: fix deadly typo in response_format.json_schema.schema handling (#12168) 2025-03-04 08:24:07 +02:00
Xuan-Son Nguyen
7b69003af7
webui : add ?m=... and ?q=... params (#12148)
* webui : add ?m=... and ?q=... params

* also clear prefilledMessage variable

* better approach

* fix comment

* test: bump timeout on GITHUB_ACTION
2025-03-03 11:42:45 +01:00
Vivian
2cc4a5e44a
webui : minor typo fixes (#12116)
* fix typos and improve menu text clarity

* rename variable trimedValue to trimmedValue

* add updated index.html.gz

* rebuild

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-01 11:15:09 +01:00
Olivier Chafik
d7cfe1ffe0
docs: add docs/function-calling.md to lighten server/README.md's plight (#12069) 2025-02-25 18:52:56 +00:00
rhjdvsgsgks
401af80b54
server: handle echo=false on /v1/completions (#12060) 2025-02-25 12:52:52 +01:00
Olivier Chafik
0b52745649
server: support add_generation_prompt query param (#12062) 2025-02-25 10:40:22 +00:00
Concedo
159c47f0e6 Merge commit '335eb04a91' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CONTRIBUTING.md
#	Makefile
#	docs/build.md
#	examples/llama.swiftui/llama.swiftui/UI/ContentView.swift
#	examples/run/run.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
2025-02-24 11:55:14 +08:00
Georgi Gerganov
cf756d6e0a
server : disable Nagle's algorithm (#12020) 2025-02-22 11:46:31 +01:00
momonga
c392e5094d
server (webui): Fix Premature Submission During IME Conversion (#11971)
* fix skip ime composing

* fix npm rebuild

* fix warn

---------

Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-20 19:43:22 +01:00
Concedo
6d7ef10671 Merge branch 'upstream' into concedo_experimental
Renable qwen2vl GPU for vulkan https://github.com/ggml-org/llama.cpp/pull/11902

# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	.gitignore
#	CONTRIBUTING.md
#	Makefile
#	common/CMakeLists.txt
#	common/arg.cpp
#	common/common.cpp
#	examples/main/main.cpp
#	examples/run/run.cpp
#	examples/server/tests/README.md
#	ggml/src/ggml-cuda/mma.cuh
#	scripts/get_chat_template.py
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tests/test-chat.cpp
2025-02-20 23:17:20 +08:00
Georgi Gerganov
abd4d0bc4f
speculative : update default params (#11954)
* speculative : update default params

* speculative : do not discard the last drafted token
2025-02-19 13:29:42 +02:00
igardev
b58934c183
server : (webui) Enable communication with parent html (if webui is in iframe) (#11940)
* Webui: Enable communication with parent html (if webui is in iframe):
- Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to the llama.cpp server
- On pressing na Escape button sends command "escapePressed" to the parent

Example handling from the parent html side:
- Send command "setText" from parent html to webui in iframe:
const iframe = document.getElementById('askAiIframe');
if (iframe) {
	iframe.contentWindow.postMessage({ command: 'setText', text: text, context: context }, '*');
}

- Listen for Escape key from webui on parent html:
// Listen for escape key event in the iframe
window.addEventListener('keydown', (event) => {
	if (event.key === 'Escape') {
		// Process case when Escape is pressed inside webui
	}
});

* Move the extraContext from storage to app.context.

* Fix formatting.

* add Message.extra

* format + build

* MessageExtraContext

* build

* fix display

* rm console.log

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-18 23:01:44 +01:00
Olivier Chafik
63e489c025
tool-call: refactor common chat / tool-call api (+ tests / fixes) (#11900)
* tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type

* addressed clang-tidy lints in [test-]chat.*

* rm minja deps from util & common & move it to common/minja/

* add name & tool_call_id to common_chat_msg

* add common_chat_tool

* added json <-> tools, msgs conversions to chat.h

* fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens)

* fix deepseek r1 slow test (no longer <think> opening w/ new template)

* allow empty tools w/ auto + grammar

* fix & test server grammar & json_schema params w/ & w/o --jinja
2025-02-18 18:03:23 +00:00
Xuan-Son Nguyen
63ac128563
server : add TEI API format for /rerank endpoint (#11942)
* server : add TEI API format for /rerank endpoint

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix

* also gitignore examples/server/*.gz.hpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-18 14:21:41 +01:00
xiaobing318
09aaf4f1f5
docs : Fix duplicated file extension in test command (#11935)
This commit fixes an issue in the llama.cpp project where the command for testing the llama-server object contained a duplicated file extension. The original command was:
./tests.sh unit/test_chat_completion.py.py -v -x
It has been corrected to:
./tests.sh unit/test_chat_completion.py -v -x
This change ensures that the test script correctly locates and executes the intended test file, preventing test failures due to an incorrect file name.
2025-02-18 10:12:49 +01:00
Antoine Viallon
c4d29baf32
server : fix divide-by-zero in metrics reporting (#11915) 2025-02-17 11:25:12 +01:00
Xuan-Son Nguyen
0f2bbe6564
server : bump httplib to 0.19.0 (#11908) 2025-02-16 17:11:22 +00:00
Concedo
f144b1f345 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/llama-cpp-cuda.srpm.spec
#	.devops/llama-cpp.srpm.spec
#	.devops/nix/package.nix
#	.devops/rocm.Dockerfile
#	.github/ISSUE_TEMPLATE/020-enhancement.yml
#	.github/ISSUE_TEMPLATE/030-research.yml
#	.github/ISSUE_TEMPLATE/040-refactor.yml
#	.github/ISSUE_TEMPLATE/config.yml
#	.github/pull_request_template.md
#	.github/workflows/bench.yml.disabled
#	.github/workflows/build.yml
#	.github/workflows/labeler.yml
#	CONTRIBUTING.md
#	Makefile
#	README.md
#	SECURITY.md
#	ci/README.md
#	common/CMakeLists.txt
#	docs/android.md
#	docs/backend/SYCL.md
#	docs/build.md
#	docs/cuda-fedora.md
#	docs/development/HOWTO-add-model.md
#	docs/docker.md
#	docs/install.md
#	docs/llguidance.md
#	examples/cvector-generator/README.md
#	examples/imatrix/README.md
#	examples/imatrix/imatrix.cpp
#	examples/llama.android/llama/src/main/cpp/CMakeLists.txt
#	examples/llama.swiftui/README.md
#	examples/llama.vim
#	examples/lookahead/README.md
#	examples/lookup/README.md
#	examples/main/README.md
#	examples/passkey/README.md
#	examples/pydantic_models_to_grammar_examples.py
#	examples/retrieval/README.md
#	examples/server/CMakeLists.txt
#	examples/server/README.md
#	examples/simple-cmake-pkg/README.md
#	examples/speculative/README.md
#	flake.nix
#	grammars/README.md
#	pyproject.toml
#	scripts/check-requirements.sh
2025-02-16 02:08:39 +08:00
Georgi Gerganov
68ff663a04
repo : update links to new url (#11886)
* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
2025-02-15 16:40:57 +02:00
Concedo
754fef5204 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/cuda.Dockerfile
#	.devops/musa.Dockerfile
#	.github/workflows/build.yml
#	README.md
#	docs/docker.md
#	examples/imatrix/imatrix.cpp
#	examples/llama-bench/llama-bench.cpp
#	examples/main/README.md
#	examples/perplexity/perplexity.cpp
#	examples/server/README.md
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-cuda/CMakeLists.txt
#	models/templates/deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja
#	models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja
#	scripts/get_chat_template.py
#	scripts/sync-ggml.last
#	tests/test-chat.cpp
#	tests/test-gguf.cpp
#	tests/test-sampling.cpp
2025-02-15 00:49:46 +08:00
Concedo
39fad991cc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/main/README.md
#	examples/run/run.cpp
2025-02-14 11:34:29 +08:00
Reza Rahemtola
c1f958c038
server : (docs) Update wrong tool calling example (#11809)
Call updated to match the tool used in the output just below, following the example in https://github.com/ggerganov/llama.cpp/pull/9639
2025-02-13 17:22:44 +01:00