koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-08 01:41:37 +00:00

Author	SHA1	Message	Date
Concedo	c00fe0af5a	Merge commit '`9f102a1407`' into concedo_experimental # Conflicts: # .devops/intel.Dockerfile # .github/ISSUE_TEMPLATE/010-bug-compilation.yml # .github/ISSUE_TEMPLATE/011-bug-results.yml # .github/pull_request_template.md # CODEOWNERS # README.md # common/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/binary-ops.c # ggml/src/ggml-hexagon/htp/hex-dma.c # ggml/src/ggml-hexagon/htp/hex-dma.h # ggml/src/ggml-hexagon/htp/hex-dump.h # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/ssm-conv.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # scripts/snapdragon/adb/run-bench.sh # scripts/sync_vendor.py # tests/test-backend-ops.cpp # tools/llama-bench/llama-bench.cpp	2026-03-25 23:45:41 +08:00
Adrien Gallouët	8c7957ca33	common : add standard Hugging Face cache support (#20775 ) * common : add standard Hugging Face cache support - Use HF API to find all files - Migrate all manifests to hugging face cache at startup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check with the quant tag Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve error handling and report API errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Restore common_cached_model_info and align mmproj filtering Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Prefer main when getting cached ref Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use cached files when HF API fails Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use final_path.. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check all inputs Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 07:30:33 +01:00
Concedo	ef854f002e	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/python-type-check.yml # AGENTS.md # CONTRIBUTING.md # examples/model-conversion/scripts/embedding/run-original-model.py # examples/model-conversion/scripts/utils/compare_tokens.py # examples/pydantic_models_to_grammar.py # ggml/src/ggml-rpc/ggml-rpc.cpp # pyrightconfig.json # scripts/compare-llama-bench.py # scripts/jinja/jinja-tester.py # scripts/server-bench.py # tests/test-grammar-integration.cpp # tests/test-grammar-parser.cpp # tests/test-llama-grammar.cpp # tests/test-tokenizer-random.py # tools/cli/README.md # tools/completion/README.md # tools/llama-bench/llama-bench.cpp # tools/server/README.md	2026-03-22 23:39:13 +08:00
Sigbjørn Skjæret	29b28a9824	ci : switch from pyright to ty (#20826 ) * type fixes * switch to ty * tweak rules * tweak more rules * more tweaks * final tweak * use common import-not-found rule	2026-03-21 08:54:34 +01:00
Concedo	6054bacadd	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/ai-issues.yml # CONTRIBUTING.md # docs/autoparser.md # docs/ops.md # docs/ops/Metal.csv # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/hex-dma.h # ggml/src/ggml-hexagon/htp/hex-utils.h # ggml/src/ggml-hexagon/htp/htp-ctx.h # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/htp_iface.idl # ggml/src/ggml-hexagon/htp/hvx-base.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hip/CMakeLists.txt # models/templates/Apriel-1.6-15b-Thinker-fixed.jinja # models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja # models/templates/deepseek-ai-DeepSeek-V3.1.jinja # models/templates/llama-cpp-deepseek-r1.jinja # models/templates/meetkai-functionary-medium-v3.1.jinja # scripts/fetch_server_test_models.py # scripts/snapdragon/adb/run-cli.sh # scripts/snapdragon/adb/run-completion.sh # scripts/snapdragon/adb/run-mtmd.sh # scripts/snapdragon/adb/run-tool.sh # tests/test-chat-auto-parser.cpp # tests/test-chat-peg-parser.cpp # tests/test-chat.cpp # tools/cli/cli.cpp # tools/server/README.md	2026-03-21 12:06:01 +08:00
Ryan Goulden	26c9ce1288	server: Add cached_tokens info to oaicompat responses (#19361 ) * tests : fix fetch_server_test_models.py * server: to_json_oaicompat cached_tokens Adds OpenAI and Anthropic compatible information about the number of cached prompt tokens used in a response.	2026-03-19 19:09:33 +01:00
Piotr Wilkin (ilintar)	5e54d51b19	common/parser: add proper reasoning tag prefill reading (#20424 ) * Implement proper prefill extraction * Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp * Update tools/server/server-task.cpp * refactor: move grammars to variant, remove grammar_external, handle exception internally * Make code less C++y Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-19 16:58:21 +01:00
Concedo	9084527b36	Merge commit '`67a2209fab`' into concedo_experimental # Conflicts: # .github/workflows/build-cache.yml # .github/workflows/build-cross.yml # .github/workflows/build-self-hosted.yml # .github/workflows/build.yml # .github/workflows/python-lint.yml # .github/workflows/release.yml # .github/workflows/server-self-hosted.yml # .github/workflows/server-webui.yml # .github/workflows/server.yml # CODEOWNERS # ggml/src/ggml-sycl/gated_delta_net.cpp # scripts/sync_vendor.py # tools/cli/cli.cpp	2026-03-17 11:11:25 +08:00
Georgi Gerganov	88915cb55c	server : fix wait in test_cancel_requests() test (#20601 ) * server : fix wait in test_cancel_requests() test * codeowners : add team for server tests	2026-03-15 20:54:37 +02:00
Concedo	f3d2f58fa8	note: smartcache is broken for rnn currently	2026-03-15 11:31:47 +08:00
Concedo	b1c500ae2b	Merge commit '`2948e6049a`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CONTRIBUTING.md # docs/backend/VirtGPU/development.md # docs/ops.md # docs/ops/WebGPU.csv # embd_res/templates/GigaChat3-10B-A1.8B.jinja # embd_res/templates/GigaChat3.1-10B-A1.8B.jinja # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # scripts/sync_vendor.py # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-chat.cpp # tests/test-grammar-integration.cpp # tests/test-quantize-fns.cpp	2026-03-15 11:21:24 +08:00
Concedo	ff3f8533d3	Merge commit '`c96f608d98`' into concedo_experimental # Conflicts: # CONTRIBUTING.md # docs/ops.md # docs/ops/Vulkan.csv # models/templates/LFM2-8B-A1B.jinja # tests/peg-parser/test-python-dict-parser.cpp # tests/peg-parser/test-unicode.cpp # tests/test-chat-peg-parser.cpp # tests/test-chat.cpp # tools/llama-bench/llama-bench.cpp	2026-03-14 17:14:34 +08:00
ZeroV0LT	f17b3be63f	llama : fix pooling assertion crash in chunked GDN detection path (#20468 ) * llama : fix pooling assertion crash in chunked GDN detection path The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by #20340 (`d28961d`). Same class of bug as #12517, fixed by #12545. * server : add mean pooling tests to embedding test suite Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by #20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path. --------- Co-authored-by: Domenico Crupi <domenico@zerovolt.it>	2026-03-13 20:53:42 +02:00
Concedo	d2c911884d	Merge commit '`213c4a0b81`' into concedo_experimental # Conflicts: # CODEOWNERS # common/CMakeLists.txt # common/chat-peg-parser.cpp # common/chat.cpp # docs/backend/SYCL.md # docs/development/parsing.md # docs/ops.md # docs/ops/SYCL.csv # embd_res/templates/Apriel-1.6-15b-Thinker-fixed.jinja # embd_res/templates/Bielik-11B-v3.0-Instruct.jinja # embd_res/templates/GLM-4.7-Flash.jinja # embd_res/templates/LFM2-8B-A1B.jinja # embd_res/templates/StepFun3.5-Flash.jinja # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-sycl/backend.hpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/convert.hpp # ggml/src/ggml-sycl/count-equal.cpp # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/presets.hpp # ggml/src/ggml-sycl/softmax.cpp # ggml/src/ggml-sycl/vecdotq.hpp # models/templates/Apertus-8B-Instruct.jinja # models/templates/CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja # models/templates/Qwen-QwQ-32B.jinja # models/templates/Qwen3-Coder.jinja # models/templates/deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja # models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja # models/templates/deepseek-ai-DeepSeek-V3.1.jinja # models/templates/fireworks-ai-llama-3-firefunction-v2.jinja # models/templates/moonshotai-Kimi-K2.jinja # models/templates/unsloth-Apriel-1.5.jinja # tests/CMakeLists.txt # tests/peg-parser/test-basic.cpp # tests/peg-parser/tests.h # tests/test-backend-ops.cpp # tests/test-chat-peg-parser.cpp # tests/test-chat-template.cpp # tests/test-chat.cpp # tests/test-json-schema-to-grammar.cpp # tests/test-peg-parser.cpp # tools/CMakeLists.txt # tools/cli/cli.cpp	2026-03-13 21:35:56 +08:00
Piotr Wilkin (ilintar)	0e810413bb	tests : use `reasoning` instead of `reasoning_budget` in server tests (#20432 )	2026-03-12 13:41:01 +01:00
Concedo	6adcd0b5db	Merge commit '`34df42f7be`' into concedo_experimental # Conflicts: # README.md # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/act-ops.c # ggml/src/ggml-hexagon/htp/binary-ops.c # ggml/src/ggml-hexagon/htp/cpy-ops.c # ggml/src/ggml-hexagon/htp/get-rows-ops.c # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/hvx-arith.h # ggml/src/ggml-hexagon/htp/hvx-base.h # ggml/src/ggml-hexagon/htp/hvx-inverse.h # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/rope-ops.c # ggml/src/ggml-hexagon/htp/set-rows-ops.c # ggml/src/ggml-hexagon/htp/softmax-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # tests/test-backend-ops.cpp # tools/cli/cli.cpp # tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte	2026-03-10 22:20:04 +08:00
Evan Huus	23fbfcb1ad	server: Parse port numbers from MCP server URLs in CORS proxy (#20208 ) * Parse port numbers from MCP server URLs * Pass scheme to http proxy for determining whether to use SSL * Fix download on non-standard port and re-add port to logging * add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-09 17:47:54 +01:00
Piotr Wilkin (ilintar)	566059a26b	Autoparser - complete refactoring of parser architecture (#18675 ) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser	2026-03-06 21:01:00 +01:00
Tom Vaucourt	e68f2fb894	server : preserve anthropic thinking blocks in conversion (#20120 ) * server : preserve anthropic thinking blocks in conversion (#20090) * server : add tests for anthropic thinking block conversion --------- Co-authored-by: root <root@llamacpp.home>	2026-03-06 17:41:12 +01:00
Marcel Petrick	92f7da00b4	chore : correct typos [no ci] (#20041 ) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 08:50:21 +01:00
Concedo	4e358265a3	Merge commit '`8387ffb28d`' into concedo_experimental # Conflicts: # docs/backend/VirtGPU.md # docs/backend/ZenDNN.md # ggml/src/ggml-cpu/amx/amx.cpp # ggml/src/ggml-cpu/amx/mmq.cpp # ggml/src/ggml-sycl/add-id.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h # ggml/src/ggml-virtgpu/backend/backend-dispatched.h # ggml/src/ggml-virtgpu/backend/backend-virgl-apir.h # ggml/src/ggml-virtgpu/backend/backend.cpp # ggml/src/ggml-virtgpu/backend/shared/api_remoting.h # ggml/src/ggml-virtgpu/backend/shared/apir_backend.gen.h # ggml/src/ggml-virtgpu/backend/shared/apir_backend.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs_rpc.h # ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp # ggml/src/ggml-virtgpu/ggml-backend-device.cpp # ggml/src/ggml-virtgpu/ggml-backend-reg.cpp # ggml/src/ggml-virtgpu/ggml-backend.cpp # ggml/src/ggml-virtgpu/ggml-remoting.h # ggml/src/ggml-virtgpu/include/apir_hw.h # ggml/src/ggml-virtgpu/regenerate_remoting.py # ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-impl.h # ggml/src/ggml-virtgpu/virtgpu-forward.gen.h # ggml/src/ggml-virtgpu/virtgpu.cpp # ggml/src/ggml-virtgpu/virtgpu.h # ggml/src/ggml-zendnn/CMakeLists.txt # ggml/src/ggml-zendnn/ggml-zendnn.cpp # src/CMakeLists.txt # tests/CMakeLists.txt # tests/test-tokenizer-0.sh # tools/cli/README.md # tools/completion/README.md # tools/imatrix/imatrix.cpp # tools/server/README.md	2026-02-28 12:45:16 +08:00
Pascal	2e7e638523	server : support multiple model aliases via comma-separated --alias (#19926 ) * server : support multiple model aliases via comma-separated --alias * server : update --alias description and regenerate docs * server : multiple model aliases and tags - address review feedback from ngxson - --alias accepts comma-separated values (std::set, no duplicates) - --tags for informational metadata (not used for routing) - aliases resolve transparently in router via get_meta/has_model - /v1/models exposes aliases and tags fields * regenerate docs * nits * server : use first alias as model_name for backward compat address review feedback from ngxson * server : add single-model test for aliases and tags	2026-02-27 07:05:23 +01:00
Concedo	46cd17c17e	Merge commit '`88d23ad515`' into concedo_experimental # Conflicts: # CODEOWNERS # docs/build.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-zendnn/CMakeLists.txt # tests/test-chat-template.cpp	2026-01-29 22:25:56 +08:00
Georgi Gerganov	b931f81b5a	server : adjust spec tests to generate up to 16 tokens (#19093 )	2026-01-28 09:11:40 +02:00
Concedo	e8e7c357c9	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build-cache.yml # .github/workflows/build-cmake-pkg.yml # .github/workflows/build-linux-cross.yml # .github/workflows/build.yml # .github/workflows/check-vendor.yml # .github/workflows/close-issue.yml # .github/workflows/copilot-setup-steps.yml # .github/workflows/docker.yml # .github/workflows/editorconfig.yml # .github/workflows/gguf-publish.yml # .github/workflows/labeler.yml # .github/workflows/pre-tokenizer-hashes.yml # .github/workflows/python-check-requirements.yml # .github/workflows/python-lint.yml # .github/workflows/python-type-check.yml # .github/workflows/release.yml # .github/workflows/server-webui.yml # .github/workflows/server.yml # .github/workflows/update-ops-docs.yml # .github/workflows/winget.yml # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # requirements/requirements-tool_bench.txt # src/CMakeLists.txt # src/llama-quant.cpp # tests/test-backend-ops.cpp # tests/test-chat.cpp # tools/cli/cli.cpp # tools/server/README.md	2026-01-23 14:27:04 +08:00
손희준	fbbf3ad190	server: /v1/responses (partial) (#18486 ) * from previous PR * Make instruction(system) as first message * Convert [input_message] (text/image/file) * Rename convert_responses_to_chatcmpl(body) -> response_body * Initial tool call support * Erase instructions field from chatcmpl body * Feed reasoning texts to chat template * Use std::vector instead of opaque json array * Make output_item.added events consistent * Move `server_task_result_cmpl_partial::update` from header to source * Match ID of output_item.added and .done events * Add function_call only if there is no "fc_" prefix * Add function call output at non-streaming API * Test if ID is persistent * Add doc * Fix style - use trailing comma * Rewrite state management * catch up with upstream/master * Fix style - "type" is the first item of SSE data * Explicitly check "instructions" from response_body * Make lambdas static * Check if reasoning content exists * Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final * Reject `input_file` since it is not supported by chatcmpl * Add "fc_" prefix to non-straming function call id as coderabbit pointed out --------- Co-authored-by: openingnow <>	2026-01-21 17:47:23 +01:00
Concedo	0d43bdc46d	Merge branch 'upstream' into concedo_experimental # Conflicts: # examples/batched/batched.cpp # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # src/llama-context.cpp # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-01-17 00:41:28 +08:00
Xuan-Son Nguyen	a04c2b06a3	server: improve slots scheduling for n_cmpl (#18789 ) * server : make sure children tasks are scheduled to launch with parent * fix * add comment pointing to this PR * fix * clean up * more debug messages * add pop_deferred_task with specific ID version * improve the logic * simple approach * no double move * correct return type of launch_slots_with_parent_task	2026-01-15 17:10:28 +01:00
Concedo	7d2c1c4f46	note: clip_is_mrope was moved to mtmd_decode_use_mrope upstream and no longer syncs since https://github.com/ggml-org/llama.cpp/pull/18793 Merge commit '`c1e79e610f`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/release.yml # CMakeLists.txt # CONTRIBUTING.md # MIT_LICENSE_GGML_SDCPP_LLAMACPP_ONLY.md # README.md # SECURITY.md # ci/run.sh # common/CMakeLists.txt # common/arg.cpp # docs/ops.md # docs/ops/BLAS.csv # docs/ops/zDNN.csv # docs/preset.md # examples/batched/batched.cpp # examples/debug/debug.cpp # ggml/src/ggml-blas/CMakeLists.txt # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # licenses/LICENSE-curl # licenses/LICENSE-httplib # scripts/pr2wt.sh # scripts/sync_vendor.py # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tools/cli/README.md # tools/completion/README.md # tools/llama-bench/llama-bench.cpp # tools/server/README.md # vendor/cpp-httplib/LICENSE	2026-01-13 23:31:14 +08:00
Concedo	0dc18c668c	Merge commit '`a61c8bc3bf`' into concedo_experimental # Conflicts: # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # scripts/pr2wt.sh # src/llama-model.cpp # tools/CMakeLists.txt # tools/mtmd/CMakeLists.txt # tools/mtmd/clip.cpp # tools/mtmd/clip.h	2026-01-13 23:06:50 +08:00
Georgi Gerganov	f307926482	server : adjust unified KV cache tests (#18716 )	2026-01-10 17:51:56 +02:00
Georgi Gerganov	f5f8812f7c	server : use different seeds for child completions (#18700 ) * server : use different seeds for child completions * cont : handle default seed * cont : note	2026-01-09 09:33:50 +02:00
Concedo	956ab99934	Merge commit '`56d2fed2b3`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # README.md # examples/CMakeLists.txt # examples/debug/CMakeLists.txt # examples/model-conversion/scripts/causal/compare-logits.py # examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py # examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh # examples/model-conversion/scripts/causal/run-converted-model.sh # examples/model-conversion/scripts/causal/run-org-model.py # examples/model-conversion/scripts/embedding/run-converted-model.sh # examples/model-conversion/scripts/embedding/run-original-model.py # examples/model-conversion/scripts/utils/common.py # examples/model-conversion/scripts/utils/semantic_check.py # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/htp-ctx.h # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/hvx-utils.c # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/htp/softmax-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # scripts/snapdragon/adb/run-bench.sh # tests/test-arg-parser.cpp # tools/CMakeLists.txt	2026-01-09 00:30:53 +08:00
R	3d26a09dc7	server : add thinking content blocks to Anthropic Messages API (#18551 ) * server : add thinking content blocks to Anthropic Messages API Add support for returning reasoning/thinking content in Anthropic API responses when using models with --reasoning-format deepseek and the thinking parameter enabled. - Non-streaming: adds thinking block before text in content array - Streaming: emits thinking_delta events with correct block indices - Partial streaming: tracks reasoning state across chunks via anthropic_has_reasoning member variable Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model. * server : fix Anthropic API streaming for thinking content blocks Add signature field and fix duplicate content_block_start events in Anthropic Messages API streaming responses for reasoning models. * server: refactor Anthropic streaming state to avoid raw pointer Replace raw pointer to task_result_state with direct field copies: - Copy state fields in update() before processing chunk - Use local copies in to_json_anthropic() instead of dereferencing - Pre-compute state updates for next chunk in update() This makes the data flow clearer and avoids unsafe pointer patterns.	2026-01-06 16:17:13 +01:00
Concedo	6cc71db85a	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/backend/SYCL.md # examples/model-conversion/Makefile # examples/model-conversion/scripts/causal/run-org-model.py # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cuda/CMakeLists.txt	2025-12-25 00:06:27 +08:00
Concedo	d1983959d2	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/release.yml # AGENTS.md # common/CMakeLists.txt # docs/development/parsing.md # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-vulkan/ggml-vulkan.cpp # tests/test-arg-parser.cpp # tests/test-backend-ops.cpp # tests/test-grammar-llguidance.cpp # tests/test-tokenizer-0.cpp # tests/test-tokenizer-1-bpe.cpp # tests/test-tokenizer-1-spm.cpp # tools/batched-bench/batched-bench.cpp # tools/cli/cli.cpp # tools/llama-bench/llama-bench.cpp # tools/server/README.md	2025-12-24 23:42:28 +08:00
Xuan-Son Nguyen	5ee4e43f26	server: return_progress to also report 0% processing state (#18305 )	2025-12-23 21:49:05 +01:00
Xuan-Son Nguyen	ddcb75dd8a	server: add auto-sleep after N seconds of idle (#18228 ) * implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments	2025-12-21 02:24:42 +01:00
Concedo	050a5b1f52	Merge commit '`4aced7a631`' into concedo_experimental # Conflicts: # .devops/cann.Dockerfile # .devops/cpu.Dockerfile # .devops/cuda.Dockerfile # .devops/intel.Dockerfile # .devops/musa.Dockerfile # .devops/rocm.Dockerfile # .devops/tools.sh # .devops/vulkan.Dockerfile # .github/workflows/build.yml # .github/workflows/release.yml # .gitignore # docs/ops.md # docs/ops/SYCL.csv # examples/batched/batched.cpp # examples/eval-callback/eval-callback.cpp # examples/gen-docs/gen-docs.cpp # examples/lookahead/lookahead.cpp # examples/lookup/lookup-create.cpp # examples/lookup/lookup-stats.cpp # examples/lookup/lookup.cpp # examples/model-conversion/scripts/causal/compare-logits.py # examples/model-conversion/scripts/causal/run-org-model.py # examples/model-conversion/scripts/utils/check-nmse.py # examples/parallel/parallel.cpp # examples/retrieval/retrieval.cpp # examples/save-load-state/save-load-state.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # examples/training/finetune.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/repack.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/dequantize.hpp # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/pad.cpp # ggml/src/ggml-sycl/ssm_conv.cpp # ggml/src/ggml-sycl/vecdotq.hpp # pyrightconfig.json # scripts/sync-ggml.last # tests/test-arg-parser.cpp # tests/test-backend-ops.cpp # tools/cvector-generator/cvector-generator.cpp # tools/imatrix/imatrix.cpp # tools/mtmd/CMakeLists.txt # tools/mtmd/clip.cpp # tools/perplexity/perplexity.cpp # tools/server/README.md	2025-12-16 23:14:12 +08:00
Georgi Gerganov	254098a279	common : refactor common_sampler + grammar logic changes (#17937 ) * common : refactor common_sampler + grammar logic changes * tests : increase max_tokens to get needed response * batched : fix uninitialized samplers	2025-12-14 10:11:13 +02:00
Concedo	17c0c8d55d	Merge branch 'upstream' into concedo_experimental # Conflicts: # README.md # docs/backend/zDNN.md # docs/build.md # docs/ops.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # src/llama-quant.cpp # tests/test-backend-ops.cpp # tools/llama-bench/llama-bench.cpp # tools/server/README.md	2025-12-07 16:48:38 +08:00
Xuan-Son Nguyen	c42712b056	server: support multiple generations from one prompt (OAI "n" option) (#17775 ) * backend support * server: support multiple generations from one prompt (OAI "n" option) * fix invalid batch * format oai * clean up * disable ctx shift * add test * update comments * fix style * add n_cmpl to docs [no ci] * allowing using both n_cmpl and n	2025-12-06 15:54:38 +01:00
Concedo	addf1d8785	Merge commit '`190c4838bd`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/release.yml # .gitignore # CMakeLists.txt # CODEOWNERS # common/CMakeLists.txt # ggml/CMakeLists.txt # ggml/src/ggml-webgpu/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu.cpp # tests/.gitignore # tests/CMakeLists.txt # tests/test-backend-ops.cpp	2025-12-04 16:50:15 +08:00
Pascal	e7c2cf1356	server: add router multi-model tests (#17704 ) (#17722 ) Some checks are pending Python Type-Check / pyright type-check (push) Waiting to run Details * llama-server: add router multi-model tests (#17704) Add 4 test cases for model router: - test_router_unload_model: explicit model unloading - test_router_models_max_evicts_lru: LRU eviction with --models-max - test_router_no_models_autoload: --no-models-autoload flag behavior - test_router_api_key_required: API key authentication Tests use async model loading with polling and graceful skip when insufficient models available for eviction testing. utils.py changes: - Add models_max, models_dir, no_models_autoload attributes to ServerProcess - Handle JSONDecodeError for non-JSON error responses (fallback to text) * llama-server: update test models to new HF repos * add offline * llama-server: fix router LRU eviction test and add preloading Fix eviction test: load 2 models first, verify state, then load 3rd to trigger eviction. Previous logic loaded all 3 at once, causing first model to be evicted before verification could occur. Add module fixture to preload models via ServerPreset.load_all() and mark test presets as offline to use cached models * llama-server: fix split model download on Windows --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-12-03 15:10:37 +01:00
Concedo	03cec02a3d	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/release.yml # .github/workflows/winget.yml # CODEOWNERS # README.md # ci/run.sh # docs/build.md # docs/ops.md # docs/ops/Vulkan.csv # ggml/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # scripts/sync_vendor.py # src/CMakeLists.txt # tests/test-json-schema-to-grammar.cpp # tests/test-quantize-stats.cpp # tools/server/CMakeLists.txt # tools/server/README.md	2025-12-03 18:56:31 +08:00
Xuan-Son Nguyen	13628d8bdb	server: add --media-path for local media files (#17697 ) * server: add --media-path for local media files * remove unused fn	2025-12-02 22:49:20 +01:00
Chad Voegele	c4357dcc35	Server: Change Invalid Schema from Server Error (500) to User Error (400) (#17572 ) * Make invalid schema a user error (400) * Move invalid_argument exception handler to ex_wrapper * Fix test * Simplify test back to original pattern	2025-12-02 17:33:50 +01:00
Xuan-Son Nguyen	5d6bd842ea	server: remove default "gpt-3.5-turbo" model name (#17668 ) * server: remove default "gpt-3.5-turbo" model name * do not reflect back model name from request * fix test	2025-12-02 11:38:57 +01:00
Xuan-Son Nguyen	ec18edfcba	server: introduce API for serving / loading / unloading multiple models (#17470 ) * server: add model management and proxy * fix compile error * does this fix windows? * fix windows build * use subprocess.h, better logging * add test * fix windows * feat: Model/Router server architecture WIP * more stable * fix unsafe pointer * also allow terminate loading model * add is_active() * refactor: Architecture improvements * tmp apply upstream fix * address most problems * address thread safety issue * address review comment * add docs (first version) * address review comment * feat: Improved UX for model information, modality interactions etc * chore: update webui build output * refactor: Use only the message data `model` property for displaying model used info * chore: update webui build output * add --models-dir param * feat: New Model Selection UX WIP * chore: update webui build output * feat: Add auto-mic setting * feat: Attachments UX improvements * implement LRU * remove default model path * better --models-dir * add env for args * address review comments * fix compile * refactor: Chat Form Submit component * ad endpoint docs * Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2 Co-authored-by: Aleksander <aleksander.grygier@gmail.com> * feat: Add copy to clipboard to model name in model info dialog * feat: Model unavailable UI state for model selector * feat: Chat Form Actions UI logic improvements * feat: Auto-select model from last assistant response * chore: update webui build output * expose args and exit_code in API * add note * support extra_args on loading model * allow reusing args if auto_load * typo docs * oai-compat /models endpoint * cleaner * address review comments * feat: Use `model` property for displaying the `repo/model-name` naming format * refactor: Attachments data * chore: update webui build output * refactor: Enum imports * feat: Improve Model Selector responsiveness * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * refactor: Formatters * chore: update webui build output * refactor: Copy To Clipboard Icon component * chore: update webui build output * refactor: Cleanup * chore: update webui build output * refactor: UI badges * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * chore: update webui build output * add --models-allow-extra-args for security * nits * add stdin_file * fix merge * fix: Retrieve lost setting after resolving merge conflict * refactor: DatabaseStore -> DatabaseService * refactor: Database, Conversations & Chat services + stores architecture improvements (WIP) * refactor: Remove redundant settings * refactor: Multi-model business logic WIP * chore: update webui build output * feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic * chore: update webui build output * fix: Add `untrack` inside chat processing info data logic to prevent infinite effect * fix: Regenerate * feat: Remove redundant settigns + rearrange * fix: Audio attachments * refactor: Icons * chore: update webui build output * feat: Model management and selection features WIP * chore: update webui build output * refactor: Improve server properties management * refactor: Icons * chore: update webui build output * feat: Improve model loading/unloading status updates * chore: update webui build output * refactor: Improve API header management via utility functions * remove support for extra args * set hf_repo/docker_repo as model alias when posible * refactor: Remove ConversationsService * refactor: Chat requests abort handling * refactor: Server store * tmp webui build * refactor: Model modality handling * chore: update webui build output * refactor: Processing state reactivity * fix: UI * refactor: Services/Stores syntax + logic improvements Refactors components to access stores directly instead of using exported getter functions. This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction. Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`. * refactor: Architecture cleanup * feat: Improve statistic badges * feat: Condition available models based on modality + better model loading strategy & UX * docs: Architecture documentation * feat: Update logic for PDF as Image * add TODO for http client * refactor: Enhance model info and attachment handling * chore: update webui build output * refactor: Components naming * chore: update webui build output * refactor: Cleanup * refactor: DRY `getAttachmentDisplayItems` function + fix UI * chore: update webui build output * fix: Modality detection improvement for text-based PDF attachments * refactor: Cleanup * docs: Add info comment * refactor: Cleanup * re * refactor: Cleanup * refactor: Cleanup * feat: Attachment logic & UI improvements * refactor: Constants * feat: Improve UI sidebar background color * chore: update webui build output * refactor: Utils imports + move types to `app.d.ts` * test: Fix Storybook mocks * chore: update webui build output * test: Update Chat Form UI tests * refactor: Tooltip Provider from core layout * refactor: Tests to separate location * decouple server_models from server_routes * test: Move demo test to tests/server * refactor: Remove redundant method * chore: update webui build output * also route anthropic endpoints * fix duplicated arg * fix invalid ptr to shutdown_handler * server : minor * rm unused fn * add ?autoload=true\|false query param * refactor: Remove redundant code * docs: Update README documentations + architecture & data flow diagrams * fix: Disable autoload on calling server props for the model * chore: update webui build output * fix ubuntu build * fix: Model status reactivity * fix: Modality detection for MODEL mode * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-01 19:41:04 +01:00
Concedo	0ccb298087	Merge commit '`ddf9f94389`' into concedo_experimental # Conflicts: # examples/model-conversion/scripts/causal/run-converted-model.sh # examples/model-conversion/scripts/causal/run-org-model.py # src/CMakeLists.txt # src/llama-quant.cpp # tools/server/README.md	2025-11-28 23:27:50 +08:00

1 2 3

119 commits