koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-07 00:41:50 +00:00

Author	SHA1	Message	Date
Concedo	eda4a312cb	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/vulkan.Dockerfile # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/common.hpp # tests/test-backend-ops.cpp # tools/server/README.md	2025-11-28 13:22:02 +08:00
Xuan-Son Nguyen	e509411cf1	server: enable jinja by default, update docs (#17524 ) * server: enable jinja by default, update docs * fix tests	2025-11-27 01:02:50 +01:00
Concedo	724763fdec	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/vulkan.Dockerfile # .github/workflows/build.yml # .github/workflows/server.yml # common/common.cpp # examples/batched/README.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/arch-fallback.h # ggml/src/ggml-opencl/ggml-opencl.cpp # scripts/sync-ggml.last # src/CMakeLists.txt # tests/test-backend-ops.cpp # tools/server/CMakeLists.txt	2025-11-25 16:38:07 +08:00
Aaron Teo	877566d512	llama: introduce support for model-embedded sampling parameters (#17120 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details	2025-11-25 09:56:07 +08:00
LostRuins Concedo	5125c0b879	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/vulkan.Dockerfile # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/set_rows.cl # ggml/src/ggml-vulkan/ggml-vulkan.cpp # ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp # tests/test-backend-ops.cpp # tools/batched-bench/batched-bench.cpp	2025-11-11 17:10:11 +08:00
Georgi Gerganov	f914544b16	batched-bench : add "separate text gen" mode (#17103 )	2025-11-10 12:59:29 +02:00
Xuan-Son Nguyen	aa3b7a90b4	arg: add --cache-list argument to list cached models (#17073 ) * arg: add --cache-list argument to list cached models * new manifest naming format * improve naming * Update common/arg.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-08 21:54:14 +01:00
LostRuins Concedo	d6a2ad8455	still not really working right	2025-11-09 01:57:48 +08:00
LostRuins Concedo	dfb0966ed2	not working	2025-11-08 10:49:10 +08:00
LostRuins Concedo	7061cd1cc9	Merge commit '`e4a71599e5`' into concedo_experimental # Conflicts: # CODEOWNERS # tools/mtmd/clip.cpp	2025-11-08 10:28:49 +08:00
Xuan-Son Nguyen	5c9a18e674	common: move download functions to download.(cpp\|h) (#17059 ) * common: move download functions to download.(cpp\|h) * rm unused includes * minor cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-07 11:23:34 +01:00
Xuan-Son Nguyen	070ff4d535	mtmd: add --image-min/max-tokens (#16921 )	2025-11-03 11:11:18 +01:00
Sigbjørn Skjæret	961660b8c3	common : allow --system-prompt-file for diffusion-cli (#16903 )	2025-11-01 11:01:42 +01:00
Concedo	2b00e55356	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/docker.yml # ggml/src/ggml-opencl/kernels/mul_mm_f16_f32_l4_lm.cl # ggml/src/ggml-opencl/kernels/mul_mm_f32_f32_l4_lm.cl # ggml/src/ggml-sycl/rope.cpp # ggml/src/ggml-webgpu/wgsl-shaders/rope.tmpl.wgsl # requirements/requirements-convert_legacy_llama.txt # tests/test-backend-ops.cpp # tests/test-rope.cpp # tools/server/README.md	2025-10-31 10:52:57 +08:00
Shagun Bera	835e918d84	common: fix typo in cli help text (#16864 )	2025-10-30 17:47:31 +02:00
Concedo	16cbe9f24e	Merge branch 'upstream' into concedo_experimental # Conflicts: # CODEOWNERS # docs/ops.md # docs/ops/SYCL.csv # examples/embedding/README.md # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-sycl/backend.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/norm.cpp # ggml/src/ggml-sycl/norm.hpp # scripts/snapdragon/adb/run-bench.sh # scripts/snapdragon/adb/run-cli.sh # src/llama-batch.cpp # tests/test-backend-ops.cpp # tests/test-chat.cpp # tests/test-json-schema-to-grammar.cpp # tools/llama-bench/README.md	2025-10-30 13:44:46 +08:00
Sam Malayek	1c1409e131	embedding: add raw option for --embd-output-format (#16541 ) * Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README	2025-10-28 12:51:41 +02:00
Concedo	3712c6e6cd	Merge branch 'upstream' into concedo_experimental # Conflicts: # requirements/requirements-convert_hf_to_gguf.txt # tools/imatrix/CMakeLists.txt # tools/run/CMakeLists.txt	2025-10-24 18:12:16 +08:00
Xuan-Son Nguyen	d0660f237a	mtmd-cli : allow using --jinja (#16718 ) * mtmd-cli : allow using --jinja * support -sys * implement chat_history * fix clear memory * rm -sys support, added TODO	2025-10-23 15:00:49 +02:00
Concedo	85556118b5	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-cann/acl_tensor.cpp # ggml/src/ggml-cann/acl_tensor.h # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/presets.hpp	2025-10-18 10:56:55 +08:00
takasurazeem	6f5d924637	common : Update the docs on -t --threads (#16236 ) * Update the docs on -t --threads * Revert "Update the docs on -t --threads" This reverts commit eba97345e2c88d8ca510abec87d00bf6b9b0e0c2. * docs: clarify -t/--threads parameter uses CPU threads and defaults to all available cores * Update arg.cpp	2025-10-16 08:11:33 +03:00
Concedo	7e7da2583e	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-cuda/common.cuh # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-musa/CMakeLists.txt	2025-10-12 16:42:51 +08:00
Georgi Gerganov	4b2dae383d	common : update presets (#16504 ) * presets : add --embd-gemma-default and remove old embedding presets * presets : add gpt-oss presets * presets : add vision presets * cont : remove reasoning overrides [no ci] * cont : fix batch size for embedding gemma [no ci]	2025-10-12 09:29:13 +03:00
Concedo	6d8f8cd65b	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/CMakeLists.txt	2025-10-11 10:01:43 +08:00
Georgi Gerganov	d00cbea63c	server : host-memory prompt caching (#16391 ) * minor : code style * server : fix prompt similarity calculation * server : initial host-memory prompt caching * cont * server : refactor * cont * cont : make the server task of the slot const * cont : minor [no ci] * server : cache prompts and checkpoints only for completion tasks * server : improve prompt caching logic * cont : fix check for number of cached prompts [no ci] * server : improve caching logic, add -cram CLI arg * server : print prompt mismatch info * cont : better naming [no ci] * server : improve prompt cache loading logic * server : add option to debug the slot contents (#16482) * server : add option to debug the slot contents * Update tools/server/server.cpp --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * server : add option to disable prompt cache --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>	2025-10-09 18:54:51 +03:00
Concedo	5b6ba02167	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # ci/run.sh # examples/model-conversion/Makefile # examples/model-conversion/README.md # examples/model-conversion/logits.cpp # examples/model-conversion/requirements.txt # examples/model-conversion/scripts/embedding/convert-model.sh # examples/model-conversion/scripts/embedding/run-converted-model.sh # examples/model-conversion/scripts/embedding/run-original-model.py # examples/model-conversion/scripts/utils/semantic_check.py # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/kleidiai/kernels.cpp # ggml/src/ggml-cpu/kleidiai/kernels.h # ggml/src/ggml-cpu/kleidiai/kleidiai.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/softmax.cpp # ggml/src/ggml-sycl/softmax.hpp # requirements/requirements-all.txt # tests/test-chat-parser.cpp # tools/server/README.md	2025-10-09 23:46:56 +08:00
Pascal	12bbc3fa50	refactor: centralize CoT parsing in backend for streaming mode (#16394 ) * refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2025-10-08 23:18:41 +03:00
Concedo	b6f6338bba	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build-linux-cross.yml # .github/workflows/build.yml # CODEOWNERS # ggml/CMakeLists.txt # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-webgpu/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.tmpl.wgsl # tests/test-backend-ops.cpp # tests/test-chat-template.cpp # tools/llama-bench/llama-bench.cpp # tools/rpc/README.md # tools/server/README.md	2025-10-09 01:33:27 +08:00
Georgi Gerganov	ef4c5b87ea	presets : fix pooling param for embedding models (#16455 )	2025-10-07 10:32:32 +03:00
Gadflyii	3df2244df4	llama : add --no-host to disable host buffers (#16310 ) * implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-10-06 19:55:53 +02:00
Concedo	c83dde8a34	not working commit, need to fix vulkan shaders gen	2025-10-05 11:32:50 +08:00
Concedo	1d728bbc89	Merge commit '`128d522c04`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/release.yml # ggml/src/ggml-vulkan/ggml-vulkan.cpp # tests/test-alloc.cpp # tests/test-chat.cpp	2025-10-04 23:51:22 +08:00
Radoslav Gerganov	898acba681	rpc : add support for multiple devices (#16276 ) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order	2025-10-04 12:49:16 +03:00
ddh0	f6dcda3900	server : context checkpointing for hybrid and recurrent models (#16382 ) * initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-03 21:34:51 +03:00
Concedo	1731a3212c	Merge commit '`ded67b9444`' into concedo_experimental # Conflicts: # .devops/rocm.Dockerfile # .github/workflows/build.yml # .github/workflows/docker.yml # .github/workflows/release.yml # CODEOWNERS # common/CMakeLists.txt # common/arg.cpp # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/get_rows.cl # ggml/src/ggml-opencl/kernels/pad.cl # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py # tests/test-arg-parser.cpp # tests/test-backend-ops.cpp # tools/run/run.cpp	2025-10-03 16:15:27 +08:00
Adrien Gallouët	4201deae9c	common: introduce http.h for httplib-based client (#16373 ) * common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>	2025-10-01 20:22:18 +03:00
Adrien Gallouët	bf6f3b3a19	common : disable progress bar without a tty (#16352 ) * common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing headers Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-30 20:52:41 +03:00
Adrien Gallouët	364a7a6d4a	common : remove common_has_curl() (#16351 ) `test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-30 17:39:44 +03:00
Concedo	20c802a198	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CODEOWNERS # ggml/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/kleidiai/kleidiai.cpp # scripts/sync-ggml.last # tests/test-backend-ops.cpp	2025-09-30 22:28:53 +08:00
Concedo	2201ddb759	fix tool builds	2025-09-30 16:29:11 +08:00
Adrien Gallouët	3c62aed89f	common : simplify etag tracking by removing json (#16342 ) The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-30 10:36:33 +03:00
Concedo	b120e107f9	Merge branch 'upstream' into concedo_experimental # Conflicts: # .clang-tidy # .devops/musa.Dockerfile # .github/workflows/build-linux-cross.yml # .github/workflows/build.yml # .github/workflows/docker.yml # .gitignore # CODEOWNERS # CONTRIBUTING.md # README.md # build-xcframework.sh # ci/README-MUSA.md # ci/run.sh # common/CMakeLists.txt # docs/docker.md # examples/CMakeLists.txt # examples/eval-callback/CMakeLists.txt # examples/model-conversion/Makefile # examples/model-conversion/README.md # examples/model-conversion/logits.cpp # examples/model-conversion/scripts/causal/compare-logits.py # examples/model-conversion/scripts/causal/run-org-model.py # examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh # examples/model-conversion/scripts/embedding/run-converted-model.sh # examples/model-conversion/scripts/embedding/run-original-model.py # examples/model-conversion/scripts/utils/check-nmse.py # examples/model-conversion/scripts/utils/inspect-org-model.py # examples/model-conversion/scripts/utils/semantic_check.py # ggml/CMakeLists.txt # ggml/include/ggml-zdnn.h # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/set_rows.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/set_rows.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-quantize-perf.cpp # tests/test-tokenizers-repo.sh # tools/perplexity/perplexity.cpp # tools/server/tests/README.md	2025-09-27 17:09:14 +08:00
Adrien Gallouët	b995a10760	common : use cpp-httplib as a cURL alternative for downloads (#16185 ) * vendor : update httplib Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : use cpp-httplib as a cURL alternative for downloads The existing cURL implementation is intentionally left untouched to prevent any regressions and to allow for safe, side-by-side testing by toggling the `LLAMA_CURL` CMake option. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : Bump to Windows 10 Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-26 14:12:19 +03:00
Concedo	efe546390b	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CODEOWNERS # CONTRIBUTING.md # README.md # ci/run.sh # examples/embedding/README.md # tests/test-backend-ops.cpp	2025-09-22 21:25:29 +08:00
Adrien Gallouët	37a23c17bd	common : enable `--offline` mode without curl support (#16137 ) * common : use the json parser Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : enable --offline mode without CURL support This change refactors the download logic to properly support offline mode even when the project is built without CURL. Without this commit, using `--offline` would give the following error: error: built without CURL, cannot download model from the internet even if all the files are already cached. Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-22 15:13:51 +03:00
Haiyue Wang	d05affbab7	common : remove unused local variables (#16140 ) These two local variables 'arg' and 'arg_prefix' have been overriden by: 1. for (const auto & arg : opt.args) 2. for (int i = 1; i < argc; i++) { const std::string arg_prefix = "--"; std::string arg = argv[i];	2025-09-22 11:48:42 +03:00
Concedo	0dc6b9f418	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/amx/amx.cpp # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.tmpl.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/set_rows.wgsl # ggml/src/ggml-zdnn/ggml-zdnn.cpp # scripts/sync-ggml.last # tests/test-backend-ops.cpp # tests/test-chat.cpp # tools/llama-bench/README.md # tools/llama-bench/llama-bench.cpp	2025-09-21 11:38:47 +08:00
Concedo	3e72aaff5b	Merge commit '`8f8f2274ee`' into concedo_experimental # Conflicts: # .devops/rocm.Dockerfile # .github/workflows/build.yml # .github/workflows/release.yml # CMakeLists.txt # examples/simple/simple.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-opencl/kernels/tsembd.cl # ggml/src/ggml-sycl/binbcast.cpp # ggml/src/ggml-sycl/binbcast.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/tsembd.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # src/llama-model.cpp # tools/batched-bench/CMakeLists.txt # tools/cvector-generator/CMakeLists.txt # tools/export-lora/CMakeLists.txt # tools/gguf-split/CMakeLists.txt # tools/imatrix/CMakeLists.txt # tools/llama-bench/CMakeLists.txt # tools/llama-bench/llama-bench.cpp # tools/main/CMakeLists.txt # tools/main/README.md # tools/mtmd/CMakeLists.txt # tools/perplexity/CMakeLists.txt # tools/perplexity/perplexity.cpp # tools/quantize/CMakeLists.txt # tools/rpc/rpc-server.cpp # tools/run/CMakeLists.txt # tools/run/run.cpp # tools/tokenize/CMakeLists.txt # tools/tts/CMakeLists.txt	2025-09-21 08:58:23 +08:00
Eric Curtin	4ca088b036	Add resumable downloads for llama-server model loading (#15963 ) - Implement resumable downloads in common_download_file_single function - Add detection of partial download files (.downloadInProgress) - Check server support for HTTP Range requests via Accept-Ranges header - Implement HTTP Range request with "bytes=<start>-" header - Open files in append mode when resuming vs create mode for new downloads Signed-off-by: Eric Curtin <eric.curtin@docker.com>	2025-09-18 16:22:50 +01:00
jacekpoplawski	8ff206097c	llama-bench: add --n-cpu-moe support (#15952 ) * llama-bench: add --n-cpu-moe support Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.	2025-09-16 16:17:08 +02:00

1 2 3 4 5 ...

285 commits