Commit graph

285 commits

Author SHA1 Message Date
Concedo
eda4a312cb Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/common.hpp
#	tests/test-backend-ops.cpp
#	tools/server/README.md
2025-11-28 13:22:02 +08:00
Xuan-Son Nguyen
e509411cf1
server: enable jinja by default, update docs (#17524)
* server: enable jinja by default, update docs

* fix tests
2025-11-27 01:02:50 +01:00
Concedo
724763fdec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	common/common.cpp
#	examples/batched/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/arch-fallback.h
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	scripts/sync-ggml.last
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/server/CMakeLists.txt
2025-11-25 16:38:07 +08:00
Aaron Teo
877566d512
llama: introduce support for model-embedded sampling parameters (#17120)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
2025-11-25 09:56:07 +08:00
LostRuins Concedo
5125c0b879 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/set_rows.cl
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
#	tests/test-backend-ops.cpp
#	tools/batched-bench/batched-bench.cpp
2025-11-11 17:10:11 +08:00
Georgi Gerganov
f914544b16
batched-bench : add "separate text gen" mode (#17103) 2025-11-10 12:59:29 +02:00
Xuan-Son Nguyen
aa3b7a90b4
arg: add --cache-list argument to list cached models (#17073)
* arg: add --cache-list argument to list cached models

* new manifest naming format

* improve naming

* Update common/arg.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-08 21:54:14 +01:00
LostRuins Concedo
d6a2ad8455 still not really working right 2025-11-09 01:57:48 +08:00
LostRuins Concedo
dfb0966ed2 not working 2025-11-08 10:49:10 +08:00
LostRuins Concedo
7061cd1cc9 Merge commit 'e4a71599e5' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	tools/mtmd/clip.cpp
2025-11-08 10:28:49 +08:00
Xuan-Son Nguyen
5c9a18e674
common: move download functions to download.(cpp|h) (#17059)
* common: move download functions to download.(cpp|h)

* rm unused includes

* minor cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-07 11:23:34 +01:00
Xuan-Son Nguyen
070ff4d535
mtmd: add --image-min/max-tokens (#16921) 2025-11-03 11:11:18 +01:00
Sigbjørn Skjæret
961660b8c3
common : allow --system-prompt-file for diffusion-cli (#16903) 2025-11-01 11:01:42 +01:00
Concedo
2b00e55356 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	ggml/src/ggml-opencl/kernels/mul_mm_f16_f32_l4_lm.cl
#	ggml/src/ggml-opencl/kernels/mul_mm_f32_f32_l4_lm.cl
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/rope.tmpl.wgsl
#	requirements/requirements-convert_legacy_llama.txt
#	tests/test-backend-ops.cpp
#	tests/test-rope.cpp
#	tools/server/README.md
2025-10-31 10:52:57 +08:00
Shagun Bera
835e918d84
common: fix typo in cli help text (#16864) 2025-10-30 17:47:31 +02:00
Concedo
16cbe9f24e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	docs/ops.md
#	docs/ops/SYCL.csv
#	examples/embedding/README.md
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-sycl/norm.hpp
#	scripts/snapdragon/adb/run-bench.sh
#	scripts/snapdragon/adb/run-cli.sh
#	src/llama-batch.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tools/llama-bench/README.md
2025-10-30 13:44:46 +08:00
Sam Malayek
1c1409e131
embedding: add raw option for --embd-output-format (#16541)
* Add --embd-output-format raw for plain numeric embedding output

This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting.

* Move raw output handling into format handling section

* Move raw output handling into else-if block with other format handlers

* Use LOG instead of printf for raw embedding output

* docs: document 'raw' embedding output format in arg.cpp and README
2025-10-28 12:51:41 +02:00
Concedo
3712c6e6cd Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	requirements/requirements-convert_hf_to_gguf.txt
#	tools/imatrix/CMakeLists.txt
#	tools/run/CMakeLists.txt
2025-10-24 18:12:16 +08:00
Xuan-Son Nguyen
d0660f237a
mtmd-cli : allow using --jinja (#16718)
* mtmd-cli : allow using --jinja

* support -sys

* implement chat_history

* fix clear memory

* rm -sys support, added TODO
2025-10-23 15:00:49 +02:00
Concedo
85556118b5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/presets.hpp
2025-10-18 10:56:55 +08:00
takasurazeem
6f5d924637
common : Update the docs on -t --threads (#16236)
* Update the docs on -t --threads

* Revert "Update the docs on -t --threads"

This reverts commit eba97345e2c88d8ca510abec87d00bf6b9b0e0c2.

* docs: clarify -t/--threads parameter uses CPU threads and defaults to all available cores

* Update arg.cpp
2025-10-16 08:11:33 +03:00
Concedo
7e7da2583e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-cuda/common.cuh
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
2025-10-12 16:42:51 +08:00
Georgi Gerganov
4b2dae383d
common : update presets (#16504)
* presets : add --embd-gemma-default and remove old embedding presets

* presets : add gpt-oss presets

* presets : add vision presets

* cont : remove reasoning overrides [no ci]

* cont : fix batch size for embedding gemma [no ci]
2025-10-12 09:29:13 +03:00
Concedo
6d8f8cd65b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/CMakeLists.txt
2025-10-11 10:01:43 +08:00
Georgi Gerganov
d00cbea63c
server : host-memory prompt caching (#16391)
* minor : code style

* server : fix prompt similarity calculation

* server : initial host-memory prompt caching

* cont

* server : refactor

* cont

* cont : make the server task of the slot const

* cont : minor [no ci]

* server : cache prompts and checkpoints only for completion tasks

* server : improve prompt caching logic

* cont : fix check for number of cached prompts [no ci]

* server : improve caching logic, add -cram CLI arg

* server : print prompt mismatch info

* cont : better naming [no ci]

* server : improve prompt cache loading logic

* server : add option to debug the slot contents (#16482)

* server : add option to debug the slot contents

* Update tools/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* server : add option to disable prompt cache

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-10-09 18:54:51 +03:00
Concedo
5b6ba02167 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ci/run.sh
#	examples/model-conversion/Makefile
#	examples/model-conversion/README.md
#	examples/model-conversion/logits.cpp
#	examples/model-conversion/requirements.txt
#	examples/model-conversion/scripts/embedding/convert-model.sh
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.h
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/softmax.hpp
#	requirements/requirements-all.txt
#	tests/test-chat-parser.cpp
#	tools/server/README.md
2025-10-09 23:46:56 +08:00
Pascal
12bbc3fa50
refactor: centralize CoT parsing in backend for streaming mode (#16394)
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2025-10-08 23:18:41 +03:00
Concedo
b6f6338bba Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	CODEOWNERS
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-webgpu/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.tmpl.wgsl
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/rpc/README.md
#	tools/server/README.md
2025-10-09 01:33:27 +08:00
Georgi Gerganov
ef4c5b87ea
presets : fix pooling param for embedding models (#16455) 2025-10-07 10:32:32 +03:00
Gadflyii
3df2244df4
llama : add --no-host to disable host buffers (#16310)
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-10-06 19:55:53 +02:00
Concedo
c83dde8a34 not working commit, need to fix vulkan shaders gen 2025-10-05 11:32:50 +08:00
Concedo
1d728bbc89 Merge commit '128d522c04' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	tests/test-alloc.cpp
#	tests/test-chat.cpp
2025-10-04 23:51:22 +08:00
Radoslav Gerganov
898acba681
rpc : add support for multiple devices (#16276)
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
2025-10-04 12:49:16 +03:00
ddh0
f6dcda3900
server : context checkpointing for hybrid and recurrent models (#16382)
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-03 21:34:51 +03:00
Concedo
1731a3212c Merge commit 'ded67b9444' into concedo_experimental
# Conflicts:
#	.devops/rocm.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	.github/workflows/release.yml
#	CODEOWNERS
#	common/CMakeLists.txt
#	common/arg.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/get_rows.cl
#	ggml/src/ggml-opencl/kernels/pad.cl
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
#	tools/run/run.cpp
2025-10-03 16:15:27 +08:00
Adrien Gallouët
4201deae9c
common: introduce http.h for httplib-based client (#16373)
* common: introduce http.h for httplib-based client

This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.

It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* tools : add missing WIN32_LEAN_AND_MEAN

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2025-10-01 20:22:18 +03:00
Adrien Gallouët
bf6f3b3a19
common : disable progress bar without a tty (#16352)
* common : disable progress bar without a tty

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add missing headers

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-30 20:52:41 +03:00
Adrien Gallouët
364a7a6d4a
common : remove common_has_curl() (#16351)
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.

The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-30 17:39:44 +03:00
Concedo
20c802a198 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CODEOWNERS
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2025-09-30 22:28:53 +08:00
Concedo
2201ddb759 fix tool builds 2025-09-30 16:29:11 +08:00
Adrien Gallouët
3c62aed89f
common : simplify etag tracking by removing json (#16342)
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.

This legacy code can be removed in a future version.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-30 10:36:33 +03:00
Concedo
b120e107f9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.clang-tidy
#	.devops/musa.Dockerfile
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	.gitignore
#	CODEOWNERS
#	CONTRIBUTING.md
#	README.md
#	build-xcframework.sh
#	ci/README-MUSA.md
#	ci/run.sh
#	common/CMakeLists.txt
#	docs/docker.md
#	examples/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/model-conversion/Makefile
#	examples/model-conversion/README.md
#	examples/model-conversion/logits.cpp
#	examples/model-conversion/scripts/causal/compare-logits.py
#	examples/model-conversion/scripts/causal/run-org-model.py
#	examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/model-conversion/scripts/utils/check-nmse.py
#	examples/model-conversion/scripts/utils/inspect-org-model.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	ggml/CMakeLists.txt
#	ggml/include/ggml-zdnn.h
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/set_rows.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/set_rows.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-quantize-perf.cpp
#	tests/test-tokenizers-repo.sh
#	tools/perplexity/perplexity.cpp
#	tools/server/tests/README.md
2025-09-27 17:09:14 +08:00
Adrien Gallouët
b995a10760
common : use cpp-httplib as a cURL alternative for downloads (#16185)
* vendor : update httplib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : use cpp-httplib as a cURL alternative for downloads

The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml : Bump to Windows 10

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-26 14:12:19 +03:00
Concedo
efe546390b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CODEOWNERS
#	CONTRIBUTING.md
#	README.md
#	ci/run.sh
#	examples/embedding/README.md
#	tests/test-backend-ops.cpp
2025-09-22 21:25:29 +08:00
Adrien Gallouët
37a23c17bd
common : enable --offline mode without curl support (#16137)
* common : use the json parser

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : enable --offline mode without CURL support

This change refactors the download logic to properly support offline mode
even when the project is built without CURL.

Without this commit, using `--offline` would give the following error:

    error: built without CURL, cannot download model from the internet

even if all the files are already cached.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-22 15:13:51 +03:00
Haiyue Wang
d05affbab7
common : remove unused local variables (#16140)
These two local variables 'arg' and 'arg_prefix' have been overriden by:

  1. for (const auto & arg : opt.args)

  2. for (int i = 1; i < argc; i++) {
        const std::string arg_prefix = "--";

        std::string arg = argv[i];
2025-09-22 11:48:42 +03:00
Concedo
0dc6b9f418 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/amx/amx.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.tmpl.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/set_rows.wgsl
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
2025-09-21 11:38:47 +08:00
Concedo
3e72aaff5b Merge commit '8f8f2274ee' into concedo_experimental
# Conflicts:
#	.devops/rocm.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	examples/simple/simple.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/kernels/tsembd.cl
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/binbcast.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/tsembd.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	src/llama-model.cpp
#	tools/batched-bench/CMakeLists.txt
#	tools/cvector-generator/CMakeLists.txt
#	tools/export-lora/CMakeLists.txt
#	tools/gguf-split/CMakeLists.txt
#	tools/imatrix/CMakeLists.txt
#	tools/llama-bench/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
#	tools/main/CMakeLists.txt
#	tools/main/README.md
#	tools/mtmd/CMakeLists.txt
#	tools/perplexity/CMakeLists.txt
#	tools/perplexity/perplexity.cpp
#	tools/quantize/CMakeLists.txt
#	tools/rpc/rpc-server.cpp
#	tools/run/CMakeLists.txt
#	tools/run/run.cpp
#	tools/tokenize/CMakeLists.txt
#	tools/tts/CMakeLists.txt
2025-09-21 08:58:23 +08:00
Eric Curtin
4ca088b036
Add resumable downloads for llama-server model loading (#15963)
- Implement resumable downloads in common_download_file_single function
- Add detection of partial download files (.downloadInProgress)
- Check server support for HTTP Range requests via Accept-Ranges header
- Implement HTTP Range request with "bytes=<start>-" header
- Open files in append mode when resuming vs create mode for new downloads

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-09-18 16:22:50 +01:00
jacekpoplawski
8ff206097c
llama-bench: add --n-cpu-moe support (#15952)
* llama-bench: add --n-cpu-moe support

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.
2025-09-16 16:17:08 +02:00