Commit graph

884 commits

Author SHA1 Message Date
Concedo
724763fdec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	common/common.cpp
#	examples/batched/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/arch-fallback.h
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	scripts/sync-ggml.last
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/server/CMakeLists.txt
2025-11-25 16:38:07 +08:00
Aaron Teo
877566d512
llama: introduce support for model-embedded sampling parameters (#17120)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
2025-11-25 09:56:07 +08:00
Concedo
5248838a05 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/cann.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	.gitignore
#	README.md
#	common/CMakeLists.txt
#	docs/ops.md
#	docs/ops/Vulkan.csv
#	examples/eval-callback/eval-callback.cpp
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/arch/x86/repack.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	scripts/sync-ggml.last
#	src/llama-grammar.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/server/CMakeLists.txt
2025-11-22 18:26:13 +08:00
Georgi Gerganov
196f5083ef
common : more accurate sampling timing (#17382)
* common : more accurate sampling timing

* eval-callback : minor fixes

* cont : add time_meas impl

* cont : fix log msg [no ci]

* cont : fix multiple definitions of time_meas

* llama-cli : exclude chat template init from time measurement

* cont : print percentage of unaccounted time

* cont : do not reset timings
2025-11-20 13:40:10 +02:00
Xuan-Son Nguyen
10e9780154
chat: fix int overflow, prevent size calculation in float/double (#17357)
* chat: fix int overflow, prevent size calculation in float/double

* Update common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-18 19:11:53 +01:00
hksdpc255
1920345c3b
common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932)
* Add files via upload

* fix unit test

* fix crashes for --reasoning-format=none

* Patch buggy official MiniMax-M2 chat template

* add upstream minja fix: https://github.com/ochafik/minja/pull/7

* Fix <think> token not generated

* add test copied from https://github.com/ggml-org/llama.cpp/pull/16946

* cleanup

* Hopes to fix the compilation error on CI

* Delete chat template patching since it’s fixed by upstream Minja

* Remove undeeded Minimax-M2 template patch

https://github.com/ochafik/minja/pull/7#issuecomment-3480356100

* Add proper handling of optional parameters with test
merged tests from: https://github.com/ggml-org/llama.cpp/pull/16946/commits/23d4bb75c485c12ac89f81c424dc03c87a640e8c

* Fix making all tool parameters optional

* Move xml tool parser to separate file

* cleanup & add tests for GLM4.5

* add streaming tests & enhancement & cleanups

Add streaming test for both GLM 4.5 and minimax-m2.
Cleanup for preserved_tokens.
Cleanup for grammar rule name.
Enhance the parser's stability.

* cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo

* apply suggestions from reviewers

* fix a misuse for data.grammar_lazy

* fix grammar when tool have no argument

* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2

* update chat.cpp

* fix grammar for GLM 4.5/4.6

* Try fix Jinja template for GLM

* Try fix GLM-4.6.jinja

* Update common/chat-parser-xml-toolcall.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking

* Improve Kimi-K2 chat template

* Fix unit test

* Fix "Invalid tool call arguments passed" in a rare case.

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-18 18:54:15 +01:00
LostRuins Concedo
3fe0e39b62 Merge commit '4dca015b7e' into concedo_experimental
# Conflicts:
#	.github/copilot-instructions.md
#	README.md
#	docs/ops.md
#	docs/ops/CPU.csv
#	docs/ops/CUDA.csv
#	docs/ops/Vulkan.csv
#	ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-11-16 18:33:58 +08:00
Xuan-Son Nguyen
9b17d74ab7
mtmd: add mtmd_log_set (#17268) 2025-11-14 15:56:19 +01:00
LostRuins Concedo
26e9090088 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/cann.Dockerfile
#	.devops/cpu.Dockerfile
#	.devops/cuda.Dockerfile
#	.devops/intel.Dockerfile
#	.devops/musa.Dockerfile
#	.devops/nix/package.nix
#	.devops/rocm.Dockerfile
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	build-xcframework.sh
#	ci/run.sh
#	common/CMakeLists.txt
#	common/download.cpp
#	docs/backend/CANN.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.h
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/sync_vendor.py
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-rope.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/rpc/CMakeLists.txt
#	tools/server/CMakeLists.txt
2025-11-13 15:45:50 +08:00
Adrien Gallouët
52cf111b31
cmake : cleanup (#17199) 2025-11-12 14:48:30 +02:00
Adrien Gallouët
78010a0d52
cmake : move OpenSSL linking to vendor/cpp-httplib (#17177)
* cmake : move OpenSSL linking to vendor/cpp-httplib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* bring back httplib 0.27.0

* add -DLLAMA_HTTPLIB

* update cmake config for visionos

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-11-12 12:32:50 +01:00
Xuan-Son Nguyen
1d45b4228f
vendor: split httplib to cpp/h files (#17150)
* vendor: split httplib to cpp/h files

* move defines

* include httplib if curl is not used

* add TODO

* fix build ios

* fix build visionos instead
2025-11-11 13:32:58 +01:00
LostRuins Concedo
5125c0b879 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/set_rows.cl
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
#	tests/test-backend-ops.cpp
#	tools/batched-bench/batched-bench.cpp
2025-11-11 17:10:11 +08:00
Georgi Gerganov
f914544b16
batched-bench : add "separate text gen" mode (#17103) 2025-11-10 12:59:29 +02:00
Xuan-Son Nguyen
aa3b7a90b4
arg: add --cache-list argument to list cached models (#17073)
* arg: add --cache-list argument to list cached models

* new manifest naming format

* improve naming

* Update common/arg.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-08 21:54:14 +01:00
LostRuins Concedo
d6a2ad8455 still not really working right 2025-11-09 01:57:48 +08:00
LostRuins Concedo
dfb0966ed2 not working 2025-11-08 10:49:10 +08:00
LostRuins Concedo
fdcb281a3a Merge commit '2f966b8ed8' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	docs/docker.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-thread-safety.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/mtmd/clip.cpp
2025-11-08 10:34:17 +08:00
LostRuins Concedo
7061cd1cc9 Merge commit 'e4a71599e5' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	tools/mtmd/clip.cpp
2025-11-08 10:28:49 +08:00
Xuan-Son Nguyen
5c9a18e674
common: move download functions to download.(cpp|h) (#17059)
* common: move download functions to download.(cpp|h)

* rm unused includes

* minor cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-07 11:23:34 +01:00
Georgi Gerganov
13b339bcd9
server : do not default to multiple slots with speculative decoding (#17017)
* server : do not default to multiple slots with speculative decoding

* cont : fix
2025-11-05 14:32:55 +02:00
Xuan-Son Nguyen
070ff4d535
mtmd: add --image-min/max-tokens (#16921) 2025-11-03 11:11:18 +01:00
Aldehir Rojas
87c9efc3b2
common : move gpt-oss reasoning processing to init params (#16937) 2025-11-02 16:56:28 +02:00
Sigbjørn Skjæret
961660b8c3
common : allow --system-prompt-file for diffusion-cli (#16903) 2025-11-01 11:01:42 +01:00
Concedo
2b00e55356 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	ggml/src/ggml-opencl/kernels/mul_mm_f16_f32_l4_lm.cl
#	ggml/src/ggml-opencl/kernels/mul_mm_f32_f32_l4_lm.cl
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/rope.tmpl.wgsl
#	requirements/requirements-convert_legacy_llama.txt
#	tests/test-backend-ops.cpp
#	tests/test-rope.cpp
#	tools/server/README.md
2025-10-31 10:52:57 +08:00
Shagun Bera
835e918d84
common: fix typo in cli help text (#16864) 2025-10-30 17:47:31 +02:00
Concedo
16cbe9f24e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	docs/ops.md
#	docs/ops/SYCL.csv
#	examples/embedding/README.md
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-sycl/norm.hpp
#	scripts/snapdragon/adb/run-bench.sh
#	scripts/snapdragon/adb/run-cli.sh
#	src/llama-batch.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tools/llama-bench/README.md
2025-10-30 13:44:46 +08:00
Sam Malayek
1c1409e131
embedding: add raw option for --embd-output-format (#16541)
* Add --embd-output-format raw for plain numeric embedding output

This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting.

* Move raw output handling into format handling section

* Move raw output handling into else-if block with other format handlers

* Use LOG instead of printf for raw embedding output

* docs: document 'raw' embedding output format in arg.cpp and README
2025-10-28 12:51:41 +02:00
Aldehir Rojas
280d97be96
grammar : support array references in json schema (#16792)
* grammar : support array references in json schema

* Update json-schema-to-grammar.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* grammar : improve regex when naming ref derived rules

* grammar : replace non-conformant definitions array with anyOf test case

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-28 09:37:52 +01:00
Yuri Khrustalev
c053e18a66
chat: Add LFM2 tool handling (#16763)
* Add LFM2 tool handling

* fmt

* Apply suggestion from @ykhrustalev
2025-10-27 23:54:01 +01:00
Concedo
3712c6e6cd Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	requirements/requirements-convert_hf_to_gguf.txt
#	tools/imatrix/CMakeLists.txt
#	tools/run/CMakeLists.txt
2025-10-24 18:12:16 +08:00
Xuan-Son Nguyen
d0660f237a
mtmd-cli : allow using --jinja (#16718)
* mtmd-cli : allow using --jinja

* support -sys

* implement chat_history

* fix clear memory

* rm -sys support, added TODO
2025-10-23 15:00:49 +02:00
Concedo
f47a0690ac Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/ops.md
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tools/rpc/rpc-server.cpp
2025-10-18 11:10:37 +08:00
Concedo
85556118b5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/presets.hpp
2025-10-18 10:56:55 +08:00
Olivier Chafik
79967ec596
grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626) 2025-10-17 08:59:31 +03:00
takasurazeem
6f5d924637
common : Update the docs on -t --threads (#16236)
* Update the docs on -t --threads

* Revert "Update the docs on -t --threads"

This reverts commit eba97345e2c88d8ca510abec87d00bf6b9b0e0c2.

* docs: clarify -t/--threads parameter uses CPU threads and defaults to all available cores

* Update arg.cpp
2025-10-16 08:11:33 +03:00
Concedo
1ff97f8a00 Merge commit '5016b72862' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	docs/ops.md
#	docs/ops/SYCL.csv
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/binbcast.hpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-chat-parser.cpp
#	tests/test-json-partial.cpp
2025-10-16 12:05:21 +08:00
Aldehir Rojas
2c301e91ab
common : handle unicode during partial json parsing (#16526)
* common : handle unicode during partial json parsing

* common : set missing `ensure_ascii = true` during json dump
2025-10-12 16:18:47 +03:00
Concedo
7e7da2583e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-cuda/common.cuh
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
2025-10-12 16:42:51 +08:00
Georgi Gerganov
4b2dae383d
common : update presets (#16504)
* presets : add --embd-gemma-default and remove old embedding presets

* presets : add gpt-oss presets

* presets : add vision presets

* cont : remove reasoning overrides [no ci]

* cont : fix batch size for embedding gemma [no ci]
2025-10-12 09:29:13 +03:00
Concedo
6d8f8cd65b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/CMakeLists.txt
2025-10-11 10:01:43 +08:00
Georgi Gerganov
d00cbea63c
server : host-memory prompt caching (#16391)
* minor : code style

* server : fix prompt similarity calculation

* server : initial host-memory prompt caching

* cont

* server : refactor

* cont

* cont : make the server task of the slot const

* cont : minor [no ci]

* server : cache prompts and checkpoints only for completion tasks

* server : improve prompt caching logic

* cont : fix check for number of cached prompts [no ci]

* server : improve caching logic, add -cram CLI arg

* server : print prompt mismatch info

* cont : better naming [no ci]

* server : improve prompt cache loading logic

* server : add option to debug the slot contents (#16482)

* server : add option to debug the slot contents

* Update tools/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* server : add option to disable prompt cache

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-10-09 18:54:51 +03:00
Concedo
5b6ba02167 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ci/run.sh
#	examples/model-conversion/Makefile
#	examples/model-conversion/README.md
#	examples/model-conversion/logits.cpp
#	examples/model-conversion/requirements.txt
#	examples/model-conversion/scripts/embedding/convert-model.sh
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.h
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/softmax.hpp
#	requirements/requirements-all.txt
#	tests/test-chat-parser.cpp
#	tools/server/README.md
2025-10-09 23:46:56 +08:00
Pascal
12bbc3fa50
refactor: centralize CoT parsing in backend for streaming mode (#16394)
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2025-10-08 23:18:41 +03:00
Concedo
b6f6338bba Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	CODEOWNERS
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-webgpu/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.tmpl.wgsl
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/rpc/README.md
#	tools/server/README.md
2025-10-09 01:33:27 +08:00
Georgi Gerganov
ef4c5b87ea
presets : fix pooling param for embedding models (#16455) 2025-10-07 10:32:32 +03:00
Gadflyii
3df2244df4
llama : add --no-host to disable host buffers (#16310)
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-10-06 19:55:53 +02:00
Concedo
c83dde8a34 not working commit, need to fix vulkan shaders gen 2025-10-05 11:32:50 +08:00
Concedo
1d728bbc89 Merge commit '128d522c04' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	tests/test-alloc.cpp
#	tests/test-chat.cpp
2025-10-04 23:51:22 +08:00
Radoslav Gerganov
898acba681
rpc : add support for multiple devices (#16276)
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
2025-10-04 12:49:16 +03:00