Commit graph

145 commits

Author SHA1 Message Date
Concedo
4f2fcaa2ef Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ci/run.sh
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/repack.cpp
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/concat.cpp
#	ggml/src/ggml-sycl/conv.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/getrows.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/gla.cpp
#	ggml/src/ggml-sycl/im2col.cpp
#	ggml/src/ggml-sycl/mmq.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/tsembd.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	tests/test-backend-ops.cpp
2025-06-21 00:32:22 +08:00
Concedo
c16d672ce4 Merge commit '9230dbe2c7' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cpu/CMakeLists.txt
#	src/llama-graph.cpp
#	tools/server/README.md
2025-06-21 00:01:29 +08:00
Sigbjørn Skjæret
88fc854b4b
llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
Georgi Gerganov
4c9fdfbe15
ubatch : new splitting logic (#14217)
ggml-ci
2025-06-20 10:14:14 +03:00
Concedo
b925bbfc6d add simple api example 2025-06-19 23:05:28 +08:00
aa956
d67341dc18
server : add server parameters for draft model cache type (#13782)
Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>
2025-06-19 16:01:03 +03:00
bashayer hijji
fffcce535e
llama-bench : add --no-warmup flag (#14224) (#14270)
Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking.

- Add no_warmup boolean field to cmd_params struct

- Add --no-warmup command-line argument parsing

- Add help text documentation for the new flag

- Wrap existing warmup logic in conditional check

- Maintain full backward compatibility (warmup enabled by default)

Addresses #14224
2025-06-19 12:24:12 +02:00
Concedo
5f0a7a84ae Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	scripts/sync-ggml.last
2025-06-18 21:22:51 +08:00
Xuan-Son Nguyen
413977de32
mtmd : refactor llava-uhd preprocessing logic (#14247)
* mtmd : refactor llava-uhd preprocessing logic

* fix editorconfig
2025-06-18 10:43:57 +02:00
Concedo
4356a00f4a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ci/run.sh
#	docs/function-calling.md
#	examples/gritlm/gritlm.cpp
#	ggml/CMakeLists.txt
#	ggml/cmake/common.cmake
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	requirements/requirements-compare-llama-bench.txt
#	scripts/compare-llama-bench.py
#	tests/CMakeLists.txt
2025-06-18 00:16:54 +08:00
Georgi Gerganov
89fea80d29
server : fix incorrect usage of llama_get_embeddings() (#14225)
* server : fix incorrect usage of llama_get_embeddings()

ggml-ci

* cont : fix the fix

ggml-ci
2025-06-16 22:33:27 +03:00
Georgi Gerganov
d3e64b9f49
llama : rework embeddings logic (#14208)
* llama : rework embeddings logic

ggml-ci

* cont : fix rerank

ggml-ci

* cont : engrish [no ci]

* cont : fix rerank

ggml-ci

* server : support both embeddings and completions with single model

ggml-ci

* cont : avoid embeddings_org

ggml-ci
2025-06-16 14:14:00 +03:00
Eric Curtin
cd355eda7d
server : When listening on a unix domain socket don't print http:// and port (#14180)
Instead show something like this:

main: server is listening on file.sock - starting the main loop

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-06-15 23:36:22 +02:00
Concedo
5f9e96e82d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	CMakeLists.txt
#	README.md
#	common/CMakeLists.txt
#	docs/multimodal.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/gemm.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	src/llama-context.cpp
2025-06-14 09:05:45 +08:00
Concedo
69e4a32ca2 Merge commit 'd4e0d95cf5' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	common/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
2025-06-14 01:58:53 +08:00
Concedo
4204f111f7 Merge commit '8f47e25f56' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/build-linux-cross.yml
#	docs/backend/CANN.md
#	examples/batched.swift/Sources/main.swift
#	examples/embedding/embedding.cpp
#	examples/gritlm/gritlm.cpp
#	examples/llama.android/llama/src/main/cpp/llama-android.cpp
#	examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	examples/retrieval/retrieval.cpp
#	examples/save-load-state/save-load-state.cpp
#	examples/simple-chat/simple-chat.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	tools/batched-bench/batched-bench.cpp
#	tools/cvector-generator/cvector-generator.cpp
#	tools/imatrix/imatrix.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
#	tools/run/run.cpp
2025-06-13 22:05:03 +08:00
Georgi Gerganov
ffad043973
server : fix SWA condition for full context reprocess (#14163)
ggml-ci
2025-06-13 11:18:25 +03:00
Georgi Gerganov
7d516443dd
server : re-enable SWA speculative decoding (#14131)
ggml-ci
2025-06-12 11:51:38 +03:00
Aman
7781e5fe99
webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
* webui: Wrap long numbers instead of infinite horizontal scroll

* Use tailwind class

* update index.html.gz
2025-06-11 16:42:25 +02:00
Taylor
2baf07727f
server : pass default --keep argument (#14120) 2025-06-11 13:43:43 +03:00
Juk Armstrong
3a12db23b6
Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104) 2025-06-10 16:48:07 +01:00
Concedo
8386546e08 Switched VS2019 for revert cu12.1 build, hopefully solves dll issues
try change order (+3 squashed commit)

Squashed commit:

[457f02507] try newer jimver

[64af28862] windows pyinstaller shim. the final loader will be moved into the packed directory later.

[0272ecf2d] try alternative way of getting cuda toolkit 12.4 since jimver wont work, also fix rocm
try again (+3 squashed commit)

Squashed commit:

[133e81633] try without pwsh

[4d99cefba] try without pwsh

[bdfa91e7d] try alternative way of getting cuda toolkit 12.4, also fix rocm
2025-06-10 23:08:02 +08:00
R0CKSTAR
dc0623fddb
webui: fix sidebar being covered by main content (#14082)
* webui: fix sidebar being covered by main content

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* webui: update index.html.gz

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-09 12:01:17 +02:00
Georgi Gerganov
87d34b381d
server : fix LRU check (#14079)
ggml-ci
2025-06-09 12:57:58 +03:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
2025-06-06 14:11:15 +03:00
Concedo
bc89b465a8 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	.github/workflows/server.yml
#	README.md
#	docs/build.md
#	docs/install.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
2025-06-05 11:03:34 +08:00
Georgi Gerganov
3637576288
server : disable speculative decoding for SWA models (#13970)
* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models
2025-06-02 21:34:40 +03:00
Olivier Chafik
c9bbc77931
server: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Xuan-Son Nguyen
bfd322796c
mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-02 16:29:28 +02:00
Concedo
6ce85c54d6 not working correctly 2025-06-02 22:12:10 +08:00
Max Krasnyansky
053b1539c0
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 15:39:19 -07:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
igardev
c7e0a2054b
webui : Replace alert and confirm with custom modals. (#13711)
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-31 11:56:08 +02:00
Georgi Gerganov
3f55f781f1
llama : auto-batch preparation (#13845)
* llama : auto-batch

ggml-ci

* context : simplify if branching
2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen
51fa76f172
mtmd : drop _shared from libmtmd name, merge helpers into libmtmd (⚠️ breaking change) (#13917)
* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir
2025-05-31 10:14:29 +02:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci
2025-05-31 10:24:04 +03:00
Concedo
b08dca65ed Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/CMakeLists.txt
#	common/arg.cpp
#	common/chat.cpp
#	examples/parallel/README.md
#	examples/parallel/parallel.cpp
#	ggml/cmake/common.cmake
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	models/ggml-vocab-bert-bge.gguf.inp
#	models/ggml-vocab-bert-bge.gguf.out
#	models/ggml-vocab-command-r.gguf.inp
#	models/ggml-vocab-command-r.gguf.out
#	models/ggml-vocab-deepseek-coder.gguf.inp
#	models/ggml-vocab-deepseek-coder.gguf.out
#	models/ggml-vocab-deepseek-llm.gguf.inp
#	models/ggml-vocab-deepseek-llm.gguf.out
#	models/ggml-vocab-falcon.gguf.inp
#	models/ggml-vocab-falcon.gguf.out
#	models/ggml-vocab-gpt-2.gguf.inp
#	models/ggml-vocab-gpt-2.gguf.out
#	models/ggml-vocab-llama-bpe.gguf.inp
#	models/ggml-vocab-llama-bpe.gguf.out
#	models/ggml-vocab-llama-spm.gguf.inp
#	models/ggml-vocab-llama-spm.gguf.out
#	models/ggml-vocab-mpt.gguf.inp
#	models/ggml-vocab-mpt.gguf.out
#	models/ggml-vocab-phi-3.gguf.inp
#	models/ggml-vocab-phi-3.gguf.out
#	models/ggml-vocab-qwen2.gguf.inp
#	models/ggml-vocab-qwen2.gguf.out
#	models/ggml-vocab-refact.gguf.inp
#	models/ggml-vocab-refact.gguf.out
#	models/ggml-vocab-starcoder.gguf.inp
#	models/ggml-vocab-starcoder.gguf.out
#	requirements/requirements-gguf_editor_gui.txt
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/run/run.cpp
#	tools/server/CMakeLists.txt
2025-05-31 13:04:21 +08:00
Concedo
c987abf9f5 Merge commit '763d06edb7' into concedo_experimental
# Conflicts:
#	.github/workflows/build-linux-cross.yml
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/clip.cpp
#	tools/mtmd/mtmd.cpp
#	tools/server/CMakeLists.txt
2025-05-31 12:44:18 +08:00
Concedo
0c108f6054 Merge commit '34b7c0439e' into concedo_experimental
# Conflicts:
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/sync-ggml.last
#	src/CMakeLists.txt
#	tools/mtmd/clip.cpp
2025-05-31 12:27:45 +08:00
Georgi Gerganov
53f925074d
sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen
10961339b2
mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
Đinh Trọng Huy
e0e3aa231d
llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 19:01:58 +02:00
Sky
c962ae3382
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
2025-05-28 16:33:54 +02:00
Concedo
8c701d7ded Merge commit '72b090da2c' into concedo_experimental
# Conflicts:
#	docs/backend/CANN.md
#	docs/function-calling.md
#	examples/embedding/embedding.cpp
#	examples/retrieval/retrieval.cpp
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/Doxyfile
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/concat.cpp
#	ggml/src/ggml-sycl/conv.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/getrows.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/gla.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-sycl/outprod.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/tsembd.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	scripts/compare-commits.sh
#	tests/test-chat.cpp
#	tests/test-sampling.cpp
2025-05-28 00:28:41 +08:00
Concedo
868cb6aff7 Merge commit 'e121edc432' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	common/CMakeLists.txt
#	docs/function-calling.md
#	ggml/src/ggml-sycl/binbcast.cpp
#	models/templates/README.md
#	scripts/tool_bench.py
#	src/llama-kv-cache.cpp
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/mtmd/clip.h
#	tools/rpc/rpc-server.cpp
#	tools/server/README.md
2025-05-28 00:20:45 +08:00
Xuan-Son Nguyen
bc583e3c63
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens
2025-05-27 14:06:10 +02:00
Olivier Chafik
03f582ae8f
server: fix streaming crashes (#13786)
* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
Olivier Chafik
d74e94c1b3
server: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik
f13847cfb5
server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
Georgi Gerganov
79c137f776
examples : allow extracting embeddings from decoder contexts (#13797)
ggml-ci
2025-05-26 14:03:54 +03:00