Concedo
0d72c794fa
Merge commit ' c8ade30036
' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/im2col_f16.cl
# ggml/src/ggml-opencl/kernels/im2col_f32.cl
# ggml/src/ggml-sycl/im2col.cpp
# tools/mtmd/clip.cpp
2025-07-25 19:42:45 +08:00
Molly Sophia
adef81781a
server : allow setting --reverse-prompt
arg ( #14799 )
...
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-22 09:24:22 +08:00
Concedo
4abea4b5c9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build.md
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kernels.cpp
# ggml/src/ggml-cpu/kleidiai/kernels.h
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# tests/test-backend-ops.cpp
# tools/server/README.md
2025-07-21 23:37:42 +08:00
IsaacDynamo
b4efd77f8a
server : add parse_special option to /tokenize endpoint ( #14783 )
2025-07-21 10:24:51 +03:00
Concedo
bdff33e0de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# examples/parallel/parallel.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# scripts/server-bench.py
# src/llama-kv-cache-unified.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
6ffd4e9c44
server : pre-calculate EOG logit biases ( #14721 )
...
ggml-ci
2025-07-16 14:04:12 +03:00
Georgi Gerganov
538cc77f7f
server : fix handling of the ignore_eos flag ( #14710 )
...
ggml-ci
2025-07-16 12:13:57 +03:00
Johannes Gäßler
5cae766541
scripts: synthetic prompt mode for server-bench.py ( #14695 )
2025-07-16 09:33:28 +02:00
Concedo
ce7aa0d5c0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-all.txt
2025-07-15 23:59:53 +08:00
Johannes Gäßler
494c5899cb
scripts: benchmark for HTTP server throughput ( #14668 )
...
* scripts: benchmark for HTTP server throughput
* fix server connection reset
2025-07-14 13:14:30 +02:00
Concedo
8cebec5128
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakePresets.json
# README.md
# common/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tools/run/CMakeLists.txt
2025-07-13 23:39:41 +08:00
Douglas Hanley
0c1df14b5f
server : fix pooled embedding output ( #14645 )
2025-07-12 13:21:02 +03:00
Concedo
b8c1fc7c9e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/development/HOWTO-add-model.md
# ggml/src/ggml-sycl/rope.cpp
# tests/test-backend-ops.cpp
2025-07-09 19:25:28 +08:00
Alawode Oluwandabira
17a1f0d2d4
server: Add ability to mount server at prefix ( #14544 )
...
* Add server_prefix
* Correct server path env
* Rename cli flag to --api-prefix
* Change all to api_prefix
2025-07-08 11:47:33 +03:00
Concedo
a17c79b1a9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/eval-callback/eval-callback.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/gelu.cl
# tests/test-backend-ops.cpp
2025-07-07 17:46:58 +08:00
Sigbjørn Skjæret
ddef99522d
server : fix assistant prefilling when content is an array ( #14360 )
2025-07-05 09:17:14 +02:00
Concedo
cdda9d16e0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# build-xcframework.sh
# ci/run.sh
# examples/Miku.sh
# examples/chat-13B.sh
# examples/chat-persistent.sh
# examples/chat-vicuna.sh
# examples/chat.sh
# examples/jeopardy/jeopardy.sh
# examples/reason-act.sh
# examples/server-llama2-13B.sh
# examples/sycl/build.sh
# examples/sycl/run-llama2.sh
# examples/sycl/run-llama3.sh
# examples/ts-type-to-grammar.sh
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/apple/validate-apps.sh
# scripts/apple/validate-ios.sh
# scripts/apple/validate-macos.sh
# scripts/apple/validate-tvos.sh
# scripts/apple/validate-visionos.sh
# scripts/check-requirements.sh
# scripts/ci-run.sh
# scripts/compare-commits.sh
# scripts/debug-test.sh
# scripts/gen-authors.sh
# scripts/get-hellaswag.sh
# scripts/get-pg.sh
# scripts/get-wikitext-103.sh
# scripts/get-wikitext-2.sh
# scripts/get-winogrande.sh
# scripts/hf.sh
# scripts/qnt-all.sh
# scripts/run-all-perf.sh
# scripts/run-all-ppl.sh
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.sh
# scripts/tool_bench.sh
# tests/test-backend-ops.cpp
# tests/test-lora-conversion-inference.sh
# tests/test-tokenizer-0.sh
# tools/server/README.md
2025-06-30 20:38:44 +08:00
Vedran Miletić
e9b6350e61
scripts : make the shell scripts cross-platform ( #14341 )
2025-06-30 10:17:18 +02:00
matteo
caf5681fcb
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client ( #13196 )
...
* initial commit for handling extra template kwargs
* enable_thinking and assistant prefill cannot be enabled at the same time
* can set chat_template_kwargs in command line
* added doc
* fixed formatting
* add support for extra context in generic template init
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix merge conflict
* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
* normalize environment variable name
* simplify code
* prefill cannot be used with thinking models
* compatibility with the new reasoning-budget parameter
* fix prefill for non thinking models
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
2025-06-29 20:02:53 +02:00
Renat
83790b0e7e
server : fix appearance of the chats list context menu for Safari ( #14322 )
2025-06-29 19:29:57 +02:00
Concedo
ace537d44e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# examples/simple-chat/simple-chat.cpp
# src/llama-quant.cpp
# tools/run/run.cpp
# tools/server/README.md
2025-06-24 23:06:16 +08:00
Nigel Bosch
1b809cee22
server : move no API key doc to /health ( #14352 )
2025-06-24 10:59:11 +02:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos ( #14339 )
...
* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments
2025-06-23 12:27:35 +03:00
Concedo
4f2fcaa2ef
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/run.sh
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/im2col.cpp
# ggml/src/ggml-sycl/mmq.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# tests/test-backend-ops.cpp
2025-06-21 00:32:22 +08:00
Concedo
c16d672ce4
Merge commit ' 9230dbe2c7
' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cpu/CMakeLists.txt
# src/llama-graph.cpp
# tools/server/README.md
2025-06-21 00:01:29 +08:00
Sigbjørn Skjæret
88fc854b4b
llama : improve sep token handling ( #14272 )
2025-06-20 14:04:09 +02:00
Georgi Gerganov
4c9fdfbe15
ubatch : new splitting logic ( #14217 )
...
ggml-ci
2025-06-20 10:14:14 +03:00
aa956
d67341dc18
server : add server parameters for draft model cache type ( #13782 )
...
Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>
2025-06-19 16:01:03 +03:00
Concedo
4356a00f4a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ci/run.sh
# docs/function-calling.md
# examples/gritlm/gritlm.cpp
# ggml/CMakeLists.txt
# ggml/cmake/common.cmake
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/ggml-cpu.c
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# requirements/requirements-compare-llama-bench.txt
# scripts/compare-llama-bench.py
# tests/CMakeLists.txt
2025-06-18 00:16:54 +08:00
Georgi Gerganov
89fea80d29
server : fix incorrect usage of llama_get_embeddings() ( #14225 )
...
* server : fix incorrect usage of llama_get_embeddings()
ggml-ci
* cont : fix the fix
ggml-ci
2025-06-16 22:33:27 +03:00
Georgi Gerganov
d3e64b9f49
llama : rework embeddings logic ( #14208 )
...
* llama : rework embeddings logic
ggml-ci
* cont : fix rerank
ggml-ci
* cont : engrish [no ci]
* cont : fix rerank
ggml-ci
* server : support both embeddings and completions with single model
ggml-ci
* cont : avoid embeddings_org
ggml-ci
2025-06-16 14:14:00 +03:00
Eric Curtin
cd355eda7d
server : When listening on a unix domain socket don't print http:// and port ( #14180 )
...
Instead show something like this:
main: server is listening on file.sock - starting the main loop
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-06-15 23:36:22 +02:00
Concedo
5f9e96e82d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# CMakeLists.txt
# README.md
# common/CMakeLists.txt
# docs/multimodal.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# src/llama-context.cpp
2025-06-14 09:05:45 +08:00
Concedo
69e4a32ca2
Merge commit ' d4e0d95cf5
' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# common/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# scripts/sync-ggml.last
# tests/CMakeLists.txt
2025-06-14 01:58:53 +08:00
Concedo
4204f111f7
Merge commit ' 8f47e25f56
' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-linux-cross.yml
# docs/backend/CANN.md
# examples/batched.swift/Sources/main.swift
# examples/embedding/embedding.cpp
# examples/gritlm/gritlm.cpp
# examples/llama.android/llama/src/main/cpp/llama-android.cpp
# examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/passkey/passkey.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/simple-chat/simple-chat.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# tools/batched-bench/batched-bench.cpp
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
# tools/run/run.cpp
2025-06-13 22:05:03 +08:00
Georgi Gerganov
ffad043973
server : fix SWA condition for full context reprocess ( #14163 )
...
ggml-ci
2025-06-13 11:18:25 +03:00
Georgi Gerganov
7d516443dd
server : re-enable SWA speculative decoding ( #14131 )
...
ggml-ci
2025-06-12 11:51:38 +03:00
Aman
7781e5fe99
webui: Wrap long numbers instead of infinite horizontal scroll ( #14062 )
...
* webui: Wrap long numbers instead of infinite horizontal scroll
* Use tailwind class
* update index.html.gz
2025-06-11 16:42:25 +02:00
Taylor
2baf07727f
server : pass default --keep argument ( #14120 )
2025-06-11 13:43:43 +03:00
Juk Armstrong
3a12db23b6
Fixed spec timings to: accepted/tested instead of accepted/drafted ( #14104 )
2025-06-10 16:48:07 +01:00
R0CKSTAR
dc0623fddb
webui: fix sidebar being covered by main content ( #14082 )
...
* webui: fix sidebar being covered by main content
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* webui: update index.html.gz
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-09 12:01:17 +02:00
Georgi Gerganov
87d34b381d
server : fix LRU check ( #14079 )
...
ggml-ci
2025-06-09 12:57:58 +03:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
Concedo
bc89b465a8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# .github/workflows/server.yml
# README.md
# docs/build.md
# docs/install.md
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
2025-06-05 11:03:34 +08:00
Georgi Gerganov
3637576288
server : disable speculative decoding for SWA models ( #13970 )
...
* server : use swa-full fo draft context
ggml-ci
* server : disable speculative decoding for SWA models
2025-06-02 21:34:40 +03:00
Olivier Chafik
c9bbc77931
server
: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 )
...
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Concedo
6ce85c54d6
not working correctly
2025-06-02 22:12:10 +08:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
igardev
c7e0a2054b
webui : Replace alert and confirm with custom modals. ( #13711 )
...
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.
* use Modal Provider to simplify the use of confirm and alert modals.
* Increase the z index of the modal dialogs.
* Update index.html.gz
* also add showPrompt
* rebuild
---------
Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-31 11:56:08 +02:00
Georgi Gerganov
3f55f781f1
llama : auto-batch preparation ( #13845 )
...
* llama : auto-batch
ggml-ci
* context : simplify if branching
2025-05-31 12:55:57 +03:00