Concedo
8c701d7ded
Merge commit ' 72b090da2c' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/function-calling.md
# examples/embedding/embedding.cpp
# examples/retrieval/retrieval.cpp
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/Doxyfile
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# scripts/compare-commits.sh
# tests/test-chat.cpp
# tests/test-sampling.cpp
2025-05-28 00:28:41 +08:00
Concedo
868cb6aff7
Merge commit ' e121edc432' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# common/CMakeLists.txt
# docs/function-calling.md
# ggml/src/ggml-sycl/binbcast.cpp
# models/templates/README.md
# scripts/tool_bench.py
# src/llama-kv-cache.cpp
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/mtmd/clip.h
# tools/rpc/rpc-server.cpp
# tools/server/README.md
2025-05-28 00:20:45 +08:00
Olivier Chafik
03f582ae8f
server: fix streaming crashes ( #13786 )
...
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
Olivier Chafik
d74e94c1b3
server: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )
...
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik
f13847cfb5
server: fix regression on streamed non-chat completion w/ stops ( #13785 )
...
* more forgiving message diffs: partial stop words aren't erased, full stops are
* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
Georgi Gerganov
79c137f776
examples : allow extracting embeddings from decoder contexts ( #13797 )
...
ggml-ci
2025-05-26 14:03:54 +03:00
Concedo
89a3742ded
skip unquantizable clip layers
2025-05-26 16:02:49 +08:00
Olivier Chafik
e121edc432
server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )
...
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 00:30:51 +01:00
Xuan-Son Nguyen
2f099b510f
webui : bump max upload file size to 500MB ( #13779 )
2025-05-25 18:02:18 +01:00
Percy Piper
c508256db2
rpc : Fix build on OpenBSD ( #13541 )
2025-05-25 15:35:53 +03:00
Xuan-Son Nguyen
40aaa8a403
mtmd : add support for Qwen2-Audio and SeaLLM-Audio ( #13760 )
...
* mtmd : add Qwen2-Audio support
* small clean up
* update discussion link
* clarify mtmd_get_output_embd
* clarification in multimodal.md
* fix ultravox bug
* ggml_cont
2025-05-25 14:06:32 +02:00
Olivier Chafik
d785f9c1fd
server: fix/test add_generation_prompt ( #13770 )
...
Co-authored-by: ochafik <ochafik@google.com>
2025-05-25 10:45:49 +01:00
Olivier Chafik
f5cd27b71d
server: streaming of tool calls and thoughts when --jinja is on (#12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Concedo
55cc9acec5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# README.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/clip.cpp
# tools/mtmd/clip.h
2025-05-24 12:10:36 +08:00
Xuan-Son Nguyen
9ecf3e66a3
server : support audio input ( #13714 )
...
* server : support audio input
* add audio support on webui
2025-05-23 11:03:47 +02:00
Concedo
22ef97d7d3
Merge commit ' ab86335760' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# examples/retrieval/retrieval.cpp
# examples/simple-chat/simple-chat.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# requirements/requirements-convert_lora_to_gguf.txt
# tools/run/run.cpp
2025-05-23 11:41:36 +08:00
Georgi Gerganov
8a1d206f1d
tts : fix n_ubatch + make WavTokenizer cache-less ( #13713 )
...
ggml-ci
2025-05-22 22:21:07 +03:00
Xuan-Son Nguyen
797990c4bc
mtmd : add ultravox audio input ( #13623 )
...
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
2025-05-22 20:42:48 +02:00
Georgi Gerganov
cc74d5be99
server : pad small embedding batches ( #13692 )
...
ggml-ci
2025-05-22 16:33:39 +03:00
Georgi Gerganov
5fbfe384d4
server : improve error reporting ( #13680 )
2025-05-21 19:46:56 +03:00
Concedo
da7fd4aa57
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# README.md
# ci/README.md
# docs/docker.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-arg-parser.cpp
2025-05-21 23:12:22 +08:00
Concedo
9f976e9c65
swa full used unless ctx shift and fast forward disabled
2025-05-21 22:47:45 +08:00
Robin Davidsson
0d5c742161
server : Add the endpoints /api/tags and /api/chat ( #13659 )
...
* Add the endpoints /api/tags and /api/chat
Add the endpoints /api/tags and /api/chat, and improved the model metadata response
* Remove trailing whitespaces
* Removed code that is not needed for copilot to work.
2025-05-21 15:15:27 +02:00
Dorin-Andrei Geman
42158ae2e8
server : fix first message identification ( #13634 )
...
* server : fix first message identification
When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626 ) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
* server : Fix checks for first role message for stream=True
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
---------
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-21 15:07:57 +02:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Concedo
c0edde61c5
hey what do you know, it worked
2025-05-21 18:50:05 +08:00
Concedo
d04b4eeb04
merge not working
2025-05-21 18:06:41 +08:00
l3utterfly
b7a17463ec
mtmd-helper : bug fix to token batching in mtmd ( #13650 )
...
* Update mtmd-helper.cpp
* Update tools/mtmd/mtmd-helper.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-20 18:55:30 +02:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Nicolò Scipione
f7c9429c85
sycl : Overcoming workaround for mmap() allocation on Windows ( #13482 )
...
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-20 08:54:43 +08:00
Xuan-Son Nguyen
92ecdcc06a
mtmd : add vision support for llama 4 ( #13282 )
...
* wip llama 4 conversion
* rm redundant __init__
* fix conversion
* fix conversion
* test impl
* try this
* reshape patch_embeddings_0
* fix view
* rm ffn_post_norm
* cgraph ok
* f32 for pos embd
* add image marker tokens
* Llama4UnfoldConvolution
* correct pixel shuffle
* fix merge conflicts
* correct
* add debug_graph
* logits matched, but it still preceives the image incorrectly
* fix style
* add image_grid_pinpoints
* handle llama 4 preprocessing
* rm load_image_size
* rm unused line
* fix
* small fix 2
* add test & docs
* fix llava-1.6 test
* test: add notion of huge models
* add comment
* add warn about degraded quality
2025-05-19 13:04:14 +02:00
Concedo
59300dbdf5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/actions/windows-setup-curl/action.yml
# .github/workflows/build-linux-cross.yml
# README.md
# common/CMakeLists.txt
# examples/parallel/README.md
# examples/parallel/parallel.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-vulkan/CMakeLists.txt
# tools/server/README.md
2025-05-18 23:27:53 +08:00
Isaac McFadyen
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
2025-05-17 23:59:48 +02:00
Xuan-Son Nguyen
6aa892ec2a
server : do not return error out of context (with ctx shift disabled) ( #13577 )
2025-05-16 21:50:00 +02:00
Xuan-Son Nguyen
aea9f8b4e7
webui : improve accessibility for visually impaired people ( #13551 )
...
* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content
2025-05-16 21:49:01 +02:00
Concedo
e5d26a2356
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/CMakeLists.txt
# docs/backend/SYCL.md
# ggml/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# ggml/src/gguf.cpp
# scripts/compare-llama-bench.py
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-05-16 15:30:31 +08:00
Concedo
6cafc0e73e
Merge commit ' 71bdbdb587' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cpu/CMakeLists.txt
# tools/batched-bench/batched-bench.cpp
# tools/mtmd/clip.h
2025-05-16 15:25:15 +08:00
Concedo
12e6928ec2
i'm gonna regret this, aren't i?
2025-05-15 23:59:55 +08:00
Concedo
7a76e237b8
fixed clip quantize again
2025-05-15 23:22:12 +08:00
Diego Devesa
6c8b91500e
llama-bench : fix -ot with dl backends ( #13563 )
2025-05-15 15:46:55 +02:00
Xuan-Son Nguyen
3cc1f1f1d2
webui : handle PDF input (as text or image) + convert pasted long content to file ( #13562 )
...
* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages
2025-05-15 14:24:50 +02:00
Piotr Wilkin (ilintar)
c753d7bed0
server : proper error handling for missing elements in messages array (OpenAI compatible backend) ( #13540 )
2025-05-15 08:40:58 +02:00
Georgi Gerganov
b2838049cc
bench : handle decode errors ( #13548 )
...
ggml-ci
2025-05-15 05:57:02 +03:00
Olivier Chafik
aa48e373f2
server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Olivier Chafik
3198405e98
common: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Georgi Gerganov
053174436f
server : passthrough the /models endpoint during loading ( #13535 )
...
* server : passthrough the /models endpoint during loading
* server : update readme + return json for "meta" field
2025-05-14 15:42:10 +03:00
Xuan-Son Nguyen
360a9c98e1
server : fix cache_tokens bug with no cache_prompt ( #13533 )
2025-05-14 13:35:07 +02:00
Xuan-Son Nguyen
bb1681fbd5
webui : use fflate for more deterministic gzip compress ( #13525 )
...
* webui : use pako for more deterministic gzip compress
* simpler code
* use fflate instead of pako
2025-05-14 10:26:12 +02:00
Luca Stefani
d486dd3e8e
webui: Allow pasting file from clipboard ( #13526 )
...
* server: Allow pasting file from clipboard
* server: Prevent default action on file paste
* update build
* format then build combined
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-14 10:07:31 +02:00
Ed Addario
e5c834f718
quantize : improve tensor-type pattern matching ( #13033 )
2025-05-13 19:12:31 +02:00