Commit graph

9934 commits

Author SHA1 Message Date
Concedo
4eaf05dfeb handle oai without v1 prefix 2025-10-16 02:16:49 +08:00
Concedo
dfeccea3a1 added shitty fractional scaling support for GNOME. but really just use KDE 2025-10-15 22:28:04 +08:00
Concedo
5207b8d4be more sd path fallbacks 2025-10-15 15:22:06 +08:00
Concedo
610ba18971 sdcpp precision fix 2025-10-15 11:08:35 +08:00
Concedo
a8c023d906 quick test rocm 2025-10-14 17:19:16 +08:00
Concedo
833a778b18 try fix cu11 fa again 2025-10-13 16:36:59 +08:00
Concedo
3a42c6b523 apply fix from https://github.com/ggml-org/llama.cpp/pull/16558 2025-10-13 15:31:26 +08:00
Concedo
c6884a1462 Revert "revert https://github.com/ggml-org/llama.cpp/pull/15953 for now as it breaks kokoro"
This reverts commit 20678ddca1.
2025-10-13 15:25:53 +08:00
Concedo
ca8f36195f try fix cu11 fa 2025-10-13 14:50:47 +08:00
Concedo
8b787866c6 fixed a typo 2025-10-13 11:14:38 +08:00
Concedo
59aa1529dc add embeddings vulkan to makefile 2025-10-13 11:05:45 +08:00
Concedo
20678ddca1 revert https://github.com/ggml-org/llama.cpp/pull/15953 for now as it breaks kokoro 2025-10-13 10:36:51 +08:00
Concedo
121e2fefc8 updated lite 2025-10-12 20:52:16 +08:00
Concedo
54db35cd7a fix t5 scale as well 2025-10-12 20:35:46 +08:00
Concedo
e0ba01c65e fix cuda builds 2025-10-12 20:09:16 +08:00
Concedo
1a360b8458 sdcpp: optimize the handling of the FeedForward precision fix (+1 squashed commits)
Squashed commits:

[621ff6392] sdcpp: optimize the handling of the FeedForward precision fix (+1 squashed commits)

Squashed commits:

[05b16906c] sdcpp: optimize the handling of the FeedForward precision fix
2025-10-12 17:49:38 +08:00
Concedo
9503547ca1 Merge remote-tracking branch 'lcpp/gg/cacheless-embd' into concedo_experimental 2025-10-12 16:47:48 +08:00
Concedo
7e7da2583e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-cuda/common.cuh
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
2025-10-12 16:42:51 +08:00
Concedo
76d5fcbe49 fix the issue that occurs when using CUDA with k-quants weights 2025-10-12 16:18:03 +08:00
Georgi Gerganov
d4d465bce4
graph : support cacheless embeddings with FA and iSWA 2025-10-12 10:35:38 +03:00
Georgi Gerganov
4b2dae383d
common : update presets (#16504)
* presets : add --embd-gemma-default and remove old embedding presets

* presets : add gpt-oss presets

* presets : add vision presets

* cont : remove reasoning overrides [no ci]

* cont : fix batch size for embedding gemma [no ci]
2025-10-12 09:29:13 +03:00
sirus20x6
41aac5c69b
ggml : Fix FP16 ELU positive branch (#16519)
Co-authored-by: Aaron <shelhamer.aaron@gmail.com>
2025-10-12 08:25:37 +03:00
Daniel Bevenius
a2fba89a42
hparams : add check for layer index in is_recurrent (#16511)
* hparams : add check for layer index in is_recurrent

This commit adds a check in the is_recurrent method to ensure that the
provided layer index is within the valid range.

The motivation for this change is to prevent potential out-of-bounds
and also be consistent with other methods in the class that perform
similar checks, like is_swa.
2025-10-12 07:19:06 +02:00
sirus20x6
20cc625edc
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (#16518)
The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion.

Co-authored-by: Aaron <shelhamer.aaron@gmail.com>
2025-10-12 08:15:00 +03:00
Concedo
a0ed446e61 handle numbers outside int32 range with wrapping 2025-10-12 12:46:45 +08:00
Wagner Bruna
9f9494cf3f
sd: add 'default' to the list of supported samplers (#1788) 2025-10-12 12:35:56 +08:00
Concedo
65c2129f65 https://github.com/leejet/stable-diffusion.cpp/pull/877/commits/47c0f8e4bd6916442d04b0a4412554cf3a043e8d 2025-10-12 10:01:29 +08:00
Johannes Gäßler
11f0af5504
CUDA: faster tile FA, add oob checks, more HSs (#16492) 2025-10-11 20:54:32 +02:00
Concedo
720fc30832 Merge branch 'upstream' into concedo_experimental 2025-10-11 23:19:38 +08:00
Concedo
e92f9fd422 cursed hack for RNN models 2025-10-11 23:14:55 +08:00
Georgi Gerganov
a3cb04744f
metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494)
Some checks failed
Python Type-Check / pyright type-check (push) Has been cancelled
2025-10-11 16:54:10 +03:00
Pascal
4a8fbe0a5e
feat: render user content as markdown option (#16358)
* feat: render user content as markdown option
- Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options
- Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle
- Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise
- chore: update webui build output

* chore: update webui build output
2025-10-11 15:50:49 +02:00
Yann Follet
31d0ff1869
server / ranking : add sorting and management of top_n (#16403)
* server / ranking : add sorting and management of top_n

* Make the retro compatible if no top_n will return
all results

here is a script to make some test

```script

URL=${1:-http://127.0.0.1:8181}

curl "$URL/v1/rerank" -H "Content-Type: application/json" \
 -d '{ "model": "M", "query": "What is the recipe to make bread ?",
 "return_text" : true,
 "texts" : true,
 "top_n": 6,
 "documents": [
 "voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel",
 "it is a bear",
 "bread recipe : floor, water, yest, salt",
 "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
 "here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
 "recipe to make cookies : floor, eggs, water, chocolat",
 "here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
 "il fait tres beau aujourd hui",
 "je n ai pas faim, je ne veux pas manger",
 "je suis a paris"
 ] }' | jq
```

* use resize() instead for(...)

* simplify top_n init since no need to return error

result to test :

./tests.sh unit/test_rerank.py -v -x
==================================================== test session starts =====================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 8 items

unit/test_rerank.py::test_rerank PASSED                                                                                [ 12%]
unit/test_rerank.py::test_rerank_tei_format PASSED                                                                     [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED                                                        [ 37%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED                                                              [ 50%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED                                                               [ 62%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED                                                        [ 75%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED                            [ 87%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED               [100%]

===================================================== 8 passed in 4.31s ======================================================

* add rerank top_n unit test

here is the result :

./tests.sh unit/test_rerank.py -v -x
=================================================================== test session starts ===================================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 16 items

unit/test_rerank.py::test_rerank PASSED                                                                                                             [  6%]
unit/test_rerank.py::test_rerank_tei_format PASSED                                                                                                  [ 12%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED                                                                                     [ 18%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED                                                                                           [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED                                                                                            [ 31%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED                                                                                     [ 37%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED                                                         [ 43%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED                                            [ 50%]
unit/test_rerank.py::test_rerank_top_n[None-4] PASSED                                                                                               [ 56%]
unit/test_rerank.py::test_rerank_top_n[2-2] PASSED                                                                                                  [ 62%]
unit/test_rerank.py::test_rerank_top_n[4-4] PASSED                                                                                                  [ 68%]
unit/test_rerank.py::test_rerank_top_n[99-4] PASSED                                                                                                 [ 75%]
unit/test_rerank.py::test_rerank_tei_top_n[None-4] PASSED                                                                                           [ 81%]
unit/test_rerank.py::test_rerank_tei_top_n[2-2] PASSED                                                                                              [ 87%]
unit/test_rerank.py::test_rerank_tei_top_n[4-4] PASSED                                                                                              [ 93%]
unit/test_rerank.py::test_rerank_tei_top_n[99-4] PASSED                                                                                             [100%]

=================================================================== 16 passed in 8.84s ===================================================================

* editor config check fix
2025-10-11 16:39:04 +03:00
Diego Devesa
97870e6497
cuda : avoid initializing unused devices (#16510) 2025-10-11 13:02:26 +02:00
amirai21
477a66b035
convert : correctly handle LLaMA tokenizer for Jamba (#16470)
Some checks failed
Python Type-Check / pyright type-check (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
* fix: convert_hf_to_gguf - change Jamba non-sentencepiece mode (tokenizer.json) vocab construction

* fix: convert_hf_to_gguf - jamba non-sentencepiece tokenizer to use _set_vocab_llama_hf func

* fix: convert_hf_to_gguf - removed get_vocab_base_pre from jamba
2025-10-11 10:33:41 +02:00
Concedo
0cc0ea4cf9 reset prompt template idx 2025-10-11 12:30:07 +08:00
Concedo
5cea2fe944 don't enforce dims 2025-10-11 11:34:47 +08:00
Concedo
80f88eb703 wip qwen image edit. not working yet 2025-10-11 11:24:17 +08:00
Concedo
6d8f8cd65b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/CMakeLists.txt
2025-10-11 10:01:43 +08:00
Georgi Gerganov
e60f01d941
server : fix division by zero when reporting stats (#16501)
Some checks are pending
Python Type-Check / pyright type-check (push) Waiting to run
2025-10-10 22:15:05 +03:00
Georgi Gerganov
81086cd6a3
vocab : mark EOT token for Granite models (#16499)
* vocab : mark EOT token for Granite models

* sampling : fallback to EOS when EOT is not found
2025-10-10 17:17:31 +03:00
Radoslav Gerganov
68ee98ae18
server : return HTTP 400 if prompt exceeds context length (#16486)
In streaming mode when prompt exceeds context length, the server returns
HTTP 200 status code with a JSON error in the body.  This is very
confusing and inconsistent with all other inference engines which return
HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in
such cases.
2025-10-10 16:11:07 +02:00
Radoslav Gerganov
cdb6da468c
server : log requests to /v1/completions (#16495) 2025-10-10 13:22:27 +03:00
Concedo
bc09f34f66 only accept qwen image pruned models matching 40 or 41 layers 2025-10-10 16:26:32 +08:00
Wagner Bruna
bc762fe9b4
add support for Qwen Image Pruning (#1779)
From leejet/stable-diffusion.cpp#874 .
2025-10-10 16:22:47 +08:00
Prajwal B Mehendarkar
6d69ab3f26
cmake : Dont define XOPENSOURCE on AIX (#16481) 2025-10-10 11:15:46 +03:00
Wagner Bruna
bece22f996
fix encoding VAE tiling for Qwen Image (#1785) 2025-10-10 10:07:50 +08:00
Pascal
1faa13a118
webui: updated the chat service to only include max_tokens in the req… (#16489)
* webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel

* chore: update webui build output
2025-10-09 22:54:57 +02:00
duduta
1deee0f8d4
cpu : optimize the ggml NORM operation (#15953)
* ggml-cpu: optimize norm operation to use intrinsics or Accelerate

          rename function

          add endif macro comment

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* implement s390x SIMD suggested by @taronaeo

* add TODO comment

* tidy up spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2025-10-09 21:11:15 +02:00
Georgi Gerganov
d00cbea63c
server : host-memory prompt caching (#16391)
* minor : code style

* server : fix prompt similarity calculation

* server : initial host-memory prompt caching

* cont

* server : refactor

* cont

* cont : make the server task of the slot const

* cont : minor [no ci]

* server : cache prompts and checkpoints only for completion tasks

* server : improve prompt caching logic

* cont : fix check for number of cached prompts [no ci]

* server : improve caching logic, add -cram CLI arg

* server : print prompt mismatch info

* cont : better naming [no ci]

* server : improve prompt cache loading logic

* server : add option to debug the slot contents (#16482)

* server : add option to debug the slot contents

* Update tools/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* server : add option to disable prompt cache

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-10-09 18:54:51 +03:00