Concedo
b08dca65ed
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/CMakeLists.txt
# common/arg.cpp
# common/chat.cpp
# examples/parallel/README.md
# examples/parallel/parallel.cpp
# ggml/cmake/common.cmake
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/rope.cpp
# models/ggml-vocab-bert-bge.gguf.inp
# models/ggml-vocab-bert-bge.gguf.out
# models/ggml-vocab-command-r.gguf.inp
# models/ggml-vocab-command-r.gguf.out
# models/ggml-vocab-deepseek-coder.gguf.inp
# models/ggml-vocab-deepseek-coder.gguf.out
# models/ggml-vocab-deepseek-llm.gguf.inp
# models/ggml-vocab-deepseek-llm.gguf.out
# models/ggml-vocab-falcon.gguf.inp
# models/ggml-vocab-falcon.gguf.out
# models/ggml-vocab-gpt-2.gguf.inp
# models/ggml-vocab-gpt-2.gguf.out
# models/ggml-vocab-llama-bpe.gguf.inp
# models/ggml-vocab-llama-bpe.gguf.out
# models/ggml-vocab-llama-spm.gguf.inp
# models/ggml-vocab-llama-spm.gguf.out
# models/ggml-vocab-mpt.gguf.inp
# models/ggml-vocab-mpt.gguf.out
# models/ggml-vocab-phi-3.gguf.inp
# models/ggml-vocab-phi-3.gguf.out
# models/ggml-vocab-qwen2.gguf.inp
# models/ggml-vocab-qwen2.gguf.out
# models/ggml-vocab-refact.gguf.inp
# models/ggml-vocab-refact.gguf.out
# models/ggml-vocab-starcoder.gguf.inp
# models/ggml-vocab-starcoder.gguf.out
# requirements/requirements-gguf_editor_gui.txt
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
# tools/mtmd/CMakeLists.txt
# tools/run/run.cpp
# tools/server/CMakeLists.txt
2025-05-31 13:04:21 +08:00
Concedo
c987abf9f5
Merge commit ' 763d06edb7' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-linux-cross.yml
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-vulkan/CMakeLists.txt
# tools/mtmd/CMakeLists.txt
# tools/mtmd/clip.cpp
# tools/mtmd/mtmd.cpp
# tools/server/CMakeLists.txt
2025-05-31 12:44:18 +08:00
Concedo
0c108f6054
Merge commit ' 34b7c0439e' into concedo_experimental
...
# Conflicts:
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/sync-ggml.last
# src/CMakeLists.txt
# tools/mtmd/clip.cpp
2025-05-31 12:27:45 +08:00
Concedo
c923e9fe46
added option to unload model from admin control
2025-05-31 11:51:09 +08:00
Concedo
08e0745e7e
added singleinstance flag and local shutdown api
2025-05-31 11:37:32 +08:00
Wagner Bruna
12f99ba907
fix: workaround for default clip_skip issues ( #1572 )
...
Sets the clip_skip value explicitly to 1 or 2 for all generation
requests, aligning with the tests in the Conditioner objects in
conditioner.hpp.
This should fix #1546 regardless of future changes to the default
behavior of sdcpp. This workaround can be removed once a proper
fix is implemented in sdcpp.
2025-05-31 10:36:30 +08:00
Johannes Gäßler
e562eece7c
CUDA: fix typo in FlashAttention code ( #13926 )
2025-05-30 21:22:03 +02:00
Concedo
3829ae728e
attempt to debug
2025-05-31 01:02:30 +08:00
Diego Devesa
b47ab7b8e9
sched : avoid changing cur_copy when a graph is already allocated ( #13922 )
2025-05-30 18:56:19 +02:00
Georgi Gerganov
dd665cc9d4
parallel : increase the variability of the prompt lengths ( #13927 )
...
ggml-ci
2025-05-30 19:38:07 +03:00
Diego Devesa
df0c0c7d02
cuda : prevent using split buffers with 3d/4d matrices ( #13919 )
2025-05-30 16:37:18 +02:00
Akarshan Biswas
b49a8ff96b
SYCL: Add mrope kernel ( #13755 )
...
* SYCL: Add mrope kernel
* feat: Optimize rope operations with vectorization
Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
* Use ceil_div
2025-05-30 19:40:57 +05:30
Georgi Gerganov
53f925074d
sync : vendor ( #13901 )
...
* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci
2025-05-30 16:25:45 +03:00
Sigbjørn Skjæret
db38704f01
convert : fix rwkv bos/eos token ( #13844 )
2025-05-30 14:50:43 +02:00
Xuan-Son Nguyen
07e4351ce6
convert : allow partial update to the chkhsh pre-tokenizer list ( #13847 )
...
* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import
2025-05-30 12:24:37 +02:00
Đinh Trọng Huy
291f2b6913
llama : add support for DistilBert ( #13907 )
...
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-30 11:56:02 +02:00
zhangkaihuo
2c90da4c7e
llama : use llm_build_granite for minicpm ( #13911 )
2025-05-30 10:31:48 +02:00
Concedo
6529326c59
allow temperatures up to 1.0 when function calling
2025-05-30 15:59:18 +08:00
Concedo
d99f362513
added 2 more readme images
2025-05-30 14:34:20 +08:00
Concedo
fe401ca4c2
fixed a typo
2025-05-30 13:35:42 +08:00
Concedo
a11ab0b08e
reverse clip skip fix as it might be breaking some sdxl models
2025-05-30 10:40:03 +08:00
Christian Kastner
ec9e0301fe
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture ( #13890 )
2025-05-30 01:28:54 +02:00
Sigbjørn Skjæret
e83ba3e460
llama : add support for jina-reranker-v2 ( #13900 )
2025-05-29 21:42:31 +02:00
Concedo
2a309c144d
updated lite
2025-05-30 00:29:46 +08:00
Concedo
c881bb7348
match a few common oai voices
2025-05-29 23:29:17 +08:00
Sigbjørn Skjæret
2b131621e6
gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method ( #13561 )
2025-05-29 15:36:05 +02:00
Yibo Cai
54a2c7a8cd
arm64: optimize q4_k_q8_k kernel with i8mm ( #13886 )
...
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 |
| 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 |
| 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 |
| 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 |
| 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 |
| 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 |
---------------------------------------------------------------------
```
2025-05-29 14:39:20 +03:00
Christian Kastner
21fcc21ad5
cmake: Factor out CPU architecture detection ( #13883 )
...
* cmake: Define function for querying architecture
The tests and results match exactly those of ggml/src/CMakeLists.txt
* Switch arch detection over to new function
2025-05-29 12:50:25 +02:00
Vineel Abhinav
dd8ba93416
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm ( #13882 )
...
* F32-Mamba-Seq_Scan-SVE
* Fix formatting
* ggml : missing space
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-29 12:18:43 +03:00
Georgi Gerganov
66c92061f5
tests : remove json.hpp from a test ( #13880 )
...
ggml-ci
2025-05-29 12:17:16 +03:00
Sigbjørn Skjæret
5ca82fc1d7
convert : workaround for AutoConfig dummy labels ( #13881 )
2025-05-29 10:00:57 +02:00
Sigbjørn Skjæret
6385b843a8
llama : add RobertaForSequenceClassification reranker support ( #13875 )
2025-05-29 08:15:01 +02:00
Vineel Abhinav
1b8fb8152d
ggml: aarch64: Implement SVE F32 kernels for vector functions ( #13843 )
...
* F32-Mamba-SVE
* F32-Mamba-SVE
* Resolve test errors-1
* Resolve test errors-2
* F32-vec-SVE
* F32-vec-SVE
* F32-vec-SVE
2025-05-29 09:01:33 +03:00
Beinsezii
53ae30640e
gguf-py : fix SafetensorRemote return on undefined size (< 0) ( #13841 )
2025-05-28 23:50:20 +02:00
Xuan-Son Nguyen
763d06edb7
llama : fix KV shift for qwen2vl ( #13870 )
...
* llama : fix KV shift for qwen2vl
* add ref to the PR
2025-05-28 22:35:31 +02:00
Xuan-Son Nguyen
10961339b2
mtmd : move helpers to dedicated library ( ⚠️ breaking change) ( #13866 )
...
* mtmd : move helpers to dedicated library
* fix server build
* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
bandoti
d98f2a35fc
ci: disable LLAMA_CURL for Linux cross-builds ( #13871 )
2025-05-28 15:46:47 -03:00
Đinh Trọng Huy
e0e3aa231d
llama : add support for BertForSequenceClassification reranker ( #13858 )
...
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 19:01:58 +02:00
Concedo
e14aec58bc
embeds no offload qkv
2025-05-29 00:28:02 +08:00
Concedo
fcc1b43c06
embeddings change to encode
2025-05-28 23:24:33 +08:00
Đinh Trọng Huy
aa6dff05be
convert: small addition to support LlamaModel ( #13838 )
...
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 16:34:18 +02:00
Sky
c962ae3382
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode ( #13853 )
...
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
2025-05-28 16:33:54 +02:00
Xuan-Son Nguyen
a3938fb53d
convert : fix qwen omni conversion ( #13859 )
...
* convert : fix qwen omni conversion
* fix typo
2025-05-28 16:12:35 +02:00
Alex Fanthome
f7873fc698
tests : change umlaut test ( #11600 )
2025-05-28 15:49:28 +02:00
Johannes Gäßler
a68247439b
CUDA: fix FA tg at long context for CC >= 8.9 ( #13852 )
2025-05-28 13:33:37 +02:00
Xuan-Son Nguyen
26b79b6cb3
convert : fix tensor naming conflict for llama 4 vision ( #13836 )
...
* convert : fix tensor naming conflict for llama 4 vision
* add comment
2025-05-28 10:05:54 +02:00
leo-pony
1e8659e65a
CANN: Add SOC TYPE printing in cmake configuration ( #13837 )
2025-05-28 11:54:20 +08:00
lhez
a3c30846e4
opencl: add new ops - argsort, div, sub, addrows, sigmoid, group_norm ( #13787 )
...
* opencl: add `argsort`
* opencl: add `div`
* opencl: add `add_rows`
* opencl: add `sub`
* opencl: add `sigmoid`, both `f16` and `f32`
* opencl: add `group_norm`
2025-05-27 12:56:08 -07:00
lhez
1701d4c54f
opencl: mark mul_mat f32f32 as supporting non-contiguous tensors ( #13790 )
2025-05-27 12:53:14 -07:00
Jeff Bolz
bef8176387
vulkan: use timestamp queries for GGML_VULKAN_PERF ( #13817 )
...
Also change it to be controlled by an env var rather than cmake flag
2025-05-27 18:39:07 +02:00