Commit graph

10692 commits

Author SHA1 Message Date
Georgi Gerganov
64978340b0
ggml : add asserts (#14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-16 14:43:32 +03:00
Georgi Gerganov
6ffd4e9c44
server : pre-calculate EOG logit biases (#14721)
ggml-ci
2025-07-16 14:04:12 +03:00
Shunta Saito
e4841d24d3
llama : fix parallel processing for plamo2 (#14716) 2025-07-16 12:12:22 +02:00
Georgi Gerganov
538cc77f7f
server : fix handling of the ignore_eos flag (#14710)
ggml-ci
2025-07-16 12:13:57 +03:00
Concedo
2a59adce0f stay on macos 14 2025-07-16 15:47:33 +08:00
Johannes Gäßler
5cae766541
scripts: synthetic prompt mode for server-bench.py (#14695) 2025-07-16 09:33:28 +02:00
Sigbjørn Skjæret
4b91d6f71f
convert : only check for tokenizer folder if we need it (#14704) 2025-07-16 08:52:04 +02:00
Sigbjørn Skjæret
cf91f217f1
convert : add pre-computed hashes first to prevent order mishaps (#14701) 2025-07-16 08:51:12 +02:00
Concedo
cbe9fc87c5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	src/llama-vocab.cpp
2025-07-16 12:03:54 +08:00
Min-Hua
79e0b68c17
llama: add LLAMA_API to deprecated llama_kv_self_seq_div (#14708)
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?

Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.

Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
2025-07-16 07:00:42 +03:00
Ed Addario
c81f4192f9
gguf-py : dump bpw per layer and model in markdown mode (#14703) 2025-07-16 00:04:42 +02:00
Gabriel Larson
4a4f426944
model : add Kimi-K2 support (#14654)
* Kimi-K2 conversion

* add Kimi_K2  pre type

* Kimi-K2

* Kimi-K2 unicode

* Kimi-K2

* LLAMA_MAX_EXPERTS 384

* fix vocab iteration

* regex space fix

* add kimi-k2 to pre_computed_hashes

* Updated with kimi-k2 get_vocab_base_pre hash

* fix whitespaces

* fix flake errors

* remove more unicode.cpp whitespaces

* change set_vocab() flow

* add moonshotai-Kimi-K2.jinja to /models/templates/

* update moonshotai-Kimi-K2.jinja

* add kimi-k2 chat template

* add kimi-k2

* update NotImplementedError

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* except Exception

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){}

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-15 21:54:22 +02:00
Jeff Bolz
ba1ceb3456
vulkan: fix noncontig check for mat_mul_id splitting (#14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-15 21:51:09 +02:00
Jeff Bolz
10a0351a97
vulkan: add RTE variants for glu/add/sub/mul/div (#14653) 2025-07-15 21:32:11 +02:00
Shunta Saito
68e37a61a7
model : add PLaMo-2 support (#14560)
* Add PLaMo-2 model using hybrid memory module

* Fix z shape

* Add cmath to include from llama-vocab.h

* Explicitly dequantize normalization weights before RoPE apply

* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing

* Use ATTN_K/Q_NORM for k,q weights to prevent quantization

* Remove SSM_BCDT that is not used from anywhere

* Do not duplicate embedding weights for output.weight

* Fix tokenizer encoding problem for multibyte strings

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up

* Remove unnecessary part for Grouped Query Attention

* Fix how to load special token id to gguf

* Remove unused tensor mapping

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations

* Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix plamo2 tokenizer session to prevent multiple calls of build()

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-15 18:11:42 +02:00
Concedo
ce7aa0d5c0 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	requirements/requirements-all.txt
2025-07-15 23:59:53 +08:00
Concedo
d3d5e36af6 backwards compat for older flags in config load 2025-07-15 22:05:57 +08:00
Concedo
51cac6f30c add a title to load config 2025-07-15 18:22:44 +08:00
Concedo
8396add5be removed hunyuan autoguess template, fixed multi file loading up to 999 parts 2025-07-15 17:49:49 +08:00
R0CKSTAR
cbc68be51d
cuda: fix build warnings in set-rows.cu (unused variable) (#14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-15 15:28:53 +08:00
Concedo
f3f6168f85 added new autoguess templates 2025-07-15 14:51:20 +08:00
Anton Mitkov
bdca38376f
sycl: Hotfix for non dnnl codepath (#14677) 2025-07-14 18:12:42 +01:00
Concedo
4db8ba6228 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-sycl/gemm.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/set_rows.cpp
2025-07-14 23:16:44 +08:00
Concedo
b7f8d0fe2b handle inconsistent final message content being sent with finish_reason 2025-07-14 22:17:18 +08:00
shalinib-ibm
55c509daf5
ggml : refactor llamafile_sgemm PPC code (#14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-07-14 16:16:42 +03:00
Aman Gupta
9c9e4fc635
llama-context: add ability to get logits (#14672) 2025-07-14 21:01:41 +08:00
Johannes Gäßler
494c5899cb
scripts: benchmark for HTTP server throughput (#14668)
* scripts: benchmark for HTTP server throughput

* fix server connection reset
2025-07-14 13:14:30 +02:00
Concedo
7e5582afdb updated lite 2025-07-14 17:41:36 +08:00
Concedo
0e8f96414a add in backwards compatibility for older clients with incorrect json_schema passing 2025-07-14 17:41:08 +08:00
Akarshan Biswas
0f4c6ec0f1
SYCL: use 1D kernel for set_rows (#14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
2025-07-14 10:37:55 +01:00
Anton Mitkov
65a3ebb0aa
sycl: Batched mulmat rework for oneDNN dispatch (#14617) 2025-07-14 10:37:35 +01:00
Molly Sophia
0d9226763c
llama : add jinja template for rwkv-world (#14665)
* llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-14 07:43:43 +08:00
Ed Addario
982e347255
quantize : fix minor logic flaw in --tensor-type (#14572) 2025-07-13 18:02:17 +02:00
Concedo
aa3623dcce remove unwanted workflow 2025-07-13 23:43:56 +08:00
Concedo
bc2877d2fe test without g3n fix 2025-07-13 23:42:59 +08:00
Concedo
8cebec5128 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakePresets.json
#	README.md
#	common/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tools/run/CMakeLists.txt
2025-07-13 23:39:41 +08:00
Concedo
66755c8fe9 switch to miniaudio, support mp3 for whisper 2025-07-13 23:24:07 +08:00
Sigbjørn Skjæret
923e3ea2e3
cuda : add set rows for bf16 (#14664) 2025-07-13 15:01:24 +02:00
Concedo
e7eb6d3200 increase default ctx size to 8k, rename usecublas to usecuda 2025-07-13 18:27:42 +08:00
Concedo
811463a704 split audio and vision detection separately 2025-07-13 17:47:15 +08:00
Yavor Ivanov
e743cddb60
cuda : add ELU support (#14657) 2025-07-13 11:33:16 +02:00
Georgi Gerganov
05fec5bd29
ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
2025-07-13 10:36:33 +03:00
Yavor Ivanov
dcf7f2ea3c
metal : Add missing unary ops Metal support (#14660) 2025-07-13 08:38:13 +03:00
Yavor Ivanov
84b396e051
cmake : Add CMake presets for Linux and GCC (#14656) 2025-07-13 08:12:36 +03:00
Concedo
0938af7c83 fixed noscript image gen 2025-07-13 11:37:52 +08:00
Concedo
6f4f1b7389 allowing resuming incomplete aria2 downloads 2025-07-13 11:14:39 +08:00
Tarek Dakhran
c31e60647d
tests : cover lfm2 cases in test_ssm_conv (#14651) 2025-07-12 19:10:14 +02:00
Tarek Dakhran
67eade1bf9
docs : add LFM2 to models section (#14650)
* readme : add LFM2 to models section

* fix copy paste...
2025-07-12 19:07:08 +02:00
Concedo
8faf01016d updated lite 2025-07-13 00:39:52 +08:00
Aman Gupta
7de5c7cab6
CUDA: add set rows for f32 and f16 (#14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
2025-07-12 16:31:38 +03:00