Commit graph

55 commits

Author SHA1 Message Date
Concedo
9f976e9c65 swa full used unless ctx shift and fast forward disabled 2025-05-21 22:47:45 +08:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci
2025-05-20 08:05:46 +03:00
Concedo
e5d26a2356 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/CMakeLists.txt
#	docs/backend/SYCL.md
#	ggml/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/gemm.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	ggml/src/gguf.cpp
#	scripts/compare-llama-bench.py
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-05-16 15:30:31 +08:00
Sigbjørn Skjæret
f5170c1d7a
editorconfig : fix trailing whitespace from #13542 (#13546) 2025-05-14 21:22:49 +03:00
Gilad S.
017f10b5fa
fix: crash when calling llama_state_get_size on a context without a KV cache (#13542) 2025-05-14 19:18:18 +03:00
Concedo
21e31e255b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	README.md
#	build-xcframework.sh
#	common/CMakeLists.txt
#	examples/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-metal/ggml-metal.m
#	ggml/src/ggml-metal/ggml-metal.metal
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	scripts/compare-llama-bench.py
#	src/CMakeLists.txt
#	src/llama-model.cpp
#	src/llama.cpp
#	tests/test-backend-ops.cpp
#	tests/test-opt.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/README.md
#	tools/mtmd/clip.cpp
#	tools/rpc/rpc-server.cpp
#	tools/server/CMakeLists.txt
#	tools/server/README.md
2025-05-13 00:28:35 +08:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Georgi Gerganov
064cc596ac
context : fix state io for memory-less contexts (#13470)
ggml-ci
2025-05-12 15:12:27 +03:00
David Huang
7f323a589f
Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B (#13386) 2025-05-11 14:18:39 +02:00
Concedo
2439014a03 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	examples/embedding/embedding.cpp
#	tools/imatrix/imatrix.cpp
#	tools/perplexity/perplexity.cpp
2025-05-08 23:41:02 +08:00
Georgi Gerganov
6562e5a4d6
context : allow cache-less context for embeddings (#13108)
* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
Georgi Gerganov
51fb96b1ff
context : remove logits_all flag (#13284)
* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci
2025-05-08 14:26:50 +03:00
Concedo
0951ad9f58 temp merge, not working 2025-05-03 11:42:01 +08:00
Georgi Gerganov
a75cb30dc9
context : fix reorder logic (#13267)
ggml-ci
2025-05-02 20:54:13 +03:00
Georgi Gerganov
c642bc014c
kv-cache : separate recurrent vs non-recurrent impl (#12799)
* kv-cache : serparate recurrent vs non-recurrent impl (wip)

ggml-ci

* kv-cache : init -> contructor + add llama_memory_params

ggml-ci

* kv-cache : fix callback reference

ggml-ci

* context : llama_kv_cache -> llama_memory_i

ggml-ci

* context : move memory creation logic to model

ggml-ci

* llama : remove reference of memory during encode

ggml-ci

* kv-cache : hide padding details in the implementation

ggml-ci

* kv-cache : add ubatch_next()

ggml-ci

* context : simplify sbatch logic

ggml-ci

* kv-cache : hide defrag logic in the implementation

ggml-ci

* context : hide kv cache details in implementation

ggml-ci

* build : fix

ggml-ci

* cont : another fix

ggml-ci

* kv-cache : simplify interface (wip)

ggml-ci

* kv-cache : use separate KV cell structs for unified/recurrent

ggml-ci

* kv-cache : clean-up

ggml-ci

* model : better llama_model::create_model() signature

ggml-ci

* kv-cache : fix recurrent seq_rm()

ggml-ci

* kv-cache : replace `struct callbacks` with `llama_model &`

ggml-ci

* kv-cache : replace `struct graph_params` with `llama_context &`

ggml-ci

* kv-cache : fix offload check

ggml-ci

* context : avoid passing unique_ptr

ggml-ci

* kv-cache : avoid using the backends from the llama_context

ref #13113

ggml-ci

* kv-cache : more consistent debug logs [no ci]

* kv-cache : do not pass the full llama_context for kv graphs

ggml-ci

* kv-cache : remove comment

* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext

ggml-ci

* kv-cache : fix recurrent multi-user case

ggml-ci

* memory : remove comments [no ci]
2025-05-02 17:48:36 +03:00
Concedo
ca53d1bedc Merge commit '13c9a3319b' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cpu/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2025-05-02 16:42:16 +08:00
ddh0
16a457facd
fix typo: n_ctx_pre_seq -> n_ctx_per_seq (#13221) 2025-04-30 21:28:43 +01:00
Concedo
b2ecfa0f55 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/llama-bench/README.md
#	examples/llama-bench/llama-bench.cpp
#	examples/llava/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-chat-template.cpp
2025-04-29 21:05:16 +08:00
pockers21
fb0471d175
context : do not clear output buffer on reserve (#13152)
Co-authored-by: pockers21 <liyang2@uniontech.com>
2025-04-28 16:45:40 +03:00
Concedo
3f545eadbe Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-backend-ops.cpp
2025-04-26 09:12:40 +08:00
Diego Devesa
295354ea68
llama : fix K-shift with quantized K and BLAS backend (#13113) 2025-04-25 19:40:11 +02:00
Concedo
bce519cee7 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	tests/test-backend-ops.cpp
2025-04-18 12:44:20 +08:00
Georgi Gerganov
2f74c354c0
graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-17 18:16:36 +03:00
Concedo
06159939d9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	docs/build.md
#	examples/rpc/rpc-server.cpp
#	examples/sycl/build.sh
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hip/CMakeLists.txt
#	scripts/sync-ggml.last
2025-04-17 00:52:37 +08:00
Juk Armstrong
daa422881a
llama : DeepSeek V2/V3 MLA implementation (#12801)
* Merged using squash to remove all noise commit messages

* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large

* Removed 3 conts (2x RoPE and 1x RMS-norm)

* Changed to use `<cmath>` instead of `<math.h>`

* Reverted removal of the 3 conts

* Used `reshape` in `llm_graph_context::build_attn_mha()`

* Use `k_pe = ggml_reshape`

* Removed the 3 conts again

* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF

* Removed MQA optimisation from `build_attn_mha()` as no gains now

* Simplified `is_mla` branch in `llm_build_deepseek2()`

* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls

* Fixed call to `build_attn` in `llm_build_t5_enc`
2025-04-15 09:49:57 +03:00
Concedo
822cf2430e Merge commit 'f1e3eb4249' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	docs/backend/SYCL.md
#	examples/llava/clip.cpp
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/cmake/host-toolchain.cmake.in
2025-04-08 20:48:53 +08:00
Georgi Gerganov
3e1d29348b
kv-cache : simplify + fix warning for recurrent models (#12756)
ggml-ci
2025-04-04 21:48:10 +03:00
Concedo
103d60ed2c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/common.cpp
#	examples/batched-bench/batched-bench.cpp
#	examples/batched/batched.cpp
#	examples/export-lora/export-lora.cpp
#	examples/gritlm/gritlm.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
2025-04-03 18:57:49 +08:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers (#11397)
* llama : add option to override tensor buffers

* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Georgi Gerganov
a10b36c91a
llama : refactor kv cache guard (#12695)
* llama : refactor kv cache guard

ggml-ci

* cont : fix comment [no ci]

* llama : fix kv_cache restore logic

ggml-ci

* context : simplify kv cache updates

ggml-ci

* cont : better name [no ci]

* llama : fix llama_decode return code when could not find KV slot

ggml-ci

* context : change log err -> warn [no ci]

* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Concedo
e6337ff957 Merge commit 'e408d4351a' into concedo_experimental
# Conflicts:
#	ggml/CMakeLists.txt
2025-03-30 18:26:02 +08:00
Xuan-Son Nguyen
af6ae1efb2
llama : fix non-causal mask for gemma 3 (#12615) 2025-03-30 00:07:37 +01:00
Concedo
396875e1c4 update api docs and lite 2025-03-29 15:39:25 +08:00
Georgi Gerganov
b4ae50810e
metal : improve FA + improve MoE (#12612)
* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci
2025-03-28 20:21:59 +02:00
Concedo
ea358369cc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ci/README.md
#	ci/run.sh
#	docs/backend/CUDA-FEDORA.md
#	docs/build.md
#	docs/install.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/common.cuh
#	tests/test-backend-ops.cpp
2025-03-26 00:18:01 +08:00
Georgi Gerganov
2d77d88e70
context : fix worst-case reserve outputs (#12545)
ggml-ci
2025-03-25 09:19:23 +02:00
Concedo
7030ebf401 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/SYCL.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
#	ggml/src/ggml-sycl/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-22 00:32:42 +08:00
fairydreaming
568013d0cd
context : clear sets containing encoder output sequence ids before storing new values (#12470)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-19 21:01:57 +01:00
Concedo
0c90d2ebcf Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	cmake/common.cmake
#	docs/backend/SYCL.md
#	examples/main/README.md
#	examples/speculative/speculative.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-19 19:27:11 +08:00
Georgi Gerganov
75422e8bc4
graph : normalize Q, K, V shapes + sync cross attention (#12449)
* graph : normalize Q, K, V shapes and add comments

ggml-ci

* context : synchronize before getting cross attention data

* model : fix command-r attention norm check
2025-03-18 21:35:19 +02:00
Georgi Gerganov
8551c44d84
context : always use non-causal attention for encoder graphs (#12447)
* context : always use non-causal attention for encoder graphs

ggml-ci

* context : move the change to llama_context::encode()

ggml-ci
2025-03-18 13:05:49 +02:00
Georgi Gerganov
dc079cfdff
context : fix init of n_outputs (#12397)
ggml-ci
2025-03-16 19:29:36 +02:00
Concedo
6888f5495d allow quantkv with contextshift 2025-03-16 21:48:42 +08:00
Concedo
67851e5415 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/run/run.cpp
#	ggml/src/ggml-cann/aclnn_ops.cpp
2025-03-15 19:54:19 +08:00
fairydreaming
8fcb563613
Load all MoE experts during warmup (#11571)
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-14 13:47:05 +01:00
Concedo
be3bba67ff Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	src/llama-model.cpp
2025-03-14 18:25:21 +08:00
Georgi Gerganov
081bee8c64
hparams : add SWA rope parameters (#12374)
ggml-ci
2025-03-14 09:03:24 +02:00
Concedo
7dc72db9de Merge branch 'upstream' into concedo_experimental 2025-03-14 11:58:53 +08:00
Concedo
0db4ae6237 traded my ink for a pen 2025-03-14 11:58:15 +08:00
Georgi Gerganov
84d5475541
llama : fix Gemma3 SWA KV cache shift (#12373)
* llama : fix Gemma3 SWA KV cache shift

ggml-ci

* hparams : add comment [no ci]
2025-03-13 19:08:07 +02:00