Commit graph

143 commits

Author SHA1 Message Date
Concedo
b9280718b5 fix bge memory excessive usage? 2025-09-21 08:38:37 +08:00
Concedo
1dbd2fc259 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/build-s390x.md
#	docs/ops.md
#	docs/ops/zDNN.csv
#	ggml/include/ggml-zdnn.h
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/concat.cpp
#	ggml/src/ggml-sycl/conv.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/getrows.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/gla.cpp
#	ggml/src/ggml-sycl/im2col.cpp
#	ggml/src/ggml-sycl/mmq.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/set_rows.cpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/tsembd.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn-impl.h
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	tools/llama-bench/llama-bench.cpp
2025-09-13 12:25:30 +08:00
Haiyue Wang
f4e664f838
context : remove redundant explicit casting to the same type (#15948)
The function 'output_reserve' return type is 'uint32_t', so need to add
explicit casting.
2025-09-12 18:16:32 +03:00
Concedo
6463f5c26b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CONTRIBUTING.md
#	docs/backend/CANN.md
#	examples/eval-callback/eval-callback.cpp
#	examples/model-conversion/requirements.txt
#	examples/model-conversion/scripts/causal/run-org-model.py
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	models/templates/README.md
#	requirements/requirements-convert_hf_to_gguf.txt
#	requirements/requirements-convert_legacy_llama.txt
#	requirements/requirements-tool_bench.txt
#	tests/.gitignore
#	tests/test-backend-ops.cpp
#	tests/test-chat-parser.cpp
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-tokenizer-random.py
2025-09-11 22:34:45 +08:00
Concedo
5de51b77c1 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/close-issue.yml
#	docs/build-s390x.md
#	examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-cuda/fattn-tile-f16.cu
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/tool_bench.py
#	tests/test-backend-ops.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/server/README.md
2025-09-11 22:28:19 +08:00
Daniel Bevenius
86587da03b
llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893)
This commit adds check for two function pointers returned from
ggml_backend_reg_get_proc_address.

The motivation for this is that the function pointer could be nullptr if
the get proc address function changes in the future. This is also
consistent with all the other calls to ggml_backend_reg_get_proc_address
in the code base.
2025-09-10 05:33:58 +02:00
Georgi Gerganov
663027fd54
context : fix n_outputs during reserve (#15858)
ggml-ci
2025-09-08 10:26:36 +03:00
Concedo
f0d4128e9f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/CANN.md
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/causal/convert-model.sh
#	examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
#	examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
#	examples/model-conversion/scripts/causal/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/embedding/convert-model.sh
#	examples/model-conversion/scripts/embedding/modelcard.template
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	examples/model-conversion/scripts/utils/create-collection-add-model.sh
#	examples/model-conversion/scripts/utils/inspect-converted-model.sh
#	examples/model-conversion/scripts/utils/inspect-org-model.py
#	examples/model-conversion/scripts/utils/perplexity-gen.sh
#	examples/model-conversion/scripts/utils/perplexity-run-simple.sh
#	examples/model-conversion/scripts/utils/perplexity-run.sh
#	examples/model-conversion/scripts/utils/quantize.sh
#	examples/model-conversion/scripts/utils/run-embedding-server.sh
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	src/llama-context.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
2025-09-05 13:25:34 +08:00
Daniel Bevenius
d1e2adba65
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791)
* llama : set n_outputs to 1 to avoid 0 outputs mean-pooling

This commit modifies the llama_context constructor to set n_outputs to
1.

The motivation for this is that when using pooling, and specifically
mean pooling, for embeddings having n_outputs set to 0 can lead to the
following error:
```console
$ build/bin/llama-embedding -m models/nomic-embed-text-1.5-Q4_K_M.gguf \
   --pooling mean -p "Hello, how are you?"
...
llama_context:        CPU  output buffer size =     0.12 MiB
/home/danbev/work/ai/llama.cpp/ggml/src/ggml.c:3023: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
0x0000743c96d107e3 in __GI___wait4 (pid=292978, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
30	in ../sysdeps/unix/sysv/linux/wait4.c
196	        waitpid(child_pid, NULL, 0);
230	        ggml_print_backtrace();
3023	    GGML_ASSERT(ggml_can_mul_mat(a, b));
1823	                cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean);
18983	    llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
1399	    auto * gf = model.build_graph(gparams);
292	            auto * gf = graph_reserve(1, n_seqs, n_outputs, mctx.get(), true);
2329	        auto * ctx = new llama_context(*model, params);
913	    llama_context * lctx = llama_init_from_model(model, cparams);
105	    common_init_result llama_init = common_init_from_params(params);
[Inferior 1 (process 292976) detached]
Aborted (core dumped)
```

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add comment about not reserving graphs with zero outputs

* add assert in graph_reserve to ensure n_outputs >= 1

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-04 15:40:44 +02:00
Concedo
2562129271 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	ci/run.sh
#	docs/backend/CANN.md
#	examples/speculative/speculative.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/flash_attn_f16.cl
#	ggml/src/ggml-opencl/kernels/flash_attn_f32.cl
#	ggml/src/ggml-opencl/kernels/flash_attn_f32_f16.cl
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/gguf.cpp
#	src/llama-context.cpp
#	tests/test-sampling.cpp
#	tools/server/README.md
2025-09-03 17:16:42 +08:00
Diego Devesa
274966226f
llama : fix fattn reserve call n_seqs parameter (#15699)
ggml-ci
2025-08-31 18:47:05 +03:00
Concedo
7e35954695 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/build.md
#	docs/function-calling.md
#	examples/eval-callback/eval-callback.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.h
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	scripts/compare-llama-bench.py
#	scripts/server-bench.py
#	scripts/tool_bench.py
#	tests/test-chat.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-08-31 23:33:36 +08:00
Diego Devesa
9777032dcc
llama : separate compute buffer reserve from fattn check (#15696)
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
2025-08-31 15:49:03 +02:00
Johannes Gäßler
e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Concedo
3060dfb99f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/convert-model.sh
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
#	scripts/compare-commits.sh
2025-08-28 23:17:29 +08:00
Georgi Gerganov
8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks (#15505)
ggml-ci
2025-08-28 12:27:02 +03:00
Concedo
575eb40950 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/multimodal/minicpmv4.0.md
#	examples/model-conversion/Makefile
#	examples/model-conversion/README.md
#	examples/model-conversion/logits.cpp
#	examples/model-conversion/scripts/causal/modelcard.template
#	examples/model-conversion/scripts/utils/hf-create-model.py
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	tests/test-backend-ops.cpp
#	tools/batched-bench/batched-bench.cpp
2025-08-26 19:09:48 +08:00
Georgi Gerganov
85cc1ae998
context : print graph stats for memory-less contexts (#15586)
ggml-ci
2025-08-26 12:47:00 +03:00
Concedo
8b8396c30c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	docs/build-s390x.md
#	examples/llama.vim
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	scripts/compare-llama-bench.py
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-08-23 11:35:28 +08:00
Georgi Gerganov
9ebebef62f
llama : remove KV cache defragmentation logic (#15473)
ggml-ci
2025-08-22 12:22:13 +03:00
Georgi Gerganov
cd36b5e5c7
llama : remove deprecated llama_kv_self API (#15472)
ggml-ci
2025-08-21 19:13:45 +03:00
Georgi Gerganov
715a6db02c
kv-cache : drop the "unified" prefix (#15467)
* kv-cache : drop the "unified" prefix

ggml-ci

* cont : fix comment [no ci]
2025-08-21 17:00:33 +03:00
Concedo
1c41c38a6a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/cuda.Dockerfile
#	CODEOWNERS
#	README.md
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/test-chat.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/mtmd/clip.h
2025-08-20 20:34:45 +08:00
Georgi Gerganov
9d262f4bad
server : remove swa_full warning (#15399) 2025-08-19 08:45:26 +03:00
Concedo
7ac0102ed3 hope i didnt break anything 2025-08-14 21:42:24 +08:00
Georgi Gerganov
d32e03f449
server : add SWA checkpoints (#15293)
* server : add SWA checkpoints

ggml-ci

* cont : server clean-up

* server : handle state restore fails

* llama : add extended llama_state_seq_ API

* server : do not make checkpoints if --swa-full

ggml-ci

* llama : remove flags value for NONE

* server : configure number of SWA checkpoints with CLI arg

ggml-ci

* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
Jonathan Graehl
5cdb27e091
finetune: SGD optimizer, more CLI args (#13873)
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 12:03:57 +02:00
Concedo
7590a0ea39 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ggml/CMakeLists.txt
#	ggml/cmake/ggml-config.cmake.in
#	ggml/src/CMakeLists.txt
#	models/templates/README.md
#	tools/imatrix/imatrix.cpp
2025-08-05 19:24:29 +08:00
compilade
ee3a9fcf88
context : fix index overflow on huge outputs (#15080)
* context : fix overflow when re-ordering huge outputs

* context : fix logits size overflow for huge batches
2025-08-05 11:27:45 +02:00
Concedo
8bd0a560f0 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	requirements/requirements-convert_hf_to_gguf_update.txt
#	scripts/compare-llama-bench.py
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/imatrix/README.md
#	tools/imatrix/imatrix.cpp
#	tools/llama-bench/llama-bench.cpp
2025-08-04 22:42:02 +08:00
Georgi Gerganov
a4569c41fd
llama : enable LLAMA_SET_ROWS=1 by default (#14959)
ggml-ci
2025-08-02 17:14:21 +03:00
Concedo
f430916a71 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/CANN.md
#	docs/multimodal/minicpmo2.6.md
#	docs/multimodal/minicpmv2.5.md
#	docs/multimodal/minicpmv2.6.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/cmake/ggml-config.cmake.in
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/repack.cpp
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/add.cl
#	ggml/src/ggml-opencl/kernels/mul.cl
#	scripts/compare-commits.sh
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml.last
#	tools/server/README.md
2025-08-02 10:25:10 +08:00
Georgi Gerganov
ba42794c9e
graph : fix equal_seq() check (#14986)
ggml-ci
2025-08-01 06:38:12 +03:00
Concedo
b8425f5a9c merge but voxtral not working 2025-07-28 22:08:05 +08:00
Daniel Bevenius
ca0ef2dddb
llama : clarify comment about pp and tg graphs [no ci] (#14895)
* llama : clarify comment about pp and tg graphs [no ci]

This commit clarifies the comment in `llama-context.cpp` regarding the
prefill prompt (pp), and token generation (tg) graphs.

The motivation for this is that I've struggled to remember these and had
to look them up more than once, so I thought it would be helpful to add
a comment that makes it clear what these stand for.

* squash! llama : clarify comment about pp and tg graphs [no ci]

Change "pp" to "prompt processing".
2025-07-27 12:10:51 +02:00
Concedo
21b7d0a899 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/rocm.Dockerfile
#	docs/build-s390x.md
#	docs/development/HOWTO-add-model.md
#	docs/ops.md
#	docs/ops/CPU.csv
#	docs/ops/CUDA.csv
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/rms_norm.cl
#	scripts/create_ops_docs.py
#	tests/test-backend-ops.cpp
#	tools/export-lora/export-lora.cpp
2025-07-27 17:10:53 +08:00
Concedo
0fcfbdb93c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/musa.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/close-issue.yml
#	ci/README.md
#	docs/build.md
#	docs/docker.md
#	ggml/CMakeLists.txt
#	ggml/cmake/ggml-config.cmake.in
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/fattn-wmma-f16.cu
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tools/imatrix/README.md
#	tools/imatrix/imatrix.cpp
2025-07-25 19:53:13 +08:00
Georgi Gerganov
c1dbea752a
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870)
ggml-ci
2025-07-25 14:28:06 +03:00
Georgi Gerganov
e4868d16d2
context : perform output reorder lazily upon access after sync (#14853)
* context : perform output reorder after lazily upon access after sync

ggml-ci

* cont : add TODO
2025-07-24 16:31:48 +03:00
Concedo
b0b7a07b34 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/parallel/parallel.cpp
2025-07-18 23:49:45 +08:00
Georgi Gerganov
d498af3d5a
graph : avoid huge warm-up graphs for MoE models (#14753)
* graph : avoid huge warm-up graphs for MoE models

ggml-ci

* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Concedo
b8e3280432 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	ggml/src/ggml-sycl/ggml-sycl.cpp
2025-07-18 13:46:32 +08:00
Georgi Gerganov
8f974bc1e9
graph : refactor context to not pass gf explicitly (#14629)
ggml-ci
2025-07-18 08:29:28 +03:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs (#14482)
* llama : reuse compute graphs

ggml-ci

* llama-bench : add graph reuse parameter

ggml-ci

* cont : remove the parameter and the sched resets

ggml-ci

* graph : rename update() to can_reuse()

ggml-ci

* params : remove is_same()

ggml-ci

* graph : set res->params in llm_graph_context constructor

ggml-ci

* graph : avoid set_max_nodes in llm_graph_result

ggml-ci

* kv-cache : reuse llama_context's graph result instance

ggml-ci

* context : reset the previous graph result upon memory updates

ggml-ci

* batch : llama_ubatch now carries its data instead of pointing to balloc

ggml-ci

* merge : fix build

ggml-ci

* graph : fix can_reuse() checks when flash-attention is disabled

* graph : move llm_graph_result impl in source file + debug env

ggml-ci
2025-07-17 19:08:33 +03:00
Concedo
bdff33e0de Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	ci/run.sh
#	docs/build.md
#	examples/CMakeLists.txt
#	examples/parallel/parallel.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	scripts/server-bench.py
#	src/llama-kv-cache-unified.cpp
#	tests/test-backend-ops.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Concedo
ce7aa0d5c0 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	requirements/requirements-all.txt
2025-07-15 23:59:53 +08:00
Aman Gupta
9c9e4fc635
llama-context: add ability to get logits (#14672) 2025-07-14 21:01:41 +08:00
Concedo
ace537d44e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	examples/simple-chat/simple-chat.cpp
#	src/llama-quant.cpp
#	tools/run/run.cpp
#	tools/server/README.md
2025-06-24 23:06:16 +08:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos (#14339)
* kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

* cont : improve error message

ggml-ci

* cont : add more comments
2025-06-23 12:27:35 +03:00