Concedo
b9280718b5
fix bge memory excessive usage?
2025-09-21 08:38:37 +08:00
Concedo
1dbd2fc259
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/build-s390x.md
# docs/ops.md
# docs/ops/zDNN.csv
# ggml/include/ggml-zdnn.h
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/im2col.cpp
# ggml/src/ggml-sycl/mmq.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# ggml/src/ggml-zdnn/ggml-zdnn-impl.h
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# tools/llama-bench/llama-bench.cpp
2025-09-13 12:25:30 +08:00
Haiyue Wang
f4e664f838
context : remove redundant explicit casting to the same type ( #15948 )
...
The function 'output_reserve' return type is 'uint32_t', so need to add
explicit casting.
2025-09-12 18:16:32 +03:00
Concedo
6463f5c26b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CONTRIBUTING.md
# docs/backend/CANN.md
# examples/eval-callback/eval-callback.cpp
# examples/model-conversion/requirements.txt
# examples/model-conversion/scripts/causal/run-org-model.py
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# models/templates/README.md
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_legacy_llama.txt
# requirements/requirements-tool_bench.txt
# tests/.gitignore
# tests/test-backend-ops.cpp
# tests/test-chat-parser.cpp
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-tokenizer-random.py
2025-09-11 22:34:45 +08:00
Concedo
5de51b77c1
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/close-issue.yml
# docs/build-s390x.md
# examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# ggml/src/ggml-cuda/fattn-tile-f16.cu
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/tool_bench.py
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-09-11 22:28:19 +08:00
Daniel Bevenius
86587da03b
llama : check returned fn ptrs from ggml_backend_reg_get_proc_address ( #15893 )
...
This commit adds check for two function pointers returned from
ggml_backend_reg_get_proc_address.
The motivation for this is that the function pointer could be nullptr if
the get proc address function changes in the future. This is also
consistent with all the other calls to ggml_backend_reg_get_proc_address
in the code base.
2025-09-10 05:33:58 +02:00
Georgi Gerganov
663027fd54
context : fix n_outputs during reserve ( #15858 )
...
ggml-ci
2025-09-08 10:26:36 +03:00
Concedo
f0d4128e9f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/compare-embeddings-logits.sh
# examples/model-conversion/scripts/causal/convert-model.sh
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
# examples/model-conversion/scripts/causal/run-converted-model.sh
# examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
# examples/model-conversion/scripts/embedding/convert-model.sh
# examples/model-conversion/scripts/embedding/modelcard.template
# examples/model-conversion/scripts/embedding/run-converted-model.sh
# examples/model-conversion/scripts/utils/create-collection-add-model.sh
# examples/model-conversion/scripts/utils/inspect-converted-model.sh
# examples/model-conversion/scripts/utils/inspect-org-model.py
# examples/model-conversion/scripts/utils/perplexity-gen.sh
# examples/model-conversion/scripts/utils/perplexity-run-simple.sh
# examples/model-conversion/scripts/utils/perplexity-run.sh
# examples/model-conversion/scripts/utils/quantize.sh
# examples/model-conversion/scripts/utils/run-embedding-server.sh
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# src/llama-context.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
2025-09-05 13:25:34 +08:00
Daniel Bevenius
d1e2adba65
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling ( #15791 )
...
* llama : set n_outputs to 1 to avoid 0 outputs mean-pooling
This commit modifies the llama_context constructor to set n_outputs to
1.
The motivation for this is that when using pooling, and specifically
mean pooling, for embeddings having n_outputs set to 0 can lead to the
following error:
```console
$ build/bin/llama-embedding -m models/nomic-embed-text-1.5-Q4_K_M.gguf \
--pooling mean -p "Hello, how are you?"
...
llama_context: CPU output buffer size = 0.12 MiB
/home/danbev/work/ai/llama.cpp/ggml/src/ggml.c:3023: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
0x0000743c96d107e3 in __GI___wait4 (pid=292978, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
30 in ../sysdeps/unix/sysv/linux/wait4.c
196 waitpid(child_pid, NULL, 0);
230 ggml_print_backtrace();
3023 GGML_ASSERT(ggml_can_mul_mat(a, b));
1823 cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean);
18983 llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
1399 auto * gf = model.build_graph(gparams);
292 auto * gf = graph_reserve(1, n_seqs, n_outputs, mctx.get(), true);
2329 auto * ctx = new llama_context(*model, params);
913 llama_context * lctx = llama_init_from_model(model, cparams);
105 common_init_result llama_init = common_init_from_params(params);
[Inferior 1 (process 292976) detached]
Aborted (core dumped)
```
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add comment about not reserving graphs with zero outputs
* add assert in graph_reserve to ensure n_outputs >= 1
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-04 15:40:44 +02:00
Concedo
2562129271
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# ci/run.sh
# docs/backend/CANN.md
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/flash_attn_f16.cl
# ggml/src/ggml-opencl/kernels/flash_attn_f32.cl
# ggml/src/ggml-opencl/kernels/flash_attn_f32_f16.cl
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/gguf.cpp
# src/llama-context.cpp
# tests/test-sampling.cpp
# tools/server/README.md
2025-09-03 17:16:42 +08:00
Diego Devesa
274966226f
llama : fix fattn reserve call n_seqs parameter ( #15699 )
...
ggml-ci
2025-08-31 18:47:05 +03:00
Concedo
7e35954695
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/build.md
# docs/function-calling.md
# examples/eval-callback/eval-callback.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kernels.cpp
# ggml/src/ggml-cpu/kleidiai/kernels.h
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# scripts/compare-llama-bench.py
# scripts/server-bench.py
# scripts/tool_bench.py
# tests/test-chat.cpp
# tools/batched-bench/batched-bench.cpp
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-08-31 23:33:36 +08:00
Diego Devesa
9777032dcc
llama : separate compute buffer reserve from fattn check ( #15696 )
...
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
2025-08-31 15:49:03 +02:00
Johannes Gäßler
e81b8e4b7f
llama: use FA + max. GPU layers by default ( #15434 )
...
* llama: use max. GPU layers by default, auto -fa
* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Concedo
3060dfb99f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# scripts/compare-commits.sh
2025-08-28 23:17:29 +08:00
Georgi Gerganov
8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks ( #15505 )
...
ggml-ci
2025-08-28 12:27:02 +03:00
Concedo
575eb40950
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/multimodal/minicpmv4.0.md
# examples/model-conversion/Makefile
# examples/model-conversion/README.md
# examples/model-conversion/logits.cpp
# examples/model-conversion/scripts/causal/modelcard.template
# examples/model-conversion/scripts/utils/hf-create-model.py
# ggml/src/ggml-opencl/ggml-opencl.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
2025-08-26 19:09:48 +08:00
Georgi Gerganov
85cc1ae998
context : print graph stats for memory-less contexts ( #15586 )
...
ggml-ci
2025-08-26 12:47:00 +03:00
Concedo
8b8396c30c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build-s390x.md
# examples/llama.vim
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# scripts/compare-llama-bench.py
# src/CMakeLists.txt
# tests/test-backend-ops.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-08-23 11:35:28 +08:00
Georgi Gerganov
9ebebef62f
llama : remove KV cache defragmentation logic ( #15473 )
...
ggml-ci
2025-08-22 12:22:13 +03:00
Georgi Gerganov
cd36b5e5c7
llama : remove deprecated llama_kv_self API ( #15472 )
...
ggml-ci
2025-08-21 19:13:45 +03:00
Georgi Gerganov
715a6db02c
kv-cache : drop the "unified" prefix ( #15467 )
...
* kv-cache : drop the "unified" prefix
ggml-ci
* cont : fix comment [no ci]
2025-08-21 17:00:33 +03:00
Concedo
1c41c38a6a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/cuda.Dockerfile
# CODEOWNERS
# README.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-opencl/ggml-opencl.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-chat.cpp
# tools/batched-bench/batched-bench.cpp
# tools/mtmd/clip.h
2025-08-20 20:34:45 +08:00
Georgi Gerganov
9d262f4bad
server : remove swa_full warning ( #15399 )
2025-08-19 08:45:26 +03:00
Concedo
7ac0102ed3
hope i didnt break anything
2025-08-14 21:42:24 +08:00
Georgi Gerganov
d32e03f449
server : add SWA checkpoints ( #15293 )
...
* server : add SWA checkpoints
ggml-ci
* cont : server clean-up
* server : handle state restore fails
* llama : add extended llama_state_seq_ API
* server : do not make checkpoints if --swa-full
ggml-ci
* llama : remove flags value for NONE
* server : configure number of SWA checkpoints with CLI arg
ggml-ci
* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
Jonathan Graehl
5cdb27e091
finetune: SGD optimizer, more CLI args ( #13873 )
...
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 12:03:57 +02:00
Concedo
7590a0ea39
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ggml/CMakeLists.txt
# ggml/cmake/ggml-config.cmake.in
# ggml/src/CMakeLists.txt
# models/templates/README.md
# tools/imatrix/imatrix.cpp
2025-08-05 19:24:29 +08:00
compilade
ee3a9fcf88
context : fix index overflow on huge outputs ( #15080 )
...
* context : fix overflow when re-ordering huge outputs
* context : fix logits size overflow for huge batches
2025-08-05 11:27:45 +02:00
Concedo
8bd0a560f0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf_update.txt
# scripts/compare-llama-bench.py
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
2025-08-04 22:42:02 +08:00
Georgi Gerganov
a4569c41fd
llama : enable LLAMA_SET_ROWS=1 by default ( #14959 )
...
ggml-ci
2025-08-02 17:14:21 +03:00
Concedo
f430916a71
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/add.cl
# ggml/src/ggml-opencl/kernels/mul.cl
# scripts/compare-commits.sh
# scripts/compare-llama-bench.py
# scripts/sync-ggml.last
# tools/server/README.md
2025-08-02 10:25:10 +08:00
Georgi Gerganov
ba42794c9e
graph : fix equal_seq() check ( #14986 )
...
ggml-ci
2025-08-01 06:38:12 +03:00
Concedo
b8425f5a9c
merge but voxtral not working
2025-07-28 22:08:05 +08:00
Daniel Bevenius
ca0ef2dddb
llama : clarify comment about pp and tg graphs [no ci] ( #14895 )
...
* llama : clarify comment about pp and tg graphs [no ci]
This commit clarifies the comment in `llama-context.cpp` regarding the
prefill prompt (pp), and token generation (tg) graphs.
The motivation for this is that I've struggled to remember these and had
to look them up more than once, so I thought it would be helpful to add
a comment that makes it clear what these stand for.
* squash! llama : clarify comment about pp and tg graphs [no ci]
Change "pp" to "prompt processing".
2025-07-27 12:10:51 +02:00
Concedo
21b7d0a899
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/rocm.Dockerfile
# docs/build-s390x.md
# docs/development/HOWTO-add-model.md
# docs/ops.md
# docs/ops/CPU.csv
# docs/ops/CUDA.csv
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/rms_norm.cl
# scripts/create_ops_docs.py
# tests/test-backend-ops.cpp
# tools/export-lora/export-lora.cpp
2025-07-27 17:10:53 +08:00
Concedo
0fcfbdb93c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# .github/workflows/close-issue.yml
# ci/README.md
# docs/build.md
# docs/docker.md
# ggml/CMakeLists.txt
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/fattn-wmma-f16.cu
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
2025-07-25 19:53:13 +08:00
Georgi Gerganov
c1dbea752a
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 ( #14870 )
...
ggml-ci
2025-07-25 14:28:06 +03:00
Georgi Gerganov
e4868d16d2
context : perform output reorder lazily upon access after sync ( #14853 )
...
* context : perform output reorder after lazily upon access after sync
ggml-ci
* cont : add TODO
2025-07-24 16:31:48 +03:00
Concedo
b0b7a07b34
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/parallel/parallel.cpp
2025-07-18 23:49:45 +08:00
Georgi Gerganov
d498af3d5a
graph : avoid huge warm-up graphs for MoE models ( #14753 )
...
* graph : avoid huge warm-up graphs for MoE models
ggml-ci
* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Concedo
b8e3280432
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# ggml/src/ggml-sycl/ggml-sycl.cpp
2025-07-18 13:46:32 +08:00
Georgi Gerganov
8f974bc1e9
graph : refactor context to not pass gf explicitly ( #14629 )
...
ggml-ci
2025-07-18 08:29:28 +03:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs ( #14482 )
...
* llama : reuse compute graphs
ggml-ci
* llama-bench : add graph reuse parameter
ggml-ci
* cont : remove the parameter and the sched resets
ggml-ci
* graph : rename update() to can_reuse()
ggml-ci
* params : remove is_same()
ggml-ci
* graph : set res->params in llm_graph_context constructor
ggml-ci
* graph : avoid set_max_nodes in llm_graph_result
ggml-ci
* kv-cache : reuse llama_context's graph result instance
ggml-ci
* context : reset the previous graph result upon memory updates
ggml-ci
* batch : llama_ubatch now carries its data instead of pointing to balloc
ggml-ci
* merge : fix build
ggml-ci
* graph : fix can_reuse() checks when flash-attention is disabled
* graph : move llm_graph_result impl in source file + debug env
ggml-ci
2025-07-17 19:08:33 +03:00
Concedo
bdff33e0de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# examples/parallel/parallel.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# scripts/server-bench.py
# src/llama-kv-cache-unified.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Concedo
ce7aa0d5c0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-all.txt
2025-07-15 23:59:53 +08:00
Aman Gupta
9c9e4fc635
llama-context: add ability to get logits ( #14672 )
2025-07-14 21:01:41 +08:00
Concedo
ace537d44e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# examples/simple-chat/simple-chat.cpp
# src/llama-quant.cpp
# tools/run/run.cpp
# tools/server/README.md
2025-06-24 23:06:16 +08:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos ( #14339 )
...
* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments
2025-06-23 12:27:35 +03:00