Concedo
a6efa9d182
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# tests/test-backend-ops.cpp
2026-01-30 20:37:37 +08:00
Daniel Bevenius
83bcdf7217
memory : remove unused tmp_buf ( #19199 )
...
This commit removes the unused tmp_buf variable from llama-kv-cache.cpp
and llama-memory-recurrent.cpp.
The tmp_buf variable was declared but never used but since it has a
non-trivial constructor/desctuctor we don't get an unused variable
warning about it.
2026-01-30 10:37:06 +01:00
Concedo
f6ece6fd37
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/check-vendor.yml
# .github/workflows/close-issue.yml
# .github/workflows/editorconfig.yml
# .github/workflows/gguf-publish.yml
# .github/workflows/labeler.yml
# .github/workflows/pre-tokenizer-hashes.yml
# .github/workflows/python-check-requirements.yml
# .github/workflows/python-lint.yml
# .github/workflows/python-type-check.yml
# .github/workflows/server.yml
# .github/workflows/update-ops-docs.yml
# README.md
# docs/build.md
# examples/model-conversion/scripts/utils/perplexity-gen.sh
# examples/model-conversion/scripts/utils/perplexity-run-simple.sh
# examples/model-conversion/scripts/utils/perplexity-run.sh
# examples/model-conversion/scripts/utils/quantize.sh
# examples/model-conversion/scripts/utils/run-embedding-server.sh
# ggml/src/ggml-cpu/ggml-cpu.c
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-opencl/kernels/mul_mv_q6_k_f32.cl
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/compare-llama-bench.py
# tests/test-backend-ops.cpp
# tests/test-gguf.cpp
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-01-27 23:06:13 +08:00
Georgi Gerganov
d9c6ce46f7
kv-cache : support V-less cache ( #19067 )
...
* kv-cache : support V-less cache
* cuda : better check for V_is_K_view
* cuda : improve V_is_K_view check
* graph : add comments
* hparams : refactor
2026-01-25 15:48:56 +02:00
Concedo
e8e7c357c9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cache.yml
# .github/workflows/build-cmake-pkg.yml
# .github/workflows/build-linux-cross.yml
# .github/workflows/build.yml
# .github/workflows/check-vendor.yml
# .github/workflows/close-issue.yml
# .github/workflows/copilot-setup-steps.yml
# .github/workflows/docker.yml
# .github/workflows/editorconfig.yml
# .github/workflows/gguf-publish.yml
# .github/workflows/labeler.yml
# .github/workflows/pre-tokenizer-hashes.yml
# .github/workflows/python-check-requirements.yml
# .github/workflows/python-lint.yml
# .github/workflows/python-type-check.yml
# .github/workflows/release.yml
# .github/workflows/server-webui.yml
# .github/workflows/server.yml
# .github/workflows/update-ops-docs.yml
# .github/workflows/winget.yml
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# requirements/requirements-tool_bench.txt
# src/CMakeLists.txt
# src/llama-quant.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/server/README.md
2026-01-23 14:27:04 +08:00
Georgi Gerganov
a5eaa1d6a3
mla : make the V tensor a view of K ( #18986 )
...
* mla : pass V as a view of K to the FA op
* cuda : adjust mla logic to new layout
* kv-cache : fix rope shift
* tests : remove comment
* cuda : fix reusable_cutoff
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-22 22:09:01 +02:00
Concedo
7f618454ff
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# CODEOWNERS
# docs/backend/OPENCL.md
# docs/ops.md
# docs/ops/CANN.csv
# docs/ops/WebGPU.csv
# ggml/src/ggml-blas/CMakeLists.txt
# ggml/src/ggml-opencl/kernels/mul_mv_q6_k.cl
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/cpy.tmpl.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/set_rows.wgsl
# tests/test-backend-ops.cpp
2026-01-18 23:24:29 +08:00
Georgi Gerganov
2fbde785bc
kv-cache : optimize KQ mask construction ( #18842 )
...
* kv-cache : optimize KQ mask construction
* cont : add explanation + improve
* cont : fix
2026-01-17 15:42:42 +02:00
Concedo
1daeed5d4d
Merge commit ' 9963b81f63' into concedo_experimental
...
# Conflicts:
# .github/workflows/server.yml
# SECURITY.md
# docs/backend/SYCL.md
# examples/model-conversion/README.md
# examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
2025-12-17 20:30:34 +08:00
Concedo
c93c4c5505
Merge commit ' 4a4f7e6550' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# CODEOWNERS
# README.md
# ci/run.sh
# docs/development/HOWTO-add-model.md
# grammars/README.md
# src/llama-context.cpp
# src/llama.cpp
# tools/CMakeLists.txt
# tools/completion/README.md
# tools/llama-bench/README.md
2025-12-17 14:30:39 +08:00
Concedo
050a5b1f52
Merge commit ' 4aced7a631' into concedo_experimental
...
# Conflicts:
# .devops/cann.Dockerfile
# .devops/cpu.Dockerfile
# .devops/cuda.Dockerfile
# .devops/intel.Dockerfile
# .devops/musa.Dockerfile
# .devops/rocm.Dockerfile
# .devops/tools.sh
# .devops/vulkan.Dockerfile
# .github/workflows/build.yml
# .github/workflows/release.yml
# .gitignore
# docs/ops.md
# docs/ops/SYCL.csv
# examples/batched/batched.cpp
# examples/eval-callback/eval-callback.cpp
# examples/gen-docs/gen-docs.cpp
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/model-conversion/scripts/causal/compare-logits.py
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/model-conversion/scripts/utils/check-nmse.py
# examples/parallel/parallel.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# examples/training/finetune.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/pad.cpp
# ggml/src/ggml-sycl/ssm_conv.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# pyrightconfig.json
# scripts/sync-ggml.last
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/imatrix.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/clip.cpp
# tools/perplexity/perplexity.cpp
# tools/server/README.md
2025-12-16 23:14:12 +08:00
Concedo
e88bf41fdc
Merge commit ' 12280ae905' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# common/CMakeLists.txt
# docs/docker.md
# examples/model-conversion/scripts/causal/compare-logits.py
# ggml/src/ggml-hexagon/htp/rope-ops.c
# tests/test-backend-ops.cpp
# tests/test-barrier.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2025-12-16 16:29:01 +08:00
ssweens
4529c660c8
kv-cache: Fix state restore fragmented cache ( #17982 )
...
* kv-cache : fix state restore with fragmented cache (#17527 )
Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache.
* tests : update logic
* cleanup: tightened state_read_meta sig, added is_contiguous case
* fix: state_read_meta arg reorder loose ends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-15 19:28:35 +02:00
Johannes Gäßler
b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization ( #16653 )
...
* llama: automatically fit args to free memory
llama-fit-params tool
* fix CI
* hints for bug reports, ensure no reallocation
* fix segfault with Vulkan
* add llama-fit-params to CI
* fix CI
* fix CI
* fix CI
* minor adjustments
* fix assignment of 1 dense layer
* fix logger not being reset on model load failure
* remove --n-gpu-layer hint on model load failure
* fix llama-fit-params verbosity
* fix edge case
* fix typo [no ci]
2025-12-15 09:24:59 +01:00
Georgi Gerganov
609a2d0268
models : fix YaRN regression + consolidate logic ( #18006 )
...
* models : fix YaRN regression + consolidate logic
* cont : fix the fix
* cont : remove header
* cont : add header
2025-12-14 08:34:56 +02:00
Georgi Gerganov
7bed317f53
models : fix the attn_factor for mistral3 graphs + improve consistency ( #17945 )
...
* models : fix the attn_factor for mistral3 graphs
* cont : rework attn_factor correction logic
* cont : make deepseek2 consistent
* cont : add TODO
* cont : special-case DSv2
* cont : revert Mistral 3 Large changes
* cont : fix DS2 to use the original attn_factor
* cont : minor comments
2025-12-12 17:12:40 +02:00
Georgi Gerganov
4dff236a52
ggml : remove GGML_KQ_MASK_PAD constant ( #17910 )
...
* ggml : remove GGML_KQ_MASK_PAD constant
* cont : remove comment
2025-12-10 20:53:16 +02:00
Concedo
2b00e55356
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# ggml/src/ggml-opencl/kernels/mul_mm_f16_f32_l4_lm.cl
# ggml/src/ggml-opencl/kernels/mul_mm_f32_f32_l4_lm.cl
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/rope.tmpl.wgsl
# requirements/requirements-convert_legacy_llama.txt
# tests/test-backend-ops.cpp
# tests/test-rope.cpp
# tools/server/README.md
2025-10-31 10:52:57 +08:00
JJJYmmm
d261223d24
model: add support for qwen3vl series ( #16780 )
...
* support qwen3vl series.
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
* bugfix: fix the arch check for qwen3vl-moe.
* use build_ffn
* optimize deepstack structure
* optimize deepstack feature saving
* Revert "optimize deepstack feature saving" for temporal fix
This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71.
* code clean
* use fused qkv in clip
* clean up / rm is_deepstack_layers for simplification
* add test model
* move test model to "big" section
* fix imrope check
* remove trailing whitespace
* fix rope fail
* metal : add imrope support
* add imrope support for sycl
* vulkan: add imrope w/o check
* fix vulkan
* webgpu: add imrope w/o check
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix tensor mapping
---------
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 16:19:14 +01:00
Concedo
16cbe9f24e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# docs/ops.md
# docs/ops/SYCL.csv
# examples/embedding/README.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-sycl/backend.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/norm.hpp
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# src/llama-batch.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
# tools/llama-bench/README.md
2025-10-30 13:44:46 +08:00
Xuan-Son Nguyen
e3af5563bd
llama: store mrope data in KV cell ( #16825 )
...
* llama: store mrope data in KV cell
* correct x,y ordering
* address review comments
* add consistency checks
* Update src/llama-kv-cache.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add TODO
* fix asan error
* kv-cells : improve ext handling
* cont : fix headers
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-29 18:09:18 +01:00
Georgi Gerganov
85a7d8677b
memory : remove KV cache size padding ( #16812 )
...
* memory : remove KV cache size padding
* cont : restore padding for n_kv tensor shape
* server : use slot context size instead of training context size
* server : simplify context limit logic
2025-10-28 20:19:44 +02:00
Johannes Gäßler
7a0e900e36
llama: consistent ctx <-> buf order for KV cache ( #16746 )
2025-10-28 11:23:54 +01:00
Concedo
6d8f8cd65b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/CMakeLists.txt
2025-10-11 10:01:43 +08:00
Georgi Gerganov
d00cbea63c
server : host-memory prompt caching ( #16391 )
...
* minor : code style
* server : fix prompt similarity calculation
* server : initial host-memory prompt caching
* cont
* server : refactor
* cont
* cont : make the server task of the slot const
* cont : minor [no ci]
* server : cache prompts and checkpoints only for completion tasks
* server : improve prompt caching logic
* cont : fix check for number of cached prompts [no ci]
* server : improve caching logic, add -cram CLI arg
* server : print prompt mismatch info
* cont : better naming [no ci]
* server : improve prompt cache loading logic
* server : add option to debug the slot contents (#16482 )
* server : add option to debug the slot contents
* Update tools/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* server : add option to disable prompt cache
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-10-09 18:54:51 +03:00
Concedo
b120e107f9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .clang-tidy
# .devops/musa.Dockerfile
# .github/workflows/build-linux-cross.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# .gitignore
# CODEOWNERS
# CONTRIBUTING.md
# README.md
# build-xcframework.sh
# ci/README-MUSA.md
# ci/run.sh
# common/CMakeLists.txt
# docs/docker.md
# examples/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/model-conversion/Makefile
# examples/model-conversion/README.md
# examples/model-conversion/logits.cpp
# examples/model-conversion/scripts/causal/compare-logits.py
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
# examples/model-conversion/scripts/embedding/run-converted-model.sh
# examples/model-conversion/scripts/embedding/run-original-model.py
# examples/model-conversion/scripts/utils/check-nmse.py
# examples/model-conversion/scripts/utils/inspect-org-model.py
# examples/model-conversion/scripts/utils/semantic_check.py
# ggml/CMakeLists.txt
# ggml/include/ggml-zdnn.h
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/set_rows.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-quantize-perf.cpp
# tests/test-tokenizers-repo.sh
# tools/perplexity/perplexity.cpp
# tools/server/tests/README.md
2025-09-27 17:09:14 +08:00
Johannes Gäßler
e789095502
llama: print memory breakdown on exit ( #15860 )
...
* llama: print memory breakdown on exit
2025-09-24 16:53:48 +02:00
Concedo
5de51b77c1
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/close-issue.yml
# docs/build-s390x.md
# examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# ggml/src/ggml-cuda/fattn-tile-f16.cu
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/tool_bench.py
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-09-11 22:28:19 +08:00
Georgi Gerganov
cf0e3ba150
model : avoid ggml_cont_3d for fused QKV weights ( #15662 )
...
* model : avoid ggml_cont_3d for fused QKV weights
ggml-ci
* kv-cache : make cpy_k and cpy_v implementation more readable
ggml-ci
* cont : add comments
ggml-ci
* cont : minor fix [no ci]
* cont : one more fix
* cont : clarity
ggml-ci
* kv-cache : require contiguous heads of k_cur and v_cur
ggml-ci
2025-09-08 10:25:33 +03:00
Georgi Gerganov
c610b6c11b
kv-cache : fix SWA checks + disable cacheless iSWA ( #15811 )
...
ggml-ci
2025-09-05 10:39:22 +03:00
Concedo
f0d4128e9f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/compare-embeddings-logits.sh
# examples/model-conversion/scripts/causal/convert-model.sh
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
# examples/model-conversion/scripts/causal/run-converted-model.sh
# examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
# examples/model-conversion/scripts/embedding/convert-model.sh
# examples/model-conversion/scripts/embedding/modelcard.template
# examples/model-conversion/scripts/embedding/run-converted-model.sh
# examples/model-conversion/scripts/utils/create-collection-add-model.sh
# examples/model-conversion/scripts/utils/inspect-converted-model.sh
# examples/model-conversion/scripts/utils/inspect-org-model.py
# examples/model-conversion/scripts/utils/perplexity-gen.sh
# examples/model-conversion/scripts/utils/perplexity-run-simple.sh
# examples/model-conversion/scripts/utils/perplexity-run.sh
# examples/model-conversion/scripts/utils/quantize.sh
# examples/model-conversion/scripts/utils/run-embedding-server.sh
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# src/llama-context.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
2025-09-05 13:25:34 +08:00
Daniel Bevenius
fb15d649ed
llama : add support for EmbeddingGemma 300m ( #15798 )
...
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.
This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.
With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.
Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
2025-09-04 18:10:29 +02:00
Concedo
3060dfb99f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# scripts/compare-commits.sh
2025-08-28 23:17:29 +08:00
Georgi Gerganov
c8d0d14e77
kv-cache : fix find_slot to not search for continuous slot ( #15638 )
...
ggml-ci
2025-08-28 17:09:05 +03:00
Georgi Gerganov
8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks ( #15505 )
...
ggml-ci
2025-08-28 12:27:02 +03:00
Georgi Gerganov
1bded5a3b3
kv-cache : better estimate of n_kv for multi-sequence batches ( #15610 )
...
ggml-ci
2025-08-27 13:55:12 +03:00
Concedo
f8ee5d9e25
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# src/llama-kv-cache.cpp
# tests/test-backend-ops.cpp
2025-08-25 01:53:26 +08:00
Georgi Gerganov
b730706a49
kv-cache : support layer reuse ( #15504 )
...
* kv-cache : support layer reuse
ggml-ci
* cont : update comments [no ci]
2025-08-24 13:07:07 +03:00
Concedo
8b8396c30c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build-s390x.md
# examples/llama.vim
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# scripts/compare-llama-bench.py
# src/CMakeLists.txt
# tests/test-backend-ops.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-08-23 11:35:28 +08:00
Georgi Gerganov
9ebebef62f
llama : remove KV cache defragmentation logic ( #15473 )
...
ggml-ci
2025-08-22 12:22:13 +03:00
Georgi Gerganov
715a6db02c
kv-cache : drop the "unified" prefix ( #15467 )
...
* kv-cache : drop the "unified" prefix
ggml-ci
* cont : fix comment [no ci]
2025-08-21 17:00:33 +03:00
Georgi Gerganov
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
2025-06-05 15:29:22 +03:00
Georgi Gerganov
0fc16b42e8
kv-cache : split implementation in separate sources ( #13920 )
...
ggml-ci
2025-06-01 11:39:27 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
3f55f781f1
llama : auto-batch preparation ( #13845 )
...
* llama : auto-batch
ggml-ci
* context : simplify if branching
2025-05-31 12:55:57 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i ( #13746 )
...
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
2025-05-31 10:24:04 +03:00
Xuan-Son Nguyen
763d06edb7
llama : fix KV shift for qwen2vl ( #13870 )
...
* llama : fix KV shift for qwen2vl
* add ref to the PR
2025-05-28 22:35:31 +02:00
Georgi Gerganov
81713121ee
kv-cells : track min/max used cells and per-sequence positions ( #13808 )
...
* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci
2025-05-27 13:49:41 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00