Commit graph

22 commits

Author SHA1 Message Date
Concedo
ae292c496e handle SWA conflicting with rewind, increased default SWA padding. 2026-04-16 17:00:26 +08:00
Concedo
0251c6dbde added swa padding controls 2026-04-16 16:21:48 +08:00
Concedo
8fa87621d1 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	common/chat.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
2026-04-03 16:36:41 +08:00
Georgi Gerganov
39b27f0da0
(revert) kv-cache : do not quantize SWA KV cache (#21332)
This reverts commit 17193cce34.
2026-04-03 09:07:01 +03:00
Concedo
34ad53e950 merged support for gemma4. the e2b, e4b and 26b work, the 31b does not 2026-04-03 11:07:46 +08:00
Georgi Gerganov
17193cce34
kv-cache : do not quantize SWA KV cache (#21277) 2026-04-02 11:54:05 +03:00
Concedo
a0a78dacc4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	docs/ops.md
#	docs/ops/SYCL.csv
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	pyproject.toml
#	requirements/requirements-convert_legacy_llama.txt
#	src/CMakeLists.txt
#	src/llama-vocab.cpp
#	tests/test-backend-ops.cpp
2026-02-07 15:54:02 +08:00
forforever73
b83111815e
model : support Step3.5-Flash (#19283)
* Support Step3.5-Flash

* fix: norm.weight + 1 (HF zero_centered=true)

* step35: simplify GGUF conversion + drop redundant rope KVs

* Address review feedback

* rename limits -> clamp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rename swiglu limits -> swiglu clamp in LLM_KV

* avoid CI fail

* Apply suggestions from code review

* Apply suggestions from code review

* disabled KV shifting for LLM_ARCH_STEP35

* Apply suggestions from code review

* mistakenly removed cmath

* add model size && apply missed suggestion

* assert partial_rotary_factors

* fix CI errors:

* load freq_base_swa

---------

Co-authored-by: lvyichen <lvyichen@stepfun.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-06 21:06:14 +01:00
LostRuins Concedo
d6a2ad8455 still not really working right 2025-11-09 01:57:48 +08:00
Georgi Gerganov
16bcc1259d
kv-cache : pad the cache size to 256 for performance (#17046)
* kv-cache : pad the size of the small SWA cache for performance

* context : pad the total context to 256

* cont : future-proof the swa pad

* server : adjust test params to new logic
2025-11-07 20:03:25 +02:00
Concedo
1d728bbc89 Merge commit '128d522c04' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	tests/test-alloc.cpp
#	tests/test-chat.cpp
2025-10-04 23:51:22 +08:00
ddh0
f6dcda3900
server : context checkpointing for hybrid and recurrent models (#16382)
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-03 21:34:51 +03:00
Concedo
b120e107f9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.clang-tidy
#	.devops/musa.Dockerfile
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	.gitignore
#	CODEOWNERS
#	CONTRIBUTING.md
#	README.md
#	build-xcframework.sh
#	ci/README-MUSA.md
#	ci/run.sh
#	common/CMakeLists.txt
#	docs/docker.md
#	examples/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/model-conversion/Makefile
#	examples/model-conversion/README.md
#	examples/model-conversion/logits.cpp
#	examples/model-conversion/scripts/causal/compare-logits.py
#	examples/model-conversion/scripts/causal/run-org-model.py
#	examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/model-conversion/scripts/utils/check-nmse.py
#	examples/model-conversion/scripts/utils/inspect-org-model.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	ggml/CMakeLists.txt
#	ggml/include/ggml-zdnn.h
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/set_rows.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/set_rows.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-quantize-perf.cpp
#	tests/test-tokenizers-repo.sh
#	tools/perplexity/perplexity.cpp
#	tools/server/tests/README.md
2025-09-27 17:09:14 +08:00
Johannes Gäßler
e789095502
llama: print memory breakdown on exit (#15860)
* llama: print memory breakdown on exit
2025-09-24 16:53:48 +02:00
Concedo
5de51b77c1 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/close-issue.yml
#	docs/build-s390x.md
#	examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-cuda/fattn-tile-f16.cu
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/tool_bench.py
#	tests/test-backend-ops.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/server/README.md
2025-09-11 22:28:19 +08:00
Georgi Gerganov
c610b6c11b
kv-cache : fix SWA checks + disable cacheless iSWA (#15811)
ggml-ci
2025-09-05 10:39:22 +03:00
Concedo
f0d4128e9f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/CANN.md
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/causal/convert-model.sh
#	examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
#	examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
#	examples/model-conversion/scripts/causal/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/embedding/convert-model.sh
#	examples/model-conversion/scripts/embedding/modelcard.template
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	examples/model-conversion/scripts/utils/create-collection-add-model.sh
#	examples/model-conversion/scripts/utils/inspect-converted-model.sh
#	examples/model-conversion/scripts/utils/inspect-org-model.py
#	examples/model-conversion/scripts/utils/perplexity-gen.sh
#	examples/model-conversion/scripts/utils/perplexity-run-simple.sh
#	examples/model-conversion/scripts/utils/perplexity-run.sh
#	examples/model-conversion/scripts/utils/quantize.sh
#	examples/model-conversion/scripts/utils/run-embedding-server.sh
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	src/llama-context.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
2025-09-05 13:25:34 +08:00
Daniel Bevenius
fb15d649ed
llama : add support for EmbeddingGemma 300m (#15798)
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.

This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.

With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.

Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
2025-09-04 18:10:29 +02:00
Concedo
f8ee5d9e25 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	src/llama-kv-cache.cpp
#	tests/test-backend-ops.cpp
2025-08-25 01:53:26 +08:00
Georgi Gerganov
b730706a49
kv-cache : support layer reuse (#15504)
* kv-cache : support layer reuse

ggml-ci

* cont : update comments [no ci]
2025-08-24 13:07:07 +03:00
Concedo
8b8396c30c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	docs/build-s390x.md
#	examples/llama.vim
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	scripts/compare-llama-bench.py
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-08-23 11:35:28 +08:00
Georgi Gerganov
715a6db02c
kv-cache : drop the "unified" prefix (#15467)
* kv-cache : drop the "unified" prefix

ggml-ci

* cont : fix comment [no ci]
2025-08-21 17:00:33 +03:00
Renamed from src/llama-kv-cache-unified-iswa.cpp (Browse further)