Commit graph

145 commits

Author SHA1 Message Date
Concedo
b59b5dbbd1 Merge commit '456af35eb7' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-sycl/getrows.cpp
#	src/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
2025-06-20 23:41:27 +08:00
Gabe Goodhart
edc4a29eff
memory : Hybrid recurrent cache (#13979)
* feat: Add llama_model_is_hybrid API call

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in
llama-arch with llama_model_is_recurrent delegating to
llm_arch_is_recurrent. The same split is done for hybird. This is needed
because there are places where the llama_model has not yet been initialized
but we need to check if the model is recurrent (specifically for the
per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side constants for attention layer indices hparam

Branch: GraniteFour

* feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: rename *_is_hybrid -> *_is_hybrid_recurrent

The implementation of the hybrid cache intentionally does not specify the
types of the child caches, so there was a naming mismatch with these
predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add layer filter to recurrent cache

Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use per-layer sizing everywhere in kv caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First pass at llama_kv_cache_hybrid_recurrent

This follows the pattern in iswa where the two child caches are held
explicitly to support the case where a model requires a single attention
cache and a single recurrent cache where each layer uses exactly one of the
caches.

This is a rewrite of the more generic approach in the original hybrid cache
PR: https://github.com/ggml-org/llama.cpp/pull/13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Construct hybrid recurrent cache for hybrid recurrent models

This includes a refactor of the create_memory logic to avoid needing to use
the arch enum explicitly unless a model needs explicit cache instantiation
logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix wrong bool condition for split equal in hybrid cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix shift logic to defer to unified cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Support hybrid recurrent in llama-graph

NOTE: I intentionally did not add support for s_mask since it will be going
away soon

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix logic for initializing inputs and attn layers for hybrid caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update recurrent cache for changes to remove intermediate kv_cache interface

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix status for init_update sig for recurrent cache state

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing padding to n_ctx for hybrid cache construction

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update clear signature for data argument after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove errant virtual destructor leftover from previous impl attempt

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove n_embd_k/v_s from unified cache

No longer needed now that unified isn't also supporting recurrent

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069

Branch: HybridRecurrentCache

* refactor: Remove layer index from n_embd_k/v_s

Now that it's not used at all in the unified cache, we don't need to use
the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove n_embd_k/v_gqa from recurrent cache

This is no longer needed now that there are separate implementations

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow custom layer filters for hybrid recurrent

This should help support architectures like Falcon H1 where there is
overlap between layers that need attention and recurrent caches.

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove logits_all after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove llama_model_is_hybrid_Recurrent public API

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use llama_memory_state_ptr for child states in hybrid memory state

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per-
layer components are created for attention layers and recurrent layers. The
main changes are:

- Rename class llm_graph_input_s_copy -> llm_graph_input_rs
- Add a corresponding llm_graph_input_rs_hybrid_recurrent
- Rename build_inp_s_copy -> build_rs_inp_recurrent
- Add a corresponding build_rs_inp_hybrid_recurrent
- Rename build_recurrent_state -> build_rs to match build_attn w/
llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a corresponding overload of build_rs w/
llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to
llm_graph_input_attn_kv_unified
- Add a build_attn override that takes
llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input

This makes the two paradigms fully consistent. The main drawback is the
code duplication in the build_attn and build_rs implementations where the
only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix resize vs reserve and skip null tensors in size computation

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: @younesbelkada

* fix: Fix initialization of child states

Since initially writing this PR, the logic in the child state types changed
such that using the "init full" signature and keeping the ubatches on the
parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use a common build_recurrent_state method that is cache-agnostic

This reduces the code duplication between the different build_rs impls and
also retains a similar signature to the previous build_recurrent_state
method while standardizing on the input-dispatched build_rs implementation.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* recurrent : rework graph inputs + add TODOs

ggml-ci

* refactor: Make status and child states const in hybrid and iswa

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache

This removes the notion of "kv" from the interface names for these memory
types. There are still many references to kv in the implementation of the
recurrent memory which will need further adjustment.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor!: Rename all k/v related values for recurrent/hybrid to r/s

Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more
generic "mem_" prefix. The specifics of "k" (key) translate to "r"
(recurrent state) and "v" (value) translate to "s" (state-space embedding
states).

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refacor: _recurrent -> _recr for brevity

It just _happens_ to have the same number of letters as _attn!

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix spacing for ref

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: recurrent_layer() -> is_recurrent()

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix spacing for size_s_bytes declaration

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 08:08:14 +03:00
Concedo
4356a00f4a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ci/run.sh
#	docs/function-calling.md
#	examples/gritlm/gritlm.cpp
#	ggml/CMakeLists.txt
#	ggml/cmake/common.cmake
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	requirements/requirements-compare-llama-bench.txt
#	scripts/compare-llama-bench.py
#	tests/CMakeLists.txt
2025-06-18 00:16:54 +08:00
Đinh Trọng Huy
ad590be98c
model : add NeoBERT (#14164)
* convert neobert model to gguf

* add inference graph

* fix flake8 lint

* followed reviewer suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* follow reviewers suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* override NeoBERT feed-forward length

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-16 14:53:41 +02:00
Bartowski
d7da8dc83a
model : Add support for Arcee AI's upcoming AFM model (#14185)
* Add Arcee AFM support

* Add draft update code

* Fix linter and update URL, may still not be final

* Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Remote accidental blank line

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-06-16 01:04:06 +02:00
Mikko Juola
9ae4143bc6
model : add dots.llm1 architecture support (#14044) (#14118)
Adds:

* Dots1Model to convert_hf_to_gguf.py

* Computation graph code to llama-model.cpp

* Chat template to llama-chat.cpp to detect this model's template.

---

The model is called "dots.llm1" (I decided to shorten it to dots1 or
DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this
architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

* https://huggingface.co/rednote-hilab/dots.llm1.inst

* https://huggingface.co/rednote-hilab/dots.llm1.base

The model architecture is a combination of Qwen and Deepseek parts, as
seen here:

ffe12627b4/src/transformers/models/dots1/modular_dots1.py
2025-06-15 09:52:06 +02:00
Concedo
69e4a32ca2 Merge commit 'd4e0d95cf5' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	common/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
2025-06-14 01:58:53 +08:00
Concedo
4204f111f7 Merge commit '8f47e25f56' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/build-linux-cross.yml
#	docs/backend/CANN.md
#	examples/batched.swift/Sources/main.swift
#	examples/embedding/embedding.cpp
#	examples/gritlm/gritlm.cpp
#	examples/llama.android/llama/src/main/cpp/llama-android.cpp
#	examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	examples/retrieval/retrieval.cpp
#	examples/save-load-state/save-load-state.cpp
#	examples/simple-chat/simple-chat.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	tools/batched-bench/batched-bench.cpp
#	tools/cvector-generator/cvector-generator.cpp
#	tools/imatrix/imatrix.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
#	tools/run/run.cpp
2025-06-13 22:05:03 +08:00
compilade
dad5c44398
kv-cache : avoid modifying recurrent cells when setting inputs (#13834)
* kv-cache : avoid modifying recurrent cells when setting inputs

* kv-cache : remove inp_s_mask

It was replaced with equivalent and simpler functionality
with rs_z (the first zeroed state) and the already-existing inp_s_copy.

* kv-cache : fix non-consecutive token pos warning for recurrent models

The problem was apparently caused by how the tail cells were swapped.

* graph : simplify logic for recurrent state copies

* kv-cache : use cell without src refs for rs_z in recurrent cache

* llama-graph : fix recurrent state copy

The `state_copy` shuffle assumes everything is moved at once,
which is not true when `states_extra` is copied back to the cache
before copying the range of states between `head` and `head + n_seqs`.
This is only a problem if any of the cells in [`head`, `head + n_seqs`)
have an `src` in [`head + n_seqs`, `head + n_kv`),
which does happen when `n_ubatch > 1` in the `llama-parallel` example.

Changing the order of the operations avoids the potential overwrite
before use, although when copies are avoided (like with Mamba2),
this will require further changes.

* llama-graph : rename n_state to state_size in build_recurrent_state

This naming should reduce confusion between the state size
and the number of states.
2025-06-10 18:20:14 -04:00
Sigbjørn Skjæret
3678b838bb
llama : support GEGLU for jina-bert-v2 (#14090) 2025-06-10 18:02:08 +02:00
Sigbjørn Skjæret
0974ad7a7c
llama : fix llama_model_chat_template with template name (LLM_KV with suffix) (#14050) 2025-06-07 14:13:12 +02:00
Concedo
d33c88b1f4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	ci/run.sh
#	examples/embedding/embedding.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	src/CMakeLists.txt
2025-06-06 17:56:51 +08:00
Sigbjørn Skjæret
d17a809ef0
llama : support multiple classifier outputs and labels (#13940) 2025-06-06 09:03:25 +02:00
Concedo
bc89b465a8 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	.github/workflows/server.yml
#	README.md
#	docs/build.md
#	docs/install.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
2025-06-05 11:03:34 +08:00
Georgi Gerganov
5582c49c39
gemma : more consistent attention scaling for v2 and v3 (#13951)
* gemma : fix attn scale for 27B

* cont : apply scale before attn

* cont : consistent attention scaling
2025-06-02 20:54:26 +03:00
Concedo
5e667659ec Merge commit '0fc16b42e8' into concedo_experimental
# Conflicts:
#	src/CMakeLists.txt
#	src/llama-kv-cache.cpp
2025-06-02 23:14:23 +08:00
Concedo
6ce85c54d6 not working correctly 2025-06-02 22:12:10 +08:00
Georgi Gerganov
0fc16b42e8
kv-cache : split implementation in separate sources (#13920)
ggml-ci
2025-06-01 11:39:27 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci
2025-05-31 10:24:04 +03:00
Concedo
b08dca65ed Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/CMakeLists.txt
#	common/arg.cpp
#	common/chat.cpp
#	examples/parallel/README.md
#	examples/parallel/parallel.cpp
#	ggml/cmake/common.cmake
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	models/ggml-vocab-bert-bge.gguf.inp
#	models/ggml-vocab-bert-bge.gguf.out
#	models/ggml-vocab-command-r.gguf.inp
#	models/ggml-vocab-command-r.gguf.out
#	models/ggml-vocab-deepseek-coder.gguf.inp
#	models/ggml-vocab-deepseek-coder.gguf.out
#	models/ggml-vocab-deepseek-llm.gguf.inp
#	models/ggml-vocab-deepseek-llm.gguf.out
#	models/ggml-vocab-falcon.gguf.inp
#	models/ggml-vocab-falcon.gguf.out
#	models/ggml-vocab-gpt-2.gguf.inp
#	models/ggml-vocab-gpt-2.gguf.out
#	models/ggml-vocab-llama-bpe.gguf.inp
#	models/ggml-vocab-llama-bpe.gguf.out
#	models/ggml-vocab-llama-spm.gguf.inp
#	models/ggml-vocab-llama-spm.gguf.out
#	models/ggml-vocab-mpt.gguf.inp
#	models/ggml-vocab-mpt.gguf.out
#	models/ggml-vocab-phi-3.gguf.inp
#	models/ggml-vocab-phi-3.gguf.out
#	models/ggml-vocab-qwen2.gguf.inp
#	models/ggml-vocab-qwen2.gguf.out
#	models/ggml-vocab-refact.gguf.inp
#	models/ggml-vocab-refact.gguf.out
#	models/ggml-vocab-starcoder.gguf.inp
#	models/ggml-vocab-starcoder.gguf.out
#	requirements/requirements-gguf_editor_gui.txt
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/run/run.cpp
#	tools/server/CMakeLists.txt
2025-05-31 13:04:21 +08:00
Đinh Trọng Huy
291f2b6913
llama : add support for DistilBert (#13907)
* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-30 11:56:02 +02:00
zhangkaihuo
2c90da4c7e
llama : use llm_build_granite for minicpm (#13911) 2025-05-30 10:31:48 +02:00
Sigbjørn Skjæret
e83ba3e460
llama : add support for jina-reranker-v2 (#13900) 2025-05-29 21:42:31 +02:00
Sigbjørn Skjæret
6385b843a8
llama : add RobertaForSequenceClassification reranker support (#13875) 2025-05-29 08:15:01 +02:00
Concedo
868cb6aff7 Merge commit 'e121edc432' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	common/CMakeLists.txt
#	docs/function-calling.md
#	ggml/src/ggml-sycl/binbcast.cpp
#	models/templates/README.md
#	scripts/tool_bench.py
#	src/llama-kv-cache.cpp
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/mtmd/clip.h
#	tools/rpc/rpc-server.cpp
#	tools/server/README.md
2025-05-28 00:20:45 +08:00
Piotr Jasiukajtis
4032ca4066
llama : add support for Qwen3 MoE tied word embeddings (#13768) 2025-05-25 10:29:43 +02:00
Concedo
55cc9acec5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	README.md
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/clip.cpp
#	tools/mtmd/clip.h
2025-05-24 12:10:36 +08:00
Georgi Gerganov
d13d0f6135
hparams : initialize arrays (#13728)
ggml-ci
2025-05-23 20:16:13 +03:00
Xuan-Son Nguyen
8a2afb7520
llama : allow custom list of swa_layers (#13726) 2025-05-23 17:07:04 +02:00
Concedo
22ef97d7d3 Merge commit 'ab86335760' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	examples/retrieval/retrieval.cpp
#	examples/simple-chat/simple-chat.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	requirements/requirements-convert_hf_to_gguf.txt
#	requirements/requirements-convert_hf_to_gguf_update.txt
#	requirements/requirements-convert_lora_to_gguf.txt
#	tools/run/run.cpp
2025-05-23 11:41:36 +08:00
Georgi Gerganov
8a1d206f1d
tts : fix n_ubatch + make WavTokenizer cache-less (#13713)
ggml-ci
2025-05-22 22:21:07 +03:00
Concedo
da7fd4aa57 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/musa.Dockerfile
#	.github/workflows/build.yml
#	README.md
#	ci/README.md
#	docs/docker.md
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/parallel/parallel.cpp
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-arg-parser.cpp
2025-05-21 23:12:22 +08:00
Concedo
9f976e9c65 swa full used unless ctx shift and fast forward disabled 2025-05-21 22:47:45 +08:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface (#13660)
* kv-cache : simplify the interface

ggml-ci

* context : revert llama_batch_allocr position change

ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
b44890df2e
model : disable SWA for Phi models (#13676)
* model : disable SWA for Phi models

ggml-ci

* model : update warning message

* model : print warning only if n_swa > 0

* model : fix typo
2025-05-21 13:09:21 +03:00
Georgi Gerganov
be0239693c
model : fix llama4 graph (#13663)
ggml-ci
2025-05-20 19:21:04 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci
2025-05-20 08:05:46 +03:00
Concedo
e5d26a2356 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/CMakeLists.txt
#	docs/backend/SYCL.md
#	ggml/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/gemm.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	ggml/src/gguf.cpp
#	scripts/compare-llama-bench.py
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-05-16 15:30:31 +08:00
Concedo
12e6928ec2 i'm gonna regret this, aren't i? 2025-05-15 23:59:55 +08:00
Gabe Goodhart
5e7d95e22e
fix: Move build_inp_pos to the top of the graph section for build_granite (#13538)
This matches how others do it, but will still avoid the extra
initialization when rope is disabled.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-14 15:53:59 +03:00
Gabe Goodhart
d590cd4c24
model : Granite MoE shared (#13269)
* feat: Add GGUF conversion for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: hparam and arch plumbing for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Split MoE fused tensors for shared experts in conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First WIP cut at model arch in cpp

The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Cleaner (maybe more correct?) splitting for gate/up

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix the input to the shared experts

I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Avoid architecture-specific checks for Granite MoE Shared

This is a cleaner way that will allow more flexibility in architecture
strings going forward.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Split granite architectures out of llm_build_llama

This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).

NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix compiler warning about uninitialized inp_pos

This should not have been reachable, but it warns on some compliers

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-13 15:12:01 +02:00
Concedo
21e31e255b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	README.md
#	build-xcframework.sh
#	common/CMakeLists.txt
#	examples/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-metal/ggml-metal.m
#	ggml/src/ggml-metal/ggml-metal.metal
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	scripts/compare-llama-bench.py
#	src/CMakeLists.txt
#	src/llama-model.cpp
#	src/llama.cpp
#	tests/test-backend-ops.cpp
#	tests/test-opt.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/README.md
#	tools/mtmd/clip.cpp
#	tools/rpc/rpc-server.cpp
#	tools/server/CMakeLists.txt
#	tools/server/README.md
2025-05-13 00:28:35 +08:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Concedo
6bb44391bd Merge commit '5c86c9ed3e' into concedo_experimental
# Conflicts:
#	tools/imatrix/imatrix.cpp
#	tools/mtmd/README.md
#	tools/run/README.md
#	tools/run/run.cpp
2025-05-10 00:30:18 +08:00
Diego Devesa
27ebfcacba
llama : do not crash if there is no CPU backend (#13395)
* llama : do not crash if there is no CPU backend

* add checks to examples
2025-05-09 13:02:07 +02:00
Xuan-Son Nguyen
3f96aeff39
llama : one-off chat template fix for Mistral-Small-2503 (#13398)
* llama : one-off chat template fix for Mistral-Small-2503

* update readme

* add mistral-v7-tekken
2025-05-09 11:17:51 +02:00
Concedo
2439014a03 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	examples/embedding/embedding.cpp
#	tools/imatrix/imatrix.cpp
#	tools/perplexity/perplexity.cpp
2025-05-08 23:41:02 +08:00
Concedo
b6220669f4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	Makefile
#	examples/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/convert.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/sync-ggml.last
2025-05-08 23:07:33 +08:00
Georgi Gerganov
6562e5a4d6
context : allow cache-less context for embeddings (#13108)
* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00