Commit graph

145 commits

Author SHA1 Message Date
Diego Devesa
f061021206
llama : print size and type of overridden tensors (#13364) 2025-05-08 13:15:15 +02:00
Sigbjørn Skjæret
bc4e1128f7
llama : deci : support ffn-free with attention (#13296) 2025-05-07 12:49:27 +02:00
piDack
6c7fd67b64
llama : support tie embedding for chatglm models (#13328) 2025-05-07 09:23:11 +02:00
Concedo
1377a93a73 Merge commit '5215b91e93' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	cmake/x64-windows-llvm.cmake
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/CMakeLists.txt
#	tools/imatrix/imatrix.cpp
#	tools/llava/clip.cpp
#	tools/rpc/rpc-server.cpp
2025-05-06 23:15:04 +08:00
ymcki
3bf785f3ef
llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843) 2025-05-03 17:39:51 +02:00
Concedo
0951ad9f58 temp merge, not working 2025-05-03 11:42:01 +08:00
Jared Van Bortel
2f567611c0
llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245) 2025-05-02 11:42:30 -04:00
Georgi Gerganov
c642bc014c
kv-cache : separate recurrent vs non-recurrent impl (#12799)
* kv-cache : serparate recurrent vs non-recurrent impl (wip)

ggml-ci

* kv-cache : init -> contructor + add llama_memory_params

ggml-ci

* kv-cache : fix callback reference

ggml-ci

* context : llama_kv_cache -> llama_memory_i

ggml-ci

* context : move memory creation logic to model

ggml-ci

* llama : remove reference of memory during encode

ggml-ci

* kv-cache : hide padding details in the implementation

ggml-ci

* kv-cache : add ubatch_next()

ggml-ci

* context : simplify sbatch logic

ggml-ci

* kv-cache : hide defrag logic in the implementation

ggml-ci

* context : hide kv cache details in implementation

ggml-ci

* build : fix

ggml-ci

* cont : another fix

ggml-ci

* kv-cache : simplify interface (wip)

ggml-ci

* kv-cache : use separate KV cell structs for unified/recurrent

ggml-ci

* kv-cache : clean-up

ggml-ci

* model : better llama_model::create_model() signature

ggml-ci

* kv-cache : fix recurrent seq_rm()

ggml-ci

* kv-cache : replace `struct callbacks` with `llama_model &`

ggml-ci

* kv-cache : replace `struct graph_params` with `llama_context &`

ggml-ci

* kv-cache : fix offload check

ggml-ci

* context : avoid passing unique_ptr

ggml-ci

* kv-cache : avoid using the backends from the llama_context

ref #13113

ggml-ci

* kv-cache : more consistent debug logs [no ci]

* kv-cache : do not pass the full llama_context for kv graphs

ggml-ci

* kv-cache : remove comment

* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext

ggml-ci

* kv-cache : fix recurrent multi-user case

ggml-ci

* memory : remove comments [no ci]
2025-05-02 17:48:36 +03:00
Concedo
17cbf9fd49 plamo fixed 2025-05-02 22:46:17 +08:00
Sigbjørn Skjæret
cb06a3c363
llama : orion rope type is neox (#13261) 2025-05-02 12:44:24 +02:00
Sigbjørn Skjæret
626083faf7
llama : plamo rope type is neox (#13260) 2025-05-02 12:40:56 +02:00
Concedo
ca53d1bedc Merge commit '13c9a3319b' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cpu/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2025-05-02 16:42:16 +08:00
Jared Van Bortel
a70183eb00
llama-model : fix the reported size class for nomic-embed-text-v2-moe (#13223) 2025-05-01 10:09:41 +03:00
Concedo
8273739412 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/cpu.Dockerfile
#	.devops/cuda.Dockerfile
#	.devops/intel.Dockerfile
#	.devops/llama-cli-cann.Dockerfile
#	.devops/musa.Dockerfile
#	.devops/rocm.Dockerfile
#	.devops/vulkan.Dockerfile
#	examples/llama-bench/llama-bench.cpp
#	examples/rpc/rpc-server.cpp
#	scripts/compare-llama-bench.py
#	tests/test-quantize-stats.cpp
2025-04-30 17:22:18 +08:00
Johannes Gäßler
cdf76586b2
CUDA: fix non-cont. inputs for batched mat mul (#13155) 2025-04-29 16:00:27 +02:00
Concedo
b2ecfa0f55 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/llama-bench/README.md
#	examples/llama-bench/llama-bench.cpp
#	examples/llava/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-chat-template.cpp
2025-04-29 21:05:16 +08:00
Sigbjørn Skjæret
7d3af70b08
llama : llm_type order by size (#13177) 2025-04-29 13:25:53 +02:00
Sigbjørn Skjæret
e98b3692be
llama : set qwen3 model type sizes (#13175) 2025-04-29 11:00:31 +02:00
AT
5f5e39e1ba
model : Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture (#12466)
* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture

- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.

https://www.nomic.ai/blog/posts/nomic-embed-text-v2

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

* fix tokenizer

* don't rename this tensor

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2025-04-28 22:52:15 +03:00
Johannes Gäßler
69699be48a
CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137) 2025-04-28 09:29:26 +02:00
Concedo
bce519cee7 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	tests/test-backend-ops.cpp
2025-04-18 12:44:20 +08:00
Georgi Gerganov
2f74c354c0
graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-17 18:16:36 +03:00
Concedo
06159939d9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	docs/build.md
#	examples/rpc/rpc-server.cpp
#	examples/sycl/build.sh
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hip/CMakeLists.txt
#	scripts/sync-ggml.last
2025-04-17 00:52:37 +08:00
Juk Armstrong
daa422881a
llama : DeepSeek V2/V3 MLA implementation (#12801)
* Merged using squash to remove all noise commit messages

* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large

* Removed 3 conts (2x RoPE and 1x RMS-norm)

* Changed to use `<cmath>` instead of `<math.h>`

* Reverted removal of the 3 conts

* Used `reshape` in `llm_graph_context::build_attn_mha()`

* Use `k_pe = ggml_reshape`

* Removed the 3 conts again

* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF

* Removed MQA optimisation from `build_attn_mha()` as no gains now

* Simplified `is_mla` branch in `llm_build_deepseek2()`

* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls

* Fixed call to `build_attn` in `llm_build_t5_enc`
2025-04-15 09:49:57 +03:00
Concedo
a0ae187563 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	README.md
#	build-xcframework.sh
#	examples/llava/CMakeLists.txt
#	examples/llava/clip.cpp
#	examples/rpc/rpc-server.cpp
#	examples/run/run.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
2025-04-12 10:06:47 +08:00
Concedo
ea9bd61e47 Merge commit '64eda5deb9' into concedo_experimental
# Conflicts:
#	.devops/cuda.Dockerfile
#	.devops/intel.Dockerfile
#	.devops/llama-cli-cann.Dockerfile
#	.devops/musa.Dockerfile
#	.devops/rocm.Dockerfile
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	README.md
#	docs/backend/SYCL.md
#	examples/llava/clip.cpp
#	examples/server_embd.py
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	src/CMakeLists.txt
#	tests/test-chat-template.cpp
2025-04-12 08:31:22 +08:00
Yuxuan Zhang
06bb53ad9b
llama-model : add Glm4Model implementation for GLM-4-0414 (#12867)
* GLM-4-0414

* use original one

* Using with tensor map

* fix bug

* change order

* change order

* format with flask8
2025-04-11 12:10:10 +02:00
Xuan-Son Nguyen
8b91d5355a
llama : correct rms norm for llama 4 (#12882) 2025-04-11 08:49:50 +02:00
Bo Zheng
d3bd7193ba
llama : Support Qwen3 and Qwen3MoE (#12828)
* add qwen3 & qwen3moe support.

* fix

---------

Co-authored-by: bozheng-hit <dsoul0621@gmail.com>
2025-04-09 11:47:36 +02:00
Concedo
ebf924c5d1 Merge branch 'upstream' into concedo_experimental 2025-04-08 21:46:30 +08:00
Xuan-Son Nguyen
1466621e73
llama : Support llama 4 text-only (#12791)
* llama4 conversion

* initial support, no chat template

* clean up a bit

* fix tokenizer conversion

* correct hparams

* try this

* fix shexp

* ffn_inp_normed

* chat template

* clean up model conversion

* add_bos

* add scale_before_ffn

* fix order

* weight_before_ffn

* llm_graph_input_attn_temp

* add chunk attn mask

* build_inp_attn_scale()

* add comment about ggml_repeat

* clarify comments

* fix build
2025-04-07 23:06:44 +02:00
Concedo
103d60ed2c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/common.cpp
#	examples/batched-bench/batched-bench.cpp
#	examples/batched/batched.cpp
#	examples/export-lora/export-lora.cpp
#	examples/gritlm/gritlm.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
2025-04-03 18:57:49 +08:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers (#11397)
* llama : add option to override tensor buffers

* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Concedo
9e182b3e78 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	docs/backend/SYCL.md
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	scripts/sync-ggml.last
#	tests/test-chat-template.cpp
2025-04-01 20:16:07 +08:00
Sigbjørn Skjæret
2c3f8b850a
llama : support BailingMoE (Ling) (#12634) 2025-03-30 22:21:03 +02:00
Concedo
ce05aa722d Merge commit '0bb2919335' into concedo_experimental
# Conflicts:
#	ggml/src/CMakeLists.txt
#	src/llama-model.cpp
2025-03-30 18:18:20 +08:00
Djip007
0bb2919335
llama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU (#12632)
this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (#12498) without
completely disable CPU extra buffer.

Co-authored-by: philou <philou@framework>
2025-03-29 14:07:37 +01:00
Concedo
396875e1c4 update api docs and lite 2025-03-29 15:39:25 +08:00
Sigbjørn Skjæret
3714c3ee1a
llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631) 2025-03-28 22:13:02 +01:00
Si1w
f125b8dccf
llama : add PLM GGUF Conversion & Inference Support (#12457)
* add edgellm model arch[conversation feature doesn't work]

* remove output.weight layer for edgellm arch

* [Model] update the name of the model

* update the name of model arch in convert gguf

* [Model] Refarctor the model arch into llama-model

* [Bug] Fix the bug in create attn kv

* [Code] Fix editorconfig erros

* [Code] Remove Trailing whitespace

* [Code] Remove Trailing whitespace

* [Code] Change the order of model arch in list

* [Code] Fix flake8 Lint errors

* Remove trailing white space

* [Code] Remove  call in model arch
2025-03-27 12:49:15 +02:00
HighDoping
953c2a62cf
model : restore support for T5Encoder (#12590) 2025-03-27 11:43:33 +01:00
Concedo
ea358369cc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ci/README.md
#	ci/run.sh
#	docs/backend/CUDA-FEDORA.md
#	docs/build.md
#	docs/install.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/common.cuh
#	tests/test-backend-ops.cpp
2025-03-26 00:18:01 +08:00
Xuan-Son Nguyen
fbdfefe74e
llama : gemma3 : use output tensor if it exists in model weight (#12506)
* llama : gemma3 : use output tensor if it exists in model weight

* also add to the llm_tensor_names
2025-03-22 23:28:19 +01:00
Concedo
ae670dbe0e no repacking for avx2 for kcpp because it breaks 4_0_4_4 quants 2025-03-22 00:33:27 +08:00
Concedo
7030ebf401 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/SYCL.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
#	ggml/src/ggml-sycl/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-22 00:32:42 +08:00
Georgi Gerganov
af04481e6b
model : do not repack if a GPU device is present (#12498)
ggml-ci
2025-03-21 16:14:29 +02:00
Sigbjørn Skjæret
960e726077
chore : cleanup llama_model_loader::TENSOR_ usage (#12492) 2025-03-21 10:21:36 +01:00
Sigbjørn Skjæret
dbb3a4739e
llama : make Qwen2MoE QKV bias optional (#12477) 2025-03-20 12:49:59 +01:00
Concedo
0c90d2ebcf Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	cmake/common.cmake
#	docs/backend/SYCL.md
#	examples/main/README.md
#	examples/speculative/speculative.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-19 19:27:11 +08:00
Sigbjørn Skjæret
108e53c2f1
llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456)
* Add support for GPT2, Bloom and CodeShell tied word embeddings

* Deduplicate tied word embeddings weights

* Workaround for incorrect weight map

It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first.

* check++

* fatfingers--
2025-03-19 09:08:49 +01:00