Concedo
b08dca65ed
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/CMakeLists.txt
# common/arg.cpp
# common/chat.cpp
# examples/parallel/README.md
# examples/parallel/parallel.cpp
# ggml/cmake/common.cmake
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/rope.cpp
# models/ggml-vocab-bert-bge.gguf.inp
# models/ggml-vocab-bert-bge.gguf.out
# models/ggml-vocab-command-r.gguf.inp
# models/ggml-vocab-command-r.gguf.out
# models/ggml-vocab-deepseek-coder.gguf.inp
# models/ggml-vocab-deepseek-coder.gguf.out
# models/ggml-vocab-deepseek-llm.gguf.inp
# models/ggml-vocab-deepseek-llm.gguf.out
# models/ggml-vocab-falcon.gguf.inp
# models/ggml-vocab-falcon.gguf.out
# models/ggml-vocab-gpt-2.gguf.inp
# models/ggml-vocab-gpt-2.gguf.out
# models/ggml-vocab-llama-bpe.gguf.inp
# models/ggml-vocab-llama-bpe.gguf.out
# models/ggml-vocab-llama-spm.gguf.inp
# models/ggml-vocab-llama-spm.gguf.out
# models/ggml-vocab-mpt.gguf.inp
# models/ggml-vocab-mpt.gguf.out
# models/ggml-vocab-phi-3.gguf.inp
# models/ggml-vocab-phi-3.gguf.out
# models/ggml-vocab-qwen2.gguf.inp
# models/ggml-vocab-qwen2.gguf.out
# models/ggml-vocab-refact.gguf.inp
# models/ggml-vocab-refact.gguf.out
# models/ggml-vocab-starcoder.gguf.inp
# models/ggml-vocab-starcoder.gguf.out
# requirements/requirements-gguf_editor_gui.txt
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
# tools/mtmd/CMakeLists.txt
# tools/run/run.cpp
# tools/server/CMakeLists.txt
2025-05-31 13:04:21 +08:00
Đinh Trọng Huy
291f2b6913
llama : add support for DistilBert ( #13907 )
...
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-30 11:56:02 +02:00
zhangkaihuo
2c90da4c7e
llama : use llm_build_granite for minicpm ( #13911 )
2025-05-30 10:31:48 +02:00
Sigbjørn Skjæret
e83ba3e460
llama : add support for jina-reranker-v2 ( #13900 )
2025-05-29 21:42:31 +02:00
Sigbjørn Skjæret
6385b843a8
llama : add RobertaForSequenceClassification reranker support ( #13875 )
2025-05-29 08:15:01 +02:00
Concedo
868cb6aff7
Merge commit ' e121edc432
' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# common/CMakeLists.txt
# docs/function-calling.md
# ggml/src/ggml-sycl/binbcast.cpp
# models/templates/README.md
# scripts/tool_bench.py
# src/llama-kv-cache.cpp
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/mtmd/clip.h
# tools/rpc/rpc-server.cpp
# tools/server/README.md
2025-05-28 00:20:45 +08:00
Piotr Jasiukajtis
4032ca4066
llama : add support for Qwen3 MoE tied word embeddings ( #13768 )
2025-05-25 10:29:43 +02:00
Concedo
55cc9acec5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# README.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/clip.cpp
# tools/mtmd/clip.h
2025-05-24 12:10:36 +08:00
Georgi Gerganov
d13d0f6135
hparams : initialize arrays ( #13728 )
...
ggml-ci
2025-05-23 20:16:13 +03:00
Xuan-Son Nguyen
8a2afb7520
llama : allow custom list of swa_layers ( #13726 )
2025-05-23 17:07:04 +02:00
Concedo
22ef97d7d3
Merge commit ' ab86335760
' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# examples/retrieval/retrieval.cpp
# examples/simple-chat/simple-chat.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# requirements/requirements-convert_lora_to_gguf.txt
# tools/run/run.cpp
2025-05-23 11:41:36 +08:00
Georgi Gerganov
8a1d206f1d
tts : fix n_ubatch + make WavTokenizer cache-less ( #13713 )
...
ggml-ci
2025-05-22 22:21:07 +03:00
Concedo
da7fd4aa57
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# README.md
# ci/README.md
# docs/docker.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-arg-parser.cpp
2025-05-21 23:12:22 +08:00
Concedo
9f976e9c65
swa full used unless ctx shift and fast forward disabled
2025-05-21 22:47:45 +08:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
b44890df2e
model : disable SWA for Phi models ( #13676 )
...
* model : disable SWA for Phi models
ggml-ci
* model : update warning message
* model : print warning only if n_swa > 0
* model : fix typo
2025-05-21 13:09:21 +03:00
Georgi Gerganov
be0239693c
model : fix llama4 graph ( #13663 )
...
ggml-ci
2025-05-20 19:21:04 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Concedo
e5d26a2356
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/CMakeLists.txt
# docs/backend/SYCL.md
# ggml/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# ggml/src/gguf.cpp
# scripts/compare-llama-bench.py
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-05-16 15:30:31 +08:00
Concedo
12e6928ec2
i'm gonna regret this, aren't i?
2025-05-15 23:59:55 +08:00
Gabe Goodhart
5e7d95e22e
fix: Move build_inp_pos to the top of the graph section for build_granite ( #13538 )
...
This matches how others do it, but will still avoid the extra
initialization when rope is disabled.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-14 15:53:59 +03:00
Gabe Goodhart
d590cd4c24
model : Granite MoE shared ( #13269 )
...
* feat: Add GGUF conversion for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: hparam and arch plumbing for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Split MoE fused tensors for shared experts in conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First WIP cut at model arch in cpp
The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Cleaner (maybe more correct?) splitting for gate/up
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix the input to the shared experts
I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Avoid architecture-specific checks for Granite MoE Shared
This is a cleaner way that will allow more flexibility in architecture
strings going forward.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Split granite architectures out of llm_build_llama
This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).
NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix compiler warning about uninitialized inp_pos
This should not have been reachable, but it warns on some compliers
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-13 15:12:01 +02:00
Concedo
21e31e255b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/docker.yml
# README.md
# build-xcframework.sh
# common/CMakeLists.txt
# examples/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-metal/ggml-metal.m
# ggml/src/ggml-metal/ggml-metal.metal
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/backend.hpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# scripts/compare-llama-bench.py
# src/CMakeLists.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/README.md
# tools/mtmd/clip.cpp
# tools/rpc/rpc-server.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2025-05-13 00:28:35 +08:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Concedo
6bb44391bd
Merge commit ' 5c86c9ed3e
' into concedo_experimental
...
# Conflicts:
# tools/imatrix/imatrix.cpp
# tools/mtmd/README.md
# tools/run/README.md
# tools/run/run.cpp
2025-05-10 00:30:18 +08:00
Diego Devesa
27ebfcacba
llama : do not crash if there is no CPU backend ( #13395 )
...
* llama : do not crash if there is no CPU backend
* add checks to examples
2025-05-09 13:02:07 +02:00
Xuan-Son Nguyen
3f96aeff39
llama : one-off chat template fix for Mistral-Small-2503 ( #13398 )
...
* llama : one-off chat template fix for Mistral-Small-2503
* update readme
* add mistral-v7-tekken
2025-05-09 11:17:51 +02:00
Concedo
2439014a03
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# examples/embedding/embedding.cpp
# tools/imatrix/imatrix.cpp
# tools/perplexity/perplexity.cpp
2025-05-08 23:41:02 +08:00
Concedo
b6220669f4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# Makefile
# examples/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/sync-ggml.last
2025-05-08 23:07:33 +08:00
Georgi Gerganov
6562e5a4d6
context : allow cache-less context for embeddings ( #13108 )
...
* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
Diego Devesa
f061021206
llama : print size and type of overridden tensors ( #13364 )
2025-05-08 13:15:15 +02:00
Sigbjørn Skjæret
bc4e1128f7
llama : deci : support ffn-free with attention ( #13296 )
2025-05-07 12:49:27 +02:00
piDack
6c7fd67b64
llama : support tie embedding for chatglm models ( #13328 )
2025-05-07 09:23:11 +02:00
Concedo
1377a93a73
Merge commit ' 5215b91e93
' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# cmake/x64-windows-llvm.cmake
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/CMakeLists.txt
# tools/imatrix/imatrix.cpp
# tools/llava/clip.cpp
# tools/rpc/rpc-server.cpp
2025-05-06 23:15:04 +08:00
ymcki
3bf785f3ef
llama : Llama-3_1-Nemotron-Ultra-253B-v1 support ( #12843 )
2025-05-03 17:39:51 +02:00
Concedo
0951ad9f58
temp merge, not working
2025-05-03 11:42:01 +08:00
Jared Van Bortel
2f567611c0
llama-model : support Qwen2 embedding models and pooling_mode_lasttoken ( #13245 )
2025-05-02 11:42:30 -04:00
Georgi Gerganov
c642bc014c
kv-cache : separate recurrent vs non-recurrent impl ( #12799 )
...
* kv-cache : serparate recurrent vs non-recurrent impl (wip)
ggml-ci
* kv-cache : init -> contructor + add llama_memory_params
ggml-ci
* kv-cache : fix callback reference
ggml-ci
* context : llama_kv_cache -> llama_memory_i
ggml-ci
* context : move memory creation logic to model
ggml-ci
* llama : remove reference of memory during encode
ggml-ci
* kv-cache : hide padding details in the implementation
ggml-ci
* kv-cache : add ubatch_next()
ggml-ci
* context : simplify sbatch logic
ggml-ci
* kv-cache : hide defrag logic in the implementation
ggml-ci
* context : hide kv cache details in implementation
ggml-ci
* build : fix
ggml-ci
* cont : another fix
ggml-ci
* kv-cache : simplify interface (wip)
ggml-ci
* kv-cache : use separate KV cell structs for unified/recurrent
ggml-ci
* kv-cache : clean-up
ggml-ci
* model : better llama_model::create_model() signature
ggml-ci
* kv-cache : fix recurrent seq_rm()
ggml-ci
* kv-cache : replace `struct callbacks` with `llama_model &`
ggml-ci
* kv-cache : replace `struct graph_params` with `llama_context &`
ggml-ci
* kv-cache : fix offload check
ggml-ci
* context : avoid passing unique_ptr
ggml-ci
* kv-cache : avoid using the backends from the llama_context
ref #13113
ggml-ci
* kv-cache : more consistent debug logs [no ci]
* kv-cache : do not pass the full llama_context for kv graphs
ggml-ci
* kv-cache : remove comment
* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext
ggml-ci
* kv-cache : fix recurrent multi-user case
ggml-ci
* memory : remove comments [no ci]
2025-05-02 17:48:36 +03:00
Concedo
17cbf9fd49
plamo fixed
2025-05-02 22:46:17 +08:00
Sigbjørn Skjæret
cb06a3c363
llama : orion rope type is neox ( #13261 )
2025-05-02 12:44:24 +02:00
Sigbjørn Skjæret
626083faf7
llama : plamo rope type is neox ( #13260 )
2025-05-02 12:40:56 +02:00
Concedo
ca53d1bedc
Merge commit ' 13c9a3319b
' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cpu/CMakeLists.txt
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2025-05-02 16:42:16 +08:00
Jared Van Bortel
a70183eb00
llama-model : fix the reported size class for nomic-embed-text-v2-moe ( #13223 )
2025-05-01 10:09:41 +03:00
Concedo
8273739412
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/cpu.Dockerfile
# .devops/cuda.Dockerfile
# .devops/intel.Dockerfile
# .devops/llama-cli-cann.Dockerfile
# .devops/musa.Dockerfile
# .devops/rocm.Dockerfile
# .devops/vulkan.Dockerfile
# examples/llama-bench/llama-bench.cpp
# examples/rpc/rpc-server.cpp
# scripts/compare-llama-bench.py
# tests/test-quantize-stats.cpp
2025-04-30 17:22:18 +08:00
Johannes Gäßler
cdf76586b2
CUDA: fix non-cont. inputs for batched mat mul ( #13155 )
2025-04-29 16:00:27 +02:00
Concedo
b2ecfa0f55
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/llama-bench/README.md
# examples/llama-bench/llama-bench.cpp
# examples/llava/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-chat-template.cpp
2025-04-29 21:05:16 +08:00
Sigbjørn Skjæret
7d3af70b08
llama : llm_type order by size ( #13177 )
2025-04-29 13:25:53 +02:00
Sigbjørn Skjæret
e98b3692be
llama : set qwen3 model type sizes ( #13175 )
2025-04-29 11:00:31 +02:00
AT
5f5e39e1ba
model : Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture ( #12466 )
...
* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture
- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.
https://www.nomic.ai/blog/posts/nomic-embed-text-v2
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* fix tokenizer
* don't rename this tensor
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2025-04-28 22:52:15 +03:00
Johannes Gäßler
69699be48a
CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 ( #13137 )
2025-04-28 09:29:26 +02:00