Concedo
a0ae187563
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# README.md
# build-xcframework.sh
# examples/llava/CMakeLists.txt
# examples/llava/clip.cpp
# examples/rpc/rpc-server.cpp
# examples/run/run.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
2025-04-12 10:06:47 +08:00
Concedo
ea9bd61e47
Merge commit ' 64eda5deb9' into concedo_experimental
...
# Conflicts:
# .devops/cuda.Dockerfile
# .devops/intel.Dockerfile
# .devops/llama-cli-cann.Dockerfile
# .devops/musa.Dockerfile
# .devops/rocm.Dockerfile
# .devops/vulkan.Dockerfile
# .github/workflows/build.yml
# .github/workflows/docker.yml
# README.md
# docs/backend/SYCL.md
# examples/llava/clip.cpp
# examples/server_embd.py
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# src/CMakeLists.txt
# tests/test-chat-template.cpp
2025-04-12 08:31:22 +08:00
Yuxuan Zhang
06bb53ad9b
llama-model : add Glm4Model implementation for GLM-4-0414 ( #12867 )
...
* GLM-4-0414
* use original one
* Using with tensor map
* fix bug
* change order
* change order
* format with flask8
2025-04-11 12:10:10 +02:00
Xuan-Son Nguyen
8b91d5355a
llama : correct rms norm for llama 4 ( #12882 )
2025-04-11 08:49:50 +02:00
Bo Zheng
d3bd7193ba
llama : Support Qwen3 and Qwen3MoE ( #12828 )
...
* add qwen3 & qwen3moe support.
* fix
---------
Co-authored-by: bozheng-hit <dsoul0621@gmail.com>
2025-04-09 11:47:36 +02:00
Plamen Minev
381603a775
ci: detach common from the library ( #12827 )
...
* fix: detach common from the library
* fix: building chat test template
2025-04-09 10:11:11 +02:00
Georgi Gerganov
a19b5cef16
llama : fix FA when KV cache is not used (i.e. embeddings) ( #12825 )
...
* ggml : FA supports F32 V
* graph : cast KV to F16 when the KV cache is not used
ggml-ci
* server : add test that exercises embeddings with FA enabled
ggml-ci
2025-04-08 19:54:51 +03:00
Concedo
ebf924c5d1
Merge branch 'upstream' into concedo_experimental
2025-04-08 21:46:30 +08:00
Concedo
822cf2430e
Merge commit ' f1e3eb4249' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# docs/backend/SYCL.md
# examples/llava/clip.cpp
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/cmake/host-toolchain.cmake.in
2025-04-08 20:48:53 +08:00
Xuan-Son Nguyen
1466621e73
llama : Support llama 4 text-only ( #12791 )
...
* llama4 conversion
* initial support, no chat template
* clean up a bit
* fix tokenizer conversion
* correct hparams
* try this
* fix shexp
* ffn_inp_normed
* chat template
* clean up model conversion
* add_bos
* add scale_before_ffn
* fix order
* weight_before_ffn
* llm_graph_input_attn_temp
* add chunk attn mask
* build_inp_attn_scale()
* add comment about ggml_repeat
* clarify comments
* fix build
2025-04-07 23:06:44 +02:00
Georgi Gerganov
3e1d29348b
kv-cache : simplify + fix warning for recurrent models ( #12756 )
...
ggml-ci
2025-04-04 21:48:10 +03:00
Concedo
4e740311fe
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/run.sh
# docs/backend/SYCL.md
# docs/build.md
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# tests/test-chat-template.cpp
2025-04-04 15:07:47 +08:00
yumeyao
5dd5d1ab00
vocab : use string_view::find() to avoid unnecessary looking up beyond the fragment range ( #12706 )
2025-04-03 18:32:54 +03:00
Concedo
103d60ed2c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/common.cpp
# examples/batched-bench/batched-bench.cpp
# examples/batched/batched.cpp
# examples/export-lora/export-lora.cpp
# examples/gritlm/gritlm.cpp
# examples/parallel/parallel.cpp
# examples/passkey/passkey.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-vulkan/CMakeLists.txt
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
2025-04-03 18:57:49 +08:00
Georgi Gerganov
833e2b7409
model : print tensor size during load ( #12711 )
...
* model : print tensor size during load
* cont : fix units MB -> MiB
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-04-02 16:38:54 +03:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers ( #11397 )
...
* llama : add option to override tensor buffers
* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Georgi Gerganov
a10b36c91a
llama : refactor kv cache guard ( #12695 )
...
* llama : refactor kv cache guard
ggml-ci
* cont : fix comment [no ci]
* llama : fix kv_cache restore logic
ggml-ci
* context : simplify kv cache updates
ggml-ci
* cont : better name [no ci]
* llama : fix llama_decode return code when could not find KV slot
ggml-ci
* context : change log err -> warn [no ci]
* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Sigbjørn Skjæret
83a88bd6af
vocab : BailingMoE : change possessive quantifiers to greedy ( #12677 )
2025-04-02 11:21:48 +02:00
jklincn
e39e727e9a
llama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key ( #12672 )
2025-04-01 14:54:28 +02:00
Concedo
9e182b3e78
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# docs/backend/SYCL.md
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/ggml-vulkan.cpp
# scripts/sync-ggml.last
# tests/test-chat-template.cpp
2025-04-01 20:16:07 +08:00
Daniel Bevenius
c80a7759da
vocab : add special infill tokens for CodeLlama ( #11850 )
...
* vocab : add special infill tokens for CodeLlama
The commit adds the following special tokens for CodeLlama infill:
- `▁<PRE>`
- `▁<SUF>`
- `▁<MID>`
The motivation for this is that currently the infill example uses
CodeLlama as a suggested model. But when using this model the following
error is generated:
```console
/llama.cpp-debug/examples/infill/infill.cpp:165: GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0) failed
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
305251 Aborted (core dumped)
./build/bin/llama-infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf \
-c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 \
--in-prefix "def helloworld():\n print(\"hell" \
--in-suffix "\n print(\"goodbye world\")\n "
```
* squash! vocab : add special infill tokens for CodeLlama
Add _<EOT> as well.
2025-03-31 18:40:56 +02:00
Sigbjørn Skjæret
2c3f8b850a
llama : support BailingMoE (Ling) ( #12634 )
2025-03-30 22:21:03 +02:00
Juyoung Suk
b3de7cac73
llama : add Trillion 7B model support ( #12556 )
...
* Support Trillion 7B
* Update llama.h
* Update llama.h
* Update llama-vocab.cpp for Trillion
* Update llama-vocab.cpp
2025-03-30 20:38:33 +02:00
Sergei Vorobyov
7242dd9675
llama-chat : Add Yandex instruct model template support ( #12621 )
...
* add yandex template
* update yandex chat template
* fix tests
* adjust chat template
* fix style
* fix tool macro in template
* add clarify comment
---------
Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-30 20:12:03 +02:00
Concedo
e6337ff957
Merge commit ' e408d4351a' into concedo_experimental
...
# Conflicts:
# ggml/CMakeLists.txt
2025-03-30 18:26:02 +08:00
Concedo
ce05aa722d
Merge commit ' 0bb2919335' into concedo_experimental
...
# Conflicts:
# ggml/src/CMakeLists.txt
# src/llama-model.cpp
2025-03-30 18:18:20 +08:00
Xuan-Son Nguyen
af6ae1efb2
llama : fix non-causal mask for gemma 3 ( #12615 )
2025-03-30 00:07:37 +01:00
Djip007
0bb2919335
llama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU ( #12632 )
...
this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (#12498 ) without
completely disable CPU extra buffer.
Co-authored-by: philou <philou@framework>
2025-03-29 14:07:37 +01:00
Concedo
396875e1c4
update api docs and lite
2025-03-29 15:39:25 +08:00
Sigbjørn Skjæret
3714c3ee1a
llama : fix incorrect Qwen2Moe ffn_moe_out graph callback ( #12631 )
2025-03-28 22:13:02 +01:00
Georgi Gerganov
b4ae50810e
metal : improve FA + improve MoE ( #12612 )
...
* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci
2025-03-28 20:21:59 +02:00
Johannes Gäßler
dd373dd3bf
llama: fix error on bad grammar ( #12628 )
2025-03-28 18:08:52 +01:00
Si1w
f125b8dccf
llama : add PLM GGUF Conversion & Inference Support ( #12457 )
...
* add edgellm model arch[conversation feature doesn't work]
* remove output.weight layer for edgellm arch
* [Model] update the name of the model
* update the name of model arch in convert gguf
* [Model] Refarctor the model arch into llama-model
* [Bug] Fix the bug in create attn kv
* [Code] Fix editorconfig erros
* [Code] Remove Trailing whitespace
* [Code] Remove Trailing whitespace
* [Code] Change the order of model arch in list
* [Code] Fix flake8 Lint errors
* Remove trailing white space
* [Code] Remove call in model arch
2025-03-27 12:49:15 +02:00
HighDoping
953c2a62cf
model : restore support for T5Encoder ( #12590 )
2025-03-27 11:43:33 +01:00
Georgi Gerganov
f28bc4c286
llama : make loras compatible with repacking ( #12593 )
...
* llama : make loras compatible with repacking
ggml-ci
* cont : simplify
ggml-ci
* cont : add TODO [no ci]
2025-03-27 08:24:10 +02:00
Concedo
ea358369cc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/README.md
# ci/run.sh
# docs/backend/CUDA-FEDORA.md
# docs/build.md
# docs/install.md
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/common.cuh
# tests/test-backend-ops.cpp
2025-03-26 00:18:01 +08:00
Georgi Gerganov
2d77d88e70
context : fix worst-case reserve outputs ( #12545 )
...
ggml-ci
2025-03-25 09:19:23 +02:00
compilade
00d53800e0
llama-vocab : add SuperBPE pre-tokenizer ( #12532 )
2025-03-24 11:47:24 +01:00
Prajwal B Mehendarkar
c54f6b7988
mmap : skip resource limit checks on AIX ( #12541 )
2025-03-24 12:17:10 +02:00
Xuan-Son Nguyen
fbdfefe74e
llama : gemma3 : use output tensor if it exists in model weight ( #12506 )
...
* llama : gemma3 : use output tensor if it exists in model weight
* also add to the llm_tensor_names
2025-03-22 23:28:19 +01:00
Concedo
ae670dbe0e
no repacking for avx2 for kcpp because it breaks 4_0_4_4 quants
2025-03-22 00:33:27 +08:00
Concedo
7030ebf401
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/SYCL.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
# ggml/src/ggml-sycl/CMakeLists.txt
# tests/test-backend-ops.cpp
2025-03-22 00:32:42 +08:00
Georgi Gerganov
af04481e6b
model : do not repack if a GPU device is present ( #12498 )
...
ggml-ci
2025-03-21 16:14:29 +02:00
Sigbjørn Skjæret
960e726077
chore : cleanup llama_model_loader::TENSOR_ usage ( #12492 )
2025-03-21 10:21:36 +01:00
Sigbjørn Skjæret
dbb3a4739e
llama : make Qwen2MoE QKV bias optional ( #12477 )
2025-03-20 12:49:59 +01:00
fairydreaming
568013d0cd
context : clear sets containing encoder output sequence ids before storing new values ( #12470 )
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-19 21:01:57 +01:00
Concedo
0c90d2ebcf
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
# cmake/common.cmake
# docs/backend/SYCL.md
# examples/main/README.md
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# tests/test-backend-ops.cpp
2025-03-19 19:27:11 +08:00
Sigbjørn Skjæret
108e53c2f1
llama : add support for GPT2, Bloom and CodeShell tied word embeddings ( #12456 )
...
* Add support for GPT2, Bloom and CodeShell tied word embeddings
* Deduplicate tied word embeddings weights
* Workaround for incorrect weight map
It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first.
* check++
* fatfingers--
2025-03-19 09:08:49 +01:00
Georgi Gerganov
75422e8bc4
graph : normalize Q, K, V shapes + sync cross attention ( #12449 )
...
* graph : normalize Q, K, V shapes and add comments
ggml-ci
* context : synchronize before getting cross attention data
* model : fix command-r attention norm check
2025-03-18 21:35:19 +02:00
Xuan-Son Nguyen
99aa304fb9
llama : add support for EXAONE tied word embeddings ( #12451 )
2025-03-18 17:24:33 +01:00