Concedo
bce519cee7
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# tests/test-backend-ops.cpp
2025-04-18 12:44:20 +08:00
Georgi Gerganov
2f74c354c0
graph : make FA compatible with MLA + add initial Metal kernels ( #12953 )
...
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
2025-04-17 18:16:36 +03:00
Concedo
06159939d9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# Makefile
# docs/build.md
# examples/rpc/rpc-server.cpp
# examples/sycl/build.sh
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# scripts/sync-ggml.last
2025-04-17 00:52:37 +08:00
Juk Armstrong
daa422881a
llama : DeepSeek V2/V3 MLA implementation ( #12801 )
...
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`
2025-04-15 09:49:57 +03:00
Concedo
a0ae187563
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# README.md
# build-xcframework.sh
# examples/llava/CMakeLists.txt
# examples/llava/clip.cpp
# examples/rpc/rpc-server.cpp
# examples/run/run.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
2025-04-12 10:06:47 +08:00
Concedo
ea9bd61e47
Merge commit ' 64eda5deb9
' into concedo_experimental
...
# Conflicts:
# .devops/cuda.Dockerfile
# .devops/intel.Dockerfile
# .devops/llama-cli-cann.Dockerfile
# .devops/musa.Dockerfile
# .devops/rocm.Dockerfile
# .devops/vulkan.Dockerfile
# .github/workflows/build.yml
# .github/workflows/docker.yml
# README.md
# docs/backend/SYCL.md
# examples/llava/clip.cpp
# examples/server_embd.py
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# src/CMakeLists.txt
# tests/test-chat-template.cpp
2025-04-12 08:31:22 +08:00
Yuxuan Zhang
06bb53ad9b
llama-model : add Glm4Model implementation for GLM-4-0414 ( #12867 )
...
* GLM-4-0414
* use original one
* Using with tensor map
* fix bug
* change order
* change order
* format with flask8
2025-04-11 12:10:10 +02:00
Xuan-Son Nguyen
8b91d5355a
llama : correct rms norm for llama 4 ( #12882 )
2025-04-11 08:49:50 +02:00
Bo Zheng
d3bd7193ba
llama : Support Qwen3 and Qwen3MoE ( #12828 )
...
* add qwen3 & qwen3moe support.
* fix
---------
Co-authored-by: bozheng-hit <dsoul0621@gmail.com>
2025-04-09 11:47:36 +02:00
Concedo
ebf924c5d1
Merge branch 'upstream' into concedo_experimental
2025-04-08 21:46:30 +08:00
Xuan-Son Nguyen
1466621e73
llama : Support llama 4 text-only ( #12791 )
...
* llama4 conversion
* initial support, no chat template
* clean up a bit
* fix tokenizer conversion
* correct hparams
* try this
* fix shexp
* ffn_inp_normed
* chat template
* clean up model conversion
* add_bos
* add scale_before_ffn
* fix order
* weight_before_ffn
* llm_graph_input_attn_temp
* add chunk attn mask
* build_inp_attn_scale()
* add comment about ggml_repeat
* clarify comments
* fix build
2025-04-07 23:06:44 +02:00
Concedo
103d60ed2c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/common.cpp
# examples/batched-bench/batched-bench.cpp
# examples/batched/batched.cpp
# examples/export-lora/export-lora.cpp
# examples/gritlm/gritlm.cpp
# examples/parallel/parallel.cpp
# examples/passkey/passkey.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-vulkan/CMakeLists.txt
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
2025-04-03 18:57:49 +08:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers ( #11397 )
...
* llama : add option to override tensor buffers
* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Concedo
9e182b3e78
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# docs/backend/SYCL.md
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/ggml-vulkan.cpp
# scripts/sync-ggml.last
# tests/test-chat-template.cpp
2025-04-01 20:16:07 +08:00
Sigbjørn Skjæret
2c3f8b850a
llama : support BailingMoE (Ling) ( #12634 )
2025-03-30 22:21:03 +02:00
Concedo
ce05aa722d
Merge commit ' 0bb2919335
' into concedo_experimental
...
# Conflicts:
# ggml/src/CMakeLists.txt
# src/llama-model.cpp
2025-03-30 18:18:20 +08:00
Djip007
0bb2919335
llama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU ( #12632 )
...
this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (#12498 ) without
completely disable CPU extra buffer.
Co-authored-by: philou <philou@framework>
2025-03-29 14:07:37 +01:00
Concedo
396875e1c4
update api docs and lite
2025-03-29 15:39:25 +08:00
Sigbjørn Skjæret
3714c3ee1a
llama : fix incorrect Qwen2Moe ffn_moe_out graph callback ( #12631 )
2025-03-28 22:13:02 +01:00
Si1w
f125b8dccf
llama : add PLM GGUF Conversion & Inference Support ( #12457 )
...
* add edgellm model arch[conversation feature doesn't work]
* remove output.weight layer for edgellm arch
* [Model] update the name of the model
* update the name of model arch in convert gguf
* [Model] Refarctor the model arch into llama-model
* [Bug] Fix the bug in create attn kv
* [Code] Fix editorconfig erros
* [Code] Remove Trailing whitespace
* [Code] Remove Trailing whitespace
* [Code] Change the order of model arch in list
* [Code] Fix flake8 Lint errors
* Remove trailing white space
* [Code] Remove call in model arch
2025-03-27 12:49:15 +02:00
HighDoping
953c2a62cf
model : restore support for T5Encoder ( #12590 )
2025-03-27 11:43:33 +01:00
Concedo
ea358369cc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/README.md
# ci/run.sh
# docs/backend/CUDA-FEDORA.md
# docs/build.md
# docs/install.md
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/common.cuh
# tests/test-backend-ops.cpp
2025-03-26 00:18:01 +08:00
Xuan-Son Nguyen
fbdfefe74e
llama : gemma3 : use output tensor if it exists in model weight ( #12506 )
...
* llama : gemma3 : use output tensor if it exists in model weight
* also add to the llm_tensor_names
2025-03-22 23:28:19 +01:00
Concedo
ae670dbe0e
no repacking for avx2 for kcpp because it breaks 4_0_4_4 quants
2025-03-22 00:33:27 +08:00
Concedo
7030ebf401
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/SYCL.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
# ggml/src/ggml-sycl/CMakeLists.txt
# tests/test-backend-ops.cpp
2025-03-22 00:32:42 +08:00
Georgi Gerganov
af04481e6b
model : do not repack if a GPU device is present ( #12498 )
...
ggml-ci
2025-03-21 16:14:29 +02:00
Sigbjørn Skjæret
960e726077
chore : cleanup llama_model_loader::TENSOR_ usage ( #12492 )
2025-03-21 10:21:36 +01:00
Sigbjørn Skjæret
dbb3a4739e
llama : make Qwen2MoE QKV bias optional ( #12477 )
2025-03-20 12:49:59 +01:00
Concedo
0c90d2ebcf
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
# cmake/common.cmake
# docs/backend/SYCL.md
# examples/main/README.md
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# tests/test-backend-ops.cpp
2025-03-19 19:27:11 +08:00
Sigbjørn Skjæret
108e53c2f1
llama : add support for GPT2, Bloom and CodeShell tied word embeddings ( #12456 )
...
* Add support for GPT2, Bloom and CodeShell tied word embeddings
* Deduplicate tied word embeddings weights
* Workaround for incorrect weight map
It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first.
* check++
* fatfingers--
2025-03-19 09:08:49 +01:00
Georgi Gerganov
75422e8bc4
graph : normalize Q, K, V shapes + sync cross attention ( #12449 )
...
* graph : normalize Q, K, V shapes and add comments
ggml-ci
* context : synchronize before getting cross attention data
* model : fix command-r attention norm check
2025-03-18 21:35:19 +02:00
Xuan-Son Nguyen
99aa304fb9
llama : add support for EXAONE tied word embeddings ( #12451 )
2025-03-18 17:24:33 +01:00
Molly Sophia
7dfad387e3
llama: Add support for RWKV v7 architecture ( #12412 )
...
* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Apply code-format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* fix MUSA build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-18 07:27:50 +08:00
Sigbjørn Skjæret
8ba95dca20
llama : fix OLMo-2-0325-32B-Instruct K-norm size ( #12400 )
2025-03-16 19:46:36 +02:00
Concedo
be3bba67ff
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# src/llama-model.cpp
2025-03-14 18:25:21 +08:00
Georgi Gerganov
c522ce4143
graph : simplify attn input build for unified KV cache ( #12381 )
...
ggml-ci
2025-03-14 10:47:44 +02:00
Georgi Gerganov
081bee8c64
hparams : add SWA rope parameters ( #12374 )
...
ggml-ci
2025-03-14 09:03:24 +02:00
Concedo
7dc72db9de
Merge branch 'upstream' into concedo_experimental
2025-03-14 11:58:53 +08:00
Concedo
0db4ae6237
traded my ink for a pen
2025-03-14 11:58:15 +08:00
Georgi Gerganov
84d5475541
llama : fix Gemma3 SWA KV cache shift ( #12373 )
...
* llama : fix Gemma3 SWA KV cache shift
ggml-ci
* hparams : add comment [no ci]
2025-03-13 19:08:07 +02:00
Georgi Gerganov
e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context ( #12181 )
...
* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci
2025-03-13 12:35:44 +02:00
Concedo
77debb1b1b
gemma3 vision works, but is using more tokens than expected - may need resizing
2025-03-13 00:31:16 +08:00
Xuan-Son Nguyen
7841fc723e
llama : Add Gemma 3 support (+ experimental vision capability) ( #12343 )
...
* llama : Add Gemma 3 text-only support
* fix python coding style
* fix compile on ubuntu
* python: fix style
* fix ubuntu compile
* fix build on ubuntu (again)
* fix ubuntu build, finally
* clip : Experimental support for Gemma 3 vision (#12344 )
* clip : Experimental support for Gemma 3 vision
* fix build
* PRId64
2025-03-12 09:30:24 +01:00
Concedo
6b7d2349a7
Rewrite history to fix bad vulkan shader commits without increasing repo size
...
added dpe colab (+8 squashed commit)
Squashed commit:
[b8362da4] updated lite
[ed6c037d] move nsigma into the regular sampler stack
[ac5f61c6] relative filepath fixed
[05fe96ab] export template
[ed0a5a3e] nix_example.md: refactor (#1401 )
* nix_example.md: add override example
* nix_example.md: drop graphics example, already basic nixos knowledge
* nix_example.md: format
* nix_example.md: Vulkan is disabled on macOS
Disabled in: 1ccd253acc
* nix_examples.md: nixpkgs.config.cuda{Arches -> Capabilities}
Fixes: https://github.com/LostRuins/koboldcpp/issues/1367
[675c62f7] AutoGuess: Phi 4 (mini) (#1402 )
[4bf56982
] phrasing
[b8c0df04
] Add Rep Pen to Top N Sigma sampler chain (#1397 )
- place after nsigma and before xtc (+3 squashed commit)
Squashed commit:
[87c52b97
] disable VMM from HIP
[ee8906f3
] edit description
[e85c0e69
] Remove Unnecessary Rep Counting (#1394 )
* stop counting reps
* fix range-based initializer
* strike that - reverse it
2025-03-05 00:02:20 +08:00
Xuan-Son Nguyen
c43a3e7996
llama : add Phi-4-mini support (supersede #12099 ) ( #12108 )
...
* Added Phi-4-mini-instruct support
* Update regex per ngxson
* Change the vocab base to Xenova/gpt-4o
* fix conversion update script
* no need to check longrope
* minor style fix
* fix python style
---------
Co-authored-by: Nicholas Sparks <nisparks@microsoft.com>
2025-02-28 12:44:11 +01:00
Vitali Lovich
3e9a2860e9
llama : expose llama_model_n_head_kv in the API ( #11997 )
...
It's useful to be able to have this from the library layer as it's a key
parameter of the model (e.g. to figure out how much KV cache memory is
needed).
2025-02-25 11:29:33 +02:00
Concedo
159c47f0e6
Merge commit ' 335eb04a91
' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CONTRIBUTING.md
# Makefile
# docs/build.md
# examples/llama.swiftui/llama.swiftui/UI/ContentView.swift
# examples/run/run.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
2025-02-24 11:55:14 +08:00
Georgi Gerganov
51f311e057
llama : skip loading unused tensors ( #12004 )
...
* llama : assign unknown/unused tensors to host buffer type
ggml-ci
* llama : skip unused tensors
ggml-ci
2025-02-21 18:33:18 +02:00
Concedo
3fa4843850
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/server/README.md
# src/llama-model.cpp
2025-02-08 22:57:18 +08:00
Concedo
a83f2d5fce
reduce some spamminess
2025-02-08 22:49:48 +08:00