Concedo
b8c1fc7c9e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/development/HOWTO-add-model.md
# ggml/src/ggml-sycl/rope.cpp
# tests/test-backend-ops.cpp
2025-07-09 19:25:28 +08:00
Xuan-Son Nguyen
8f22dc0a53
model : add hunyuan moe ( #14425 )
...
* model : add hunyuan moe
* tokenizer ok
* fix tensor name
* cgraph init
* chat template
* wip
* almost working
* skip embed, fix bos
* cleanup
* yarn scaling
* cleanup
* correct rope type
* failed token fix
* ntk alpha freq_base
* tokenization working
* cleanup and pr changes
* vocab_size sanity check
* ntk alpha generic
* Update convert_hf_to_gguf.py
* Apply suggestions from code review
* fix regression
* fix style
---------
Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
2025-07-08 11:24:06 +03:00
Concedo
ace537d44e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# examples/simple-chat/simple-chat.cpp
# src/llama-quant.cpp
# tools/run/run.cpp
# tools/server/README.md
2025-06-24 23:06:16 +08:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos ( #14339 )
...
* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments
2025-06-23 12:27:35 +03:00
Ed Addario
fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) ( #13037 )
2025-06-22 23:16:26 +02:00
Concedo
4f2fcaa2ef
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/run.sh
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/im2col.cpp
# ggml/src/ggml-sycl/mmq.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# tests/test-backend-ops.cpp
2025-06-21 00:32:22 +08:00
Ruikai Peng
dd6e6d0b6a
vocab : prevent tokenizer overflow ( #14301 )
...
* vocab : prevent stack overflow in tokenize
* vocab : return error instead of aborting on oversized token count
* vocab : INT32_MIN from llama_tokenize on overflow
2025-06-20 07:13:06 -07:00
Sigbjørn Skjæret
88fc854b4b
llama : improve sep token handling ( #14272 )
2025-06-20 14:04:09 +02:00
Concedo
4356a00f4a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ci/run.sh
# docs/function-calling.md
# examples/gritlm/gritlm.cpp
# ggml/CMakeLists.txt
# ggml/cmake/common.cmake
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/ggml-cpu.c
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# requirements/requirements-compare-llama-bench.txt
# scripts/compare-llama-bench.py
# tests/CMakeLists.txt
2025-06-18 00:16:54 +08:00
Georgi Gerganov
89fea80d29
server : fix incorrect usage of llama_get_embeddings() ( #14225 )
...
* server : fix incorrect usage of llama_get_embeddings()
ggml-ci
* cont : fix the fix
ggml-ci
2025-06-16 22:33:27 +03:00
Georgi Gerganov
d3e64b9f49
llama : rework embeddings logic ( #14208 )
...
* llama : rework embeddings logic
ggml-ci
* cont : fix rerank
ggml-ci
* cont : engrish [no ci]
* cont : fix rerank
ggml-ci
* server : support both embeddings and completions with single model
ggml-ci
* cont : avoid embeddings_org
ggml-ci
2025-06-16 14:14:00 +03:00
Georgi Gerganov
b9912ac570
batch : auto-gen positions + verify multi-sequence input ( #14177 )
...
* batch : verify multi-sequence input batches
ggml-ci
* cont : auto-gen positions + verify multi-seq input
ggml-ci
* cont : first print debug info, then perform validation
ggml-ci
* cont : fix position auto-gen + add comments
ggml-ci
2025-06-15 09:18:37 +03:00
Concedo
4204f111f7
Merge commit ' 8f47e25f56
' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-linux-cross.yml
# docs/backend/CANN.md
# examples/batched.swift/Sources/main.swift
# examples/embedding/embedding.cpp
# examples/gritlm/gritlm.cpp
# examples/llama.android/llama/src/main/cpp/llama-android.cpp
# examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/passkey/passkey.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/simple-chat/simple-chat.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# tools/batched-bench/batched-bench.cpp
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
# tools/run/run.cpp
2025-06-13 22:05:03 +08:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
Concedo
d33c88b1f4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# ci/run.sh
# examples/embedding/embedding.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# src/CMakeLists.txt
2025-06-06 17:56:51 +08:00
Sigbjørn Skjæret
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
2025-06-06 09:03:25 +02:00
Georgi Gerganov
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
2025-06-05 15:29:22 +03:00
Concedo
6ce85c54d6
not working correctly
2025-06-02 22:12:10 +08:00
Georgi Gerganov
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls ( #13921 )
...
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i ( #13746 )
...
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
2025-05-31 10:24:04 +03:00
Concedo
8c701d7ded
Merge commit ' 72b090da2c
' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/function-calling.md
# examples/embedding/embedding.cpp
# examples/retrieval/retrieval.cpp
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/Doxyfile
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# scripts/compare-commits.sh
# tests/test-chat.cpp
# tests/test-sampling.cpp
2025-05-28 00:28:41 +08:00
Concedo
868cb6aff7
Merge commit ' e121edc432
' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# common/CMakeLists.txt
# docs/function-calling.md
# ggml/src/ggml-sycl/binbcast.cpp
# models/templates/README.md
# scripts/tool_bench.py
# src/llama-kv-cache.cpp
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/mtmd/clip.h
# tools/rpc/rpc-server.cpp
# tools/server/README.md
2025-05-28 00:20:45 +08:00
Georgi Gerganov
22229314fc
llama : clarify deprecation message ( #13794 )
2025-05-26 12:57:50 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Concedo
bd960a90a6
removed unnecessary function
2025-05-24 23:59:31 +08:00
Concedo
f97bbdde00
fix to allow all EOGs to trigger a stop, occam's glm4 fix,
2025-05-24 22:55:11 +08:00
Concedo
22ef97d7d3
Merge commit ' ab86335760
' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# examples/retrieval/retrieval.cpp
# examples/simple-chat/simple-chat.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# requirements/requirements-convert_lora_to_gguf.txt
# tools/run/run.cpp
2025-05-23 11:41:36 +08:00
Concedo
da7fd4aa57
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# README.md
# ci/README.md
# docs/docker.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-arg-parser.cpp
2025-05-21 23:12:22 +08:00
Concedo
9f976e9c65
swa full used unless ctx shift and fast forward disabled
2025-05-21 22:47:45 +08:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
2025-05-20 16:13:16 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Concedo
12e6928ec2
i'm gonna regret this, aren't i?
2025-05-15 23:59:55 +08:00
Diego Devesa
cf0a43bb64
llama-bench : add defrag-thold, check for invalid ranges ( #13487 )
2025-05-13 00:31:37 +02:00
Concedo
21e31e255b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/docker.yml
# README.md
# build-xcframework.sh
# common/CMakeLists.txt
# examples/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-metal/ggml-metal.m
# ggml/src/ggml-metal/ggml-metal.metal
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/backend.hpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# scripts/compare-llama-bench.py
# src/CMakeLists.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/README.md
# tools/mtmd/clip.cpp
# tools/rpc/rpc-server.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2025-05-13 00:28:35 +08:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
David Huang
7f323a589f
Add --no-op-offload
to improve -ot
pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00
Sigbjørn Skjæret
d2a4ef05c6
vocab : add ByteDance-Seed/Seed-Coder ( #13423 )
2025-05-10 22:08:07 +02:00
Concedo
2439014a03
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# examples/embedding/embedding.cpp
# tools/imatrix/imatrix.cpp
# tools/perplexity/perplexity.cpp
2025-05-08 23:41:02 +08:00
Georgi Gerganov
6562e5a4d6
context : allow cache-less context for embeddings ( #13108 )
...
* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
Georgi Gerganov
51fb96b1ff
context : remove logits_all flag ( #13284 )
...
* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci
2025-05-08 14:26:50 +03:00
Concedo
8273739412
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/cpu.Dockerfile
# .devops/cuda.Dockerfile
# .devops/intel.Dockerfile
# .devops/llama-cli-cann.Dockerfile
# .devops/musa.Dockerfile
# .devops/rocm.Dockerfile
# .devops/vulkan.Dockerfile
# examples/llama-bench/llama-bench.cpp
# examples/rpc/rpc-server.cpp
# scripts/compare-llama-bench.py
# tests/test-quantize-stats.cpp
2025-04-30 17:22:18 +08:00
Georgi Gerganov
d9d398f84f
sampling : when top-k <= 0 -> noop ( #13173 )
...
ggml-ci
2025-04-29 20:22:57 +03:00
Concedo
28a2723100
merged pixtral support, not fully working
2025-04-24 15:27:02 +08:00
Xuan-Son Nguyen
ecda2ec4b3
mtmd : Support Pixtral 12B ( #13065 )
...
* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script
2025-04-23 20:21:59 +02:00
Concedo
06159939d9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# Makefile
# docs/build.md
# examples/rpc/rpc-server.cpp
# examples/sycl/build.sh
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# scripts/sync-ggml.last
2025-04-17 00:52:37 +08:00
Ed Addario
71e90e8813
quantize: Handle user-defined quantization levels for additional tensors ( #12511 )
...
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' coding guidelines
* Update descriptions to match existing style
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' guidelines
* Implement general --tensor-type instead of tensor-specific command option
* Fix implied type bug
* Restore missing #includes
* Add regex capability for tensor selection
* Refactor function name and update ALLOWED_TENSOR_TYPE
* Add missing #include
* Handle edge case when tensor name is cls.output
* Minor logging improvement
2025-04-13 21:29:28 +03:00
Concedo
ebf924c5d1
Merge branch 'upstream' into concedo_experimental
2025-04-08 21:46:30 +08:00
Xuan-Son Nguyen
1466621e73
llama : Support llama 4 text-only ( #12791 )
...
* llama4 conversion
* initial support, no chat template
* clean up a bit
* fix tokenizer conversion
* correct hparams
* try this
* fix shexp
* ffn_inp_normed
* chat template
* clean up model conversion
* add_bos
* add scale_before_ffn
* fix order
* weight_before_ffn
* llm_graph_input_attn_temp
* add chunk attn mask
* build_inp_attn_scale()
* add comment about ggml_repeat
* clarify comments
* fix build
2025-04-07 23:06:44 +02:00