Concedo
0fcfbdb93c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# .github/workflows/close-issue.yml
# ci/README.md
# docs/build.md
# docs/docker.md
# ggml/CMakeLists.txt
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/fattn-wmma-f16.cu
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
2025-07-25 19:53:13 +08:00
Georgi Gerganov
e4868d16d2
context : perform output reorder lazily upon access after sync ( #14853 )
...
* context : perform output reorder after lazily upon access after sync
ggml-ci
* cont : add TODO
2025-07-24 16:31:48 +03:00
Concedo
b8e3280432
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# ggml/src/ggml-sycl/ggml-sycl.cpp
2025-07-18 13:46:32 +08:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs ( #14482 )
...
* llama : reuse compute graphs
ggml-ci
* llama-bench : add graph reuse parameter
ggml-ci
* cont : remove the parameter and the sched resets
ggml-ci
* graph : rename update() to can_reuse()
ggml-ci
* params : remove is_same()
ggml-ci
* graph : set res->params in llm_graph_context constructor
ggml-ci
* graph : avoid set_max_nodes in llm_graph_result
ggml-ci
* kv-cache : reuse llama_context's graph result instance
ggml-ci
* context : reset the previous graph result upon memory updates
ggml-ci
* batch : llama_ubatch now carries its data instead of pointing to balloc
ggml-ci
* merge : fix build
ggml-ci
* graph : fix can_reuse() checks when flash-attention is disabled
* graph : move llm_graph_result impl in source file + debug env
ggml-ci
2025-07-17 19:08:33 +03:00
Concedo
bdff33e0de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# examples/parallel/parallel.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# scripts/server-bench.py
# src/llama-kv-cache-unified.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Aman Gupta
ab14019821
Support diffusion models: Add Dream 7B ( #14644 )
...
* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now
2025-07-16 20:03:51 +08:00
Concedo
cbe9fc87c5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# src/llama-vocab.cpp
2025-07-16 12:03:54 +08:00
Min-Hua
79e0b68c17
llama: add LLAMA_API to deprecated llama_kv_self_seq_div ( #14708 )
...
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?
Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.
Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
2025-07-16 07:00:42 +03:00
Shunta Saito
68e37a61a7
model : add PLaMo-2 support ( #14560 )
...
* Add PLaMo-2 model using hybrid memory module
* Fix z shape
* Add cmath to include from llama-vocab.h
* Explicitly dequantize normalization weights before RoPE apply
* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing
* Use ATTN_K/Q_NORM for k,q weights to prevent quantization
* Remove SSM_BCDT that is not used from anywhere
* Do not duplicate embedding weights for output.weight
* Fix tokenizer encoding problem for multibyte strings
* Apply suggestion from @CISC
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up
* Remove unnecessary part for Grouped Query Attention
* Fix how to load special token id to gguf
* Remove unused tensor mapping
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations
* Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix plamo2 tokenizer session to prevent multiple calls of build()
---------
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-15 18:11:42 +02:00
Concedo
8cebec5128
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakePresets.json
# README.md
# common/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tools/run/CMakeLists.txt
2025-07-13 23:39:41 +08:00
Georgi Gerganov
0d5375d54b
llama : move enum llama_vocab_pre_type to implementation ( #14631 )
...
ggml-ci
2025-07-11 13:46:07 +03:00
Concedo
b8c1fc7c9e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/development/HOWTO-add-model.md
# ggml/src/ggml-sycl/rope.cpp
# tests/test-backend-ops.cpp
2025-07-09 19:25:28 +08:00
Xuan-Son Nguyen
8f22dc0a53
model : add hunyuan moe ( #14425 )
...
* model : add hunyuan moe
* tokenizer ok
* fix tensor name
* cgraph init
* chat template
* wip
* almost working
* skip embed, fix bos
* cleanup
* yarn scaling
* cleanup
* correct rope type
* failed token fix
* ntk alpha freq_base
* tokenization working
* cleanup and pr changes
* vocab_size sanity check
* ntk alpha generic
* Update convert_hf_to_gguf.py
* Apply suggestions from code review
* fix regression
* fix style
---------
Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
2025-07-08 11:24:06 +03:00
Concedo
ace537d44e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# examples/simple-chat/simple-chat.cpp
# src/llama-quant.cpp
# tools/run/run.cpp
# tools/server/README.md
2025-06-24 23:06:16 +08:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos ( #14339 )
...
* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments
2025-06-23 12:27:35 +03:00
Ed Addario
fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) ( #13037 )
2025-06-22 23:16:26 +02:00
Concedo
4f2fcaa2ef
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/run.sh
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/im2col.cpp
# ggml/src/ggml-sycl/mmq.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# tests/test-backend-ops.cpp
2025-06-21 00:32:22 +08:00
Ruikai Peng
dd6e6d0b6a
vocab : prevent tokenizer overflow ( #14301 )
...
* vocab : prevent stack overflow in tokenize
* vocab : return error instead of aborting on oversized token count
* vocab : INT32_MIN from llama_tokenize on overflow
2025-06-20 07:13:06 -07:00
Sigbjørn Skjæret
88fc854b4b
llama : improve sep token handling ( #14272 )
2025-06-20 14:04:09 +02:00
Concedo
4356a00f4a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ci/run.sh
# docs/function-calling.md
# examples/gritlm/gritlm.cpp
# ggml/CMakeLists.txt
# ggml/cmake/common.cmake
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/ggml-cpu.c
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# requirements/requirements-compare-llama-bench.txt
# scripts/compare-llama-bench.py
# tests/CMakeLists.txt
2025-06-18 00:16:54 +08:00
Georgi Gerganov
89fea80d29
server : fix incorrect usage of llama_get_embeddings() ( #14225 )
...
* server : fix incorrect usage of llama_get_embeddings()
ggml-ci
* cont : fix the fix
ggml-ci
2025-06-16 22:33:27 +03:00
Georgi Gerganov
d3e64b9f49
llama : rework embeddings logic ( #14208 )
...
* llama : rework embeddings logic
ggml-ci
* cont : fix rerank
ggml-ci
* cont : engrish [no ci]
* cont : fix rerank
ggml-ci
* server : support both embeddings and completions with single model
ggml-ci
* cont : avoid embeddings_org
ggml-ci
2025-06-16 14:14:00 +03:00
Georgi Gerganov
b9912ac570
batch : auto-gen positions + verify multi-sequence input ( #14177 )
...
* batch : verify multi-sequence input batches
ggml-ci
* cont : auto-gen positions + verify multi-seq input
ggml-ci
* cont : first print debug info, then perform validation
ggml-ci
* cont : fix position auto-gen + add comments
ggml-ci
2025-06-15 09:18:37 +03:00
Concedo
4204f111f7
Merge commit ' 8f47e25f56
' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-linux-cross.yml
# docs/backend/CANN.md
# examples/batched.swift/Sources/main.swift
# examples/embedding/embedding.cpp
# examples/gritlm/gritlm.cpp
# examples/llama.android/llama/src/main/cpp/llama-android.cpp
# examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/passkey/passkey.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/simple-chat/simple-chat.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# tools/batched-bench/batched-bench.cpp
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
# tools/run/run.cpp
2025-06-13 22:05:03 +08:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
Concedo
d33c88b1f4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# ci/run.sh
# examples/embedding/embedding.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# src/CMakeLists.txt
2025-06-06 17:56:51 +08:00
Sigbjørn Skjæret
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
2025-06-06 09:03:25 +02:00
Georgi Gerganov
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
2025-06-05 15:29:22 +03:00
Concedo
6ce85c54d6
not working correctly
2025-06-02 22:12:10 +08:00
Georgi Gerganov
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls ( #13921 )
...
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i ( #13746 )
...
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
2025-05-31 10:24:04 +03:00
Concedo
8c701d7ded
Merge commit ' 72b090da2c
' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/function-calling.md
# examples/embedding/embedding.cpp
# examples/retrieval/retrieval.cpp
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/Doxyfile
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# scripts/compare-commits.sh
# tests/test-chat.cpp
# tests/test-sampling.cpp
2025-05-28 00:28:41 +08:00
Concedo
868cb6aff7
Merge commit ' e121edc432
' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# common/CMakeLists.txt
# docs/function-calling.md
# ggml/src/ggml-sycl/binbcast.cpp
# models/templates/README.md
# scripts/tool_bench.py
# src/llama-kv-cache.cpp
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/mtmd/clip.h
# tools/rpc/rpc-server.cpp
# tools/server/README.md
2025-05-28 00:20:45 +08:00
Georgi Gerganov
22229314fc
llama : clarify deprecation message ( #13794 )
2025-05-26 12:57:50 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Concedo
bd960a90a6
removed unnecessary function
2025-05-24 23:59:31 +08:00
Concedo
f97bbdde00
fix to allow all EOGs to trigger a stop, occam's glm4 fix,
2025-05-24 22:55:11 +08:00
Concedo
22ef97d7d3
Merge commit ' ab86335760
' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# examples/retrieval/retrieval.cpp
# examples/simple-chat/simple-chat.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# requirements/requirements-convert_lora_to_gguf.txt
# tools/run/run.cpp
2025-05-23 11:41:36 +08:00
Concedo
da7fd4aa57
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# README.md
# ci/README.md
# docs/docker.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-arg-parser.cpp
2025-05-21 23:12:22 +08:00
Concedo
9f976e9c65
swa full used unless ctx shift and fast forward disabled
2025-05-21 22:47:45 +08:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
2025-05-20 16:13:16 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Concedo
12e6928ec2
i'm gonna regret this, aren't i?
2025-05-15 23:59:55 +08:00
Diego Devesa
cf0a43bb64
llama-bench : add defrag-thold, check for invalid ranges ( #13487 )
2025-05-13 00:31:37 +02:00
Concedo
21e31e255b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/docker.yml
# README.md
# build-xcframework.sh
# common/CMakeLists.txt
# examples/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-metal/ggml-metal.m
# ggml/src/ggml-metal/ggml-metal.metal
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/backend.hpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# scripts/compare-llama-bench.py
# src/CMakeLists.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/README.md
# tools/mtmd/clip.cpp
# tools/rpc/rpc-server.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2025-05-13 00:28:35 +08:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
David Huang
7f323a589f
Add --no-op-offload
to improve -ot
pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00