Concedo
b0b7a07b34
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/parallel/parallel.cpp
2025-07-18 23:49:45 +08:00
Georgi Gerganov
d498af3d5a
graph : avoid huge warm-up graphs for MoE models ( #14753 )
...
* graph : avoid huge warm-up graphs for MoE models
ggml-ci
* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Concedo
b8e3280432
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# ggml/src/ggml-sycl/ggml-sycl.cpp
2025-07-18 13:46:32 +08:00
Georgi Gerganov
8f974bc1e9
graph : refactor context to not pass gf explicitly ( #14629 )
...
ggml-ci
2025-07-18 08:29:28 +03:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs ( #14482 )
...
* llama : reuse compute graphs
ggml-ci
* llama-bench : add graph reuse parameter
ggml-ci
* cont : remove the parameter and the sched resets
ggml-ci
* graph : rename update() to can_reuse()
ggml-ci
* params : remove is_same()
ggml-ci
* graph : set res->params in llm_graph_context constructor
ggml-ci
* graph : avoid set_max_nodes in llm_graph_result
ggml-ci
* kv-cache : reuse llama_context's graph result instance
ggml-ci
* context : reset the previous graph result upon memory updates
ggml-ci
* batch : llama_ubatch now carries its data instead of pointing to balloc
ggml-ci
* merge : fix build
ggml-ci
* graph : fix can_reuse() checks when flash-attention is disabled
* graph : move llm_graph_result impl in source file + debug env
ggml-ci
2025-07-17 19:08:33 +03:00
Concedo
bdff33e0de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# examples/parallel/parallel.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# scripts/server-bench.py
# src/llama-kv-cache-unified.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Concedo
ce7aa0d5c0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-all.txt
2025-07-15 23:59:53 +08:00
Aman Gupta
9c9e4fc635
llama-context: add ability to get logits ( #14672 )
2025-07-14 21:01:41 +08:00
Concedo
ace537d44e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# examples/simple-chat/simple-chat.cpp
# src/llama-quant.cpp
# tools/run/run.cpp
# tools/server/README.md
2025-06-24 23:06:16 +08:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos ( #14339 )
...
* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments
2025-06-23 12:27:35 +03:00
Concedo
fb13e3e51b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# src/llama-context.cpp
# tests/test-backend-ops.cpp
2025-06-22 23:26:15 +08:00
Georgi Gerganov
692e3cdd0a
memory : rename interface to llama_memory_context_i ( #14296 )
...
* memory : rename interface to llama_memory_context_i
ggml-ci
* cont : fix comments
* cont : use "mctx" for referencing a memory context
ggml-ci
2025-06-21 08:03:46 +03:00
Concedo
c16d672ce4
Merge commit ' 9230dbe2c7' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cpu/CMakeLists.txt
# src/llama-graph.cpp
# tools/server/README.md
2025-06-21 00:01:29 +08:00
Georgi Gerganov
4c9fdfbe15
ubatch : new splitting logic ( #14217 )
...
ggml-ci
2025-06-20 10:14:14 +03:00
Concedo
4356a00f4a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ci/run.sh
# docs/function-calling.md
# examples/gritlm/gritlm.cpp
# ggml/CMakeLists.txt
# ggml/cmake/common.cmake
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/ggml-cpu.c
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# requirements/requirements-compare-llama-bench.txt
# scripts/compare-llama-bench.py
# tests/CMakeLists.txt
2025-06-18 00:16:54 +08:00
Georgi Gerganov
d3e64b9f49
llama : rework embeddings logic ( #14208 )
...
* llama : rework embeddings logic
ggml-ci
* cont : fix rerank
ggml-ci
* cont : engrish [no ci]
* cont : fix rerank
ggml-ci
* server : support both embeddings and completions with single model
ggml-ci
* cont : avoid embeddings_org
ggml-ci
2025-06-16 14:14:00 +03:00
Georgi Gerganov
c311ac664d
cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ ( #14188 )
...
ggml-ci
2025-06-15 10:08:58 +03:00
Georgi Gerganov
b9912ac570
batch : auto-gen positions + verify multi-sequence input ( #14177 )
...
* batch : verify multi-sequence input batches
ggml-ci
* cont : auto-gen positions + verify multi-seq input
ggml-ci
* cont : first print debug info, then perform validation
ggml-ci
* cont : fix position auto-gen + add comments
ggml-ci
2025-06-15 09:18:37 +03:00
Concedo
5f9e96e82d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# CMakeLists.txt
# README.md
# common/CMakeLists.txt
# docs/multimodal.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# src/llama-context.cpp
2025-06-14 09:05:45 +08:00
Concedo
4204f111f7
Merge commit ' 8f47e25f56' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-linux-cross.yml
# docs/backend/CANN.md
# examples/batched.swift/Sources/main.swift
# examples/embedding/embedding.cpp
# examples/gritlm/gritlm.cpp
# examples/llama.android/llama/src/main/cpp/llama-android.cpp
# examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/passkey/passkey.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/simple-chat/simple-chat.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# tools/batched-bench/batched-bench.cpp
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
# tools/run/run.cpp
2025-06-13 22:05:03 +08:00
Georgi Gerganov
60c666347b
batch : rework llama_batch_allocr ( #14153 )
...
* batch : rework llama_batch_allocr
ggml-ci
* cont : move validation inside class
ggml-ci
* cont : move output counting to class
ggml-ci
* cont : minor
ggml-ci
* batch : add TODOs
ggml-ci
2025-06-13 13:47:55 +03:00
Georgi Gerganov
f6e1a7aa87
context : simplify output counting logic during decode ( #14142 )
...
* batch : remove logits_all flag
ggml-ci
* context : simplify output counting logic during decode
ggml-ci
* cont : fix comments
2025-06-12 11:50:01 +03:00
Georgi Gerganov
c3ee46fab4
batch : remove logits_all flag ( #14141 )
...
ggml-ci
2025-06-12 11:49:26 +03:00
Georgi Gerganov
9596506965
kv-cache : fix split_equal handling in unified implementation ( #14130 )
...
ggml-ci
2025-06-12 10:02:15 +03:00
compilade
a20b2b05bc
context : round n_tokens to next multiple of n_seqs when reserving ( #14140 )
...
This fixes RWKV inference which otherwise failed
when the worst case ubatch.n_seq_tokens rounded to 0.
2025-06-12 02:56:04 -04:00
Concedo
7d8aa31f1f
fixed embeddings, added new parameter to limit max embeddings context
2025-06-10 01:11:55 +08:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
Georgi Gerganov
487a5e0401
context : fix SWA-related warning for multiple sequences ( #14045 )
2025-06-06 13:29:18 +03:00
Concedo
d33c88b1f4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# ci/run.sh
# examples/embedding/embedding.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# src/CMakeLists.txt
2025-06-06 17:56:51 +08:00
Sigbjørn Skjæret
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
2025-06-06 09:03:25 +02:00
Georgi Gerganov
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
2025-06-05 15:29:22 +03:00
Georgi Gerganov
9e31bec4fd
context : fix pos_min initialization upon error decode ( #14008 )
...
ggml-ci
2025-06-05 09:06:29 +03:00
Concedo
bc89b465a8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# .github/workflows/server.yml
# README.md
# docs/build.md
# docs/install.md
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
2025-06-05 11:03:34 +08:00
Georgi Gerganov
3e63a58ef7
kv-cache : refactor the update/defrag mechanism ( #13988 )
...
* kv-cache : refactor update mechanism
ggml-ci
* memory : improve status handling
* defrag : reset head + add comments
ggml-ci
* cont : minor fixes
ggml-ci
2025-06-04 18:58:20 +03:00
Concedo
6ce85c54d6
not working correctly
2025-06-02 22:12:10 +08:00
Georgi Gerganov
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls ( #13921 )
...
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
3f55f781f1
llama : auto-batch preparation ( #13845 )
...
* llama : auto-batch
ggml-ci
* context : simplify if branching
2025-05-31 12:55:57 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i ( #13746 )
...
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
2025-05-31 10:24:04 +03:00
Concedo
8c701d7ded
Merge commit ' 72b090da2c' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/function-calling.md
# examples/embedding/embedding.cpp
# examples/retrieval/retrieval.cpp
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/Doxyfile
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# scripts/compare-commits.sh
# tests/test-chat.cpp
# tests/test-sampling.cpp
2025-05-28 00:28:41 +08:00
Concedo
868cb6aff7
Merge commit ' e121edc432' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# common/CMakeLists.txt
# docs/function-calling.md
# ggml/src/ggml-sycl/binbcast.cpp
# models/templates/README.md
# scripts/tool_bench.py
# src/llama-kv-cache.cpp
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/mtmd/clip.h
# tools/rpc/rpc-server.cpp
# tools/server/README.md
2025-05-28 00:20:45 +08:00
Georgi Gerganov
4f81b33e32
llama : validate seq id batch input ( #13809 )
...
* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci
2025-05-27 09:40:59 +03:00
Georgi Gerganov
79c137f776
examples : allow extracting embeddings from decoder contexts ( #13797 )
...
ggml-ci
2025-05-26 14:03:54 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Concedo
22ef97d7d3
Merge commit ' ab86335760' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# examples/retrieval/retrieval.cpp
# examples/simple-chat/simple-chat.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# requirements/requirements-convert_lora_to_gguf.txt
# tools/run/run.cpp
2025-05-23 11:41:36 +08:00
Concedo
da7fd4aa57
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# README.md
# ci/README.md
# docs/docker.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-arg-parser.cpp
2025-05-21 23:12:22 +08:00
Concedo
9f976e9c65
swa full used unless ctx shift and fast forward disabled
2025-05-21 22:47:45 +08:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
2025-05-20 16:13:16 +03:00