Commit graph

173 commits

Author SHA1 Message Date
Concedo
7ac0102ed3 hope i didnt break anything 2025-08-14 21:42:24 +08:00
Georgi Gerganov
d32e03f449
server : add SWA checkpoints (#15293)
* server : add SWA checkpoints

ggml-ci

* cont : server clean-up

* server : handle state restore fails

* llama : add extended llama_state_seq_ API

* server : do not make checkpoints if --swa-full

ggml-ci

* llama : remove flags value for NONE

* server : configure number of SWA checkpoints with CLI arg

ggml-ci

* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
Jonathan Graehl
5cdb27e091
finetune: SGD optimizer, more CLI args (#13873)
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 12:03:57 +02:00
Concedo
6eea7b88d2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
2025-08-06 10:51:29 +08:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss (#15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Concedo
f430916a71 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/CANN.md
#	docs/multimodal/minicpmo2.6.md
#	docs/multimodal/minicpmv2.5.md
#	docs/multimodal/minicpmv2.6.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/cmake/ggml-config.cmake.in
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/repack.cpp
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/add.cl
#	ggml/src/ggml-opencl/kernels/mul.cl
#	scripts/compare-commits.sh
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml.last
#	tools/server/README.md
2025-08-02 10:25:10 +08:00
Diego Devesa
d6818d06a6
llama : allow other bufts when overriding to CPU, add --no-repack option (#14990) 2025-07-31 18:11:34 +02:00
Aman Gupta
8a4a856277
Add LLaDA 8b Diffusion model (#14771)
* Add support for Llada-8b: diffusion model

* Add README

* Fix README and convert_hf_to_gguf

* convert_hf_to_gguf.py: address review comments

* Make everything in a single example

* Remove model-specific sampling

* Remove unused argmax

* Remove braced initializers, improve README.md a bit

* Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps

* Remove adding the mask token

* Move add_add_bos_token to set_vocab

* use add_bool in gguf_writer.py
2025-07-31 19:49:09 +08:00
Concedo
0fcfbdb93c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/musa.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/close-issue.yml
#	ci/README.md
#	docs/build.md
#	docs/docker.md
#	ggml/CMakeLists.txt
#	ggml/cmake/ggml-config.cmake.in
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/fattn-wmma-f16.cu
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tools/imatrix/README.md
#	tools/imatrix/imatrix.cpp
2025-07-25 19:53:13 +08:00
Georgi Gerganov
e4868d16d2
context : perform output reorder lazily upon access after sync (#14853)
* context : perform output reorder after lazily upon access after sync

ggml-ci

* cont : add TODO
2025-07-24 16:31:48 +03:00
Concedo
b8e3280432 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	ggml/src/ggml-sycl/ggml-sycl.cpp
2025-07-18 13:46:32 +08:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs (#14482)
* llama : reuse compute graphs

ggml-ci

* llama-bench : add graph reuse parameter

ggml-ci

* cont : remove the parameter and the sched resets

ggml-ci

* graph : rename update() to can_reuse()

ggml-ci

* params : remove is_same()

ggml-ci

* graph : set res->params in llm_graph_context constructor

ggml-ci

* graph : avoid set_max_nodes in llm_graph_result

ggml-ci

* kv-cache : reuse llama_context's graph result instance

ggml-ci

* context : reset the previous graph result upon memory updates

ggml-ci

* batch : llama_ubatch now carries its data instead of pointing to balloc

ggml-ci

* merge : fix build

ggml-ci

* graph : fix can_reuse() checks when flash-attention is disabled

* graph : move llm_graph_result impl in source file + debug env

ggml-ci
2025-07-17 19:08:33 +03:00
Concedo
bdff33e0de Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	ci/run.sh
#	docs/build.md
#	examples/CMakeLists.txt
#	examples/parallel/parallel.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	scripts/server-bench.py
#	src/llama-kv-cache-unified.cpp
#	tests/test-backend-ops.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Aman Gupta
ab14019821
Support diffusion models: Add Dream 7B (#14644)
* Support diffusion models: Add Dream 7B

* Move diffusion to examples

* Move stuff to examples. Add patch to not use kv-cache

* Address review comments

* Make sampling fast

* llama: remove diffusion functions

* Add basic timings + cleanup

* More cleanup

* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length

* fixup!

* Review: move everything to diffusion-cli for now
2025-07-16 20:03:51 +08:00
Concedo
cbe9fc87c5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	src/llama-vocab.cpp
2025-07-16 12:03:54 +08:00
Min-Hua
79e0b68c17
llama: add LLAMA_API to deprecated llama_kv_self_seq_div (#14708)
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?

Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.

Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
2025-07-16 07:00:42 +03:00
Shunta Saito
68e37a61a7
model : add PLaMo-2 support (#14560)
* Add PLaMo-2 model using hybrid memory module

* Fix z shape

* Add cmath to include from llama-vocab.h

* Explicitly dequantize normalization weights before RoPE apply

* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing

* Use ATTN_K/Q_NORM for k,q weights to prevent quantization

* Remove SSM_BCDT that is not used from anywhere

* Do not duplicate embedding weights for output.weight

* Fix tokenizer encoding problem for multibyte strings

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up

* Remove unnecessary part for Grouped Query Attention

* Fix how to load special token id to gguf

* Remove unused tensor mapping

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations

* Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix plamo2 tokenizer session to prevent multiple calls of build()

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-15 18:11:42 +02:00
Concedo
8cebec5128 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakePresets.json
#	README.md
#	common/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tools/run/CMakeLists.txt
2025-07-13 23:39:41 +08:00
Georgi Gerganov
0d5375d54b
llama : move enum llama_vocab_pre_type to implementation (#14631)
ggml-ci
2025-07-11 13:46:07 +03:00
Concedo
b8c1fc7c9e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	docs/development/HOWTO-add-model.md
#	ggml/src/ggml-sycl/rope.cpp
#	tests/test-backend-ops.cpp
2025-07-09 19:25:28 +08:00
Xuan-Son Nguyen
8f22dc0a53
model : add hunyuan moe (#14425)
* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
2025-07-08 11:24:06 +03:00
Concedo
ace537d44e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	examples/simple-chat/simple-chat.cpp
#	src/llama-quant.cpp
#	tools/run/run.cpp
#	tools/server/README.md
2025-06-24 23:06:16 +08:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos (#14339)
* kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

* cont : improve error message

ggml-ci

* cont : add more comments
2025-06-23 12:27:35 +03:00
Ed Addario
fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) (#13037) 2025-06-22 23:16:26 +02:00
Concedo
4f2fcaa2ef Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ci/run.sh
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/repack.cpp
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/concat.cpp
#	ggml/src/ggml-sycl/conv.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/getrows.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/gla.cpp
#	ggml/src/ggml-sycl/im2col.cpp
#	ggml/src/ggml-sycl/mmq.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/tsembd.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	tests/test-backend-ops.cpp
2025-06-21 00:32:22 +08:00
Ruikai Peng
dd6e6d0b6a
vocab : prevent tokenizer overflow (#14301)
* vocab : prevent stack overflow in tokenize

* vocab : return error instead of aborting on oversized token count

* vocab : INT32_MIN from llama_tokenize on overflow
2025-06-20 07:13:06 -07:00
Sigbjørn Skjæret
88fc854b4b
llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
Concedo
4356a00f4a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ci/run.sh
#	docs/function-calling.md
#	examples/gritlm/gritlm.cpp
#	ggml/CMakeLists.txt
#	ggml/cmake/common.cmake
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	requirements/requirements-compare-llama-bench.txt
#	scripts/compare-llama-bench.py
#	tests/CMakeLists.txt
2025-06-18 00:16:54 +08:00
Georgi Gerganov
89fea80d29
server : fix incorrect usage of llama_get_embeddings() (#14225)
* server : fix incorrect usage of llama_get_embeddings()

ggml-ci

* cont : fix the fix

ggml-ci
2025-06-16 22:33:27 +03:00
Georgi Gerganov
d3e64b9f49
llama : rework embeddings logic (#14208)
* llama : rework embeddings logic

ggml-ci

* cont : fix rerank

ggml-ci

* cont : engrish [no ci]

* cont : fix rerank

ggml-ci

* server : support both embeddings and completions with single model

ggml-ci

* cont : avoid embeddings_org

ggml-ci
2025-06-16 14:14:00 +03:00
Georgi Gerganov
b9912ac570
batch : auto-gen positions + verify multi-sequence input (#14177)
* batch : verify multi-sequence input batches

ggml-ci

* cont : auto-gen positions + verify multi-seq input

ggml-ci

* cont : first print debug info, then perform validation

ggml-ci

* cont : fix position auto-gen + add comments

ggml-ci
2025-06-15 09:18:37 +03:00
Concedo
4204f111f7 Merge commit '8f47e25f56' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/build-linux-cross.yml
#	docs/backend/CANN.md
#	examples/batched.swift/Sources/main.swift
#	examples/embedding/embedding.cpp
#	examples/gritlm/gritlm.cpp
#	examples/llama.android/llama/src/main/cpp/llama-android.cpp
#	examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	examples/retrieval/retrieval.cpp
#	examples/save-load-state/save-load-state.cpp
#	examples/simple-chat/simple-chat.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	tools/batched-bench/batched-bench.cpp
#	tools/cvector-generator/cvector-generator.cpp
#	tools/imatrix/imatrix.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
#	tools/run/run.cpp
2025-06-13 22:05:03 +08:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
2025-06-06 14:11:15 +03:00
Concedo
d33c88b1f4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	ci/run.sh
#	examples/embedding/embedding.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	src/CMakeLists.txt
2025-06-06 17:56:51 +08:00
Sigbjørn Skjæret
d17a809ef0
llama : support multiple classifier outputs and labels (#13940) 2025-06-06 09:03:25 +02:00
Georgi Gerganov
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory (#14006)
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API

ggml-ci

* context : fix casts

ggml-ci
2025-06-05 15:29:22 +03:00
Concedo
6ce85c54d6 not working correctly 2025-06-02 22:12:10 +08:00
Georgi Gerganov
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls (#13921)
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci
2025-05-31 10:24:04 +03:00
Concedo
8c701d7ded Merge commit '72b090da2c' into concedo_experimental
# Conflicts:
#	docs/backend/CANN.md
#	docs/function-calling.md
#	examples/embedding/embedding.cpp
#	examples/retrieval/retrieval.cpp
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/Doxyfile
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-sycl/binbcast.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/concat.cpp
#	ggml/src/ggml-sycl/conv.cpp
#	ggml/src/ggml-sycl/cpy.cpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/getrows.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/gla.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-sycl/outprod.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/tsembd.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	scripts/compare-commits.sh
#	tests/test-chat.cpp
#	tests/test-sampling.cpp
2025-05-28 00:28:41 +08:00
Concedo
868cb6aff7 Merge commit 'e121edc432' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	common/CMakeLists.txt
#	docs/function-calling.md
#	ggml/src/ggml-sycl/binbcast.cpp
#	models/templates/README.md
#	scripts/tool_bench.py
#	src/llama-kv-cache.cpp
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/mtmd/clip.h
#	tools/rpc/rpc-server.cpp
#	tools/server/README.md
2025-05-28 00:20:45 +08:00
Georgi Gerganov
22229314fc
llama : clarify deprecation message (#13794) 2025-05-26 12:57:50 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell (#13706)
* kv-cache : rework kv_cell

ggml-ci

* kv-cells : use "shift" instead of "delta" consistently

ggml-ci

* llama : add llama_max_parallel_sequences()

ggml-ci

* kv-cells : update comments [no ci]

* context : fail upon construction if sequences exceed max value

ggml-ci

* kv-cells : get_pos() -> pos_get() + comments

ggml-ci

* kv-cells : fix tracking of "used" cells

ggml-ci
2025-05-25 16:34:36 +03:00
Concedo
bd960a90a6 removed unnecessary function 2025-05-24 23:59:31 +08:00
Concedo
f97bbdde00 fix to allow all EOGs to trigger a stop, occam's glm4 fix, 2025-05-24 22:55:11 +08:00
Concedo
22ef97d7d3 Merge commit 'ab86335760' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	examples/retrieval/retrieval.cpp
#	examples/simple-chat/simple-chat.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	requirements/requirements-convert_hf_to_gguf.txt
#	requirements/requirements-convert_hf_to_gguf_update.txt
#	requirements/requirements-convert_lora_to_gguf.txt
#	tools/run/run.cpp
2025-05-23 11:41:36 +08:00
Concedo
da7fd4aa57 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/musa.Dockerfile
#	.github/workflows/build.yml
#	README.md
#	ci/README.md
#	docs/docker.md
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/parallel/parallel.cpp
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-arg-parser.cpp
2025-05-21 23:12:22 +08:00
Concedo
9f976e9c65 swa full used unless ctx shift and fast forward disabled 2025-05-21 22:47:45 +08:00