Concedo
5de51b77c1
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/close-issue.yml
# docs/build-s390x.md
# examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# ggml/src/ggml-cuda/fattn-tile-f16.cu
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/tool_bench.py
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-09-11 22:28:19 +08:00
Eric Curtin
408ff524b4
Implement --log-colors with always/never/auto ( #15792 )
...
With auto by default
Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>
2025-09-05 19:43:59 +01:00
Concedo
f0d4128e9f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/compare-embeddings-logits.sh
# examples/model-conversion/scripts/causal/convert-model.sh
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
# examples/model-conversion/scripts/causal/run-converted-model.sh
# examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
# examples/model-conversion/scripts/embedding/convert-model.sh
# examples/model-conversion/scripts/embedding/modelcard.template
# examples/model-conversion/scripts/embedding/run-converted-model.sh
# examples/model-conversion/scripts/utils/create-collection-add-model.sh
# examples/model-conversion/scripts/utils/inspect-converted-model.sh
# examples/model-conversion/scripts/utils/inspect-org-model.py
# examples/model-conversion/scripts/utils/perplexity-gen.sh
# examples/model-conversion/scripts/utils/perplexity-run-simple.sh
# examples/model-conversion/scripts/utils/perplexity-run.sh
# examples/model-conversion/scripts/utils/quantize.sh
# examples/model-conversion/scripts/utils/run-embedding-server.sh
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# src/llama-context.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
2025-09-05 13:25:34 +08:00
Eric Curtin
badb80cadb
Document the new max GPU layers default in help ( #15771 )
...
This is a key change, just letting users know.
Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>
2025-09-04 10:49:44 +01:00
Concedo
2562129271
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# ci/run.sh
# docs/backend/CANN.md
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/flash_attn_f16.cl
# ggml/src/ggml-opencl/kernels/flash_attn_f32.cl
# ggml/src/ggml-opencl/kernels/flash_attn_f32_f16.cl
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/gguf.cpp
# src/llama-context.cpp
# tests/test-sampling.cpp
# tools/server/README.md
2025-09-03 17:16:42 +08:00
Johannes Gäßler
c466abe158
llama: -fa 1/0/-1 aliases for -fa on/off/auto ( #15746 )
2025-09-02 18:17:26 +02:00
Georgi Gerganov
0d161f021a
server : enable /slots by default and make it secure ( #15630 )
...
* server : enable /slots by default and make it secure
ggml-ci
* server : fix tests to pass `--no-slots` when necessary
* server : extend /props with info about enabled endpoints
2025-08-31 20:11:58 +03:00
Concedo
7e35954695
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/build.md
# docs/function-calling.md
# examples/eval-callback/eval-callback.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kernels.cpp
# ggml/src/ggml-cpu/kleidiai/kernels.h
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# scripts/compare-llama-bench.py
# scripts/server-bench.py
# scripts/tool_bench.py
# tests/test-chat.cpp
# tools/batched-bench/batched-bench.cpp
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-08-31 23:33:36 +08:00
Johannes Gäßler
e81b8e4b7f
llama: use FA + max. GPU layers by default ( #15434 )
...
* llama: use max. GPU layers by default, auto -fa
* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Concedo
3060dfb99f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# scripts/compare-commits.sh
2025-08-28 23:17:29 +08:00
Sigbjørn Skjæret
84ab83cc0b
model : jina-embeddings-v3 support ( #13693 )
...
* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* fix vocab parsing with only tokenizer.json
* set mask token lstrip attribute
* additional unk_token_id fallback just in case [no ci]
* revert vocab_size() change [no ci]
* merge tensor loading into general bert
* rope
* add lora embedding and loading (non-functional)
* export separate lora ggufs instead
* add adapter metadata api
* use std::string
* convert_hf_to_lora compatibility
* fix assert
* apply suggestions from review
* apply suggestion from review
2025-08-28 15:49:50 +02:00
Georgi Gerganov
da54f9f1a2
presets : add qwen3-30B-a3b FIM ( #15616 )
2025-08-27 15:48:07 +03:00
Concedo
654b9eee73
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/Makefile
# examples/model-conversion/README.md
# examples/model-conversion/scripts/utils/quantize.sh
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/group_norm.cl
# ggml/src/ggml-opencl/kernels/norm.cl
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tools/batched-bench/batched-bench.cpp
# tools/mtmd/CMakeLists.txt
2025-08-27 17:39:24 +08:00
Daniel Bevenius
fcca2182a1
common : add -m to bash completion for --model [no ci] ( #15591 )
...
This commit updates the bash completion script to include the -m
short option for the --model argument.
The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well.
2025-08-27 10:28:53 +02:00
Concedo
8b8396c30c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build-s390x.md
# examples/llama.vim
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# scripts/compare-llama-bench.py
# src/CMakeLists.txt
# tests/test-backend-ops.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-08-23 11:35:28 +08:00
Georgi Gerganov
9ebebef62f
llama : remove KV cache defragmentation logic ( #15473 )
...
ggml-ci
2025-08-22 12:22:13 +03:00
Diego Devesa
54a241f505
sched : fix possible use of wrong ids tensor when offloading moe prompt processing ( #15488 )
2025-08-21 23:09:32 +02:00
Concedo
90706ddb14
Merge commit ' fec9519802' into concedo_experimental
...
# Conflicts:
# Makefile
# examples/lookahead/README.md
# tools/server/CMakeLists.txt
2025-08-21 19:19:20 +08:00
Concedo
1c41c38a6a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/cuda.Dockerfile
# CODEOWNERS
# README.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-opencl/ggml-opencl.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-chat.cpp
# tools/batched-bench/batched-bench.cpp
# tools/mtmd/clip.h
2025-08-20 20:34:45 +08:00
Jie Fu (傅杰)
ec5ab1a36c
common : fix context shift help message ( #15448 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-20 13:33:30 +03:00
Gian-Carlo Pascutto
1e19f5d462
common : Add top-nsigma sampler to help globally ( #15428 )
...
Fixes #15423 .
2025-08-19 19:58:14 +03:00
Georgi Gerganov
d2fcd91cf9
server : disable context shift by default ( #15416 )
...
* server : disable context shift by default
ggml-ci
* server : make scopr of test parameters local
2025-08-19 16:46:37 +03:00
Concedo
7ac0102ed3
hope i didnt break anything
2025-08-14 21:42:24 +08:00
Concedo
d5876024ec
Merge commit ' f4586ee598' into concedo_experimental
...
# Conflicts:
# README.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmv2.6.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/add.cl
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tools/perplexity/perplexity.cpp
# tools/server/README.md
2025-08-14 21:29:52 +08:00
Georgi Gerganov
d32e03f449
server : add SWA checkpoints ( #15293 )
...
* server : add SWA checkpoints
ggml-ci
* cont : server clean-up
* server : handle state restore fails
* llama : add extended llama_state_seq_ API
* server : do not make checkpoints if --swa-full
ggml-ci
* llama : remove flags value for NONE
* server : configure number of SWA checkpoints with CLI arg
ggml-ci
* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
Jonathan Graehl
5cdb27e091
finetune: SGD optimizer, more CLI args ( #13873 )
...
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 12:03:57 +02:00
Sigbjørn Skjæret
b3e16665e1
server : enable -td and -tbd parameters ( #15172 )
2025-08-13 15:43:00 +02:00
Copilot
d8914fc47e
common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters ( #15191 )
...
* Checkpoint from VS Code for coding agent session
* Initial plan
* Fix typo in --override-tensor-draft flag implementation
* Add null termination for speculative tensor buffer overrides
* Apply suggestions from code review
* Apply suggestions from code review
* Extract tensor override parsing logic to common function (addresses @slaren's feedback)
* Apply suggestions from code review
* Apply suggestions
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-13 12:44:40 +02:00
Xuan-Son Nguyen
53d0a12658
server : allow specifying reasoning_format in HTTP request ( #15238 )
2025-08-11 14:48:41 +02:00
Concedo
6eea7b88d2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
2025-08-06 10:51:29 +08:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com>
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Concedo
7590a0ea39
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ggml/CMakeLists.txt
# ggml/cmake/ggml-config.cmake.in
# ggml/src/CMakeLists.txt
# models/templates/README.md
# tools/imatrix/imatrix.cpp
2025-08-05 19:24:29 +08:00
Diego Devesa
ec428b02c3
llama : add --n-cpu-moe option ( #15077 )
...
* llama : add --n-cpu-moe option
Keeps the MoE weights of the first N layers in the CPU
2025-08-05 01:05:36 +02:00
compilade
19f68fa5a4
imatrix : warn when GGUF imatrix is saved without .gguf suffix ( #15076 )
...
* imatrix : add warning when suffix is not .gguf for GGUF imatrix
* imatrix : only warn about suffix when output format is unspecified
2025-08-04 23:26:52 +02:00
Concedo
8bd0a560f0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf_update.txt
# scripts/compare-llama-bench.py
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
2025-08-04 22:42:02 +08:00
compilade
d31192b4ee
imatrix : use GGUF by default ( #14842 )
...
* imatrix : use GGUF by default
* imatrix : use GGUF regardless of the output filename
The legacy format can only be produced with --output-format dat
2025-08-03 22:00:05 +02:00
Concedo
f430916a71
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/add.cl
# ggml/src/ggml-opencl/kernels/mul.cl
# scripts/compare-commits.sh
# scripts/compare-llama-bench.py
# scripts/sync-ggml.last
# tools/server/README.md
2025-08-02 10:25:10 +08:00
Diego Devesa
a06ed5feae
llama : add simple option to enable CPU for MoE weights (--cpu-moe) ( #14992 )
2025-07-31 20:15:41 +02:00
Diego Devesa
d6818d06a6
llama : allow other bufts when overriding to CPU, add --no-repack option ( #14990 )
2025-07-31 18:11:34 +02:00
g2mt
94933c8c2e
server : implement universal assisted decoding ( #12635 )
...
* llama-server : implement universal assisted decoding
* Erase prompt tail for kv-cache
* set vocab_dft_compatible in common_speculative
* rename ctx_main to ctx_tgt
* move vocab_dft_compatible to spec struct
* clear mem_dft, remove mem
* detokenize id_last for incompatible models
* update comment
* add --spec-replace flag
* accept special tokens when translating between draft/main models
* Escape spec-replace
* clamp draft result to size to params.n_draft
* fix comment
* clean up code
* restore old example
* log common_speculative_are_compatible in speculative example
* fix
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-31 14:25:23 +02:00
Aman Gupta
8a4a856277
Add LLaDA 8b Diffusion model ( #14771 )
...
* Add support for Llada-8b: diffusion model
* Add README
* Fix README and convert_hf_to_gguf
* convert_hf_to_gguf.py: address review comments
* Make everything in a single example
* Remove model-specific sampling
* Remove unused argmax
* Remove braced initializers, improve README.md a bit
* Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps
* Remove adding the mask token
* Move add_add_bos_token to set_vocab
* use add_bool in gguf_writer.py
2025-07-31 19:49:09 +08:00
Concedo
0fcfbdb93c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/musa.Dockerfile
# .github/workflows/build.yml
# .github/workflows/close-issue.yml
# ci/README.md
# docs/build.md
# docs/docker.md
# ggml/CMakeLists.txt
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/fattn-wmma-f16.cu
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
2025-07-25 19:53:13 +08:00
Concedo
0d72c794fa
Merge commit ' c8ade30036' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/im2col_f16.cl
# ggml/src/ggml-opencl/kernels/im2col_f32.cl
# ggml/src/ggml-sycl/im2col.cpp
# tools/mtmd/clip.cpp
2025-07-25 19:42:45 +08:00
Ed Addario
d1aa0cc5d1
imatrix: add option to display importance score statistics for a given imatrix file ( #12718 )
...
* Add --show-statistics option
* Add --show-statistics logic
* Add tensor name parsing
* Tidy output format
* Fix typo in title
* Improve tensor influence ranking
* Add better statistics
* Change statistics' sort order
* Add Cosine Similarity
* Add header search path
* Change header search path to private
* Add weighted statistics per layer
* Update report title
* Refactor compute_statistics out of main
* Refactor compute_cossim out of load_imatrix
* Refactor compute_statistics out of load_imatrix
* Move imatrix statistics calculation into its own functions
* Add checks and validations
* Remove unnecessary include directory
* Rename labels
* Add m_stats getter and refactor compute_statistics out of load_imatrix
* Refactor variable names
* Minor cosmetic change
* Retrigger checks (empty commit)
* Rerun checks (empty commit)
* Fix unnecessary type promotion
Co-authored-by: compilade <git@compilade.net>
* Reverting change to improve code readability
* Rerun checks (empty commit)
* Rerun checks (empty commit)
* Rerun checks - third time's the Charm 🤞 (empty commit)
* Minor cosmetic change
* Update README
* Fix typo
* Update README
* Rerun checks (empty commit)
* Re-implement changes on top of #9400
* Update README.md
* Update README
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Update README.md
* Remove duplicate option in print_usage()
* Update README.md
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Remove input check
* Remove commented out code
---------
Co-authored-by: compilade <git@compilade.net>
2025-07-22 14:33:37 +02:00
Molly Sophia
adef81781a
server : allow setting --reverse-prompt arg ( #14799 )
...
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-22 09:24:22 +08:00
Concedo
bdff33e0de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# examples/parallel/parallel.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# scripts/server-bench.py
# src/llama-kv-cache-unified.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Aman Gupta
ab14019821
Support diffusion models: Add Dream 7B ( #14644 )
...
* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now
2025-07-16 20:03:51 +08:00
Concedo
b8c1fc7c9e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/development/HOWTO-add-model.md
# ggml/src/ggml-sycl/rope.cpp
# tests/test-backend-ops.cpp
2025-07-09 19:25:28 +08:00
Alawode Oluwandabira
17a1f0d2d4
server: Add ability to mount server at prefix ( #14544 )
...
* Add server_prefix
* Correct server path env
* Rename cli flag to --api-prefix
* Change all to api_prefix
2025-07-08 11:47:33 +03:00