koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-07 09:02:04 +00:00

Author	SHA1	Message	Date
Concedo	d577187875	update sdui	2025-12-21 20:35:19 +08:00
Concedo	c93c4c5505	Merge commit '`4a4f7e6550`' into concedo_experimental # Conflicts: # .github/ISSUE_TEMPLATE/011-bug-results.yml # CODEOWNERS # README.md # ci/run.sh # docs/development/HOWTO-add-model.md # grammars/README.md # src/llama-context.cpp # src/llama.cpp # tools/CMakeLists.txt # tools/completion/README.md # tools/llama-bench/README.md	2025-12-17 14:30:39 +08:00
Concedo	050a5b1f52	Merge commit '`4aced7a631`' into concedo_experimental # Conflicts: # .devops/cann.Dockerfile # .devops/cpu.Dockerfile # .devops/cuda.Dockerfile # .devops/intel.Dockerfile # .devops/musa.Dockerfile # .devops/rocm.Dockerfile # .devops/tools.sh # .devops/vulkan.Dockerfile # .github/workflows/build.yml # .github/workflows/release.yml # .gitignore # docs/ops.md # docs/ops/SYCL.csv # examples/batched/batched.cpp # examples/eval-callback/eval-callback.cpp # examples/gen-docs/gen-docs.cpp # examples/lookahead/lookahead.cpp # examples/lookup/lookup-create.cpp # examples/lookup/lookup-stats.cpp # examples/lookup/lookup.cpp # examples/model-conversion/scripts/causal/compare-logits.py # examples/model-conversion/scripts/causal/run-org-model.py # examples/model-conversion/scripts/utils/check-nmse.py # examples/parallel/parallel.cpp # examples/retrieval/retrieval.cpp # examples/save-load-state/save-load-state.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # examples/training/finetune.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/repack.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/dequantize.hpp # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/pad.cpp # ggml/src/ggml-sycl/ssm_conv.cpp # ggml/src/ggml-sycl/vecdotq.hpp # pyrightconfig.json # scripts/sync-ggml.last # tests/test-arg-parser.cpp # tests/test-backend-ops.cpp # tools/cvector-generator/cvector-generator.cpp # tools/imatrix/imatrix.cpp # tools/mtmd/CMakeLists.txt # tools/mtmd/clip.cpp # tools/perplexity/perplexity.cpp # tools/server/README.md	2025-12-16 23:14:12 +08:00
Concedo	e88bf41fdc	Merge commit '`12280ae905`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # common/CMakeLists.txt # docs/docker.md # examples/model-conversion/scripts/causal/compare-logits.py # ggml/src/ggml-hexagon/htp/rope-ops.c # tests/test-backend-ops.cpp # tests/test-barrier.cpp # tools/server/CMakeLists.txt # tools/server/README.md	2025-12-16 16:29:01 +08:00
Johannes Gäßler	b1f3a6e5db	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 ) * llama: automatically fit args to free memory llama-fit-params tool * fix CI * hints for bug reports, ensure no reallocation * fix segfault with Vulkan * add llama-fit-params to CI * fix CI * fix CI * fix CI * minor adjustments * fix assignment of 1 dense layer * fix logger not being reset on model load failure * remove --n-gpu-layer hint on model load failure * fix llama-fit-params verbosity * fix edge case * fix typo [no ci]	2025-12-15 09:24:59 +01:00
Georgi Gerganov	609a2d0268	models : fix YaRN regression + consolidate logic (#18006 ) * models : fix YaRN regression + consolidate logic * cont : fix the fix * cont : remove header * cont : add header	2025-12-14 08:34:56 +02:00
Jeff Bolz	5266379bca	llama_context: synchronize before reallocating output buffer (#17974 )	2025-12-13 09:19:51 -06:00
Concedo	278e45becf	Merge commit '`2fa51c19b0`' into concedo_experimental # Conflicts: # .github/actions/windows-setup-cuda/action.yml # .github/workflows/build-linux-cross.yml # .github/workflows/release.yml # README.md # docs/build-riscv64-spacemit.md # examples/model-conversion/logits.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # models/templates/Kimi-K2-Instruct.jinja # models/templates/Kimi-K2-Thinking.jinja # tests/test-chat.cpp # tools/server/README.md	2025-12-11 23:04:48 +08:00
Concedo	fd0d0cab03	move pipeline parallelism to a --pipelineparallel launch flag	2025-12-11 21:03:41 +08:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Piotr Wilkin (ilintar)	e4e9c4329c	Make graph_max_nodes vary by ubatch size (#17794 ) * Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-08 14:32:41 +01:00
Concedo	bf5efcf86d	Merge commit '`d82b7a7c1d`' into concedo_experimental # Conflicts: # ci/run.sh # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cuda/common.cuh # tests/CMakeLists.txt	2025-11-30 15:43:11 +08:00
Diego Devesa	e072b2052e	ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276 ) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 17:33:23 +02:00
Concedo	0ccb298087	Merge commit '`ddf9f94389`' into concedo_experimental # Conflicts: # examples/model-conversion/scripts/causal/run-converted-model.sh # examples/model-conversion/scripts/causal/run-org-model.py # src/CMakeLists.txt # src/llama-quant.cpp # tools/server/README.md	2025-11-28 23:27:50 +08:00
Piotr Wilkin (ilintar)	ff55414c42	model : Qwen3 Next (#16095 ) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 12:02:56 +01:00
Concedo	724763fdec	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/vulkan.Dockerfile # .github/workflows/build.yml # .github/workflows/server.yml # common/common.cpp # examples/batched/README.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/arch-fallback.h # ggml/src/ggml-opencl/ggml-opencl.cpp # scripts/sync-ggml.last # src/CMakeLists.txt # tests/test-backend-ops.cpp # tools/server/CMakeLists.txt	2025-11-25 16:38:07 +08:00
Daniel Bevenius	134e6940ca	llama : skip output reordering for single token batches (#17466 ) This commit adds a check to skip the output reordering logic when n_outputs == 1. With a single output token, the data is trivially sorted and the reordering code is currently doing unnecessary work (resetting and rebuilding output_ids to the same values). The motivation for this change is improved code clarity and avoiding confusion when debugging. While the performance impact is probably negligible, this unnecessary work happens on every decode call in llama-server when processing batches with single-token outputs.	2025-11-24 21:06:17 +01:00
LostRuins Concedo	d6a2ad8455	still not really working right	2025-11-09 01:57:48 +08:00
LostRuins Concedo	fdcb281a3a	Merge commit '`2f966b8ed8`' into concedo_experimental # Conflicts: # .github/workflows/release.yml # docs/docker.md # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-thread-safety.cpp # tools/batched-bench/batched-bench.cpp # tools/mtmd/clip.cpp	2025-11-08 10:34:17 +08:00
Sigbjørn Skjæret	9008027aa3	hparams : add n_embd_inp() to support extended embed (#16928 ) * add n_embd_full to support extended embed * don't change output * rename to n_embd_inp * restore n_embd where applicable	2025-11-07 19:27:58 +01:00
Georgi Gerganov	16bcc1259d	kv-cache : pad the cache size to 256 for performance (#17046 ) * kv-cache : pad the size of the small SWA cache for performance * context : pad the total context to 256 * cont : future-proof the swa pad * server : adjust test params to new logic	2025-11-07 20:03:25 +02:00
Johannes Gäßler	aa374175c3	CUDA: fix crash on uneven context without FA (#16988 )	2025-11-06 14:05:47 +01:00
Georgi Gerganov	cd5e3b5754	server : support unified cache across slots (#16736 ) * server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning	2025-11-02 18:14:04 +02:00
Concedo	472438aad3	Merge commit '`5a4ff43e7d`' into concedo_experimental # Conflicts: # docs/build.md # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # src/llama-context.cpp # tests/test-backend-ops.cpp	2025-10-30 13:13:00 +08:00
Diego Devesa	5a4ff43e7d	llama : disable pipeline parallelism if compute buffer allocation fails (#16748 )	2025-10-27 21:51:28 +01:00
Concedo	12a8bfd453	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CODEOWNERS # README.md # docs/ops.md # docs/ops/SYCL.csv # docs/ops/Vulkan.csv # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/backend.hpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # tests/test-backend-ops.cpp # tests/test-thread-safety.cpp	2025-10-23 17:22:17 +08:00
takuya kodama	7062dd8460	llama-context: only warn on pooling_type when user specified (#16674 ) The unexpeced pooling_type warning was incorrectly shown when users did not specify the --pooling-type parameter. In this case, the parameter defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code automatically applies the model's default pooling type. Example of spurious warning: ``` $ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello" ... llama_init_from_model: model default pooling_type is [2], but [-1] was specified ... ``` This fix ensures the warning only appears when users explicitly specify a pooling type that differs from the model's default (e.g., using --pooling-type mean on a model that expects CLS pooling).	2025-10-20 10:44:21 +03:00
Concedo	5b6ba02167	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # ci/run.sh # examples/model-conversion/Makefile # examples/model-conversion/README.md # examples/model-conversion/logits.cpp # examples/model-conversion/requirements.txt # examples/model-conversion/scripts/embedding/convert-model.sh # examples/model-conversion/scripts/embedding/run-converted-model.sh # examples/model-conversion/scripts/embedding/run-original-model.py # examples/model-conversion/scripts/utils/semantic_check.py # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/kleidiai/kernels.cpp # ggml/src/ggml-cpu/kleidiai/kernels.h # ggml/src/ggml-cpu/kleidiai/kleidiai.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/softmax.cpp # ggml/src/ggml-sycl/softmax.hpp # requirements/requirements-all.txt # tests/test-chat-parser.cpp # tools/server/README.md	2025-10-09 23:46:56 +08:00
Saba Fallah	e08db42595	model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (#16367 ) * model: EmbeddingGemma sentence-transformers dense linear projections support * model: add support for EmbeddingGemma SentenceTransformers dense linear projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/ * model: add support for EmbeddingGemma SentenceTransformers dense linear projections - converting model with dense-layers is optional - introduced dense config params * Update convert_hf_to_gguf.py Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * fixed formatting issues * Update src/llama-graph.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * - removed pooling_type_opt, always allow overriding pooling_type - asserts checking dense features dims * fix python lint * fix ubuntu gcc build warning * - fixed thread-safety test - moved asserts to load_hparams * - tidying up code - simplifying graph-context expecting both dense weights * minor : add TODO --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-09 09:39:18 +03:00
Concedo	b120e107f9	Merge branch 'upstream' into concedo_experimental # Conflicts: # .clang-tidy # .devops/musa.Dockerfile # .github/workflows/build-linux-cross.yml # .github/workflows/build.yml # .github/workflows/docker.yml # .gitignore # CODEOWNERS # CONTRIBUTING.md # README.md # build-xcframework.sh # ci/README-MUSA.md # ci/run.sh # common/CMakeLists.txt # docs/docker.md # examples/CMakeLists.txt # examples/eval-callback/CMakeLists.txt # examples/model-conversion/Makefile # examples/model-conversion/README.md # examples/model-conversion/logits.cpp # examples/model-conversion/scripts/causal/compare-logits.py # examples/model-conversion/scripts/causal/run-org-model.py # examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh # examples/model-conversion/scripts/embedding/run-converted-model.sh # examples/model-conversion/scripts/embedding/run-original-model.py # examples/model-conversion/scripts/utils/check-nmse.py # examples/model-conversion/scripts/utils/inspect-org-model.py # examples/model-conversion/scripts/utils/semantic_check.py # ggml/CMakeLists.txt # ggml/include/ggml-zdnn.h # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/set_rows.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/set_rows.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-quantize-perf.cpp # tests/test-tokenizers-repo.sh # tools/perplexity/perplexity.cpp # tools/server/tests/README.md	2025-09-27 17:09:14 +08:00
Johannes Gäßler	e789095502	llama: print memory breakdown on exit (#15860 ) * llama: print memory breakdown on exit	2025-09-24 16:53:48 +02:00
Concedo	3e72aaff5b	Merge commit '`8f8f2274ee`' into concedo_experimental # Conflicts: # .devops/rocm.Dockerfile # .github/workflows/build.yml # .github/workflows/release.yml # CMakeLists.txt # examples/simple/simple.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-opencl/kernels/tsembd.cl # ggml/src/ggml-sycl/binbcast.cpp # ggml/src/ggml-sycl/binbcast.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/tsembd.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # src/llama-model.cpp # tools/batched-bench/CMakeLists.txt # tools/cvector-generator/CMakeLists.txt # tools/export-lora/CMakeLists.txt # tools/gguf-split/CMakeLists.txt # tools/imatrix/CMakeLists.txt # tools/llama-bench/CMakeLists.txt # tools/llama-bench/llama-bench.cpp # tools/main/CMakeLists.txt # tools/main/README.md # tools/mtmd/CMakeLists.txt # tools/perplexity/CMakeLists.txt # tools/perplexity/perplexity.cpp # tools/quantize/CMakeLists.txt # tools/rpc/rpc-server.cpp # tools/run/CMakeLists.txt # tools/run/run.cpp # tools/tokenize/CMakeLists.txt # tools/tts/CMakeLists.txt	2025-09-21 08:58:23 +08:00
Concedo	b9280718b5	fix bge memory excessive usage?	2025-09-21 08:38:37 +08:00
Sigbjørn Skjæret	b8e09f08b9	model : add grok-2 support (#15539 ) * add grok-2 support * type fix * type fix * type fix * "fix" vocab for invalid sequences * fix expert tensor mapping and spaces in vocab * add chat template * fix norm tensor mapping * rename layer_out_norm to ffn_post_norm * ensure ffn_post_norm is mapped * fix experts merging * remove erroneous FFN_GATE entry * concatenate split tensors and add more metadata * process all expert layers and try cat instead of hstack * add support for community BPE vocab * fix expert feed forward length and ffn_down concat * commit this too * add ffn_up/gate/down, unsure if sequence is right * add ffn_gate/down/up to tensor names * correct residual moe (still not working) * mess-- * fix embedding scale being applied twice * add built in chat template * change beta fast for grok if default value * remove spm vocab in favor of community bpe vocab * change attention temp length metadata type to integer * update attention temp length metadata * remove comment * replace M_SQRT2 with std::sqrt(2) * add yarn metadata, move defaults to hparams	2025-09-14 23:00:59 +02:00
Concedo	1dbd2fc259	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/build-s390x.md # docs/ops.md # docs/ops/zDNN.csv # ggml/include/ggml-zdnn.h # ggml/src/ggml-sycl/binbcast.cpp # ggml/src/ggml-sycl/concat.cpp # ggml/src/ggml-sycl/conv.cpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/cpy.cpp # ggml/src/ggml-sycl/dmmv.cpp # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/getrows.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/gla.cpp # ggml/src/ggml-sycl/im2col.cpp # ggml/src/ggml-sycl/mmq.cpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/norm.cpp # ggml/src/ggml-sycl/rope.cpp # ggml/src/ggml-sycl/set_rows.cpp # ggml/src/ggml-sycl/softmax.cpp # ggml/src/ggml-sycl/tsembd.cpp # ggml/src/ggml-sycl/wkv.cpp # ggml/src/ggml-zdnn/ggml-zdnn-impl.h # ggml/src/ggml-zdnn/ggml-zdnn.cpp # tools/llama-bench/llama-bench.cpp	2025-09-13 12:25:30 +08:00
Haiyue Wang	f4e664f838	context : remove redundant explicit casting to the same type (#15948 ) The function 'output_reserve' return type is 'uint32_t', so need to add explicit casting.	2025-09-12 18:16:32 +03:00
Concedo	6463f5c26b	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/release.yml # CONTRIBUTING.md # docs/backend/CANN.md # examples/eval-callback/eval-callback.cpp # examples/model-conversion/requirements.txt # examples/model-conversion/scripts/causal/run-org-model.py # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/kleidiai/kleidiai.cpp # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # models/templates/README.md # requirements/requirements-convert_hf_to_gguf.txt # requirements/requirements-convert_legacy_llama.txt # requirements/requirements-tool_bench.txt # tests/.gitignore # tests/test-backend-ops.cpp # tests/test-chat-parser.cpp # tests/test-chat.cpp # tests/test-json-schema-to-grammar.cpp # tests/test-tokenizer-random.py	2025-09-11 22:34:45 +08:00
Concedo	5de51b77c1	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/close-issue.yml # docs/build-s390x.md # examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/kleidiai/kleidiai.cpp # ggml/src/ggml-cuda/fattn-tile-f16.cu # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-webgpu/ggml-webgpu.cpp # scripts/tool_bench.py # tests/test-backend-ops.cpp # tools/batched-bench/batched-bench.cpp # tools/server/README.md	2025-09-11 22:28:19 +08:00
Daniel Bevenius	86587da03b	llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893 ) This commit adds check for two function pointers returned from ggml_backend_reg_get_proc_address. The motivation for this is that the function pointer could be nullptr if the get proc address function changes in the future. This is also consistent with all the other calls to ggml_backend_reg_get_proc_address in the code base.	2025-09-10 05:33:58 +02:00
Georgi Gerganov	663027fd54	context : fix n_outputs during reserve (#15858 ) ggml-ci	2025-09-08 10:26:36 +03:00
Concedo	f0d4128e9f	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/backend/CANN.md # examples/model-conversion/Makefile # examples/model-conversion/scripts/causal/compare-embeddings-logits.sh # examples/model-conversion/scripts/causal/convert-model.sh # examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py # examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh # examples/model-conversion/scripts/causal/run-converted-model.sh # examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh # examples/model-conversion/scripts/embedding/convert-model.sh # examples/model-conversion/scripts/embedding/modelcard.template # examples/model-conversion/scripts/embedding/run-converted-model.sh # examples/model-conversion/scripts/utils/create-collection-add-model.sh # examples/model-conversion/scripts/utils/inspect-converted-model.sh # examples/model-conversion/scripts/utils/inspect-org-model.py # examples/model-conversion/scripts/utils/perplexity-gen.sh # examples/model-conversion/scripts/utils/perplexity-run-simple.sh # examples/model-conversion/scripts/utils/perplexity-run.sh # examples/model-conversion/scripts/utils/quantize.sh # examples/model-conversion/scripts/utils/run-embedding-server.sh # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # src/llama-context.cpp # tests/test-backend-ops.cpp # tests/test-chat.cpp	2025-09-05 13:25:34 +08:00
Daniel Bevenius	d1e2adba65	llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791 ) * llama : set n_outputs to 1 to avoid 0 outputs mean-pooling This commit modifies the llama_context constructor to set n_outputs to 1. The motivation for this is that when using pooling, and specifically mean pooling, for embeddings having n_outputs set to 0 can lead to the following error: ```console $ build/bin/llama-embedding -m models/nomic-embed-text-1.5-Q4_K_M.gguf \ --pooling mean -p "Hello, how are you?" ... llama_context: CPU output buffer size = 0.12 MiB /home/danbev/work/ai/llama.cpp/ggml/src/ggml.c:3023: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed 0x0000743c96d107e3 in __GI___wait4 (pid=292978, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory 30 in ../sysdeps/unix/sysv/linux/wait4.c 196 waitpid(child_pid, NULL, 0); 230 ggml_print_backtrace(); 3023 GGML_ASSERT(ggml_can_mul_mat(a, b)); 1823 cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean); 18983 llm->build_pooling(cls, cls_b, cls_out, cls_out_b); 1399 auto * gf = model.build_graph(gparams); 292 auto * gf = graph_reserve(1, n_seqs, n_outputs, mctx.get(), true); 2329 auto * ctx = new llama_context(model, params); 913 llama_context lctx = llama_init_from_model(model, cparams); 105 common_init_result llama_init = common_init_from_params(params); [Inferior 1 (process 292976) detached] Aborted (core dumped) ``` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add comment about not reserving graphs with zero outputs * add assert in graph_reserve to ensure n_outputs >= 1 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-04 15:40:44 +02:00
Concedo	2562129271	Merge branch 'upstream' into concedo_experimental # Conflicts: # README.md # ci/run.sh # docs/backend/CANN.md # examples/speculative/speculative.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/flash_attn_f16.cl # ggml/src/ggml-opencl/kernels/flash_attn_f32.cl # ggml/src/ggml-opencl/kernels/flash_attn_f32_f16.cl # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/gguf.cpp # src/llama-context.cpp # tests/test-sampling.cpp # tools/server/README.md	2025-09-03 17:16:42 +08:00
Diego Devesa	274966226f	llama : fix fattn reserve call n_seqs parameter (#15699 ) ggml-ci	2025-08-31 18:47:05 +03:00
Concedo	7e35954695	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/build.md # docs/function-calling.md # examples/eval-callback/eval-callback.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/kleidiai/kernels.cpp # ggml/src/ggml-cpu/kleidiai/kernels.h # ggml/src/ggml-cpu/kleidiai/kleidiai.cpp # scripts/compare-llama-bench.py # scripts/server-bench.py # scripts/tool_bench.py # tests/test-chat.cpp # tools/batched-bench/batched-bench.cpp # tools/llama-bench/llama-bench.cpp # tools/server/README.md	2025-08-31 23:33:36 +08:00
Diego Devesa	9777032dcc	llama : separate compute buffer reserve from fattn check (#15696 ) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.	2025-08-31 15:49:03 +02:00
Johannes Gäßler	e81b8e4b7f	llama: use FA + max. GPU layers by default (#15434 ) * llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault	2025-08-30 16:32:10 +02:00
Concedo	3060dfb99f	Merge branch 'upstream' into concedo_experimental # Conflicts: # examples/model-conversion/Makefile # examples/model-conversion/scripts/causal/convert-model.sh # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cuda/CMakeLists.txt # scripts/compare-commits.sh	2025-08-28 23:17:29 +08:00
Georgi Gerganov	8a4280ce43	kv-cache : remove LLAMA_SET_ROWS checks (#15505 ) ggml-ci	2025-08-28 12:27:02 +03:00
Concedo	575eb40950	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/multimodal/minicpmv4.0.md # examples/model-conversion/Makefile # examples/model-conversion/README.md # examples/model-conversion/logits.cpp # examples/model-conversion/scripts/causal/modelcard.template # examples/model-conversion/scripts/utils/hf-create-model.py # ggml/src/ggml-opencl/ggml-opencl.cpp # tests/test-backend-ops.cpp # tools/batched-bench/batched-bench.cpp	2025-08-26 19:09:48 +08:00

1 2 3 4

176 commits