koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-18 06:19:19 +00:00

Author	SHA1	Message	Date
Concedo	9b38d83377	updated the readme for more docker information to make it clearer what to expect. please don't use the docker on a M series macOS	2026-04-27 18:21:22 +08:00
Concedo	095cfd6354	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # tests/test-chat-auto-parser.cpp # tests/test-chat.cpp	2026-04-26 15:57:35 +08:00
Oliver Simons	b1a5bd4e0c	CUDA: better coalesce data-access for contiguous concat (#22330 ) Also, distribute all elements across CTAs evenly instead of launching one CTA per dim	2026-04-26 09:21:45 +02:00
Concedo	f679e3fec5	fix missing ipv4 support	2026-04-26 14:44:26 +08:00
Sigbjørn Skjæret	0c6ee1cade	ggml-cpu : re-enable fast gelu_quick_f16 (#22339 )	2026-04-26 09:28:14 +03:00
Eve	2dd84169d1	ggml-cpu: optimize avx2 q6_k (#22345 )	2026-04-26 09:27:50 +03:00
lhez	f454bd7eb8	opencl: add iq4_nl support (#22272 ) * opencl: add general support for iq4_nl * opencl: add iq4_nl gemm/gemv for adreno * opencl: pack 2 lut entries into a uint	2026-04-25 21:21:58 -07:00
Trivikram Reddy	b760272f1a	hexagon: guard HMX clock request for v75+ platforms (#22377 )	2026-04-25 17:58:26 -07:00
Piotr Wilkin (ilintar)	dcad77cc3b	chat: fix handling of space in reasoning markers (#22353 ) * chat: fix handling of space in reasoning markers * fix tests * whitespace	2026-04-25 21:24:13 +02:00
Georgi Gerganov	98dc1418ea	spec : fix vocab compat checks (#22358 )	2026-04-25 20:11:35 +03:00
Concedo	929f214bf6	updated docs, handle seed oss thinking	2026-04-25 22:44:40 +08:00
Johannes Gäßler	9725a313be	CUDA: reduce MMQ stream-k overhead (#22298 ) Some checks failed Update Operations Documentation / update-ops-docs (push) Has been cancelled Details * CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc	2026-04-25 14:15:03 +02:00
Developer-Ecosystem-Engineering	d1649047a3	metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962 ) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-25 15:14:28 +03:00
Concedo	b31877e8ec	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/pull_request_template.md # .gitignore # docs/backend/SYCL.md # docs/ops.md # docs/ops/WebGPU.csv # examples/sycl/test.sh # examples/sycl/win-test.bat # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/sycl_hw.cpp # ggml/src/ggml-sycl/sycl_hw.hpp # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp	2026-04-25 19:06:32 +08:00
Wagner Bruna	c04832bb2b	sd: add eta support (#2164 )	2026-04-25 19:04:13 +08:00
Concedo	18a3bedf63	fixed a deadlock	2026-04-25 19:03:03 +08:00
ddh0	9d34231bb8	llama-quant : default ftype param `Q5_1` --> `Q8_0` (#20828 ) Change the default `ftype` in `llama_model_quantize_params` from `LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`. In case some external program naively uses the default quantization params, we should probably default to a known-good type like Q8_0 rather than Q5_1, which is rather old.	2026-04-25 09:25:35 +03:00
Georgi Gerganov	8ea8fee966	gitignore : add .pi + personal SYSTEM.md (#22316 ) * gitignore : add .pi + personal SYSTEM.md * cont : fix requirements heading in PR template * cont : shorten line	2026-04-25 09:20:45 +03:00
Neo Zhang	eddd7a13a5	[SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291 ) * opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter	2026-04-25 09:20:14 +03:00
Reese Levine	dd2914dc81	ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327 ) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support	2026-04-25 09:18:15 +03:00
Concedo	ee2ecfbf81	updated sdui	2026-04-25 12:18:20 +08:00
Concedo	340b22283e	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/intel.Dockerfile # .github/workflows/build-android.yml # .github/workflows/build.yml # .github/workflows/release.yml # .gitignore # docs/backend/SYCL.md # docs/backend/snapdragon/README.md # examples/model-conversion/scripts/causal/convert-model.sh # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/hex-utils.h # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/htp-ctx.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/htp_iface.idl # ggml/src/ggml-hexagon/htp/hvx-base.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/libggml-htp.inf # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/mmvq.hpp # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_blk.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl # scripts/server-test-structured.py # scripts/snapdragon/adb/run-bench.sh # scripts/snapdragon/adb/run-cli.sh # scripts/snapdragon/adb/run-completion.sh # scripts/snapdragon/adb/run-mtmd.sh # scripts/snapdragon/adb/run-tool.sh # scripts/snapdragon/qdc/requirements.txt # scripts/snapdragon/windows/run-bench.ps1 # scripts/snapdragon/windows/run-cli.ps1 # scripts/snapdragon/windows/run-completion.ps1 # scripts/snapdragon/windows/run-mtmd.ps1 # scripts/snapdragon/windows/run-tool.ps1 # tests/test-backend-ops.cpp # tools/cli/cli.cpp # ty.toml	2026-04-25 12:13:14 +08:00
Concedo	4090400dff	improved gemma toolcall handling	2026-04-25 09:51:29 +08:00
Piotr Wilkin (ilintar)	0adede866d	parser: fix structured output bug (#22302 ) Some checks failed Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details * fix very stupid structured output bug * Things just cannot be too easy.	2026-04-24 23:19:55 +02:00
Trivikram Reddy	361fe72acb	Hexagon: Bump HMX Frequency to Max Corner (#22334 ) * hexagon: bump HMX freq to max corner * hex-mm: fix error in log msg	2026-04-24 13:55:17 -07:00
Shreya Jain	a702f39597	CI Snapdragon: Switch ubuntu-latest to ubuntu-slim runner (#22303 ) * switch ubuntu-latest to ubuntu-slim * Fix the path for upload so CI doesn't fail * Update .github/workflows/build-and-test-snapdragon.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use -slim image for key check and consistent naming for artifact dir Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> * Remove check-secret extra job * move QDC key check for Run QDC jobs step specifically * add a step before to check the secret for qdc jobs --------- Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-24 21:21:36 +02:00
Zheyuan Chen	13d36cf891	ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (#22199 ) * ggml-webgpu: add tile flash attention fallback * ggml-webgpu: add new fields and discard usage of mnk for tile version * ggml-webgpu: modify the vec path to discard the mnk parameter * ggml-webgpu: enable flash attention vec and tile version for broswer * ggml-webgpu: stagging KV for flash attention tile version * formatting * turn on subgroup uniformity check * remove Q_TILE as it is always 1 for vec path * make row_max and exp_sum to local register * make different bindings with same underlying buffer to have the same usage flags * move path selection into the shader library and have the host consume a single flash-attn decision object. * turn off skip_validation and address buffer overlapping when nwg==1 * formatting * merge binding when kv overlap	2026-04-24 10:39:09 -07:00
Mengsheng Wu	f65bc34c68	hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (#22306 )	2026-04-24 09:21:33 -07:00
Georgi Gerganov	15fa3c493b	metal : print GPU description (#22318 )	2026-04-24 13:56:03 +03:00
Adrien Gallouët	dc80c5252a	common : fix jinja warnings with clang 21 (#22313 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-24 12:36:02 +02:00
Georgi Gerganov	e583f3b4f5	ggml : minor coding style (#22308 )	2026-04-24 11:02:00 +03:00
Georgi Gerganov	017f090442	jinja : remove unused header (#22310 )	2026-04-24 11:01:46 +03:00
Georgi Gerganov	ffdd983fb8	server : fix swa-full logic (#22288 )	2026-04-24 10:17:37 +03:00
Yes You Can Have Your Own	793d0a7931	server: rename debug tags to match --cache-idle-slots naming (#22292 )	2026-04-24 09:28:44 +03:00
Mengsheng Wu	8bc492ebb4	hexagon: add SOLVE_TRI op (#21974 ) * hexagon: add SOLVE_TRI op * ggml: fix TODO description for solve_tri * hexagon: rm unused variable/function warnings * hexagon: chunk vs batch processingfor better thread utilization * hexagon: vectorize partial f32 loads * hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-04-23 18:39:13 -07:00
Chen Yuan	e5f070a1dc	fix(shader): handle the buffer aliasing for rms fuse (#22266 )	2026-04-23 16:32:59 -07:00
Ethan Turner	fa0b8a70a8	cli: Remove redundant local sampling variables (#20429 ) (#22264 ) This change implements the third requested change in issue 20429. Because defaults.sampling contains the reasoning budget token count and the reasoning budget message, it's not necessary to assign them to struct variables.	2026-04-24 00:53:23 +02:00
Max Krasnyansky	5d2b52d80d	hexagon: add support for basic and extended Op profiling (#22269 ) * hexagon: restore HTP_OPMASK_QUEUE * hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul * hex-prof: restore op profiling * hex-prof: enable PMU * hexagon: simplify and improve op-queuing with full profiling support Add separate profile descriptors. * hexagon: remove opsync and rename opmask into opstage opsync is no longer needed since the profiler is fully async now. opmask name was confusing and opstage is more accurate. * hexagon: refactor opbatch queue handling * hexagon: add iface hooks for enabling profiler from the host Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use. * hexagon: make profiler mode configurable On older devices getting PMU counters is expensive so it's now optional. * hexagon: add support for setting profiler pmu events from env * hexagon: simplify profiler output (no need to print buffs, etc) * hexagon: simplify pmu counter formating * hexagon: add a simple profile post-proc tool * hex-prof: add support for reading logs from stdin * hexagon: document GGML_HEXAGON_PROFILE * hex-prof: update default width for dims field * hex-prof: fix linter warnings and errors * Update ggml/src/ggml-hexagon/htp/htp-ops.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/snapdragon/ggml-hexagon-profile.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-23 14:17:21 -07:00
Shreya Jain	187a456370	Enable testing on Snapdragon devices (#21051 ) * Add the tests that we want to run on external CI * remove extra files * Fixes python issues, reove the deadlock on CI * remove unecessary changes * use override to ty.toml * fix pre-commit and try tests with secret in external repo not upstream * skip if key is unavailable * Fix feedback * switch hexagon to snapdragon * cleanup * fix secrets * remove the copyrights at the top of the files	2026-04-23 13:08:10 -07:00
srkizer	185cbff6f1	server : convert_anthropic_to_oai: also copy chat_template_kwargs (#22154 )	2026-04-23 13:32:46 -05:00
Concedo	4e07c90eca	make buttons desc shorter	2026-04-24 00:42:39 +08:00
Song Li	c78fb909b2	server: fix heap-buffer-overflow from negative n_discard (CVE-2026-21869) (#22267 ) * server: clamp n_discard to non-negative at JSON parse boundary (CVE-2026-21869) A negative n_discard from client JSON causes heap-buffer-overflow in update_slots() context-shift loop (CWE-787, CVSS 8.8). Clamp to 0 at ingress; n_discard=0 already triggers auto-discard (n_left/2). Ref: GHSA-8947-pfff-2f3c * cont : cleaner * cont : cleanerer * cont : cleanest --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-23 18:39:07 +02:00
Adrien Gallouët	12568ca8c8	vendor : update LibreSSL to 4.3.1 (#22285 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-23 17:45:56 +02:00
kvc0	c807c6e3b0	server: (anthropic API) fix prefix caching (#21793 ) When testing claude code against llama.cpp, I noticed that only n_past 18577 was used even when context was 60k or more. The log in llama-server says: ``` slot update_slots: id 3 \| task 10342 \| old: ... ; cch= \| defa0;You are slot update_slots: id 3 \| task 10342 \| new: ... ; cch= \| 1c8b4; ``` I observed that the cch value changed every time. Reading about that, the x-anthropic-billing-header system message seems to be specially handled inside of the anthropic api. I could remove it, but there is a meaningful string sometimes included at the end. So instead, I just replace the changing cch checksum with fffff. I'm treating this as an anthropic message body API detail - I think this is the right way to do this, but by all means please correct me! It's always 5 hexadecimal characters, but I've written the replacement defensively in case they change the protocol.	2026-04-23 17:45:02 +02:00
dpapasia	a135d0fd04	Provide a mechanism for downloading the params (json format) as well as the track. (#2154 ) This is very similar to 'load params' followed by 'Export JSON' but this names the json immediately following the same convention that is used for the music download. This is very useful if intending to always download the params for every track that you're downloading, as it saves you from having to rename one of the downloads.	2026-04-23 23:33:45 +08:00
Sigbjørn Skjæret	0949beb5a3	fix build number for sycl release (#22283 )	2026-04-23 21:38:58 +08:00
Daniel Bevenius	9012c50fc8	model-conversion : fix mmproj output file name [no ci] (#22274 ) * model-conversion : fix mmproj output file name [no ci] This commit updates the convert-model.sh script to properly handle mmproj output files. The motivation for this that currently the same name as the original model is used as the mmproj file, which causes the original model to be overwritten and no mmproj-<model_name>.gguf to be created. * model-conversion : use MODEL_NAME [no ci]	2026-04-23 15:07:38 +02:00
Matthias Straka	0dd7f915fd	cli : cleanup auto-completion code (#21745 )	2026-04-23 15:03:28 +02:00
Concedo	2cde0bffd2	minor text edit	2026-04-23 20:06:04 +08:00
Tarek Dakhran	550d684bd1	server: Enable transcriptions API for LFM2-Audio (#22000 )	2026-04-23 10:47:26 +02:00

1 2 3 4 5 ...

13013 commits