koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-06 08:01:27 +00:00

Author	SHA1	Message	Date
Concedo	b6f6338bba	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build-linux-cross.yml # .github/workflows/build.yml # CODEOWNERS # ggml/CMakeLists.txt # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-webgpu/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.tmpl.wgsl # tests/test-backend-ops.cpp # tests/test-chat-template.cpp # tools/llama-bench/llama-bench.cpp # tools/rpc/README.md # tools/server/README.md	2025-10-09 01:33:27 +08:00
Concedo	224800b33b	revert https://github.com/ggml-org/llama.cpp/pull/14904 , segfault on repacked q4_0 on avx2 cpu.	2025-10-09 00:37:05 +08:00
issixx	d2ee056e1d	server : fix cancel pending task (#16467 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Co-authored-by: DevAI <DevAI@gmail.com>	2025-10-08 11:20:18 +03:00
Georgi Gerganov	b2c08c9ec4	metal : mark FA blocks (#16372 ) * metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic	2025-10-08 10:57:53 +03:00
Georgi Gerganov	7fdd16b432	server : improve context checkpoint logic (#16440 )	2025-10-08 10:57:29 +03:00
Reese Levine	74b8fc17f9	ggml webgpu: profiling, CI updates, reworking of command submission (#16452 ) * Add profiling * More detailed profiling * Rework command submission to avoid global locks * Update wait handling * try new method of waiting on futures * Add serializing of command submission in some cases * Add new pool for timestamp queries and clean up logging * Serialize command submission in CI and leave a TODO note * Update webgpu CI * Add myself as WebGPU codeowner * Deadlock avoidance * Leave WebGPU/Vulkan CI serialized * Fix divide by 0 * Fix logic in division by inflight_threads * Update CODEOWNERS and remove serialize submit option	2025-10-07 13:48:56 -07:00
Tarek Dakhran	aeaf8a36f0	llama : support LiquidAI LFM2-MoE hybrid model (#16464 ) * llama : support LiquidAI LFM2-MoE hybrid model Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model. For more information about models, please read [the blog post](https://www.liquid.ai/company/news). [HF PR](https://github.com/huggingface/transformers/pull/41401) [GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) * Do not use defaultdict * Address PR feedback	2025-10-07 20:03:35 +02:00
Concedo	9919fdc83a	granite 4 template	2025-10-08 00:09:46 +08:00
Concedo	c1a246c1de	fixed typo	2025-10-07 21:51:15 +08:00
Georgi Gerganov	df1b612e29	server : add `/v1/health` endpoint (#16461 ) * server : add /v1/health endpoint * cont : update readme	2025-10-07 15:57:14 +03:00
Concedo	3b30f12ca7	future proof handling of rnn models	2025-10-07 19:12:47 +08:00
Sascha Rogmann	4e0388aa8a	webui : added download action (#13552 ) (#16282 ) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-10-07 11:11:08 +02:00
Georgi Gerganov	ef4c5b87ea	presets : fix pooling param for embedding models (#16455 )	2025-10-07 10:32:32 +03:00
Radoslav Gerganov	c61ae20d05	rpc : update documentation (#16441 ) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-07 06:59:13 +00:00
Concedo	7857578f45	handle more rnn models	2025-10-07 13:47:15 +08:00
Georgi Gerganov	0123ff38f5	memory : use sequential equal splits for recurrent modules (#16442 )	2025-10-07 08:24:17 +03:00
Georgi Gerganov	0a319bb75e	metal : add support for non-padded FA KV (#16148 ) * metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement	2025-10-07 08:23:30 +03:00
Georgi Gerganov	1d6092fc72	tests : add -INF blocks to the KQ mask in the FA tests (#16380 ) * tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-07 08:22:35 +03:00
Georgi Gerganov	8ae32dc9ec	metal : various optimizations + refactoring (#16446 ) * metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt	2025-10-07 08:21:40 +03:00
Gadflyii	3df2244df4	llama : add --no-host to disable host buffers (#16310 ) * implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-10-06 19:55:53 +02:00
Gabe Goodhart	c08002a198	chat : Granite Docling stopping (#16438 ) * fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-10-06 18:59:40 +02:00
Sigbjørn Skjæret	3a002afafa	ci : refactor sdk caching to minimize storage (#16414 ) * refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]	2025-10-06 17:40:21 +02:00
Concedo	bb5cef1756	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/package.nix # ci/run.sh # ggml/src/ggml-cpu/amx/amx.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/rms_norm.wgsl # tools/server/README.md	2025-10-06 22:41:46 +08:00
Concedo	f2b9b93838	updated lite	2025-10-06 21:59:51 +08:00
Georgi Gerganov	a23b9bdbd3	ggml : fix unaligned access in AMX code (#16315 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details	2025-10-06 16:05:27 +03:00
Daniel Bevenius	04e632a4aa	ci : remove missing reranker model files (#16444 ) This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649	2025-10-06 14:56:59 +02:00
Daniel Bevenius	a80ff183ab	ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (#16443 ) This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2ggml_f32_epr elements per iteration , there can be up to (2ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630	2025-10-06 14:17:12 +02:00
Yuannan	1d49ca3759	nix : removed metal for nix (#16118 )	2025-10-06 12:29:56 +03:00
Oleksandr Kuvshynov	c5fef0fcea	server: update readme to mention n_past_max metric (#16436 ) https://github.com/ggml-org/llama.cpp/pull/15361 added new metric exported, but I've missed this doc.	2025-10-06 10:53:31 +03:00
Concedo	2fa28fdcf8	wrap sd_parse_meta_field in trycatch	2025-10-06 00:05:19 +08:00
Wagner Bruna	c48999f7c0	additional options for image generation (#1765 ) * sd: add backend support for choosing the default sampler * use the default sampler on the API * sd: add backend support for the scheduler * sd: add backend support for distilled guidance * sd: add backend support for timestep-shift * sd: add a config field to set default image gen options	2025-10-05 23:36:20 +08:00
Gabe Goodhart	ca71fb9b36	model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206 ) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-10-05 14:57:47 +02:00
Concedo	75272f62af	remove gif-h	2025-10-05 17:49:29 +08:00
Concedo	a09d8333b5	allow lowvram (nkvo) to be used with vulkan.	2025-10-05 16:18:58 +08:00
Concedo	b5bec86231	simple quick triage for vulkan compilation	2025-10-05 14:25:35 +08:00
Reese Levine	35266573b9	ggml webgpu: actually add softmax, fix rms_norm offset (#16400 ) * implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit	2025-10-04 20:59:31 -07:00
Concedo	c83dde8a34	not working commit, need to fix vulkan shaders gen	2025-10-05 11:32:50 +08:00
Concedo	76818cb67a	update readme	2025-10-05 10:37:43 +08:00
Eve	86df2c9ae4	vulkan: use a more appropriate amount of threads when generating shaders (#16418 ) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax	2025-10-04 22:04:27 +02:00
Concedo	1d728bbc89	Merge commit '`128d522c04`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/release.yml # ggml/src/ggml-vulkan/ggml-vulkan.cpp # tests/test-alloc.cpp # tests/test-chat.cpp	2025-10-04 23:51:22 +08:00
Concedo	b8680cd0c5	Revert "preserve rocm6 ci" This reverts commit `9df2a02c4c`. (+1 squashed commits) Squashed commits: [2e96da6f0] Revert "ROCm 7 CI (#1752)" This reverts commit `118e589743`.	2025-10-04 23:33:59 +08:00
Concedo	ef773cd8cc	updated lite	2025-10-04 23:29:47 +08:00
Radoslav Gerganov	f39283960b	rpc : check src buffer when copying tensor (#16421 ) Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.	2025-10-04 16:22:45 +03:00
Wagner Bruna	a27d71f95f	fix VAE tiling for Qwen Image (#1774 ) leejet/stable-diffusion.cpp#873	2025-10-04 20:44:43 +08:00
Concedo	a98b63013e	allow tiling on qwen image	2025-10-04 20:43:36 +08:00
Radoslav Gerganov	898acba681	rpc : add support for multiple devices (#16276 ) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order	2025-10-04 12:49:16 +03:00
Acly	e29acf74fe	vulkan : incremental shader builds (#16341 ) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-04 11:42:56 +02:00
Concedo	bb06956b2d	allow wan to use img2img via init image	2025-10-04 11:25:46 +08:00
Concedo	db37688b47	qwen image disable VAE tiling as it's broken	2025-10-04 11:19:19 +08:00
henk717	118e589743	ROCm 7 CI (#1752 ) * Bump ROCm * Container experiment * Can 7.0 compile it on its own? * Clean the env before pulling docker * Cleanup attempt 2 * Fix cleanup test 2 * Bing attempts to save ROCm users * CI binary location fix attempt * Attempt to fix Docker env vars (make it compile rocm again) * Update kcpp-build-release-linux-rocm.yaml * Less fancy ROCm spelling	2025-10-04 09:11:12 +08:00

1 2 3 4 5 ...

9871 commits