koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-11 13:11:49 +00:00

Author	SHA1	Message	Date
Concedo	df6b7b5fdb	Merge branch 'concedo_experimental' of https://github.com/LostRuins/koboldcpp into concedo_experimental	2026-03-29 01:25:07 +08:00
Concedo	3eedde8ab5	Merge commit '`ded446b34c`' into concedo_experimental # Conflicts: # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # tests/test-backend-ops.cpp	2026-03-29 01:24:31 +08:00
Wagner Bruna	9223f41320	sd: call SetCircularAxesAll directly (#2078 )	2026-03-29 01:17:48 +08:00
Concedo	8760d22a84	switch back to newly updated jimver github cuda toolkit	2026-03-29 01:17:11 +08:00
Concedo	aac220f7e3	Merge commit '`0fac87b157`' into concedo_experimental # Conflicts: # .github/workflows/build-android.yml # .github/workflows/hip-quality-check.yml # docs/multimodal.md # scripts/hip/gcn-cdna-vgpr-check.py # scripts/snapdragon/windows/run-bench.ps1 # scripts/snapdragon/windows/run-cli.ps1 # scripts/snapdragon/windows/run-tool.ps1 # tests/test-backend-ops.cpp # tests/test-llama-archs.cpp # tools/imatrix/imatrix.cpp # tools/mtmd/CMakeLists.txt	2026-03-29 01:14:33 +08:00
Concedo	674b7f5eee	indicate support for claude messages api	2026-03-29 00:57:58 +08:00
Concedo	e3b7905e1c	added anthropic messages api support	2026-03-29 00:55:32 +08:00
Concedo	5ad9e3ee31	crude openai responses streaming	2026-03-29 00:16:30 +08:00
Concedo	94b266a6b0	musicui fix reset defaults	2026-03-28 21:09:40 +08:00
Concedo	1e787cd03a	improve responses api	2026-03-28 18:42:15 +08:00
Concedo	f768b2a4bd	whatever, i tried	2026-03-28 17:32:07 +08:00
Concedo	f80fdd4314	updated sdui	2026-03-28 11:24:03 +08:00
Concedo	547659fdbf	allow planning music with llm (+1 squashed commits) Squashed commits: [9a3bbf072] allow planning music with llm	2026-03-28 11:19:39 +08:00
Concedo	3ec6381123	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build-self-hosted.yml # .github/workflows/build.yml # .github/workflows/copilot-setup-steps.yml # .github/workflows/gguf-publish.yml # ci/run.sh # docs/backend/OPENVINO.md # examples/llama.android/lib/src/main/cpp/ai_chat.cpp # ggml/src/ggml-sycl/add-id.cpp # requirements/requirements-pydantic.txt # tests/test-gguf.cpp # tests/test-jinja.cpp # tests/test-llama-archs.cpp # tools/gguf-split/README.md # tools/llama-bench/llama-bench.cpp	2026-03-28 01:18:20 +08:00
Concedo	2cdf02102e	preserve previous filename	2026-03-28 01:13:03 +08:00
Wagner Bruna	e3c6227d46	sd: report back image generation parameters and metadata (#2062 ) * sd: refactor image generation result handling * sd: report back image generation metadata	2026-03-28 00:49:03 +08:00
Concedo	0c2b679ea3	support bf16 quantkv cache type	2026-03-28 00:01:17 +08:00
Concedo	326542f480	rudimentary responses api, not usable yet	2026-03-27 23:38:08 +08:00
Concedo	81cebb6179	remove unused field	2026-03-27 22:52:36 +08:00
scottf007	f0818e1eae	Add socket timeout to is_port_in_use() to fix ~280s startup delay on WSL2 (#2077 ) On WSL2 with networkingMode=mirrored, connect_ex() to non-listening ports gets black-holed through the Windows host networking stack instead of returning ECONNREFUSED. Without a timeout, TCP SYN retransmits with exponential backoff (1+2+4+8+16+32+64 ≈ 127s per port), causing Router Mode's port scan of 15001-15010 to stall for ~280 seconds on startup. Adding a 1-second timeout makes connect_ex() fail fast, reducing startup from ~303s to ~23s on affected systems. Tested on WSL2 Ubuntu 24.04 with mirrored networking, KoboldCpp v1.110, RTX 3090 Ti, Qwen3.5-27B Q4_K_M.	2026-03-27 22:50:59 +08:00
Concedo	a03998bed6	added jinja kwargs support	2026-03-27 00:28:59 +08:00
lhez	ded446b34c	opencl: allow large buffer for adreno (#20997 )	2026-03-26 08:52:21 -07:00
Michael Wand	f8d4abae86	convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505 ) * convert : fix Qwen3.5 NVFP4 conversion * Updated copilot concerns and rebased * move into _LinearAttentionVReorderBase and simplify * --flake * new_name not needed * Added input_scale to gguf * Fixed input_scale addition as tensor * Added input scale to loader and named _in_s * Update convert_hf_to_gguf.py Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:52:06 +01:00
Pavel Zloi	3d5acab3e7	convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 ) * Support of ruGPT3XL model added * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * chkhsh for ruGPT3XL model added * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fixing chkhsh for ruGPT3XL, rerun updated and _qkv_parts in RuGPT3XLModel --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:49:09 +01:00
Concedo	c91f350ed5	increase max images, take images from the end instead of beginning if too many images	2026-03-26 23:03:52 +08:00
Adrien Gallouët	9900b29c3a	common : filter out imatrix when finding models (#21023 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-26 15:37:18 +01:00
Concedo	4a5c903718	sd model model replacement logic: adjusted approach for easy merge	2026-03-26 21:57:42 +08:00
ihb2032	dc8d14c582	fix(ggml): correct RISC-V ISA string canonical ordering for RVV in CMake (#20888 ) Signed-off-by: ihb2032 <hebome@foxmail.com>	2026-03-26 13:08:41 +02:00
Adrien Gallouët	93dfbc1291	common : make LLAMA_CACHE the one cache for everything (#21009 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-26 12:04:57 +01:00
Adrien Gallouët	3cba8bba18	common : fix split model migration (#21019 ) Sadly the manifest does not list all required files, i honestly thought it was the case Without the files listed we don't have the sha256, so if the first file is valid, and all others have the correct size, then we can assume we are good and do the migration... Here my test: $ find /home/angt/.cache/llama.cpp /home/angt/.cache/llama.cpp /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf.etag /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf.etag /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json $ build/bin/llama-server ================================================================================ WARNING: Migrating cache to HuggingFace cache directory Old cache: /home/angt/.cache/llama.cpp/ New cache: /home/angt/.cache/huggingface/hub This one-time migration moves models previously downloaded with -hf from the legacy llama.cpp cache to the standard HuggingFace cache. Models downloaded with --model-url are not affected. ================================================================================ migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf migrate_old_cache_to_hf_cache: migration complete, deleting manifest: /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json $ find /home/angt/.cache/llama.cpp /home/angt/.cache/huggingface /home/angt/.cache/llama.cpp /home/angt/.cache/huggingface /home/angt/.cache/huggingface/hub /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/50d019817c2626eb9e8a41f361ff5bfa538757e6f708a3076cd3356354a75694 /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/7b273e1dbfab11dc67dce479deb5923fef27c39cbf56a20b3a928a47b77dab3c /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs/main /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-26 12:04:37 +01:00
Concedo	25216a0793	update cuda toolkit to use node24 with a fork	2026-03-26 17:16:22 +08:00
Michael Wand	112c78159f	ggml-cuda: Add NVFP4 dp4a kernel (#20644 ) Added check for dst_t to cuda_cast template for float Restored ggml_cuda_ue4m3_to_fp32, changed vecdot ints to int32ts Added CUDART/HIP Check and HIP/fp8 include Added NVFP4 to Test-backend-ops Added hip_fp8_e4m3 to __nv_fp8_e4m3 typedef --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-26 09:54:03 +01:00
Concedo	633222d2e3	fix tool builds	2026-03-26 15:15:58 +08:00
SamareshSingh	0fac87b157	imatrix : fix crash when using --show-statistics with zero counts (#19532 ) * imatrix: fix crash when using --show-statistics with zero counts Fixes division by zero that caused floating point exceptions when processing imatrix files with zero count values. Added checks to skip zero counts and handle empty activation vectors. Fix for the bug #19190 * imatrix: lower log level for zero-count skip message to DBG	2026-03-26 08:14:36 +01:00
Yihao Wang	0a524f2404	CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (#17094 ) * Refactor CUDA 2D transpose implementation to support multiple kernel types and improve parameter handling - Introduced a `conv2d_transpose_params` struct for better parameter management. - Updated `conv2d_transpose_kernel` to be templated for different kernel types (float and half). - Modified `ggml_cuda_conv_2d_transpose_p0` to handle both F16 and F32 kernel types. - Enhanced test cases to validate functionality for both kernel types. * Refactor test cases for 2D convolution transpose to support dynamic kernel types - Updated `test_conv_transpose_2d` structure to improve parameter handling by reordering constructor arguments. - Enhanced test case generation to iterate over kernel types, allowing for flexible testing of different configurations. - Removed hardcoded kernel type instances in favor of a loop for better maintainability and scalability. * Refactor ggml_compute_forward_conv_transpose_2d to support both F16 and F32 tensor types. * Refactor conv2d transpose kernel to use a template for kernel type, enhancing flexibility for different data types. Update test cases to include both F16 and F32 tensor types for comprehensive coverage. * Update ggml/src/ggml-cuda/conv2d-transpose.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Update ggml/src/ggml-cpu/ggml-cpu.c Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Refactor conv2d transpose implementation by removing the conv2d_transpose_params struct and dispatching with direct kernel launch. * Enhance cpu conv2d transpose implementation by introducing a templated kernel type for improved flexibility with F16 and F32 data types. --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-03-26 10:19:14 +08:00
Adrien Gallouët	c0159f9c1f	common : do not delete old files from the old cache when updating (#21000 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-25 22:28:04 +01:00
Saba Fallah	a970515bdb	mtmd: Add DeepSeekOCR Support (#17400 ) * mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-03-25 19:57:40 +01:00
Adrien Gallouët	056b50c319	common : fix verbosity setup (#20989 ) The verbosity threshold was set at the end of common_params_parse_ex(), after doing many things (like downloading files..) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-25 19:41:01 +01:00
Adrien Gallouët	f2c72b8f1f	common : fix gguf selection in common_list_cached_models (#20996 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-25 19:18:06 +01:00
uvos	ec54ac13a8	ci : fix parsing of vgpr counts in hip-quality-check (#20987 ) * scripts: hip: gcn-cdna-vgpr-check: fix parsing of vgpr counts when an amdclang Remark block is interlieved with another from a different process * Return warning ignore * obay pep8 inline double space before inline commets * add # noqa: NP100 for other prints too * Add script changes to cause autotrigger	2026-03-25 19:00:37 +01:00
Saba Fallah	80322ebdaf	model: codefuse-ai/F2LLM-v2 support	2026-03-25 18:33:42 +01:00
Dowon	44c51e526b	model : allow causal_attn and pooling_type on all architectures (#20973 ) * models : allow causal_attn and pooling_type on all architectures * fix: move location	2026-03-25 18:12:38 +01:00
Aparna M P	1922f87c2f	snapdragon: add missing features to WoS scripts to achieve parity with ADB scripts (#20884 ) * Add missing features to WoS scripts to achieve parity with ADB scripts * Fix line-ending in run-mtmd.ps1 Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> --------- Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-25 09:43:12 -07:00
Shreya Jain	345de3cd87	Use docker in build-android.yml (#20928 ) * use docker instead of SDK separately * fix whitespaces * Update .github/workflows/build-android.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-25 09:36:27 -07:00
Concedo	9de6e0db8b	up version for github actions except for jimver (not available yet)	2026-03-25 23:46:03 +08:00
Concedo	c00fe0af5a	Merge commit '`9f102a1407`' into concedo_experimental # Conflicts: # .devops/intel.Dockerfile # .github/ISSUE_TEMPLATE/010-bug-compilation.yml # .github/ISSUE_TEMPLATE/011-bug-results.yml # .github/pull_request_template.md # CODEOWNERS # README.md # common/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/binary-ops.c # ggml/src/ggml-hexagon/htp/hex-dma.c # ggml/src/ggml-hexagon/htp/hex-dma.h # ggml/src/ggml-hexagon/htp/hex-dump.h # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/ssm-conv.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # scripts/snapdragon/adb/run-bench.sh # scripts/sync_vendor.py # tests/test-backend-ops.cpp # tools/llama-bench/llama-bench.cpp	2026-03-25 23:45:41 +08:00
Concedo	39938e19d3	allow router mode to auto-wake other endpoints if put to sleep by auto unload	2026-03-25 23:17:20 +08:00
Concedo	8a6c41dc5c	Merge commit '`841bc203e2`' into concedo_experimental # Conflicts: # .github/workflows/ai-issues.yml # embd_res/templates/HuggingFaceTB-SmolLM3-3B.jinja # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-musa/CMakeLists.txt # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-openvino/ggml-openvino.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # tests/test-chat-auto-parser.cpp # tests/test-jinja.cpp # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-03-25 22:49:53 +08:00
Concedo	c6213e9be6	Revert "Revert "llama : disable graph reuse with pipeline parallelism (#20463 )"" This reverts commit `8043f35b22`.	2026-03-25 22:25:20 +08:00
Concedo	b81103d6ba	clean up colab a bit	2026-03-25 22:14:38 +08:00

1 2 3 4 5 ...

12436 commits