koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-12 14:11:27 +00:00

Author	SHA1	Message	Date
lhez	5016b72862	opencl: fix build targeting CL 2 (#16554 )	2025-10-13 11:50:37 -07:00
Johannes Gäßler	7049736b2d	CUDA: fix numerical issues in tile FA kernel (#16540 )	2025-10-13 17:29:45 +03:00
Jie Fu (傅杰)	01d2bdc2bc	ggml : fix build broken with -march=armv9-a on MacOS (#16520 ) * ggml : fix build broken with -march=armv9-a on MacOS Signed-off-by: Jie Fu <jiefu@tencent.com> * Add #pragma message Signed-off-by: Jie Fu <jiefu@tencent.com> * Address review comment. Signed-off-by: Jie Fu <jiefu@tencent.com> * Update ggml/src/ggml-cpu/ggml-cpu.c --------- Signed-off-by: Jie Fu <jiefu@tencent.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-13 15:48:47 +03:00
Chenguang Li	56fc38b965	CANN: fix CPU memory leak in CANN backend (#16549 ) This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed.	2025-10-13 17:01:24 +08:00
Pascal	1fb9504eb7	fix: add remark plugin to render raw HTML as literal text (#16505 ) * fix: add remark plugin to render raw HTML as literal text Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs do ensuring consistent and safe Markdown rendering Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the Markdown AST into plain-text equivalents while preserving indentation and line breaks. This ensures consistent rendering and prevents unintended HTML execution, without altering valid Markdown structure Kept 'remarkRehype' in the pipeline since it performs the required conversion from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization Refined the link-enhancement logic to skip unnecessary DOM rewrites, fixing a subtle bug where extra paragraphs were injected after the first line due to full innerHTML reconstruction, and ensuring links open in new tabs only when required Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml -> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify * fix: address review feedback from allozaur * chore: update webui build output	2025-10-13 10:55:32 +02:00
Concedo	833a778b18	try fix cu11 fa again	2025-10-13 16:36:59 +08:00
Sam/Samuel	3f750f8d76	metal: add support for opt_step_sgd (#16539 ) * metal: add support for opt_step_sgd * add newline to pass EditorConfig check	2025-10-13 11:25:02 +03:00
Georgi Gerganov	c515fc5771	ggml : fix scalar path for computing norm (#16558 )	2025-10-13 11:22:27 +03:00
Concedo	3a42c6b523	apply fix from https://github.com/ggml-org/llama.cpp/pull/16558	2025-10-13 15:31:26 +08:00
Concedo	c6884a1462	Revert "revert https://github.com/ggml-org/llama.cpp/pull/15953 for now as it breaks kokoro" This reverts commit `20678ddca1`.	2025-10-13 15:25:53 +08:00
Concedo	ca8f36195f	try fix cu11 fa	2025-10-13 14:50:47 +08:00
Concedo	8b787866c6	fixed a typo	2025-10-13 11:14:38 +08:00
Concedo	59aa1529dc	add embeddings vulkan to makefile	2025-10-13 11:05:45 +08:00
Concedo	20678ddca1	revert https://github.com/ggml-org/llama.cpp/pull/15953 for now as it breaks kokoro	2025-10-13 10:36:51 +08:00
hipudding	f9bc66c3eb	CANN: Update several operators to support FP16 data format (#16251 ) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <757486878@qq.com>	2025-10-13 08:52:22 +08:00
Sam/Samuel	a31cf36ad9	metal : add opt_step_adamw and op_sum (#16529 ) * scaffold to support opt step adamw on metal (not written so far) * add opt-step-adamw kernel for metal * pass op->src[4] as a separate buffer to the pipeline * add bounds check to opt-step-adamw kernel * complete scaffold for GGML_OP_SUM * naive GGML_OP_SUM kernel * remove unwanted comment * change OP_SUM capability gate * Add has_simdgroup_reduction to both ops to pass CI	2025-10-12 21:43:14 +03:00
Pascal	81d54bbfd5	webui: remove client-side context pre-check and rely on backend for limits (#16506 ) * fix: make SSE client robust to premature [DONE] in agentic proxy chains * webui: remove client-side context pre-check and rely on backend for limits Removed the client-side context window pre-check and now simply sends messages while keeping the dialog imports limited to core components, eliminating the maximum context alert path Simplified streaming and non-streaming chat error handling to surface a generic 'No response received from server' error whenever the backend returns no content Removed the obsolete maxContextError plumbing from the chat store so state management now focuses on the core message flow without special context-limit cases * webui: cosmetic rename of error messages * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-10-12 18:06:41 +02:00
Neo Zhang Jianyu	c7be9febcb	[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521 ) Some checks failed Update Operations Documentation / update-ops-docs (push) Has been cancelled Details * fix/refactor OP argsort, pad * fix count-equal op * update SYCL OP list * fix format issue --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-12 21:53:35 +08:00
Mathieu Baudier	8415f61e23	ci : add Vulkan on Ubuntu with default packages build (#16532 ) * ci: build Vulkan on Ubuntu with default packages * ci: disable tests in Vulkan build with default Ubuntu packages	2025-10-12 15:48:03 +02:00
Aldehir Rojas	2c301e91ab	common : handle unicode during partial json parsing (#16526 ) * common : handle unicode during partial json parsing * common : set missing `ensure_ascii = true` during json dump	2025-10-12 16:18:47 +03:00
Concedo	121e2fefc8	updated lite	2025-10-12 20:52:16 +08:00
Concedo	54db35cd7a	fix t5 scale as well	2025-10-12 20:35:46 +08:00
Concedo	e0ba01c65e	fix cuda builds	2025-10-12 20:09:16 +08:00
Concedo	1a360b8458	sdcpp: optimize the handling of the FeedForward precision fix (+1 squashed commits) Squashed commits: [621ff6392] sdcpp: optimize the handling of the FeedForward precision fix (+1 squashed commits) Squashed commits: [05b16906c] sdcpp: optimize the handling of the FeedForward precision fix	2025-10-12 17:49:38 +08:00
Concedo	9503547ca1	Merge remote-tracking branch 'lcpp/gg/cacheless-embd' into concedo_experimental	2025-10-12 16:47:48 +08:00
Concedo	7e7da2583e	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-cuda/common.cuh # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-musa/CMakeLists.txt	2025-10-12 16:42:51 +08:00
Concedo	76d5fcbe49	fix the issue that occurs when using CUDA with k-quants weights	2025-10-12 16:18:03 +08:00
Georgi Gerganov	d4d465bce4	graph : support cacheless embeddings with FA and iSWA	2025-10-12 10:35:38 +03:00
Georgi Gerganov	4b2dae383d	common : update presets (#16504 ) * presets : add --embd-gemma-default and remove old embedding presets * presets : add gpt-oss presets * presets : add vision presets * cont : remove reasoning overrides [no ci] * cont : fix batch size for embedding gemma [no ci]	2025-10-12 09:29:13 +03:00
sirus20x6	41aac5c69b	ggml : Fix FP16 ELU positive branch (#16519 ) Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-10-12 08:25:37 +03:00
Daniel Bevenius	a2fba89a42	hparams : add check for layer index in is_recurrent (#16511 ) * hparams : add check for layer index in is_recurrent This commit adds a check in the is_recurrent method to ensure that the provided layer index is within the valid range. The motivation for this change is to prevent potential out-of-bounds and also be consistent with other methods in the class that perform similar checks, like is_swa.	2025-10-12 07:19:06 +02:00
sirus20x6	20cc625edc	ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (#16518 ) The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion. Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-10-12 08:15:00 +03:00
Concedo	a0ed446e61	handle numbers outside int32 range with wrapping	2025-10-12 12:46:45 +08:00
Wagner Bruna	9f9494cf3f	sd: add 'default' to the list of supported samplers (#1788 )	2025-10-12 12:35:56 +08:00
Concedo	65c2129f65	https://github.com/leejet/stable-diffusion.cpp/pull/877/commits/47c0f8e4bd6916442d04b0a4412554cf3a043e8d	2025-10-12 10:01:29 +08:00
Johannes Gäßler	11f0af5504	CUDA: faster tile FA, add oob checks, more HSs (#16492 )	2025-10-11 20:54:32 +02:00
Concedo	720fc30832	Merge branch 'upstream' into concedo_experimental	2025-10-11 23:19:38 +08:00
Concedo	e92f9fd422	cursed hack for RNN models	2025-10-11 23:14:55 +08:00
Georgi Gerganov	a3cb04744f	metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494 ) Some checks failed Python Type-Check / pyright type-check (push) Has been cancelled Details	2025-10-11 16:54:10 +03:00
Pascal	4a8fbe0a5e	feat: render user content as markdown option (#16358 ) * feat: render user content as markdown option - Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options - Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle - Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise - chore: update webui build output * chore: update webui build output	2025-10-11 15:50:49 +02:00
Yann Follet	31d0ff1869	server / ranking : add sorting and management of top_n (#16403 ) * server / ranking : add sorting and management of top_n * Make the retro compatible if no top_n will return all results here is a script to make some test ```script URL=${1:-http://127.0.0.1:8181} curl "$URL/v1/rerank" -H "Content-Type: application/json" \ -d '{ "model": "M", "query": "What is the recipe to make bread ?", "return_text" : true, "texts" : true, "top_n": 6, "documents": [ "voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel", "it is a bear", "bread recipe : floor, water, yest, salt", "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.", "here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt", "recipe to make cookies : floor, eggs, water, chocolat", "here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt", "il fait tres beau aujourd hui", "je n ai pas faim, je ne veux pas manger", "je suis a paris" ] }' \| jq ``` * use resize() instead for(...) * simplify top_n init since no need to return error result to test : ./tests.sh unit/test_rerank.py -v -x ==================================================== test session starts ===================================================== platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3 cachedir: .pytest_cache rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests configfile: pytest.ini plugins: anyio-4.11.0 collected 8 items unit/test_rerank.py::test_rerank PASSED [ 12%] unit/test_rerank.py::test_rerank_tei_format PASSED [ 25%] unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 37%] unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 50%] unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 62%] unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 75%] unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 87%] unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [100%] ===================================================== 8 passed in 4.31s ====================================================== * add rerank top_n unit test here is the result : ./tests.sh unit/test_rerank.py -v -x =================================================================== test session starts =================================================================== platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3 cachedir: .pytest_cache rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests configfile: pytest.ini plugins: anyio-4.11.0 collected 16 items unit/test_rerank.py::test_rerank PASSED [ 6%] unit/test_rerank.py::test_rerank_tei_format PASSED [ 12%] unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 18%] unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 25%] unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 31%] unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 37%] unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 43%] unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [ 50%] unit/test_rerank.py::test_rerank_top_n[None-4] PASSED [ 56%] unit/test_rerank.py::test_rerank_top_n[2-2] PASSED [ 62%] unit/test_rerank.py::test_rerank_top_n[4-4] PASSED [ 68%] unit/test_rerank.py::test_rerank_top_n[99-4] PASSED [ 75%] unit/test_rerank.py::test_rerank_tei_top_n[None-4] PASSED [ 81%] unit/test_rerank.py::test_rerank_tei_top_n[2-2] PASSED [ 87%] unit/test_rerank.py::test_rerank_tei_top_n[4-4] PASSED [ 93%] unit/test_rerank.py::test_rerank_tei_top_n[99-4] PASSED [100%] =================================================================== 16 passed in 8.84s =================================================================== * editor config check fix	2025-10-11 16:39:04 +03:00
Diego Devesa	97870e6497	cuda : avoid initializing unused devices (#16510 )	2025-10-11 13:02:26 +02:00
amirai21	477a66b035	convert : correctly handle LLaMA tokenizer for Jamba (#16470 ) Some checks failed Python Type-Check / pyright type-check (push) Waiting to run Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details * fix: convert_hf_to_gguf - change Jamba non-sentencepiece mode (tokenizer.json) vocab construction * fix: convert_hf_to_gguf - jamba non-sentencepiece tokenizer to use _set_vocab_llama_hf func * fix: convert_hf_to_gguf - removed get_vocab_base_pre from jamba	2025-10-11 10:33:41 +02:00
Concedo	0cc0ea4cf9	reset prompt template idx	2025-10-11 12:30:07 +08:00
Concedo	5cea2fe944	don't enforce dims	2025-10-11 11:34:47 +08:00
Concedo	80f88eb703	wip qwen image edit. not working yet	2025-10-11 11:24:17 +08:00
Concedo	6d8f8cd65b	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/CMakeLists.txt	2025-10-11 10:01:43 +08:00
Georgi Gerganov	e60f01d941	server : fix division by zero when reporting stats (#16501 ) Some checks are pending Python Type-Check / pyright type-check (push) Waiting to run Details	2025-10-10 22:15:05 +03:00
Georgi Gerganov	81086cd6a3	vocab : mark EOT token for Granite models (#16499 ) * vocab : mark EOT token for Granite models * sampling : fallback to EOS when EOT is not found	2025-10-10 17:17:31 +03:00
Radoslav Gerganov	68ee98ae18	server : return HTTP 400 if prompt exceeds context length (#16486 ) In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.	2025-10-10 16:11:07 +02:00

... 14 15 16 17 18 ...

10692 commits