koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-22 11:16:08 +00:00

Author	SHA1	Message	Date
Concedo	7e08e8d8b4	add some rpc dependencies (+1 squashed commits) Squashed commits: [b092a94e5] add some rpc dependencies	2026-05-18 22:17:30 +08:00
Concedo	fecf2dc3fa	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/server-self-hosted.yml # CMakeLists.txt # CODEOWNERS # ci/run.sh # cmake/llama-config.cmake.in # common/chat.cpp # examples/sycl/start-svr.sh # examples/sycl/test.sh # examples/sycl/win-start-svr.bat # examples/sycl/win-test.bat # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/vecdotq.hpp # ggml/src/ggml-vulkan/CMakeLists.txt # scripts/wc2wt.sh # tests/test-backend-ops.cpp # tests/test-chat.cpp	2026-05-18 21:27:23 +08:00
Wagner Bruna	90326f8585	sd: sync to master-612-d7ecbe1 (#2213 )	2026-05-18 21:19:12 +08:00
Aleksander Grygier	a135ec0baa	ui: Centralize monospace font styles in app.css (#23272 ) Some checks failed Python Type-Check / python type-check (push) Has been cancelled Details	2026-05-18 15:10:14 +02:00
Martin Andersson	232f466583	webui: fix Tailwind v4 utility classes missing when built via cmake (#23253 )	2026-05-18 14:08:02 +02:00
Andrei	49c21f97cd	llama: initialize pre-norm embedding mask flag (#23256 )	2026-05-18 14:20:49 +03:00
Sigbjørn Skjæret	77e38d68f2	add myself to conversion (#23261 )	2026-05-18 12:42:56 +02:00
Martin Klacer	053e01dff6	ci : added kleidiai-server to server-self-hosted workflow (#22435 ) * kleidiai: added kleidiai-server to server-self-hosted workflow * Added KleidiAI-enabled Arm64 Linux llama-server CI/integration test workflow into the server-self-hosted.yml configuration file Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I032e33c525b7e26bc5d53719f638bee610cec1ee * Added self-hosted executor for KleidiAI server workflow Signed-off-by: Martin Klacer <martin.klacer@arm.com> * Update .github/workflows/server-self-hosted.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-18 11:14:57 +02:00
Georgi Gerganov	c3f95c1f06	scripts : allow wc2wt with an existing branch (#23189 )	2026-05-18 08:57:28 +03:00
Intel AI Get-to Market Customer Success and Solutions	0caf2a1d48	sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156 ) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-18 08:12:21 +03:00
Intel AI Get-to Market Customer Success and Solutions	5511965b19	sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150 ) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-18 08:11:51 +03:00
Neo Zhang	e98bcfec28	sycl : fix error when use -mg 1 error (#23140 )	2026-05-18 08:11:19 +03:00
Incarnas	1867a0c692	update bid to match each layers MTP source (#23237 ) * update bid to match each layers MTP source * Update conversion/qwen.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-18 12:37:12 +08:00
Sigbjørn Skjæret	dd7cad7197	cmake : do not check for bin install dir (#23234 )	2026-05-18 02:33:14 +02:00
Gabe Goodhart	726704a160	feat: Support d_conv=15 for ssm-conv.cu (#23017 ) Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2026-05-17 23:05:11 +02:00
Aldehir Rojas	87589042ca	cmake : fix LLAMA_BUILD_UI logic (#23190 )	2026-05-17 14:42:26 -04:00
Sigbjørn Skjæret	e0de4c2419	cmake : do not install conversion script (#23204 )	2026-05-17 18:07:21 +02:00
Oliver Simons	84c678242a	CUDA: Continue directly including cuda/iterator (#23102 ) Cont of #22936, forgot to update one site	2026-05-17 18:00:10 +02:00
Aman Gupta	3e12fbdea5	llama: avoid copying logits during prompt decode in MTP (#23198 ) * llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm	2026-05-17 23:30:25 +08:00
Concedo	b91ee904d2	colab link to download directly, set a different fallback url	2026-05-17 23:11:32 +08:00
Aldehir Rojas	39cf5d6191	common : delegate assistant continuation to underlying template handlers (#23089 ) * common : delegate assistant continuation to template handler * server : implement echo parameter to exclude assistant prefill in the response * server : fix tests for prefill * server : use existing llama template * cont : clean up	2026-05-17 13:36:05 +02:00
Jan Ekström	a6d6183dbc	ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009 ) * ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI For whatever reason, the files are under additional sub-path `vulkan/` under the cmake directory, which does not match either current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`), nor what gets installed when you run the cmake build+install for SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`). This allows for SPIRV-Headers to be found, as currently the CI runner's setup does not seem to include the relevant path in list of search locations. * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers This is installed by the project if it is built and installed. Receiving an error during the configuration step is generally preferred to receiving an error in the middle of a build.	2026-05-17 13:12:11 +02:00
Pascal	fcae601e44	vulkan: add cpy bf16 -> f32 pipelines (#22677 )	2026-05-17 11:31:20 +02:00
Jeff Bolz	7ba22c6a09	vulkan: Support unaligned tensors for ROPE (#22637 )	2026-05-17 11:30:16 +02:00
Concedo	de2c5f1cef	detokenize add special token ids	2026-05-17 17:04:47 +08:00
Concedo	c1514e328b	updated lite	2026-05-17 16:47:19 +08:00
Aldehir Rojas	f4cc787b9f	common : enable streaming JSON argument values (#23173 ) * common : remove atomic from json arguments * common : remove parsing logic on JSON arguments	2026-05-17 03:44:34 -05:00
Jeff Bolz	3fbadb06dc	vulkan: fuse SSM_CONV + BIAS + SILU (#22653 )	2026-05-17 10:25:50 +02:00
Rares Vernica	1a68ec9378	server : honor --embd-normalize CLI arg (#23125 ) The --embd-normalize flag was registered only for the embedding and debug examples, so llama-server rejected it and the /embedding handler used a hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's example set and read params.embd_normalize as the handler's default. The per-request "embd_normalize" body field continues to override.	2026-05-17 09:39:04 +03:00
ddh0	a16cce81d3	ngram : reduce noisy logs (#23185 ) * ngram : reduce noisy logs * ngram : reduce noisy logs	2026-05-17 09:38:17 +03:00
Wagner Bruna	1ae9a79ecc	sd: sync to master-607-fd1a279 (#2212 )	2026-05-17 11:37:16 +08:00
Concedo	1e828ccabf	Merge branch 'upstream' into concedo_experimental # Conflicts: # common/common.cpp # ggml/CMakeLists.txt # scripts/sync-ggml.last # scripts/sync_vendor.py # src/llama-context.cpp # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-05-17 11:26:18 +08:00
Judd	4f13cb7424	webui: support video files as input (#22830 )	2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen	b64739ea39	server: (router) alloc tmp buffer on heap (#23159 )	2026-05-16 23:42:16 +02:00
Pascal	64b38b561b	server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137 )	2026-05-16 21:21:06 +02:00
Winston Ma	6049906133	vulkan: removed duplicate #include <memory> in headers (#23144 )	2026-05-16 19:57:35 +02:00
Concedo	9d38a9edc0	quick fix for colab	2026-05-17 00:18:02 +08:00
Concedo	0d320f60a6	fix multiuser regression	2026-05-17 00:17:12 +08:00
Concedo	47d5772fbe	add batching failure spam logs	2026-05-16 23:21:01 +08:00
Concedo	9203b6a051	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/labeler.yml # .github/workflows/build-self-hosted.yml # .github/workflows/release.yml # .github/workflows/server-sanitize.yml # .github/workflows/server-self-hosted.yml # .github/workflows/server.yml # .github/workflows/ui-build.yml # .github/workflows/ui-ci.yml # .github/workflows/ui-publish.yml # .gitignore # CMakeLists.txt # CODEOWNERS # scripts/ui-download.cmake # scripts/xxd.cmake # tests/test-backend-ops.cpp # tests/test-reasoning-budget.cpp # tools/CMakeLists.txt # tools/server/CMakeLists.txt # tools/server/README.md	2026-05-16 22:56:33 +08:00
Concedo	3095da076a	only fetch new popped horde requests if model is not blocked queue	2026-05-16 22:27:12 +08:00
Aleksander Grygier	0253fb21f5	ui: Add request timeout for MCP tool calls (#23138 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details * feat: Add request timeout for MCP tool calls in llama-ui * feat: MCP Settings tab with max timeout setting	2026-05-16 15:20:27 +02:00
Georgi Gerganov	3a92bc99db	sync : ggml	2026-05-16 16:11:29 +03:00
Georgi Gerganov	e6c37a1adc	ggml : bump version to 0.12.0 (ggml/1494)	2026-05-16 16:11:29 +03:00
CrispStrobe	560445bf34	metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477) For a given output position j on the time axis, only input positions i such that is0 <= j < is0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.	2026-05-16 16:11:29 +03:00
Steve Lhomme	2eb3e6b242	ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480) That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code	2026-05-16 16:11:29 +03:00
Holger Voormann	25b1bc9c2f	ui: Correct links in `tools/ui/README.md` [no ci] (#23139 ) In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`. See `59778f0196`.	2026-05-16 14:42:38 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	18675b6bbc	vendor : update cpp-httplib to 0.45.0 (#23103 )	2026-05-16 15:25:21 +03:00
Aman Gupta	255582687b	llama + spec: MTP Support (#22673 ) * spec: support MTP * fix batch size * rename files * cont : simplify (#7) * MTP: clean-up (#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: `8c05923630` Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (#11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (#13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-16 20:06:23 +08:00
kubawoo	b81c2cdd74	ui: Fix handling of MCP resource template parameters (#23117 ) * Fix handling of MCP resource template parameters * Fix formatting for uri-template.test.ts --------- Co-authored-by: kuba <kuba@laptop.local.net>	2026-05-16 13:25:41 +02:00

1 2 3 4 5 ...

13375 commits