koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-22 19:47:49 +00:00

Author	SHA1	Message	Date
Johannes Gäßler	4f0e43da6f	CUDA: fix PDL CC check for JIT compilation (#23471 )	2026-05-21 23:35:29 +02:00
Pascal	47c0eda9d4	vulkan: fuse snake activation (mul, sin, sqr, mul, add) (#22855 ) * vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(ax)^2 inv_b and rewrites it to a single elementwise kernel. test_snake_fuse from the CUDA PR now also compares CPU naive vs Vulkan fused across F32 / F16 / BF16. * vulkan: address jeffbolznv review for fused snake activation Rename T / C to ne0 / ne1 in the shader and push constants to match the standard naming convention used across the Vulkan backend. Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous (the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be tightly packed on the broadcast dim (the shader reads data_a[i1]). * vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review) * vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review) * vulkan: address 0cc4m review for fused snake activation snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention. A_TYPE now applies to the activation tensor data_a instead of the broadcast multiplier, and the bindings become data_a (A_TYPE), data_b (float), data_c (float) and data_d (D_TYPE). A header at the top of the shader maps each buffer to its role in y = x + sin(b * x)^2 * c. On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern constant instead of duplicating the op list, sin_node is extracted as a named local alongside the other chain nodes, and the broadcast operands a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded float bindings on data_b and data_c (the previous a->type == x->type would silently reject any future BF16 or F16 chain once the supports_op gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1] is refreshed to match the new binding names.	2026-05-21 19:39:42 +02:00
Chen Yuan	5306f4b3b5	fix(flash-attn): replace f32 with kv_type and q_type (#23372 )	2026-05-21 07:58:49 -07:00
Georgi Gerganov	a1a69f777a	metal : optimize concat kernel and fix set kernel threads (#23411 ) * metal : fix GGML_OP_SET kernel threads * tests : extend test_cpy to support different src/dst shapes Extend test_cpy to support different source and destination tensor shapes for CPY operations (reshaping), where the total number of elements must match. - Renamed ne -> ne_src, added ne_dst parameter (default: use src shape) - Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions - Tests exercise 1024 boundary, small shapes, and large dimensionality changes - Fixed dangling reference bug (storing & to temporary std::array) - Updated all existing test calls with permute/transpose args for compatibility Assisted-by: llama.cpp:local pi * metal : optimize concat kernel with row batching for small widths When ne0 < 256, batch multiple rows into a single threadgroup to improve occupancy. This avoids underutilizing the GPU when processing narrow tensors. - Dispatch nth = min(256, ne0) threads per group - Calculate nrptg (rows per threadgroup) to fill up to 256 threads - Update kernel index calculation to handle the row batching - Add boundary check for i1 >= ne1 Assisted-by: llama.cpp:local pi * tests : clean-up * tests : refactor CPY shape tests to use dimension permutations Replace 75 hardcoded test cases with a loop over permutations of {3, 5, 7, 32} (total elements: 3360). Each src permutation is tested against canonical sorted and reverse dst, skipping identical shapes. Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32). Assisted-by: llama.cpp:local pi	2026-05-21 13:34:08 +03:00
Matt Corallo	2754ce1b3e	ggml : Check the right iface method before using the fallback 2d get (#23306 ) Probably no backends implement only one of 2d get/set, but this might be annoying for some future backend developer trying to add 2d get/set.	2026-05-21 09:24:40 +03:00
Todor Boinovski	0be84685bd	hexagon: ssm-conv fix for large prompts (#23307 ) * hexagon: remove gathers and better handling of vtcm in ssm-conv * hexagon: relax ssm-conv gating requirements * hexagon: add new prefill ssm-conv backend test * hexagon: remove trailing white space * hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-20 22:14:13 -07:00
lhez	3a6db741a8	opencl: refactor backend initilization (#23318 ) * opencl: refactor initialization * opencl: refactor GPU identification * opencl: rename for consistency * opencl: cache global mem size in dev_ctx * opencl: adjust log level * opencl: load argsort and flash_attn kernels in supports_op * argsort kernel must be built for supports_op for querying the max workgroups * flash_attn kernel has many variants, only load them when needed	2026-05-20 09:57:36 -07:00
Daniele	acd604fb27	vulkan: optimize operations in the IM2COL shader (#22685 ) * vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting	2026-05-20 17:15:13 +02:00
Max Krasnyansky	c9872a2575	hexagon: HMX quantized matmul rework (#23368 ) * hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version * hmx-mm: use activation depth mode and update naming Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * hex-mm: minor hmx matmul naming updates * hmx-mm: remove unused vars * snapdragon: scripts bump default ubatch-size to 1K * hexagon: combine HMX and power and clock settings into a single set_power call * hmx-mm: remove leftover of the scale repl helper * hexagon: fix editconf error --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>	2026-05-20 07:39:01 -07:00
Andreas Kieslinger	e947228222	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (#22522 ) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. * Further optimization pass of the first half of kernels * Optimized PDL barriers for the second batch of kernels * Further refinements after rebase. * Moves pdl logic to separate function, removes some whitespace * Strips post-hoc PDL logic * Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels * Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL * Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL * Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility * Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 * Enrolls flash_attn_combine_results * Fix: Drops needless and broken check of CUDA arch for PDL. PDL either works or is without effect. * Enrolls flash-attention kernels to pdl * Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL. * Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via and template alias and template expansion * Enrolls all remaining kernels for qwen3-coder-next into PDL * Remove all PDL LC calls to create a baseline * Added LC according to internal guidance and tested kernel performance. * Enrols missing qwen3-5 kernels passively into PDL. * Kernel optimizations (LC signals) for qwen3.5 * Enrolls ssm-scan kernels into PDL * Adds GGML_CUDA_PDL command line option to toggle PDL. * Fix: Ada and lower compilation by guarding PDL calls correctly * Cleanup: Removes commented out GGML_CUDA_PDL_LC * Cleanup: Removes experimental comments * Adds 90-virtual to build script so that Hopper GPUs can leverage PDL. * Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL. * Fix: Correct PDL en/disablement based on device-side arch check. Host side check is UB. Required moving from macros to inlined functions * Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1 * Enable PDL by default for Hopper+ devices * Enrolls softcap_f32 and two flash_attn kernels into PDL. * Improves flash attn PDL barrier placement * Fix: Perf regression on ada; excludes ada and below from PDL launches * Improves some sync barrier placements * Drops superfluous constructor * Adds #endif guard comments * Reverts experimental change to top-k-moe.cu, which moved expensive allocations in front of the PDL barrier. It did not have a meaningful impact. * Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0 PDL is disabled * Revert "Drops superfluous constructor". Adds const to remaining arguments This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1. * Cleanup: Removes and fixes some comments and whitespace * Clarifies comment of sync-barrier position * Relocates and refactors PDL launch functions and accessories * Adds error checking to the regular kernel launch path * Drops "auto" in favor of "ggml_cuda_kernel_params" * Adds "const" to ggml_cuda_kernel_launch_params * [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy	2026-05-20 13:59:02 +02:00
Georgi Gerganov	57ebaf4edd	metal : optimize pad + cpy (#23354 ) * metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup	2026-05-20 09:42:00 +03:00
ravel7524	b39a7bf1b0	ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (#23349 )	2026-05-20 09:52:21 +08:00
shaofeiqi	b28a2f372a	opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (#23303 ) * opencl: add q4_k moe support * opencl: add q5_k moe support * opencl: add q6_k moe support * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-05-19 14:29:00 -07:00
Aparna M P	17d22a35b2	hexagon: add MROPE and IMROPE support in HTP rope op (#23317 )	2026-05-19 14:10:13 -07:00
Aparna M P	ac76808e4d	hexagon: enable support for NORM op (#23319 )	2026-05-19 09:48:21 -07:00
Reese Levine	c85a242ed0	ggml-webgpu : extend GDN for K>1 (#23299 )	2026-05-19 09:45:41 +03:00
Intel AI Get-to Market Customer Success and Solutions	439f1b193d	sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (#22153 ) * sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle Signed-off-by: Chun Tao <chun.tao@intel.com> * Use async mem ops for correctness when SYCL graphs are explicitly on. Signed-off-by: Tao, Chun <chun.tao@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Tao, Chun <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-19 09:44:02 +03:00
Radoslav Gerganov	c3e9ade6dd	rpc : keep last_graph_uid in the device context (#23273 ) With the introduction of MTP we can have multiple compute contexts for the same RPC device. In this case last_graph_uid is not updated properly when contexts are being switched. This patch fixes this by moving last_graph_uid to the device context, making sure it is always updated. closes: #23242	2026-05-19 09:42:36 +03:00
Pranav Dhinakar	9a532ae4ba	hexagon: add support for TRI op (#22822 ) * Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context * addressed PR review comments for TRI op * hexagon: clang format * hex-unary: remove merge conflict markers * hex-ggml: remove duplicate op cases (merge conflict) * hex-ggml: fix editor config errors --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-18 14:04:57 -07:00
Pranav Dhinakar	b7340443d4	ggml-hexagon: add PAD op HVX kernel (#23078 ) * ggml-hexagon: add PAD op HVX kernel Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized kernels. Supports zero-padding and circular padding across all 4 tensor dimensions. * hex-ggml: remove duplicate op cases (merge conflict) * hex-pad: fix editorconfig checks and macro alignment --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-18 13:39:36 -07:00
Intel AI Get-to Market Customer Success and Solutions	0caf2a1d48	sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156 ) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-18 08:12:21 +03:00
Intel AI Get-to Market Customer Success and Solutions	5511965b19	sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150 ) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-18 08:11:51 +03:00
Gabe Goodhart	726704a160	feat: Support d_conv=15 for ssm-conv.cu (#23017 ) Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2026-05-17 23:05:11 +02:00
Oliver Simons	84c678242a	CUDA: Continue directly including cuda/iterator (#23102 ) Cont of #22936, forgot to update one site	2026-05-17 18:00:10 +02:00
Jan Ekström	a6d6183dbc	ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009 ) * ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI For whatever reason, the files are under additional sub-path `vulkan/` under the cmake directory, which does not match either current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`), nor what gets installed when you run the cmake build+install for SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`). This allows for SPIRV-Headers to be found, as currently the CI runner's setup does not seem to include the relevant path in list of search locations. * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers This is installed by the project if it is built and installed. Receiving an error during the configuration step is generally preferred to receiving an error in the middle of a build.	2026-05-17 13:12:11 +02:00
Pascal	fcae601e44	vulkan: add cpy bf16 -> f32 pipelines (#22677 )	2026-05-17 11:31:20 +02:00
Jeff Bolz	7ba22c6a09	vulkan: Support unaligned tensors for ROPE (#22637 )	2026-05-17 11:30:16 +02:00
Jeff Bolz	3fbadb06dc	vulkan: fuse SSM_CONV + BIAS + SILU (#22653 )	2026-05-17 10:25:50 +02:00
Winston Ma	6049906133	vulkan: removed duplicate #include <memory> in headers (#23144 )	2026-05-16 19:57:35 +02:00
Georgi Gerganov	e6c37a1adc	ggml : bump version to 0.12.0 (ggml/1494)	2026-05-16 16:11:29 +03:00
CrispStrobe	560445bf34	metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477) For a given output position j on the time axis, only input positions i such that is0 <= j < is0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.	2026-05-16 16:11:29 +03:00
Steve Lhomme	2eb3e6b242	ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480) That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code	2026-05-16 16:11:29 +03:00
Aman Gupta	255582687b	llama + spec: MTP Support (#22673 ) * spec: support MTP * fix batch size * rename files * cont : simplify (#7) * MTP: clean-up (#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: `8c05923630` Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (#11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (#13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-16 20:06:23 +08:00
Pranav Dhinakar	5c0e946837	ggml-hexagon: cpy: add contiguous fast-path in reshape copy (#23076 )	2026-05-14 16:55:54 -07:00
Johannes Gäßler	3e037f313c	HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (#22880 ) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout. I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.	2026-05-14 22:58:58 +02:00
Zheyuan Chen	5ec717d125	ggml-webgpu: makes the flash attn vec path subgroup-aware (#23040 ) * ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size. * ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap	2026-05-14 09:31:36 -07:00
Georgi Gerganov	67b2b7f2f2	logs : reduce (#23021 ) Some checks failed Python Type-Check / python type-check (push) Waiting to run Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces	2026-05-14 13:05:52 +03:00
alex-spacemit	81b0d882ae	ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (#22863 )	2026-05-14 17:39:30 +08:00
Ruben Ortlam	dbe7901ca6	vulkan: fix matmul integer pipeline selection (#23005 ) * vulkan: fix matmul integer pipeline selection * gate pipeline creation with the right bools	2026-05-14 10:36:54 +02:00
Katostrofik	9ed6e19b9d	SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597 ) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable. * SYCL: address review feedback - remove try/catch, check device types, deduplicate - Remove try/catch from malloc/free/memcpy helpers, check backend and device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu) - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp and declare in common.hpp to eliminate code duplication - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the host-staged path for iGPU-to-dGPU transfers - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH) in CMakeLists.txt (co-authored with @arthw) * SYCL: add build/runtime flags for Level Zero, address review feedback Implements the architecture suggested by @arthw: compile-time and runtime flags to cleanly separate Level Zero and SYCL memory API paths. - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level Zero code is wrapped in #ifdef so the build works on systems without the Level Zero SDK installed (e.g. CPU-only CI servers). Both the loader library and headers are checked before enabling. - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls whether Level Zero or SYCL memory APIs are used. Only one API style is used per session, no mixing. If Level Zero is enabled but the devices don't support the Level Zero backend, it auto-disables with a warning. - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory is not called anywhere in the backend) and used try/catch for flow control. - Update SYCL.md with documentation for both new parameters. Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development (Claude). Code reviewed and tested on my hardware. * SYCL: unify Level Zero malloc/free call sites, address review feedback Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device. Both functions are now unconditionally available — Level Zero code is #ifdef'd inside the functions, not at call sites. All call sites use uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks. Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack traces on failure, eliminate duplicated #ifdef/else patterns at 6 call sites (-29 lines net). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs so the Level Zero code path is compiled and tested in CI. Fix two bugs found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set): - The Level Zero backend check was iterating all SYCL devices including CPU. The OpenCL CPU device caused Level Zero to be disabled for the GPUs, defeating the fix on multi-GPU systems. Added is_gpu() filter so only GPU devices are checked. - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) were still calling sycl::malloc/sycl::free directly, bypassing the Level Zero path. Routed through ggml_sycl_malloc_device/free_device for consistency with the other device memory call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: address arthw review feedback on Level Zero memory API structure - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp; only ggml_sycl_free_device (used by common.cpp) stays in common.cpp - Switch both helpers to use g_ggml_sycl_enable_level_zero global instead of per-call queue backend checks - Remove #ifdef wrapper from global definition; always declare at 0, add #else branch in init block so it stays 0 when L0 not compiled in - Update init loop comment to explain GPU-only device check - CMakeLists: message(STATUS) before the if block; align option wording AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL: remove unused cstdio/cstdlib includes from common.cpp Leftover from the deleted ggml_sycl_queue_supports_level_zero helper. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * SYCL: preserve Level Zero allocation path during early malloc * ci: fix Level Zero package conflict in Intel Docker build * ci: find Level Zero loader in oneAPI package step * ci: allow Windows SYCL package without Level Zero DLL --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-14 13:39:14 +08:00
Zheyuan Chen	4c1c3ac09d	ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (#23020 )	2026-05-13 15:12:40 -07:00
scutler-nv	7f3f843c31	Fix for issue #22974 . Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (#22994 )	2026-05-13 22:36:14 +02:00
shaofeiqi	ec562eb673	opencl: add q5_0 and q5_1 MoE for Adreno (#22985 ) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-05-13 11:57:31 -07:00
lhez	1e4579fbb8	opencl: fix crash when warming up MoE on Adreno (#22876 )	2026-05-13 11:24:33 -07:00
Masashi Yoshimura	527045bfb0	flush the gpu profile timestamp before the queryset is overflowed (#22995 )	2026-05-13 10:22:44 -07:00
Max Krasnyansky	ad96bb8c0c	hexagon: add unary tanh op (#22999 )	2026-05-13 06:59:28 -07:00
Sachin Sharma	61af07c22d	ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (#22681 ) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled	2026-05-13 09:13:47 +03:00
Trivikram Reddy	856c3adac1	hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993 ) * hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access where possible in hmx-utils * hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-12 17:28:02 -07:00
yzyyzyhhh	a9883db8ee	opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (#22755 ) * ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@email.com>	2026-05-12 13:10:37 -07:00
Masashi Yoshimura	927dada6c9	ggml-webgpu: Enables running gpt-oss-20b (#22906 ) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu	2026-05-12 07:27:40 -07:00

1 2 3 4 5 ...

2481 commits