koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-11 04:51:25 +00:00

Author	SHA1	Message	Date
Concedo	f6ece6fd37	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/check-vendor.yml # .github/workflows/close-issue.yml # .github/workflows/editorconfig.yml # .github/workflows/gguf-publish.yml # .github/workflows/labeler.yml # .github/workflows/pre-tokenizer-hashes.yml # .github/workflows/python-check-requirements.yml # .github/workflows/python-lint.yml # .github/workflows/python-type-check.yml # .github/workflows/server.yml # .github/workflows/update-ops-docs.yml # README.md # docs/build.md # examples/model-conversion/scripts/utils/perplexity-gen.sh # examples/model-conversion/scripts/utils/perplexity-run-simple.sh # examples/model-conversion/scripts/utils/perplexity-run.sh # examples/model-conversion/scripts/utils/quantize.sh # examples/model-conversion/scripts/utils/run-embedding-server.sh # ggml/src/ggml-cpu/ggml-cpu.c # ggml/src/ggml-hexagon/htp/flash-attn-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-opencl/kernels/mul_mv_q6_k_f32.cl # ggml/src/ggml-sycl/ggml-sycl.cpp # scripts/compare-llama-bench.py # tests/test-backend-ops.cpp # tests/test-gguf.cpp # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-01-27 23:06:13 +08:00
Johannes Gäßler	a5bb8ba4c5	CUDA: tune GLM 4.7 Flash FA kernel selection logic (#19097 )	2026-01-27 14:28:56 +01:00
Alberto Cabrera Pérez	be8890e721	ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (#18888 ) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-27 11:08:10 +02:00
Gaurav Garg	a83c73a18a	[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042 ) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation	2026-01-27 08:52:44 +02:00
shalinib-ibm	7afdfc9b84	ggml-cpu: Enable FP16 MMA kernels on PPC (#19060 )	2026-01-27 11:52:34 +08:00
lhez	94eeb5967c	opencl: add flattened q6_K mv (#19054 ) * opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat` * opencl: clean up * opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat` * opencl: tweak the workgroup size a bit * opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat` * opencl: proper alignment for q6_K * opencl: boundary handling for flattened q6_K mv * opencl: rename q6_K mv kernel file * opencl: put flattened q6_K mv in its own file * opencl: use lower k in file name * opencl: use K in variable names	2026-01-26 19:36:24 -08:00
Johannes Gäßler	b0311c16d2	CUDA: fix padding of GQA to power of 2 in FA (#19115 )	2026-01-26 23:24:58 +01:00
Johannes Gäßler	0c21677e43	CUDA: faster FA for GQA > 1 but not power of 2 (#19092 )	2026-01-25 21:19:47 +01:00
ccbinn	0440bfd160	metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088 ) Co-authored-by: chenbin11 <chenbin11@kuaishou.com>	2026-01-25 20:07:19 +02:00
Aman Gupta	bcb43163ae	ggml-cpu: Use tiled FA for prompt-processing (#19012 ) * ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier	2026-01-25 23:25:58 +08:00
Georgi Gerganov	d9c6ce46f7	kv-cache : support V-less cache (#19067 ) * kv-cache : support V-less cache * cuda : better check for V_is_K_view * cuda : improve V_is_K_view check * graph : add comments * hparams : refactor	2026-01-25 15:48:56 +02:00
Johannes Gäßler	4e5b83b226	GGUF: check that tensor size is representable (#19072 )	2026-01-24 21:57:51 +01:00
Johannes Gäßler	8f91ca54ec	CUDA: re-use MLA K data for V in MMA FA (#19057 )	2026-01-24 10:09:36 +01:00
Aman Gupta	81ab64f3c8	ggml-cuda: enable cuda-graphs for `n-cpu-moe` (#18934 ) * ggml-cuda: add split-wise cuda graph * add n-cpu-moe compare_llama_bench.py * fix hip/musa builds	2026-01-24 14:25:20 +08:00
nullname	8af1f5f430	ggml-hexagon: flash-attn opt (#19025 ) * optimize flash attention kernel by improving score computation and online softmax update * wip * Refactor online softmax update in flash attention kernel for improved performance * Optimize flash attention kernel by replacing float array with HVX_Vector for score computation * wip	2026-01-23 22:02:07 -08:00
Neo Zhang	cb6caca191	[SYCL] use malloc to support both iGPU and dGPU in same time (#18992 ) * use malloc to support both iGPU and dGPU in same time * support windows --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-23 20:54:10 +08:00
Alberto Cabrera Pérez	091a46cb8d	ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860 ) * Boilerplate for q5_Kx8 REPACK on ARM and fallback Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Implements make_block_q5_Kx8 by extending make_block_q4_Kx8 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q5_K repack gemm and gemv generics * Gemm and Gemv ARM implementations (i8mm) * Improved qh manipulation looking at non-repack vec_dot implementation * Full unroll * Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments. Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix wrong fallback definitions of Q5_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed comments. Reverted unnecessary formatting Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed typo in generic definitions * Switching AND + Shift with Shift Insert. Better op interleaving. * Vectorize + unroll the block scales * Apply gemm optimizations to gemv * Improve bias calculation --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>	2026-01-23 09:55:08 +02:00
Concedo	e8e7c357c9	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build-cache.yml # .github/workflows/build-cmake-pkg.yml # .github/workflows/build-linux-cross.yml # .github/workflows/build.yml # .github/workflows/check-vendor.yml # .github/workflows/close-issue.yml # .github/workflows/copilot-setup-steps.yml # .github/workflows/docker.yml # .github/workflows/editorconfig.yml # .github/workflows/gguf-publish.yml # .github/workflows/labeler.yml # .github/workflows/pre-tokenizer-hashes.yml # .github/workflows/python-check-requirements.yml # .github/workflows/python-lint.yml # .github/workflows/python-type-check.yml # .github/workflows/release.yml # .github/workflows/server-webui.yml # .github/workflows/server.yml # .github/workflows/update-ops-docs.yml # .github/workflows/winget.yml # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # requirements/requirements-tool_bench.txt # src/CMakeLists.txt # src/llama-quant.cpp # tests/test-backend-ops.cpp # tests/test-chat.cpp # tools/cli/cli.cpp # tools/server/README.md	2026-01-23 14:27:04 +08:00
Concedo	5c6cc02985	remove clblast, part 2	2026-01-23 14:09:46 +08:00
Georgi Gerganov	a5eaa1d6a3	mla : make the V tensor a view of K (#18986 ) * mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-22 22:09:01 +02:00
Johannes Gäßler	e2baf02162	CUDA: fix alignment check for FA (#19023 )	2026-01-22 20:39:25 +01:00
lhez	9c96465f99	opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970 ) * opencl: add `copy_to_contiguous` and utilize mm kernels * opencl: only copy to cont for f32 and f16 tensors * opencl: use cont mm for fallback when dst is large * opencl: use nb local to copy-to-cont * opencl: use local offset as well	2026-01-22 10:29:25 -08:00
Aman Gupta	b70d251076	CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953 )	2026-01-22 18:51:53 +08:00
shaofeiqi	5516b9c16a	opencl: add TRI op support (#18979 )	2026-01-21 22:05:54 -08:00
Aleksei Nikiforov	94242a62c0	ggml-zdnn : mark zDNN buffers as non-host (#18967 ) While buffers reside in host memory, additional transformation is needed to use buffers with zDNN. Fixes #18848	2026-01-22 01:16:21 +01:00
Jeff Bolz	bd544c94a3	vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945 ) * vulkan: Remove transfer_ctx, do everything in compute_ctx. We had a bug where a set_tensor_async (using transfer_ctx) didn't get submitted before the graph_compute (using compute_ctx) that came after it. To avoid this sort of issue, just do everything in compute_ctx. Remove transfer_cmd_pool, which was already unused. * fix crash with perf logger	2026-01-21 18:01:40 +01:00
Jeff Bolz	33f890e579	vulkan: support flash attention GQA/split_k with small batches (#18938 )	2026-01-21 17:43:43 +01:00
Masato Nakasaka	067b8d7af3	Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356 )" (#18831 ) This reverts commit `980b7cd17e`.	2026-01-21 17:13:43 +01:00
Jeff Bolz	50b7f076a5	vulkan: Use mul_mat_vec_id for small values of n (#18918 ) Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.	2026-01-21 16:22:02 +01:00
Concedo	4984c9bc16	Merge commit '`12a4a47e6a`' into concedo_experimental # Conflicts: # ci/run.sh # examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh # examples/model-conversion/scripts/causal/run-converted-model.sh # examples/model-conversion/scripts/embedding/run-converted-model.sh # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-zdnn/ggml-zdnn.cpp # ggml/src/ggml-zendnn/ggml-zendnn.cpp # tests/CMakeLists.txt # tests/test-chat-parser.cpp # tests/test-chat-peg-parser.cpp # tests/test-chat.cpp # tools/cli/cli.cpp	2026-01-21 21:00:44 +08:00
Matthieu Coudron	37c35f0e1c	gguf: display strerrno when cant load a model (#18884 ) I've had issues loading models with llama-server: [44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' and I was sure it could access the file. Seems like --models-dir and --models-presets dont interact like I thought they would but I salvaged this snippet that helps troubleshooting [44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)	2026-01-21 08:52:46 +02:00
Oliver Simons	5bd341c9a1	CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964 ) * CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator Strided iterator was added in [CCCL 3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into [CTK 13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5) * Unindent as per code review request	2026-01-21 02:34:29 +01:00
Oliver Simons	d1e3556481	CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930 ) * CUDA: Replace `init_offsets` with iterators in argsort This is a QOL improvement, saving us the cost of materializing the iterator * Remove unnecessary include from top-k.cu	2026-01-20 20:11:01 +08:00
Adrien Gallouët	08f3f4a8a3	ggml : cleanup path_str() (#18928 ) - Remove pragmas as `std::codecvt_utf8` is not used. - Avoid implicit `strlen()`. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-20 11:42:49 +01:00
Georgi Gerganov	271191906c	metal : enable FA for MLA heads (#18950 )	2026-01-20 12:21:28 +02:00
Georgi Gerganov	365a3e8c31	ggml : add ggml_build_forward_select (#18550 ) * ggml : add ggml_build_forward_select * cuda : adapt CUDA graph compat to new feature * vulkan : update logic to handle command buffer closing * ggml : check compute for fusion * ggml : add comment	2026-01-19 20:03:19 +02:00
Concedo	7f618454ff	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/labeler.yml # CODEOWNERS # docs/backend/OPENCL.md # docs/ops.md # docs/ops/CANN.csv # docs/ops/WebGPU.csv # ggml/src/ggml-blas/CMakeLists.txt # ggml/src/ggml-opencl/kernels/mul_mv_q6_k.cl # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/cpy.tmpl.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/set_rows.wgsl # tests/test-backend-ops.cpp	2026-01-18 23:24:29 +08:00
lhez	d1b4757ded	opencl: fix q6_K mv for m=1 (#18893 )	2026-01-17 13:50:32 -08:00
Reese Levine	a89002f07b	ggml webgpu: support for backend sampling (#18880 ) Some checks failed Update Operations Documentation / update-ops-docs (push) Has been cancelled Details * ggml webgpu: add SOFTPLUS unary operator Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern * ggml webgpu: add EXPM1 unary operator Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add FLOOR unary operator Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add CEIL unary operator Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add ROUND unary operator Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add TRUNC unary operator Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS) * Updates to webgpu get_memory * Add argmax * Add argmax,cumsum,sum,sum_rows * Add necessary CPY/GET_ROWS operators * Support for argsort using multi-pass strategy * Update set_rows for i32 indices, move to pre-wgsl * Port unary operators to pre-wgsl and support FILL * Implement PAD * Add support for top-k * clean up, scope pipeline init mutex * fix newline * Add support for log * Update LOG for better precision, and ops doc --------- Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>	2026-01-16 16:12:43 -08:00
Concedo	0d43bdc46d	Merge branch 'upstream' into concedo_experimental # Conflicts: # examples/batched/batched.cpp # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # src/llama-context.cpp # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-01-17 00:41:28 +08:00
Thore Koritzius	388ce82241	ggml : extend ggml_pool_1d + metal (#16429 ) Some checks failed Update Operations Documentation / update-ops-docs (push) Waiting to run Details Python Type-Check / pyright type-check (push) Has been cancelled Details * chore: resolve conflicts * feat: ggml metal impl * fix: ggml_metal_kargs_pool_1d struct * fix: require contiguous input * chore: test pool_1d * chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts * chore: add p0 and s0 to testing * fix: allow padding for cpu and metal * Update ggml/src/ggml-metal/ggml-metal.metal * fix: correct single-threaded loop * ggml : cleanup * tests : add ne[1] != 1 tests * fix: ne[1] handling in np * cont : fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-16 16:59:56 +02:00
Concedo	22af5f1250	Merge commit '`2a13180100`' into concedo_experimental # Conflicts: # .devops/cann.Dockerfile # .devops/cpu.Dockerfile # .devops/cuda-new.Dockerfile # .devops/cuda.Dockerfile # .devops/intel.Dockerfile # .devops/llama-cli-cann.Dockerfile # .devops/musa.Dockerfile # .devops/nix/package.nix # .devops/rocm.Dockerfile # .devops/s390x.Dockerfile # .devops/vulkan.Dockerfile # .github/workflows/build-cmake-pkg.yml # .github/workflows/build-linux-cross.yml # .github/workflows/build.yml # .github/workflows/copilot-setup-steps.yml # .github/workflows/release.yml # .github/workflows/server-webui.yml # .github/workflows/server.yml # CMakeLists.txt # README.md # build-xcframework.sh # ci/run.sh # cmake/common.cmake # common/CMakeLists.txt # docs/backend/hexagon/CMakeUserPresets.json # docs/backend/hexagon/README.md # docs/build-riscv64-spacemit.md # docs/build.md # examples/debug/debug.cpp # examples/eval-callback/CMakeLists.txt # examples/eval-callback/eval-callback.cpp # examples/llama.android/lib/build.gradle.kts # examples/sycl/build.sh # examples/sycl/win-build-sycl.bat # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/act-ops.c # ggml/src/ggml-hexagon/htp/binary-ops.c # ggml/src/ggml-hexagon/htp/flash-attn-ops.c # ggml/src/ggml-hexagon/htp/get-rows-ops.c # ggml/src/ggml-hexagon/htp/hex-dma.c # ggml/src/ggml-hexagon/htp/hex-dma.h # ggml/src/ggml-hexagon/htp/htp-ctx.h # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/htp/rope-ops.c # ggml/src/ggml-hexagon/htp/set-rows-ops.c # ggml/src/ggml-hexagon/htp/softmax-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-hexagon/htp/worker-pool.c # scripts/debug-test.sh # scripts/serve-static.js # scripts/snapdragon/adb/run-bench.sh # scripts/snapdragon/adb/run-cli.sh # scripts/snapdragon/adb/run-mtmd.sh # scripts/snapdragon/adb/run-tool.sh # scripts/tool_bench.py # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tools/mtmd/clip.cpp	2026-01-16 21:52:01 +08:00
Perry Naseck	0802d4cfb3	ggml-blas: hide warnings from included BLAS headers (#18818 ) * fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set * ggml-blas: hide warnings from included BLAS headers	2026-01-16 13:38:25 +02:00
Concedo	af7811dbe1	Merge commit '`3e4bb29666`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # ci/run.sh # cmake/common.cmake # examples/eval-callback/CMakeLists.txt # examples/model-conversion/scripts/causal/modelcard.template # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-metal/CMakeLists.txt # src/CMakeLists.txt # tests/CMakeLists.txt # tests/test-arg-parser.cpp	2026-01-16 17:55:22 +08:00
Raul Torres	4ea2eaac01	CANN: Remove unused `ggml_cann_get_device` function (#18625 )	2026-01-16 16:34:09 +08:00
Chenguang Li	e20fa27a02	CANN: fix an issue where get_env was not fully renamed (#18796 ) * CANN: fix an issue where get_env was not fully renamed * ci: add cann with acl group * ci: define use_acl_graph using GitHub Action * ci: update cann dockerfile with acl graph	2026-01-16 16:24:04 +08:00
hipudding	baa4ba0aec	CANN: support gated linear attn (#18653 ) * CANN: support gated linear attn This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator. The feature was implemented by YushengZhao. Because the previous submission was based on an outdated codebase, this PR was rebased to merge. Co-authored-by: YushengZhao <yusheng.chao@outlook.com> Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: optimize OP gla Optimize gla for high preformance * Remove unused comments --------- Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain> Co-authored-by: YushengZhao <yusheng.chao@outlook.com>	2026-01-16 16:18:49 +08:00
shaofeiqi	785a710085	OpenCL: add SOLVE_TRI op support (#18846 ) Some checks are pending Python Type-Check / pyright type-check (push) Waiting to run Details	2026-01-15 11:17:17 -08:00
Georgi Gerganov	6e7fc8a146	cuda : print less debug logs when disabling cuda graphs (#18868 )	2026-01-15 20:53:01 +02:00
Johannes Gäßler	5c662d21a3	CUDA: fix allignment on register spill for FA (#18815 )	2026-01-15 15:14:50 +01:00

1 2 3 4 5 ...

2389 commits