koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-08-02 20:53:38 +00:00

Author	SHA1	Message	Date
Ruixiang Wang	000547513f	server: correct accepted tokens when need draft token replay (#26320 ) Some checks failed Python Type-Check / python type-check (push) Has been cancelled Details * spec: correct accepted tokens when need draft token replay * cont : naming --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-07-31 11:16:17 +03:00
David Friehs	15e755f30d	cuda: extract Q2_0 elements via __byte_perm (#25603 )	2026-07-31 11:15:44 +03:00
Ozymandias_EBON	9d9a6d29f6	SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc… (#25025 ) * SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt processing * fattn-mkl: fix interleaved dst layout in normalize kernel - Fix mkl_fa_normalize_head: use interleaved dst layout ((query * n_q_heads + head) * DV) matching TILE's flash_attn_combine_results. Previously used dense head-major layout which wrote head outputs to wrong addresses, corrupting attention for all models except Qwen3.6-27B (where GQA=6 heads were sparse enough to avoid visible overlap). - Remove 7 redundant stream->wait() calls — SYCL in-order queue already serializes pure SYCL kernel dependencies. Retain only the 4 MKL GEMM ↔ SYCL handshake barriers (oneMKL GEMM uses its own internal queue that does not respect SYCL in-order). - Remove unused dst_row_stride, diagnostic clutter, and dead K/V hex dump (fa_diag block in fattn-mkl.cpp). - Add MKL_FA_DISABLE=1 env var for A/B testing. - Add FA-DISP watchdog (MKL_FA_DEBUG=1) and FA-DIAG output fingerprint (MKL_FA_DIAG=1) in fattn.cpp. Tested: Gemma-4-26B, Gemma-4-31B, Qwen3.6-27B, Qwen3.6-35B-A3B Perf (B70/Battlemage, 32K, q8_0 KV): Gemma-4-26B: 1473 t/s MKL vs 746 TILE (1.97x) Qwen3.6-27B: 609 t/s MKL vs 330 TILE (1.85x) Co-Authored-By: Claude Code on DeepSeek-v4-Pro * Thank you for the review feedback: rename env vars, use GGML_LOG_INFO, document in SYCL.md Completed the following: - Rename MKL_FA_DISABLE → GGML_SYCL_ENABLE_MKL_FA (inverted: 0 to disable) - Rename MKL_FA_DEBUG → GGML_SYCL_MKL_FA_DEBUG - Rename MKL_FA_DIAG → GGML_SYCL_MKL_FA_DIAG - Replace fprintf(stderr, ...) / fflush(stderr) with GGML_LOG_INFO() macro - Document all three env vars in docs/backend/SYCL.md under Runtime - Add comment explaining MKL FA activation trigger (flash-attn + quantized KV cache + batch-size >= 1024 + n_kv >= 1024) Resolves review feedback from arthw. Again, thank you!!! Co-Authored-By: Claude Code on DeepSeek-v4-Pro * Thank you for the review feedback round 2: use ggml_sycl_get_env, remove dup waits, gate perf macros - Replace raw getenv() with ggml_sycl_get_env() in all 4 env-var checks (fattn.cpp: GGML_SYCL_ENABLE_MKL_FA, GGML_SYCL_MKL_FA_DEBUG, GGML_SYCL_MKL_FA_DIAG; fattn-mkl.cpp: GGML_SYCL_MKL_FA_DEBUG) - Remove duplicated stream->wait() before ev.wait_and_throw() in GEMM KQ and GEMM VKQ — ev.wait_and_throw() already waits for completion - Gate MKL_ACCUM macro behind do_print so timing accumulators are no-ops in normal operation - Remove redundant MIT/Intel copyright header from fattn-mkl.cpp - Remove unused #include <cfloat> - Expand SYCL.md MKL FA docs with step-by-step activation trigger and example llama-cli command Again, thank you!!! Co-Authored-By: Claude Code on DeepSeek-v4-Pro * fattn-mkl: enable MKL FA for all KV cache types Remove the quantized-only restriction on MKL activation — the MKL kernel converts any non-F16 K/V to F16 via to_fp16_sycl before GEMM, so F16 (default), BF16, and F32 caches all benefit from XMX hardware acceleration. The type restriction was an unnecessary gate. Before (F16/BF16 default cache + FA on at 32K prefill): ~356 t/s (TILE path) After: ~670 t/s (MKL path, matching quantized-cache baseline) Minimal change: two conditions removed, one comment updated in fattn.cpp. No kernel or conversion code changes — the dequant pipeline already covers all types. * fattn-mkl: rename mkl_disable -> mkl_enable for clarity * fattn-mkl: refine MKL FA dispatch gates Three changes: 1. Remove quantized-only restriction - MKL FA activates for all KV cache types (F16 default, BF16, F32, quantized). The MKL kernel converts non-F16 K/V via to_fp16_sycl before GEMM. 2. Rename mkl_disable -> mkl_enable to match env var (GGML_SYCL_ENABLE_MKL_FA). 3. Replace batch-size threshold with Q->ne[1] >= 32 gate. Keeps TG (Q=1) and MTP drafts (Q=3-8) on VEC path where fused kernel beats MKL launch overhead. Routes all multi-token prefill through XMX-accelerated GEMM. Production data confirms Q patterns: 1-8 TG, 32-127 cache reuse, 128+ full reprocess. At 32K F16/BF16 FA-on: 356 -> 670 t/s. * ggml-sycl: fix F16 cache + MKL FA multi-turn corruption; add gate guards Two changes: 1. Always copy F16 K/V to dense row-major buffers before MKL GEMM. Previously F16 was read in-place with raw tensor strides. During multi-turn conversations, the accumulated KV cache had different stride properties than a fresh prefill, producing corrupted outputs. Now dense F16 gets a fast memcpy; interleaved (Gemma) gets a strided copy kernel. This matches what the quantized paths already did through to_fp16_sycl. 2. Gate MKL FA on unsupported op params (max_bias, logit_softcap, batch dim mismatch) and pathological F16 strides (nb[1] not a multiple of ne[0]2). These conditions would previously crash inside the MKL kernel. Pathological strides (test-only) and ALiBi/softcap fall through to TILE/VEC which handle them correctly. The stride check uses modulo rather than equality, so both dense (nb1 == ne02) and interleaved (nb1 == H * ne02) pass — all real models use these layouts. Only test cases with overlapping rows (nb1=32 or nb1=75 for ne0=40) are blocked. Thanks to hmscider for the oneDNN FA PR (#25222) which surfaced the same insight: always normalize inputs to contiguous F16 before GEMM. Co-Authored-By: Claude Code using DeepSeek-V4-Pro <noreply@anthropic.com> fattn-mkl: fix quant+GQA KV strides, tighten MKL gate, add K>=1024 tests Adding K>=1024 flash-attn test cases surfaced several MKL bugs: - Quant K/V with a padded seq-view (real KV cache) used the wrong strides in the dequant path... only the true Gemma interleave layout should reconstruct strides. nb[2] vs ne[1]nb[1] - Gate was firing on shapes the kernel doesn't handle: head_dim < 64 or not a multiple of 64, MHA, attention sinks, and bf16 decode... fell through to vec which no bf16 case. Gate MKL to the validated envelope: gqa>=2, head_dim 64 through 512 (has to be a multiple of 64) with matching K/V head size, mask, no sinks/alibi/softcap... everything else falls back to tile. Covers Qwen Dense/MoE and Gemma4 Dense/MoE Ran test-backend-ops -o FLASH_ATTN_EXT: 3641/3641 pass. Perplexity unchanged... 6.7267 MKL vs 6.7290 stock using Qwen 27b q5_k_xl Update ggml/src/ggml-sycl/fattn.cpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Update ggml/src/ggml-sycl/fattn.cpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Update ggml/src/ggml-sycl/fattn.cpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * fattn-mkl: bound attention scratch so it doesn't grow with batch or context... also dropped the bf16 comment in fattn.cpp per arthw review. * Update ggml/src/ggml-sycl/fattn-mkl.cpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Update ggml/src/ggml-sycl/fattn-mkl.cpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * apply arthw suggestions: enum for dequant modes, macro for wg_size, env-var one-liners --------- Co-authored-by: Claude Code using DeepSeek-V4-Pro <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-07-31 10:43:16 +03:00
Neo Zhang	d5d3e05bf8	[SYCL] support the missed types in cpy (#26005 ) * support the missed types in cpy * use correct funct * rm unused code	2026-07-31 10:25:16 +03:00
fairydreaming	69e62fc77c	llama : enforce the same K and V cache types for DeepSeek V4; enable FA if V cache is quantized (#25871 ) * llama : enforce the same K and V cache types for DeepSeek V4; enable FA if V cache is quantized * llama : enforce the same K and V cache types for MLA models --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-07-31 10:03:30 +03:00
Sachin Sharma	1e22599522	ggml-zendnn : group matmul direct API for mul_mat_id (#25918 ) * ggml-zendnn : group matmul API for mul_mat_id * ggml-zendnn : scale MUL_MAT_ID fallback threshold by expert count	2026-07-31 09:40:52 +03:00
Neo Zhang	1c5b89ff63	sycl : support dev2dev memcpy by DEV2DEV_MEMCPY_FORWARD (#26234 ) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-07-31 09:20:28 +03:00
Neo Zhang	a2be61dc87	[SYCL] Support q2 mul_mat (#26231 ) * support q2_0 in mul_mat * support more q2_0 case	2026-07-31 09:19:41 +03:00
Titaniumtown	1553725965	sycl: fuse RMS_NORM + MUL (#26015 )	2026-07-31 09:17:53 +03:00
Masashi Yoshimura	8f4646a63e	ggml-webgpu: improve flash_attn_vec for quantized KV at long contexts (#25956 ) * improve fa of quantized kv cache * Fix some bugs and some comments. * fix v type check and some comments * Fix build error caused by rebasing * editorconfig checking pass	2026-07-31 09:08:40 +03:00
Xuan-Son Nguyen	5f55650a78	mtmd: add lanczos resize method [no release] (#26341 )	2026-07-30 21:59:49 +02:00
Xuan-Son Nguyen	b4ca032ae3	server: support inp embd to generate next token (#26313 ) * server: support embd for sampled token * fix ~server_batch()	2026-07-30 21:40:38 +02:00
Jeff Bolz	ea63b4d32e	vulkan: Support quantized concat (#25684 )	2026-07-30 13:11:32 -05:00
pmaybank	958d9c0b61	Test support for alternative conv layout (#25617 ) * add bool cwhn = true to conv_2d test cases * add layout check at graph building time * extend layout checks for conv2d.cu kernel * in CPU back-end kernel needs to be stored contiguously to prevent test failures with cwhn=1 * trim white space * do op support check in vulkan backend * fix CI failure and vulkan run-time assert failure by introducing new graph build-time check in ggml_backend_vk_device_supports_op * add additional check in support_op function for Vulkan to fix run-time assert failure	2026-07-31 01:14:16 +08:00
o7si	432d7ffe2c	llama-context : sync pending async copies before clearing embd_seq (#25676 )	2026-07-30 19:48:00 +03:00
Georgi Gerganov	47f686f53f	tests : avoid building get-model.cpp many times (#26317 ) * tests : remove get-model.cpp * tests : fix quant type selection	2026-07-30 19:34:04 +03:00
Robert Esclapez	e1a1abb787	ggml-cuda: Allow transpose-free gemmv computation (#26171 ) When matrix's weights are shaped 1xK is leverage a transpose-free computation to use mat_mul_vec_f.	2026-07-30 21:39:46 +08:00
Georgi Gerganov	6b36c23056	readme : refresh (#26280 ) * docs : center badges and links, remove Hot topics - Use <div align="center"> for GitHub-compatible centering - Add dev branches and compile times links - Remove Hot topics section Assisted-by: llama.cpp:Qwen3.6-27B * readme : remove sections * docs : center badges, remove Hot topics, extract sections, remove tools - Use <div align="center"> for GitHub-compatible centering - Add dev branches and compile times links - Add lib llama API and llama-server REST API links - Remove Hot topics section - Remove Recent API changes section - Extract XCFramework section into docs/xcframework.md - Extract Completions section into docs/completions.md - Extract Obtaining and quantizing models into docs/models.md - Remove tools usage sections (llama-cli, llama-server, etc.) - Move Contributing section to the end Assisted-by: llama.cpp:Qwen3.6-27B * cont : arrange links * cont : fix ws * cont : remove seminal papers * cont : change sample model * cont : trim-down contributing section * cont : sort backends alphabetically * cont : words * cont : add fig captions * docs : models words * readme : shorter caption * cont : fix typo * cont : add window frame to screenshot	2026-07-30 16:14:37 +03:00
Georgi Gerganov	9ebfc3a8cf	sync : ggml	2026-07-30 15:44:24 +03:00
Georgi Gerganov	6a4c3357c8	ggml : bump version to 0.18.0 (ggml/1576)	2026-07-30 15:44:24 +03:00
Pasha Khosravi	9b2a088819	CUDA: add Q2_0 support (#25707 )	2026-07-30 12:33:25 +03:00
timkhronos	b2f221684f	Remove custom cpu op from the M3 graph, express with stock ops (#26297 )	2026-07-30 16:30:18 +08:00
Niklas Wenzel	d0bfb19812	metal: fix memory unwire if model is freed without any GPU operations (#26082 ) * metal: fix memory leak if model is freed without any GPU operations * metal: run dummy work only if residency sets are used * metal: wrap function in #if defined * metal: measure system-wide wired memory in test * metal: always build regression test Co-authored-by: YiChen Lv <63285796+forforever73@users.noreply.github.com> --------- Co-authored-by: YiChen Lv <63285796+forforever73@users.noreply.github.com>	2026-07-30 11:11:27 +03:00
Aleksander Grygier	21a5f5b7f9	ui: IndexedDB and Conversations data fixes (#26278 ) * fix: single-flight conversations store init * refactor: remove unused legacy-migration util * fix: make createSystemMessage transactional * fix: delete message branches cascading on edit/regenerate * fix: stop stamping lastModified on conversation metadata updates * fix: count cascaded forks in bulk delete toast, bulkify deleteAll * refactor: drop redundant conversation list respreads * refactor: create conversation in a single write * fix: use table constant in toggleConversationPin * fix: keep the system message placeholder out of the edit form * fix: keep focus in the system message editor after opening it * fix: focus the main chat form after submitting a system message * fix: update timestamp of the correct conversation on stream completion	2026-07-30 10:10:37 +02:00
Jonathan Clohessy	32703b42d6	ggml : Fix issue with kleidiai ci and stringop overflow warning (#26277 ) Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>	2026-07-30 09:17:30 +03:00
Neo Zhang	a6a77bc48d	[UT] enhance UT to show all real unsupported backends (#25234 ) * enhance UT to show real unsupported backends * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-07-30 14:04:58 +08:00
Tunahan	64d528be72	mimo2: address MTP review feedback (#26228 ) Co-authored-by: tnhnyc <115956684+tnhnyc@users.noreply.github.com>	2026-07-30 11:55:58 +08:00
Aleksander Grygier	3018a11e79	fix: increase greeting spacing on md screens (#26287 )	2026-07-29 19:25:13 +02:00
Xuan-Son Nguyen	afeebe103b	llama: move suppress_tokens handling to common/sampling (#26276 ) * llama: move suppress_tokens handling to common/sampling * address security issues * rm has_logit_bias	2026-07-29 18:02:30 +02:00
Kakaru	caa596ab3f	ggml-cuda : disable MMQ on devices with less than 48 KiB shared memory (#26141 ) ggml_cuda_should_use_mmq() selects MMQ purely from the quantization type. The current MMQ configurations are designed and maintained against a minimum of 48 KiB per-block shared memory, the limit provided by NVIDIA Pascal GPUs and later. On devices that report less, no supported MMQ tile fits and mul_mat_q_switch_J() aborts when every tile size exceeds the device's per-block shared memory budget. Disable MMQ when smpbo < 48 KiB so the caller falls back to the BLAS path instead of hitting GGML_ABORT. Some current MUSA QY1 devices report only 28 KiB and are covered by this guard. Reproduced on a Moore Threads MTT S70 (arch mp_21, 28 KiB shared memory per block) with an RWKV-7 0.1B Q8_0 model: $ llama-bench -m rwkv7-g1d-0.1b-Q8_0.gguf -p 128 -n 0 J_best=0 ggml/src/ggml-cuda/template-instances/../mmq.cuh:1521: fatal error (core dumped) Only prefill (batch > 1) is affected; token generation is fine. After the fix the same device falls back to the BLAS path: Q8_0 pp128 1470.7 t/s, tg8 55.3 t/s (was: abort) FP16 unchanged Q4_K_M unchanged This matches a -DGGML_CUDA_FORCE_CUBLAS=ON build (pp128 1464.2 t/s), which confirms the fallback path is the one being taken. This is not MUSA-specific: any device with less than 48 KiB per-block shared memory is affected. Co-authored-by: KakaruHayate <KakaruHayate@users.noreply.github.com>	2026-07-29 20:27:35 +08:00
Titaniumtown	11b068d066	sycl: contiguous fast path + 32-bit index math for unary elementwise ops (#25946 ) * sycl: contiguous fast path + 32-bit index math for unary elementwise ops * sycl: use fastdiv for elementwise index math	2026-07-29 15:16:57 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	e2f59ed71d	vendor: update BoringSSL to 0.20260728.0 (#26241 )	2026-07-29 15:16:02 +03:00
Georgi Gerganov	992c325323	server : add trace logging for slot similarity checking (#26271 ) Adds trace logging in server-context.cpp for slot similarity checking during prompt cache slot selection, including skip reasons and similarity calculation details. Assisted-by: llama.cpp:Qwen3.6-27B	2026-07-29 14:59:44 +03:00
Kaben Nanlohy	e1af89a681	conversion: fix Qwen2.5-Omni mmproj conversion regression (#26262 )	2026-07-29 12:53:44 +02:00
Aman Gupta	f5b9bd39b5	RPC: add tensor_memset (#25912 )	2026-07-29 15:04:30 +08:00
Geramy Loveless	60bccc3763	add rdna3.5, and 3 to mmq configs so they can be tuned independently. (#26199 )	2026-07-29 08:43:45 +02:00
Satinder Grewal	7be2c65dc9	model: add NextN/MTP speculative decoding support for GLM_DSA (GLM-5.2) (#25980 ) * model: add NextN/MTP speculative decoding support for GLM_DSA (GLM-5.2) Adds GLM-5.2 NextN/MTP as a --spec-type draft-mtp target: nextn tensor loading via the qwen35moe/step35-style presence probe, a graph_mtp builder (enorm/hnorm/eh_proj + dense MLA + sigmoid-gated MoE with shared expert + shared head with fallbacks, _s scale tensors passed for NVFP4), t_h_nextn extraction in the trunk graph, and MTP-context KV setup: the draft head runs dense MLA, so the MTP context uses a plain attention KV cache holding only the nextn layer(s) (same pattern as the hybrid Qwen3.5 MTP context) while the main context keeps the DSA cache, now filtered to trunk layers only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * convert : support --mtp/--no-mtp export for GlmMoeDsaForCausalLM (GLM-5.2) Opt GLM-5.2 into the supports_mtp_export contract (post-#25641 shape, mirroring HYV3Model/Step35Model): --no-mtp drops the appended NextN block (blk.78) and its nextn_predict_layers KV; --mtp keeps only the NextN block plus shared embeddings/norm/lm_head. Default (bundled) output is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-07-29 14:02:31 +08:00
Guido Imperiale	e9fa0781f1	model: Add Laguna-S-2.1 LLM_TYPE (#26233 )	2026-07-28 21:02:33 +02:00
Reese Levine	bc71c24c9d	ggml-webgpu: Fix some binding alias issues to support all archs, fix recurrent-state-rollback test (#25931 ) * Add overlap glu variant to support all archs, fix recurrent-state-rollback test * format * Fix all arch overlapped ranges * format * diagnose bus error on apple ci * More testing * more testing * more targeted testing * Fix bug in alignment for > 4gb buffer offsets * Fix bug in view offsets * Try avoiding multi_buffers * not fixed yet, more logging :( * Handle edge case in set_rows * Try looking at view source * Skip deepseek32 for now and clean up trace infrastructure * simplify skipping * last cleanup * actually final cleanup * update handling of overlap * format * try skipping other failing model	2026-07-28 21:13:06 +03:00
Hongqiang Wang	8190848bb3	opencl: skip the Adreno KQ/KQV image kernels for multi-stream batches (#26189 ) The Adreno KQ/KQV image1d kernels (ggml_cl_mul_mat_kq_kqv_adreno) ignore dim 3 entirely: the sub-buffer covers only nb02*ne02 bytes and the kernel receives no ne03/ne13/nb03/nb13 arguments. With the unified KV cache, multi-sequence batches (e.g. llama-perplexity with its default -b 2048, n_seq=4, or a multi-slot llama-server) present KQ/KQV as 4D tensors with ne3 = n_stream, so every stream past the first reads the first stream's K/V and produces garbage. Flash attention masks the bug where it is enabled; devices where FA is declined (e.g. Adreno 740) hit it with default settings. Route ne03/ne13 > 1 to the general path, which handles dim 3, and honor view_offs when creating the sub-buffers (currently always 0 for tensors reaching this function, but the function would silently misread any future view). Llama-3.2-1B-Instruct Q4_0, wiki.test.raw, 8 chunks, -ngl 99: - Adreno 740, default: PPL 1817.64 -> 15.61 - Adreno 740, -fa 0: PPL 1941.64 -> 15.61 - Adreno 840, -fa 0: PPL 1943.90 -> 15.50 - single-stream (-b 512) results unchanged (15.6090) - test-backend-ops -o MUL_MAT on 740: identical before/after (909 OK, 12 pre-existing q6_K failures)	2026-07-28 11:04:42 -07:00
Daniel Bevenius	7e1e28cae3	mtmd : add Nemotron 3 Nano Omni support (parakeet) (#22520 ) * mtmd : add Nemotron 3 Nano Omni support (parakeet) This commit adds support for the subsampling and encoder part of Nemotron Nemo 3 omni model. The Parakeet subsampling/encoder were taken from parakeet.cpp which is currently a pull request against whisper.cpp. I've tried to copy the code a close as possible to hopefully enable easy patching between the these two project later. Refs: https://github.com/ggml-org/whisper.cpp/pull/3735 * mtmd : generate rel pos tensor in graph instead of in conversion [no ci] This commit removes the generation of the relative positional tensor in the model conversion script and instead computes it in the encoder graph. This is only done for the window of positions required for the current audio sample. * mtmd : add clip_get_model to clip API [no ci] This commit adds a function to get access to the clip_model. It also removes the two functions clip_get_mel_filter_tensor, and clip_get_window_tensor(const struct clip_ctx * ctx) which can now use clip_get_model to access the model tensors that it needs. * mtmd : read mel_filters and window into hparams * mtmd : use set_input_f32 lambda [no ci] * mtmd : add better asserts for mel_filters and hann window [no ci] * mtmd : add missing size_t cast * mtmd : change type of pad to size_t * mtmd : zero initialize samples_padded * mtmd : remove unsued ctx member from parakeet preprocessor * mtmd : make log_mel_spectrogram_parakeet_worker_thread private static * mtmd : sync/update parakeeet impl with latest whisper.cpp This commit updates the parakeet code in mtmd to reflect the latest updates to parakeet.cpp in whisper.cpp. A follow up commit will address the currently hardcoded dw_pad and see if we can add n_conv_kernel as a model metadata field. * mtmd : add audio_conv_kernel_size to model conversion This commit updates the model conversion to read the conv_kernel_size field from the sound_config section of the models config.json file. It then uses this field instead of the hardcoded values in parakeet.cpp. * mtmd : cleanup [no ci] * conversion : call super().filter_tensors [no ci] * do not discard result of super filter_tensors * mtmd : use build_mm instead of ggml_mul_mat * mtmd : use build_ffn * mtmd : move and reuse get_vector lambda * mtmd : use build_inp_raw for parakeet * mtmd : throw exception in get_scalar instead of assert * mtmd : fix std::min call * mtmt : use .c_str in throw clause in get_vector * mtmd : check for F32 type and non-empty tensor in get_vector The get_vector lambda is used by get_scalar but also standalone to read in the mel_filters and the window data. Therefor we are not checking for 1D tensors but allowing multiple dimensions. We do have a check in get_scalar to verify the size of the vector. * mtmd : replace hardcoded 1101 for n_tokens_real * mtmd : assert subsampling_factor is 8 This commit adds an assert of the parakeet subsampling factor to check that it is 8. The motivation for this is that this model currently has three convolutions with a stride of 2. If the underlying model updates the subsampling factor these convolution operations will need to be updated and this will produce and error if this occurs. * mtmd : remove unused ggml_tensors attn_pos_w and mm_norm_w * mtmd : remove single thread path This commit removes the single thread path which was a left over from the original parakeet.cpp where n_threads is configurable. * fix some security issues --------- Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-07-28 17:20:25 +02:00
Aleksander Grygier	6e2bc65fb2	ui: rendering performance follow-up (#26097 )	2026-07-28 17:13:25 +02:00
Julien Jerphanion	ad77bd31a6	docs: Adapt conda-forge package name (#26229 ) Co-authored-by: dev-tinker <dev-tinker@users.noreply.github.com>	2026-07-28 16:51:20 +02:00
Xuan-Son Nguyen	ee3d1b54c1	server: abstract llama_memory calls to common_memory (#26221 )	2026-07-28 16:35:20 +02:00
Aman Gupta	da5b448622	ggml : set output of view src (#25729 ) * llama-graph: set_outputs to t->view_src * change set_output to GGML_ASSERT about views not being outputs * sampler : avoid views in outputs * cont : fix dist sampler * cont : consistent logits handling * ggml : set output of view src * graph : simplify set_outputs() * cont : cleanup Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>	2026-07-28 16:23:24 +03:00
Jeff Bolz	8161641005	vulkan: add iq4_nl support back to FA (#24585 ) * vulkan: add iq4_nl support back to FA I was originally concerned about wasting shared memory on the LUT, but it's small and unlikely to matter in practice. Also support q1_0 for non-coopmat2. Fixes #23681 * remove q1_0 FA support	2026-07-28 07:06:03 -05:00
Bhavik Sharda	b62b350981	ggml-cuda: add chunked SSD matmul for Mamba-2 prefill acceleration (#22675 ) * ggml-cuda: add chunked SSD matmul for Mamba-2 prefill acceleration * cuda: added SSD CICD fixes for CUDA / HIP / MUSA / MSVC. * ggml-cuda: review comments fixed. * ggml-cuda: Fuse M matrix materialization into pre_matmul kernel and enabled test. * ggml-cuda: test updates and fixes * ggml-cuda: test updates to remove hardcoding of tensor initialise data limits. * ggml-cuda: ssd minor review comment fixed. * ggml-cuda: ssd minor CICD fixed. * CUDA SSD: Fixes correctness by promoting s0_stride_seq to int64_t, improves memory coalescing in ssm_ssd_prepare_dt_kernel, and boosts efficiency by merging B_weighted and C_scaled; also addresses prior review comments. * cuda: fix sdata read-write race in prepare_dt fallback scan loop	2026-07-28 17:33:42 +05:30
王金旭	84075273c8	spec: add DSpark speculative decoding (#25173 ) * spec: add DSpark speculative decoding DSpark (DeepSpec, 2026) on top of the merged DFlash drafter. It reuses the DFlash encoder/decoder graph, target feature extraction and KV-cache injection, and the verify/accept path unchanged; the draft model is a new "dspark" arch adding a low-rank Markov head (markov_w1/w2) and an optional (unused here) confidence head. No new public APIs. The proposal is the only change: the block is anchor-first (position 0 already predicts the first draft) and the decoder graph applies a semi-autoregressive, previous-token conditioned logit bias in-graph, chained per block position: logits'(i) = logits(i) + markov_w2 . markov_w1[prev(i)] prev(0) = the block's anchor token, prev(i>0) = argmax(logits'(i-1)) vectorized across all blocks in the batch; the anchors are fed through a dedicated graph input (token 0 of every block). Greedy stays lossless (verify unchanged, same as DFlash). - new arch "dspark" (llama_model_dspark : llama_model_dflash, reuses the graph, loads the markov/confidence tensors; shares the target's embed/lm_head). - Qwen3DSparkModel converter. - new spec type "draft-dspark" (common_speculative_impl_draft_dspark : common_speculative_impl_draft_dflash, overrides draft() only: submits whole anchor-first blocks and greedily reads back the biased logits). * spec: read draft block size in the dflash impl * docs: add DSpark section to speculative.md * spec: keep dspark block size read in the dspark impl * dspark : add TODOs for incomplete parts - confidence head is loaded but not used yet - confidence-scheduled prefix pruning is not implemented - the in-graph Markov chain is greedy-only - only Qwen3 backbones are supported for now (also noted in docs) * spec: fold DSpark into the DFlash arch Address review: drop LLM_ARCH_DSPARK and the dspark.block_size / markov_rank GGUF keys. A DSpark draft now converts to a DFlash GGUF; the Markov head tensors are detected by presence (like eagle3 d2t), block_size is read from the existing dflash.block_size key, and the block anchors are taken as a strided view of the decoder's token input instead of a separate graph input. * spec: add confidence-based draft pruning for DSpark The DSpark confidence head predicts per-position acceptance of the drafted block. --spec-draft-conf-min truncates the block at the first position below the threshold (default 0 = disabled). * fold the dspark impl into dflash, selected by spec type * address review comments * dspark: clean up and improve naming * update readme * remove trailing whitespace * dflash: draft full n_max blocks, defer dp.n_max to the central truncation The DSpark markov head views the draft batch as a uniform [n_seqs x block] grid, but the per-seq dp.n_max clamp could produce blocks of different sizes, silently corrupting the strided views and the resulting logits. Drop the clamp and always draft the full n_max block for every sequence: dp.n_max is already enforced by the central truncation in common_speculative_draft(), the same way eagle3 handles it. Co-authored-by: Zaire404 <3147879462@qq.com> * dflash: assert the markov head block-uniformity invariant, require the conf head With the draft batch always submitting equal-size n_max blocks, a non-divisible token count can only mean the batch was split across ubatches or a caller broke the layout - fail loudly instead of silently dropping the markov bias. The block_drafts > block_size early return stays: worst-case graph reserve passes legitimately build with n_seq_tokens > block_size. Also make conf_proj required when the markov head is present: the confidence head is part of the DSpark checkpoint format, and a missing head would otherwise leave --spec-draft-conf-min silently reading stale embeddings instead of confidences. Co-authored-by: Zaire404 <3147879462@qq.com> * dspark: fold conf_min into p_min p_min and conf_min express the same thing - the minimum predicted survival probability for a drafted position - differing only in how the estimate is obtained: token probability for regular drafters, the trained confidence head for DSpark. The DSpark readback never used p_min, so reuse it for the confidence threshold and drop the separate --spec-draft-conf-min flag. Both defaulted to 0 (disabled), so behavior is unchanged. Co-authored-by: Zaire404 <3147879462@qq.com> * dflash: note the confidence broadcast workaround Requested in review: the ggml_repeat only adapts the [1, n_tok] confidences to the n_embd-wide embd_nextn transport so that llama_get_embeddings_nextn can be reused - not a placeholder. Co-authored-by: Zaire404 <3147879462@qq.com> * cont : clarify [no ci] --------- Co-authored-by: Ruixiang Wang <wangruixiang07@outlook.com> Co-authored-by: Zaire404 <3147879462@qq.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-07-28 14:43:27 +03:00
Aldehir Rojas	6ba5ef2470	common/chat: add specialized minimax m3 parser (#26210 )	2026-07-28 04:27:20 -05:00
meatposes	d6b61ac0d3	sycl: fix use-after-return of the SDPA scale in the oneDNN flash-attention path (#25880 ) * sycl: fix use-after-return of the SDPA scale in the oneDNN flash-attention path The scale was uploaded with an async memcpy sourced from a stack local. On the in-order queue that copy is ordered behind the K/V staging kernels; once n_kv is large enough (>= ~26k observed on Arc Pro B70) the staging outlives the host stack frame and the copy reads recycled memory, feeding the SDPA a garbage scale. Output then collapses to a single repeated token and the KV cache is poisoned for the rest of the session. Short contexts win the race by accident, and test-backend-ops caps FLASH_ATTN_EXT at kv=1024, which is why CI never caught it. The previous device_count > 1 wait_and_throw() gate (and reverting it, PR #25741) fixes the symptom only by keeping the frame alive across the copy at the cost of a host sync on every FA call. Fix: cache one device scalar per (device, value) -- the scale is constant per model -- and upload it synchronously once. The single-device fast path (no per-call host sync) is then safe: every device-side hazard already serializes on the in-order queue. The multi-GPU conservative wait is kept unchanged. Also: - GGML_SYCL_FA_ONEDNN_MAX_KV env (0 = unlimited): optional n_kv ceiling that routes very long sequences to the native FA kernel. - test-backend-ops: FLASH_ATTN_EXT F16 cases up to kv=65536 (Qwen3.6-27B geometry hsk=hsv=256 GQA 6, and hsk=128 GQA 4), closing the kv=1024 blind spot. Note the race itself needs a live multi-op pipeline to reproduce; single-op runs pass even on broken builds. Verified on Arc Pro B70 (bmg_g31), Qwen3.6-27B Q4_K, -c 131072: output byte-identical at temp 0 to the native FA path through 32k-deep prefill, with prefill depth-flat at 820-840 t/s (vs 340-350 native at 32k depth). Assisted-by: Claude Fable 5 * sycl: handle GGML_SYCL_FA_ONEDNN_MAX_KV like the other runtime env vars and document it Review feedback on #25880: - read the variable once at backend init into g_ggml_sycl_fa_onednn_max_kv via ggml_sycl_get_env, and print it in the startup env listing (-lv 4 shows it) - document GGML_SYCL_FA_ONEDNN and GGML_SYCL_FA_ONEDNN_MAX_KV in the SYCL.md runtime table Also trim the added FLASH_ATTN_EXT cases to kv={4096,16384}: the 32768/65536 shapes exceed the legacy NMSE threshold on both the oneDNN and native kernels (long-sequence fp16 accumulation drift, present before this PR) and would fail CI for an unrelated reason. Assisted-by: Claude Fable 5 * sycl: clarify GGML_SYCL_FA_ONEDNN_MAX_KV default is disabled Assisted-by: Claude Fable 5 * sycl: state default behavior of GGML_SYCL_FA_ONEDNN_MAX_KV explicitly Assisted-by: Claude Fable 5 * Update ggml/src/ggml-sycl/fattn-onednn.cpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * sycl: write the SDPA scale from a kernel instead of caching it The per-(device, value) scale cache was a function-local static unordered_map with no synchronization, so concurrent backend instances could access and rehash it at the same time. Write the scalar with a single_task instead. The value is captured into the command, so no host memory has to outlive the call -- which is what the use-after-return fix needed in the first place. That removes the shared container, the leaked device allocation and the string key, and it also closes the remaining async-memcpy-from-a-stack-local on the first flash-attention call. Ordering does not rely on timing: the queue is created with sycl::property::queue::in_order and the dnnl stream wraps that same queue, so the write completes before the SDPA reads the scalar. The multi-GPU wait_and_throw() branch is unchanged. Also drop the <cstdlib> include, which is unused. Assisted-by: Claude Opus 5 --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-07-28 11:37:25 +03:00

1 2 3 4 5 ...

10210 commits