Commit graph

11022 commits

Author SHA1 Message Date
Concedo
82d562ad7b unstable merge 2025-12-28 23:03:03 +08:00
Concedo
9082403a43 disable vk events until directio pr or jeff's fix is added. (+1 squashed commits)
Squashed commits:

[4796db21a] disable vk events until directio pr or jeff's fix is added.
2025-12-28 21:54:25 +08:00
Concedo
a94d5ffbec Revert "Triage: revert https://github.com/ggml-org/llama.cpp/pull/18047 and https://github.com/ggml-org/llama.cpp/pull/18302"
This reverts commit dfa1b72d2f.
2025-12-28 21:48:55 +08:00
Concedo
4c1daf886a updated lite 2025-12-28 21:43:18 +08:00
Concedo
07fb18a04b handle case differences 2025-12-28 21:41:56 +08:00
Concedo
46891b3c0a updated lite 2025-12-28 18:07:13 +08:00
Concedo
21d801f6d5 init total weight for adaptive p 2025-12-28 15:33:06 +08:00
Concedo
ec95655f3c fixed default handling for special keys 2025-12-28 13:56:05 +08:00
Concedo
27261bfc26 adaptive decay as an overridable param (+1 squashed commits)
Squashed commits:

[d94df7843] adaptive decay as an overridable param
2025-12-28 13:34:20 +08:00
Concedo
1051313cb2 added deprecated item sdgendefaults (+1 squashed commits)
Squashed commits:

[efc14a5d9] fixed sd error
2025-12-27 22:47:43 +08:00
Concedo
f5282e114d allow ANY api field to have specified defaults, and to be overwritten by value specified at load time 2025-12-27 18:57:04 +08:00
Concedo
6548645aaa rename power law sampler to adaptive p 2025-12-27 17:50:58 +08:00
Johannes Gäßler
9045c9afe5
llama-fit-params: fix Gemma 3 calculation (#18372) 2025-12-27 09:56:04 +01:00
Concedo
445aad5e00 remove sdcpp qwen image lora hack 2025-12-27 16:31:29 +08:00
Wagner Bruna
84765f5967
sd: sync to master-447-ccb6b0a (#1898)
* sd: sync to master-438-298b110

* sd: sync to master-440-3e81246

* sd: sync to master-444-a0adcfb

* sd: sync to master-447-ccb6b0a
2025-12-27 16:30:52 +08:00
Concedo
9bb362cce9 revised power law sampling 2025-12-27 10:59:46 +08:00
Concedo
91d8863f18 power law sampler added 2025-12-27 09:46:06 +08:00
Jeff Bolz
c9ced4910b
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-26 16:12:58 -06:00
Jeff Bolz
7ac8902133
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-26 18:15:50 +01:00
Jeff Bolz
9bf20d8ac3
vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332) 2025-12-26 18:15:02 +01:00
Eve
cb999704fb
vulkan: small dequantization improvements (#18380)
* iq4_xs

* quants
2025-12-26 18:12:11 +01:00
Jeff Bolz
b96b82fc85
vulkan: Support UPSCALE w/antialias (#18327) 2025-12-26 17:00:57 +01:00
Jeff Bolz
10dc500bdb
vulkan: handle rope with large number of rows (#18306) 2025-12-26 16:53:46 +01:00
o7si
4893cc07bb
server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
* server : fix crash when seq_rm fails for hybrid/recurrent models

* server : add allow_processing param to clear_slot
2025-12-26 16:35:29 +01:00
Francisco Herrera
af3be131c0
docs: added note for pre SYCL Intel hardware (#18016)
Specify that it's for pre sycl hardware
2025-12-26 10:34:30 +08:00
0Marble
b07cda687c
CANN: implement the SSM_CONV operator (#17737)
* CANN: implement SSM_CONV operator

Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

* CANN: remove custom error limit for SSM_CONV

* CANN: merge SSM_CONV tensor shape/strides into one line

---------

Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-26 09:12:04 +08:00
Aman Gupta
85c40c9b02
ggml-cuda: fix regex for arch list (#18371)
* ggml-cuda: fix regex for arch list

* make regex exact
2025-12-26 01:35:14 +08:00
Concedo
dfa1b72d2f Triage: revert https://github.com/ggml-org/llama.cpp/pull/18047 and https://github.com/ggml-org/llama.cpp/pull/18302
Revert "vulkan: Implement set_tensor_async and the event interfaces (#18047)"

This reverts commit e1f15b454f. (+1 squashed commits)

Squashed commits:

[3cfbc7b1a] Revert "vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (#18302)"

This reverts commit 2a9ea2020c.
2025-12-26 01:20:31 +08:00
Concedo
399fc9c57e rename tokens tab to context, move fa to hardware 2025-12-26 00:06:07 +08:00
Aman Gupta
83b3b1c271
cuda: optimize cumsum cub path (#18362)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* cuda: optimize cumsum cub path

* remove heavy perf test
2025-12-25 23:55:38 +08:00
Concedo
062f8b28eb fixed sdui gen queue 2025-12-25 23:21:33 +08:00
Aman Gupta
b0fb0f0aee
ggml-cuda: fix blackwell native builds (#18361)
* ggml-cuda: fix blackwell native builds

Replace 12x in native architectures by 12xa

* replace for GGML_NATIVE=OFF too

* only replace for native

* remove 120f-virtual for default compilation

---------

Co-authored-by: Aman Gupta <aman>
2025-12-25 22:12:11 +08:00
Concedo
cf4201e213 wip power law sampling 2025-12-25 22:01:16 +08:00
Penglin Cai
e68c19b0fd
CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934)
* CONV_TRANSPOSE_1D kernel_size>255

* remove condition check

* fix the bug of type conversion

* removing trailing whitespaces

* fix: return true in the switch case
2025-12-25 16:46:09 +08:00
Aadeshveer Singh
c54bba869d
ggml : optimize cuda cumsum fallback kernel (#18343) 2025-12-25 12:11:13 +08:00
Xuan-Son Nguyen
f5acfb2ffa
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option

* also allow stop while loading

* add docs

* unload_lru: also wait for unload to complete
2025-12-24 23:47:49 +01:00
Xuan-Son Nguyen
4cbafad4f0
model: support MiMo-V2-Flash (#18328)
* mimov2: convert ok

* rename mimov2 --> mimo2

* fix conversion

* runnable not incorrect

* use sink

* add_sliding_window_pattern

* add swa and per-layer n_head_kv

* correct params

* somewhat working

* correct gating func

* nits

* mimo2: wire RMS eps + MoE bias + converter guards

* add co-author

Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>

* use add_rope_freq_base_swa

---------

Co-authored-by: Aaryan Kapoor <aaryankapoor2006@gmail.com>
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>
2025-12-24 23:07:08 +01:00
Concedo
6cc71db85a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/SYCL.md
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/run-org-model.py
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
2025-12-25 00:06:27 +08:00
Concedo
3589a5e136 Merge commit '12ee1763a6' into concedo_experimental
# Conflicts:
#	docs/backend/hexagon/README.md
#	docs/backend/hexagon/developer.md
#	examples/gen-docs/gen-docs.cpp
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	examples/sycl/run-llama2.sh
#	examples/sycl/run-llama3.sh
#	examples/sycl/win-run-llama2.bat
#	examples/sycl/win-run-llama3.bat
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp-utils.h
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/htp-dma.c
#	ggml/src/ggml-hexagon/htp/htp-dma.h
#	ggml/src/ggml-hexagon/htp/hvx-utils.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-opencl/kernels/transpose.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/snapdragon/adb/run-cli.sh
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2025-12-24 23:57:41 +08:00
Concedo
afe41b6eea Merge branch 'concedo_experimental' of https://github.com/LostRuins/koboldcpp into concedo_experimental 2025-12-24 23:42:52 +08:00
Concedo
d1983959d2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	AGENTS.md
#	common/CMakeLists.txt
#	docs/development/parsing.md
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
#	tests/test-grammar-llguidance.cpp
#	tests/test-tokenizer-0.cpp
#	tests/test-tokenizer-1-bpe.cpp
#	tests/test-tokenizer-1-spm.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/cli/cli.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-12-24 23:42:28 +08:00
Aadeshveer Singh
c184284230
fit-params : fix race condition in fit-params output (#18276)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
2025-12-24 15:57:38 +01:00
Aman Gupta
c8a2417d7b
CUDA: experimental native mxfp4 support for blackwell (#17906)
* CUDA: experimental native mxfp4 support for blackwell

* optimize load_tiles

* optimize quantize_mxfp4

* cleanup

* first pass review: formatting

* use interleaved layout for mma

* mmq: add assert for size

* use __nv_fp4x4_e2m1

* use iter_k as 512, cleanup

* Use 1200 as blackwell instead of 1000

* address review comments

* mmq: fix stride

* quantize.cu: use reference impl of e8m0 scale

* address review comments

* add 120f-virtual + minor fixes

---------

Co-authored-by: Aman Gupta <aman>
2025-12-24 22:28:26 +08:00
Wagner Bruna
f30da43b7f
sd: get the available schedulers directly from sd.cpp (#1900)
Avoids a hardcoded list on the Python side.
2025-12-24 21:55:24 +08:00
Saba Fallah
54132f1b1f
model : support for LlamaBidirectionalModel architecture (#18220)
* model: llama-embed-nemotron

* minor: python lint

* changed arch-name

* templated llm_build_llama to be used for both llama and llama-embed arch
2025-12-24 14:02:36 +01:00
Jeff Bolz
2a9ea2020c
vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (#18302) 2025-12-24 12:36:34 +01:00
Concedo
26d89bf589 support for downloading AVI from sdui 2025-12-24 18:40:10 +08:00
Wang Weixuan
ce7a6dc0fc
CANN : refactor ACL graph cache (#17752)
Move the graph property checking code into methods of LRU cache.

Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>
2025-12-24 17:50:24 +08:00
Jesse Ikonen
1ce0126b18
docs: Fix typos in SYCL documentation (#18269) 2025-12-24 17:19:47 +08:00
Ruben Ortlam
7f459c98e7
vulkan: use fewer FA rows for small cache runs (#18280) 2025-12-24 08:59:14 +01:00