Commit graph

10513 commits

Author SHA1 Message Date
Concedo
b30f09db80 autoscroll fixes 2025-11-28 18:31:29 +08:00
Concedo
6aa79513a9 Merge branch 'cuda-fa-vec-fix-overflow-2' into concedo_experimental 2025-11-28 13:27:16 +08:00
Concedo
eda4a312cb Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/common.hpp
#	tests/test-backend-ops.cpp
#	tools/server/README.md
2025-11-28 13:22:02 +08:00
Concedo
e570478275 limit cuda arches + scale tweaks 2025-11-28 13:05:11 +08:00
Concedo
9a46faa1c3 fix for override tensors not passing correctly 2025-11-28 13:03:40 +08:00
Piotr Wilkin (ilintar)
cd0e3a7a3b
SOLVE_TRI CUDA kernel for small matrices (#17457)
Some checks failed
Python Type-Check / pyright type-check (push) Has been cancelled
2025-11-28 12:15:32 +08:00
Neo Zhang Jianyu
efaaccdd69
refactor pad_reflect_1d to make the UT case pass (#17204)
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-11-28 08:50:56 +08:00
Johannes Gäßler
b13fcf85c5 CUDA: no FP16 arithmetic for vector FA kernel 2025-11-27 21:13:46 +01:00
Jeff Bolz
4abef75f2c
vulkan: Implement SOLVE_TRI (#17486)
* vulkan: Implement SOLVE_TRI

* load B matrix through shared memory

* use FLOAT_TYPE
2025-11-27 15:48:00 +01:00
Georgi Gerganov
c386114922
arch : add description about LLM_TENSOR_INFOS (#17550) 2025-11-27 16:34:13 +02:00
Georgi Gerganov
6783b11fb0
models : fix LFM2 tensors (#17548) 2025-11-27 16:04:29 +02:00
matt23654
909072abcf
cuda : fix UMA detection on discrete GPUs. (#17537) 2025-11-27 13:35:35 +02:00
Alberto Cabrera Pérez
cd8370b408
ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (#17494)
* Enabled q4_K_4x8 path

* Fixed generic Q4_K 8x4 implementation

* wip: dotprod gemm

* Working arm q4_K dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Undo acc rename

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Q4_K arm dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix: q4_qs reinterpret from uint to int

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Removed comments

* Fixed macro guards

* Fixed unused vars in generic implementation

* Fixed unused vars in 8x4 repack

* Fixed unused vars in generic implementation, unneeded comment

* Missing arch fallback for x86

* minor : style

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-27 13:25:14 +02:00
Eric Curtin
d21a76ac38
devops: Add build-essential to Ubuntu 26.04 image (#17531)
This is no longer passing the build, needs more packages.

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-11-27 18:35:47 +08:00
Aleksei Nikiforov
4fcd87cf7c
gguf-py : skip endian-conversion of MXFP4 data (#17523)
* gguf_convert_endian.py: skip MXFP4 data

* Use gguf.constants.GGML_QUANT_SIZES to determine block sizes
2025-11-27 11:35:38 +01:00
Acly
b78db3bd50
vulkan : move contiguous checks to device_supports_op (#17490)
* vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op

* im2col: remove contraints on src0 (kernel input)
2025-11-27 06:54:19 +01:00
Jeff Bolz
142df17c9c
vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (#17514) 2025-11-27 06:32:30 +01:00
Concedo
7527f1eff0 handle media for jinja path (+1 squashed commits)
Squashed commits:

[29d47d6b7] handle media for jinja path
2025-11-27 11:40:08 +08:00
Concedo
782ec5bffe bad identifier name 2025-11-27 11:07:13 +08:00
Concedo
d68f4a5ae5 disable clip fa for now 2025-11-27 10:20:38 +08:00
Concedo
2b00292bfe display path on 404 2025-11-27 10:07:08 +08:00
Xuan-Son Nguyen
e509411cf1
server: enable jinja by default, update docs (#17524)
* server: enable jinja by default, update docs

* fix tests
2025-11-27 01:02:50 +01:00
lhez
7cba58bbea
opencl: add sqr, sqrt, mean and ssm_conv (#17476)
* opencl: add sqr

* opencl: add sqrt

* opencl: add mean

* opencl: add ssm_conv

* opencl: add missing cl_khr_fp16

* opencl: do sqrt in f32 then convert to f16 for better precision
2025-11-26 13:29:58 -08:00
Alberto Cabrera Pérez
5449367b21
Fix chunks being too small with small matrix sizes (#17526) 2025-11-26 13:14:54 -08:00
Han Qingzhe
1d594c295c
clip: (minicpmv) fix resampler kq_scale (#17516)
* debug:"solve minicpmv precision problem"

* “debug minicpmv”

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-11-26 21:44:07 +01:00
Concedo
d7c2f27749 try to fix some fattn inconsistencies 2025-11-27 01:55:26 +08:00
Concedo
c12f9e3b7c bump version 2025-11-27 01:04:09 +08:00
Concedo
6770767d8a allow FA for clip but with wmma disabled for turing on bad sizes 2025-11-27 01:03:29 +08:00
Concedo
e6ad29341b disable FA for clip test 2025-11-27 01:02:19 +08:00
Concedo
4497096cb0 Merge commit '3e18dba9fd' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	scripts/sync_vendor.py
#	tests/test-backend-ops.cpp
2025-11-27 00:07:37 +08:00
Jeff Bolz
eec1e33a9e
vulkan: allow graph_optimize for prompt processing workloads (#17475) 2025-11-26 16:46:33 +01:00
Jeff Bolz
879d673759
vulkan: Implement top-k (#17418)
* vulkan: Implement top-k

Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.

* fix pipeline selection

* vulkan: Add N-ary search algorithm for topk

* microoptimizations
2025-11-26 16:45:43 +01:00
Concedo
5fe1d51c24 fix gpt oss 2025-11-26 23:44:56 +08:00
Wagner Bruna
998dfcd1be
sd: add an API endpoint to list the available schedulers (#1856) 2025-11-26 22:49:36 +08:00
xctan
6ab4e50d9c
ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 (#17448)
* ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16

* ggml-cpu : dedup scalar impl

* Update ggml/src/ggml-cpu/vec.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-26 15:33:05 +02:00
Adrien Gallouët
2336cc4784
cmake : use EXCLUDE_FROM_ALL to avoid patch-boringssl.cmake (#17520)
We have to separate the code path starting 3.28 because
`FetchContent_Populate` is now deprecated and will be completely removed
in a future version.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-26 15:15:21 +02:00
Adrien Gallouët
e6923caaec
ggml : fix ARM feature verification (#17519)
On arm64 with `cmake` version 3.31.6, the final feature verification fails:

    -- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
    -- Performing Test GGML_MACHINE_SUPPORTS_dotprod
    -- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_i8mm
    -- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_sve
    -- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_sme
    -- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
    -- Performing Test GGML_MACHINE_SUPPORTS_nosme
    -- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
    -- Checking for ARM features using flags:
    --   -U__ARM_FEATURE_SME
    --   -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
    -- Performing Test HAVE_DOTPROD
    -- Performing Test HAVE_DOTPROD - Failed
    -- Performing Test HAVE_SVE
    -- Performing Test HAVE_SVE - Failed
    -- Performing Test HAVE_MATMUL_INT8
    -- Performing Test HAVE_MATMUL_INT8 - Failed
    -- Performing Test HAVE_FMA
    -- Performing Test HAVE_FMA - Success
    -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
    -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
    -- Performing Test HAVE_SME
    -- Performing Test HAVE_SME - Failed
    -- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme

We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-26 15:14:41 +02:00
Concedo
d9b9c54393 added another alias to jinjatools 2025-11-26 18:52:08 +08:00
Jiacheng (Jason) Chen
3e18dba9fd
HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 (#17502)
* patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4

* Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162
2025-11-26 11:18:48 +01:00
hipudding
eeb5605de2
CANN: Add MROPE and IMROPE support (#17401)
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* CANN: ROPE supports both MROPE and IMROPE.

1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.

Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.

* Resolve review comments
2025-11-26 16:44:19 +08:00
o7si
f3a848a3b1
chore: upgrade cpp-httplib from v0.27.0 to v0.28.0 (#17513) 2025-11-26 09:21:06 +02:00
Jeff Bolz
b3b03a7baf
vulkan: Implement GGML_OP_CUMSUM (#17479) 2025-11-26 07:08:10 +01:00
Concedo
9b6320cd71 adjust launcher scaling behavior 2025-11-25 21:32:03 +08:00
Georgi Gerganov
583cb83416
ggml : add ggml_top_k (#17365)
* ggml : add ggml_top_k

* cont : add ggml_argsort_top_k

* metal : add top_k support

* ggml : cleanup

* tests : add virtual err() function for test_case

* ggml : add comments
2025-11-25 15:31:43 +02:00
Aleksei Nikiforov
05872ac885
convert : fix big-endian conversion (#17431)
* Fix convert_hf_to_gguf.py script on s390x

Assume converted model data is originally little-endian.
Byteswap data on s390x after reading it to put values in correct presentation
for any transformation needed, like calculating weight tensors.

Then byteswap data to little-endian before passing it to GGUFWriter while
GGUFWriter will byteswap data back to big endian if big endian output is requested.

byteswap(inplace=True) calls don't work with lazy tensor and array wrappers.
Use byteswap with copying data to workaround this behaviour.

* Make GGUFWriter accept tensors in native endianness instead of little-endian

With this change if no byteswapping is actually needed, 2 excessive byteswaps can be omitted on s390x

* Fix byteswapping in convert_hf_to_gguf.py for remote models
2025-11-25 14:18:16 +01:00
Diego Devesa
55ab25caf5
codeowners : remove slaren (#17492) 2025-11-25 13:00:23 +01:00
TianHao324
064c90d843
CANN: supports out_prod operator for F32 and F16 (#17406)
Co-authored-by: tianhao <tianhao42@huawei.com>
2025-11-25 17:39:06 +08:00
Concedo
724763fdec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	common/common.cpp
#	examples/batched/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/arch-fallback.h
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	scripts/sync-ggml.last
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/server/CMakeLists.txt
2025-11-25 16:38:07 +08:00
Concedo
df30473716 deduplicate repeated statements in colab, minor refactgor 2025-11-25 16:11:18 +08:00
Pascal
b1846f1c8e
webui: add rehype plugin to restore HTML in Markdown table cells (#17477)
* webui: add rehype plugin to restore HTML in Markdown table cells

The remark/rehype pipeline neutralizes inline HTML as literal text
(remarkLiteralHtml) so that XML/HTML snippets in LLM responses display
as-is instead of being rendered. This causes <br> and <ul> markup in
table cells to show as plain text.

This plugin traverses the HAST post-conversion, parses whitelisted HTML
patterns (<br>, <ul><li>) from text nodes, and replaces them with actual
HAST element nodes. For lists, adjacent siblings must be combined first
as the AST fragmentation breaks pattern matching.

Strict validation rejects malformed markup, keeping it as raw text.

* chore: update webui build output
2025-11-25 08:01:02 +01:00