Commit graph

10487 commits

Author SHA1 Message Date
Concedo
7527f1eff0 handle media for jinja path (+1 squashed commits)
Squashed commits:

[29d47d6b7] handle media for jinja path
2025-11-27 11:40:08 +08:00
Concedo
782ec5bffe bad identifier name 2025-11-27 11:07:13 +08:00
Concedo
d68f4a5ae5 disable clip fa for now 2025-11-27 10:20:38 +08:00
Concedo
2b00292bfe display path on 404 2025-11-27 10:07:08 +08:00
Concedo
d7c2f27749 try to fix some fattn inconsistencies 2025-11-27 01:55:26 +08:00
Concedo
c12f9e3b7c bump version 2025-11-27 01:04:09 +08:00
Concedo
6770767d8a allow FA for clip but with wmma disabled for turing on bad sizes 2025-11-27 01:03:29 +08:00
Concedo
e6ad29341b disable FA for clip test 2025-11-27 01:02:19 +08:00
Concedo
4497096cb0 Merge commit '3e18dba9fd' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	scripts/sync_vendor.py
#	tests/test-backend-ops.cpp
2025-11-27 00:07:37 +08:00
Concedo
5fe1d51c24 fix gpt oss 2025-11-26 23:44:56 +08:00
Wagner Bruna
998dfcd1be
sd: add an API endpoint to list the available schedulers (#1856) 2025-11-26 22:49:36 +08:00
Concedo
d9b9c54393 added another alias to jinjatools 2025-11-26 18:52:08 +08:00
Jiacheng (Jason) Chen
3e18dba9fd
HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 (#17502)
* patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4

* Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162
2025-11-26 11:18:48 +01:00
hipudding
eeb5605de2
CANN: Add MROPE and IMROPE support (#17401)
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* CANN: ROPE supports both MROPE and IMROPE.

1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.

Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.

* Resolve review comments
2025-11-26 16:44:19 +08:00
o7si
f3a848a3b1
chore: upgrade cpp-httplib from v0.27.0 to v0.28.0 (#17513) 2025-11-26 09:21:06 +02:00
Jeff Bolz
b3b03a7baf
vulkan: Implement GGML_OP_CUMSUM (#17479) 2025-11-26 07:08:10 +01:00
Concedo
9b6320cd71 adjust launcher scaling behavior 2025-11-25 21:32:03 +08:00
Georgi Gerganov
583cb83416
ggml : add ggml_top_k (#17365)
* ggml : add ggml_top_k

* cont : add ggml_argsort_top_k

* metal : add top_k support

* ggml : cleanup

* tests : add virtual err() function for test_case

* ggml : add comments
2025-11-25 15:31:43 +02:00
Aleksei Nikiforov
05872ac885
convert : fix big-endian conversion (#17431)
* Fix convert_hf_to_gguf.py script on s390x

Assume converted model data is originally little-endian.
Byteswap data on s390x after reading it to put values in correct presentation
for any transformation needed, like calculating weight tensors.

Then byteswap data to little-endian before passing it to GGUFWriter while
GGUFWriter will byteswap data back to big endian if big endian output is requested.

byteswap(inplace=True) calls don't work with lazy tensor and array wrappers.
Use byteswap with copying data to workaround this behaviour.

* Make GGUFWriter accept tensors in native endianness instead of little-endian

With this change if no byteswapping is actually needed, 2 excessive byteswaps can be omitted on s390x

* Fix byteswapping in convert_hf_to_gguf.py for remote models
2025-11-25 14:18:16 +01:00
Diego Devesa
55ab25caf5
codeowners : remove slaren (#17492) 2025-11-25 13:00:23 +01:00
TianHao324
064c90d843
CANN: supports out_prod operator for F32 and F16 (#17406)
Co-authored-by: tianhao <tianhao42@huawei.com>
2025-11-25 17:39:06 +08:00
Concedo
724763fdec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	common/common.cpp
#	examples/batched/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/arch-fallback.h
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	scripts/sync-ggml.last
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/server/CMakeLists.txt
2025-11-25 16:38:07 +08:00
Concedo
df30473716 deduplicate repeated statements in colab, minor refactgor 2025-11-25 16:11:18 +08:00
Pascal
b1846f1c8e
webui: add rehype plugin to restore HTML in Markdown table cells (#17477)
* webui: add rehype plugin to restore HTML in Markdown table cells

The remark/rehype pipeline neutralizes inline HTML as literal text
(remarkLiteralHtml) so that XML/HTML snippets in LLM responses display
as-is instead of being rendered. This causes <br> and <ul> markup in
table cells to show as plain text.

This plugin traverses the HAST post-conversion, parses whitelisted HTML
patterns (<br>, <ul><li>) from text nodes, and replaces them with actual
HAST element nodes. For lists, adjacent siblings must be combined first
as the AST fragmentation breaks pattern matching.

Strict validation rejects malformed markup, keeping it as raw text.

* chore: update webui build output
2025-11-25 08:01:02 +01:00
Jeff Bolz
d414db02d3
vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (#17455) 2025-11-25 07:11:27 +01:00
Aaron Teo
877566d512
llama: introduce support for model-embedded sampling parameters (#17120)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
2025-11-25 09:56:07 +08:00
Jeff Bolz
3d07caa99b
vulkan: more FA details in vk_perf_logger (#17443) 2025-11-24 22:25:24 +01:00
Daniel Bevenius
134e6940ca
llama : skip output reordering for single token batches (#17466)
This commit adds a check to skip the output reordering logic when
n_outputs == 1. With a single output token, the data is trivially
sorted and the reordering code is currently doing unnecessary work
(resetting and rebuilding output_ids to the same values).

The motivation for this change is improved code clarity and avoiding
confusion when debugging. While the performance impact is probably
negligible, this unnecessary work happens on every decode call in
llama-server when processing batches with single-token outputs.
2025-11-24 21:06:17 +01:00
Jiacheng (Jason) Chen
0543f928a3
HIP: WMMA-MMQ kernels for RDNA 4 (#17156)
* first commit naive test to enable mmq for RDNA4

* adding appropriate WMMA instructions

* git rebase on top of master: fixing the correctness of the mat mul operations, updating layout mappings for RDNA4

* clean up merge conflicts

* add comments and code clean up

* PR clean up, addressed comments

* enable MMQ fallback on RDNA4

* addressed comments: add guards in load generic, separate wmma branch for use_mmq function

* Revert build-xcframework.sh

* Formating: remove trailing whitespace

* revert CMake files

* clean up after rebase: remove duplicated change, revert cmake files

* clean up after rebase: revert changes from build-xcframework.sh

* clean up: remove extra space line in mma.cuh

* Revert "clean up: remove extra space line in mma.cuh"

This reverts commit b39ed57c4529906466bd0bc7c2a86e08fc2f8bee.
2025-11-24 20:00:10 +01:00
Concedo
bd0d6c2da5 add estopia to colab list, pending minor refactor 2025-11-25 00:18:11 +08:00
Sigbjørn Skjæret
b61de2b2df
convert : allow quantizing lora again (#17453) 2025-11-24 15:50:55 +01:00
henk717
e2232d9ad9
Make colab more user friendly (#1857) 2025-11-24 22:31:38 +08:00
Concedo
9a7f749f7c minor tweak for sd 2025-11-24 22:31:03 +08:00
Xuan-Son Nguyen
b8372eecd9
server: split server.cpp code into server/common/task/queue (#17362)
* add server-task, server-common

* add server-queue

* rm redundant includes

* move enum stop_type to server-task

* server : headers cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-24 14:41:53 +01:00
Daniel Bevenius
6ab8eacddf
examples : add -kvu to batched usage example [no ci] (#17469)
This commit adds the --kv-unified flag to the usage example
in the README.md file for the batched example.

The motivation for this is that without this flag the example will fail
with the following error:
```console
Hello my name is
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
2025-11-24 15:38:45 +02:00
Georgi Gerganov
2d50b9d8cb sync : ggml 2025-11-24 15:26:31 +02:00
Daniel Bevenius
697edfeead ggml : remove dirty flag from version string (ggml/1391)
This commit removes the "-dirty" suffix from the GGML version string.

The motivation for this change is to ensure that the version string
works with different ways of checking out ggml and using it in projects.
By removing the dirty flag from the version string, we avoid potential
artifacts like shared libraries getting a -dirty suffix in their names.

Instead, if the project is built from a dirty git state, the dirty flag
will be appended to the commit hash in the GGML_BUILD_COMMIT variable.
This will enable users to still identify that the build was made from
from a modified/dirty state even though the version might match a "real"
version.

For example, the commit can be produces as follows:
```c++
    printf("commit: %s\n", ggml_commit());
```
Which would print the following for a dirty build:
```console
commit: 781baf2a-dirty
```

Refs: https://github.com/ggml-org/ggml/pull/1363#issuecomment-3569691546
2025-11-24 15:26:31 +02:00
Alberto Cabrera Pérez
dbb852b549
ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) (#16739)
* Enabled q4_K_8x8_q8_K path on ARM

* wip: I8mm qs multiplication, pending bias

* cpu : arm : REPACK gemm q4_K8x8 implementation

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Guard gemm with proper features, improved superblock scale and min calc

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* cpu: arm: Implemented REPACK gemv for Q4_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Removed completed TODO

* Fixed missing guards when selecting optimal repack type for Q4_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed macro guard for gemv

* Fixed wrong comment in GEMV

* Fixed warning for unused variable

* vdotq_s32 -> ggml_vdotq_s32

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Clang-format issues

* Apply suggestions from code review

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Removed unnecessary GGML_UNUSED

* Fixed guards in q4_k gemm and gemv (repack)

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-11-24 13:08:11 +02:00
ixgbe
5f55c385cb
ggml: add RISC-V cpu-feats (#17461)
* ggml: add RISC-V cpu-feats

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* fix comment[1]

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-24 13:07:14 +02:00
william pan
4902eebe33
models : Added support for RND1 Diffusion Language Model (#17433)
* Converted RND1 model to GGUF weights

* RND1 llama.cpp support v1

* RND1 llama.cpp support v2 non causal bug

* RND1 llama.cpp support v3 doccumentation

* RND1 llama.cpp support v4 clean code

* linting issues

* RND1 pr fixes v1

* RND1 pr fixes v2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Diffusion documentation edits

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-24 14:16:56 +08:00
Max Krasnyansky
923ae3c619
hexagon: add support for ROPE_NEOX (#17458) 2025-11-23 18:55:56 -08:00
Raul Torres
01ad35e6d6
CANN: Define cann_graph_update_required before macro (#17434)
**Description of the problem**

`cann_graph_update_required` is redundantly defined and
initialized as `false` inside two mutually exclusive macro branches.

**Proposed solution**

Define it right before the macro so that it could serve both
branches.
2025-11-24 10:02:52 +08:00
M. Mediouni
fcb013847c
ggml-hexagon: Initial Hexagon v68/v69 support (#17394)
* ggml-hexagon: fix build error with GCC

Add stdexcept include to fix GCC build errors

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

* ggml-hexagon: check VTCM acquire failures

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

* ggml-hexagon: disable destination bypass on older than v73

v68 errors out if having bypass enabled when the VTCM is the destination.

At least on v68 this made things actually work... not a proper fix though, so to look at later...

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

* ggml-hexagon: add initial v68/v69 support

v68 is the Hexagon revision notably used on the Snapdragon 8cx
Gen 3 and the QCM6490.

Also add support for v69.

8MB isn't a supported page size, so relax asked for page size constraint
for HAP_compute_res_attr_set_vtcm_param_v2 to optimal.

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

---------

Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
2025-11-23 16:54:49 -08:00
Wagner Bruna
717d9c6401 sd: sync to master-377-2034588 2025-11-23 19:29:01 -03:00
Wagner Bruna
703bbf67d8 sd: sync to master-371-5498cc0 2025-11-23 19:29:01 -03:00
Wagner Bruna
8ef66e90c1 sd: sync to master-366-f532972 2025-11-23 19:29:01 -03:00
Wagner Bruna
3a7dd1a97f sd: sync to master-358-347710f
Also adapt Koboldcpp LoRA loading function, and add
backend support for lora_apply_mode.
2025-11-23 19:28:54 -03:00
Wagner Bruna
3318b73c94 sd: sync to master-355-694f0d9 2025-11-23 19:28:34 -03:00
nullname
d5bc1ad110
ggml-hexagon: add hex_supported_buffer for better buffer supported check (#17212)
* hexagon: add buffer support checks for hexagon sessions

* refactor: simplify buffer support checks in hexagon operations

* hexagon: update buffer support checks to use tensor structure

* refactor: streamline buffer initialization for DSP queue in hexagon operations

* refactor: simplify buffer initialization in DSP queue for hexagon operations

* refactor: optimize hex_supported_buffer function by fold expression

* wip

* refactor: simplify dspqueue_buffers_init function and its usage in hexagon operations

* fix: improve nan handling at hvx_vec_fast_sigmoid_fp32_guard

* refactor: optimize hvx_vec_inverse_fp32_guard for better nan handling

* refactor: update hvx_vec_fast_sigmoid_fp32_guard to use adjusted exponent limits

* refactor: modify hvx_vec_fast_sigmoid_fp32_guard to accept parameters for improved flexibility

* refactor: update hvx_vec_exp_fp32_guard to accept max_exp and inf parameters to save some instructions

* refactor: move hvx_vec_inverse_fp32_guard implementation to hvx-inverse.c for better perf
2025-11-23 14:26:36 -08:00
Pascal
0c7220db56
webui: minor settings reorganization and add disable autoscroll option (#17452)
* webui: added a dedicated 'Display' settings section that groups visualization options

* webui: added a Display setting to toggle automatic chat scrolling

* chore: update webui build output
2025-11-23 18:42:00 +01:00