Commit graph

7661 commits

Author SHA1 Message Date
Concedo
6494dce405 handle estimation for multipart gguf (+1 squashed commits)
Squashed commits:

[c7b4af92] handle estimation for multipart gguf
2025-04-21 22:07:22 +08:00
Concedo
9cd6a1add2 allow mmproj to be run on cpu 2025-04-21 21:03:10 +08:00
Concedo
f968079290 randomize image names to prevent caching in noscript 2025-04-21 13:24:40 +08:00
Concedo
687bb5375c Merge branch 'upstream' into concedo_experimental 2025-04-20 20:57:11 +08:00
Concedo
2ed6850c0b added override tensor 2025-04-20 20:56:17 +08:00
Jeffrey Morgan
6602304814
llava: fix errors in clip.h on certain compilers (#13030) 2025-04-20 12:15:41 +02:00
Concedo
17360a3b32 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	examples/llava/clip.cpp
2025-04-20 17:59:58 +08:00
Concedo
636b92ec1d updated lite 2025-04-20 17:52:55 +08:00
Jeff Bolz
66168204be
vulkan: support noncontiguous rms_norm (#13031) 2025-04-20 10:50:02 +02:00
Jeffrey Morgan
4ba9d711ba
metal: add neg operator (#13029) 2025-04-20 08:28:40 +03:00
bandoti
00137157fc
Disable CI cross-compile builds (#13022) 2025-04-19 18:05:03 +02:00
Concedo
75dfad2bb0 fixed noscript (+1 squashed commits)
Squashed commits:

[dba28399] fixed noscript
2025-04-19 23:16:08 +08:00
Sigbjørn Skjæret
fb28f4f80e
gguf-py : fix upload python package workflow (#13020) 2025-04-19 16:26:38 +02:00
Concedo
12c2efdadd noscript image gen 2025-04-19 18:56:52 +08:00
Xuan-Son Nguyen
37b9f0d29d
clip : refactor, add image_manipulation and llava_uhd classes (#13011)
* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include
2025-04-19 09:15:45 +02:00
Concedo
95d1aaf4d4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/rpc/rpc-server.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	requirements/requirements-all.txt
2025-04-19 13:17:13 +08:00
Concedo
305e533dc6 i already knew zenity would cause issues 2025-04-19 13:04:41 +08:00
Concedo
78a910be26 noscript chat mode tweaks 2025-04-19 12:40:13 +08:00
Daniel Tang
6408210082
main : Fix Ctrl+D/newline handling (#12951)
This restores the behavior from #491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (#1040).

This also actually implements #587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."

Fixes #12949
2025-04-18 22:02:55 +02:00
Chris Thompson
aff9d107b0
gguf-py : GGUF Editor GUI - Python + Qt6 (#12930) 2025-04-18 20:30:41 +02:00
Xuan-Son Nguyen
35370ba945
server : use std::move whenever possible (#12936)
* server : use std::move whenever possible

* use r-value ref

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* make task creation scoped

* restore std::move

* fix task_id not set correctly

* apply changes from suggestion

Co-authored-by: ggerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-18 19:58:12 +02:00
Concedo
a5b5d21cca added chat mode to noscript 2025-04-19 00:59:00 +08:00
Concedo
4b0f63ed62 cleanup 2025-04-18 22:57:10 +08:00
Concedo
29b57d2175 updated vulkan to make use of cm2 2025-04-18 22:10:57 +08:00
Akarshan Biswas
8d66005763
SYCL: Refactor and enable FP16 in binary broadcast OPs (#12975)
* SYCL: refactor move to a separate file

* Fix binbcast

* Remove duplicates

* fix include formatting

* fix typo
2025-04-18 15:57:56 +02:00
Xuan-Son Nguyen
b9154ecff9
mtmd : add methods to access mtmd_image_tokens (#12906)
* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member
2025-04-18 10:04:51 +02:00
Radoslav Gerganov
2db9ba1464
rpc : add RPC_CMD_HELLO (#12955)
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
2025-04-18 10:13:42 +03:00
Concedo
40adb8af35 update lite, remove aetherroom dead site 2025-04-18 13:23:14 +08:00
Concedo
5d57d62665 add a timeout for zenity check 2025-04-18 13:07:26 +08:00
Concedo
bce519cee7 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	tests/test-backend-ops.cpp
2025-04-18 12:44:20 +08:00
Concedo
1a09d9cf0e increase to 10 save slots 2025-04-18 11:30:32 +08:00
Georgi Gerganov
2f74c354c0
graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-17 18:16:36 +03:00
Concedo
64dd4f932f update readme 2025-04-17 23:06:07 +08:00
Alan Gray
207c22ec2d
ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970) 2025-04-17 15:19:42 +02:00
hipudding
7a395f67a7
CANN: Add support for async operator submission (#12864)
Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
2025-04-17 20:34:16 +08:00
Mikko Juola
971f245b3b
llama : recognize IBM Granite 3.3 FIM tokens (#12988)
The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.

Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:

```
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    ...
    "<reponame>",
```
2025-04-17 11:37:05 +03:00
Concedo
c67510718e kv override option (+1 squashed commits)
Squashed commits:

[e615fc01] kv override option
2025-04-17 14:22:30 +08:00
Concedo
cbdee99354 revert sdcpp clip quant changes 2025-04-17 11:11:34 +08:00
kimminsu
12b17501e6
opencl: fix incorrect local_size index in profiling log (#12868) 2025-04-16 14:25:57 -07:00
Jeff Bolz
015022bb53
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-16 20:37:25 +02:00
Concedo
b2719268df merge sdcpp fixes 2025-04-17 00:52:49 +08:00
Concedo
06159939d9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	docs/build.md
#	examples/rpc/rpc-server.cpp
#	examples/sycl/build.sh
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hip/CMakeLists.txt
#	scripts/sync-ggml.last
2025-04-17 00:52:37 +08:00
Concedo
a5c39143a1 enabled coopmat2 for the price of 7mb 2025-04-17 00:21:03 +08:00
Chenguang Li
b43d89e311
CANN: Add 310P operator support check (#12962) 2025-04-16 16:21:05 +08:00
lhez
80f19b4186
opencl: split ggml-opencl.cl into multiple files and cleanup (#12886)
* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>

* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
2025-04-15 12:26:00 -07:00
Concedo
fbf039966c debugmode has debug in cli 2025-04-15 23:42:46 +08:00
Concedo
c168b063e5 cli fix 2025-04-15 23:30:18 +08:00
Georgi Gerganov
f8f820cc4d
metal : add FA-vec kernels for head size 96 (#12952)
ggml-ci
2025-04-15 14:45:05 +03:00
hipudding
54a7272043
CANN: Add x86 build ci (#12950)
* CANN: Add x86 build ci

* CANN: fix code format
2025-04-15 12:08:55 +01:00
David Huang
84778e9770
CUDA/HIP: Share the same unified memory allocation logic. (#12934)
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
2025-04-15 11:20:38 +02:00