Anton Mitkov
2bf9d539dd
sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices ( #13973 )
2025-06-25 18:09:55 +02:00
Concedo
969ef81701
updated lite
2025-06-25 20:51:27 +08:00
Concedo
45f0cc7310
hack to fix compilation on avx2 intel macbook
2025-06-25 20:15:20 +08:00
Reithan
54dde5e565
Add memoized cache to llama_grammar_reject_candidates_for_stack
( #1615 )
...
* Add memoized cache to llama_grammar_reject_candidates_for_stack
* make size cutoff more aggressive and move to outer branch
* update comment
* add cache reset whenever grammar is reloaded
* remove explicit reference types for compiler transportability
2025-06-25 19:22:19 +08:00
lhez
73e53dc834
opencl: ref count ggml_backend_opencl_context
and refactor profiling ( #14254 )
...
* Move profiling info into `ggml_backend_opencl_context`
* Add `enqueue_ndrange_kernel` to launch kernel
2025-06-24 11:46:25 -07:00
Georgi Gerganov
62af464227
batch : fix check for empty sequences in memory ( #14364 )
...
* batch : fix check for empty sequences in memory
ggml-ci
* cont : reuse the var
ggml-ci
2025-06-24 18:26:30 +03:00
Concedo
b884a7f058
try switch back to size max for vulkan
2025-06-24 23:14:24 +08:00
Concedo
ace537d44e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# examples/simple-chat/simple-chat.cpp
# src/llama-quant.cpp
# tools/run/run.cpp
# tools/server/README.md
2025-06-24 23:06:16 +08:00
Mathieu Baudier
c148cf1946
cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION ( #14362 )
2025-06-24 15:05:31 +02:00
Nigel Bosch
1b809cee22
server : move no API key doc to /health ( #14352 )
2025-06-24 10:59:11 +02:00
Sigbjørn Skjæret
abf241045d
main : honor --verbose-prompt on interactive prompts ( #14350 )
2025-06-24 09:31:00 +02:00
Bartowski
901e20bbe5
jinja : Add Mistral-Small-3.2-24B-Instruct-2506.jinja ( #14349 )
...
This will allow the use of tools on the llama-server
2025-06-24 09:17:58 +03:00
uvos
0142961a2e
CUDA/HIP: optimize mmv paths taken for HIP devices ( #14324 )
...
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-06-24 01:12:56 +02:00
bandoti
ce82bd0117
ci: add workflow for relocatable cmake package ( #14346 )
2025-06-23 15:30:51 -03:00
Jeff Bolz
bf2a99e3cb
vulkan: update windows SDK in release.yml ( #14344 )
2025-06-23 15:44:48 +02:00
Molly Sophia
72c6bc3f3d
llama : better rwkv chat template and add missing inputs.use_jinja
setting ( #14336 )
...
* llama-cli : add missing `inputs.use_jinja` setting
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama : better legacy chat template for rwkv
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-06-23 19:56:19 +08:00
Johannes Gäßler
defe2158dd
CUDA: mul_mat_v support for batch sizes > 1 ( #14262 )
...
* CUDA: mul_mat_v support for batch sizes > 1
* use 64 bit math for initial offset calculation
2025-06-23 13:11:31 +02:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos ( #14339 )
...
* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments
2025-06-23 12:27:35 +03:00
Concedo
8ce56bd547
tts more silence at the end
2025-06-23 17:15:35 +08:00
Jeff Bolz
3a9457df96
vulkan: update windows SDK in CI ( #14334 )
2025-06-23 10:19:24 +02:00
Ed Addario
fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) ( #13037 )
2025-06-22 23:16:26 +02:00
Sigbjørn Skjæret
238005c2dc
gguf-py : fix SpecialVocab parsing when post_processor is null ( #14330 )
2025-06-22 19:46:17 +02:00
Ruikai Peng
66aba7aca9
run : avoid double tokenization ( #14327 )
...
* run : avoid double tokenization by adopting common_tokenize heuristic
* build : fix windows gcc and clang warnings
* lint : fixed trailing whitepace
* run : fix is_first flag
2025-06-23 01:28:06 +08:00
Georgi Gerganov
f1f5e82df6
examples : fix is_first logic for tokenization ( #14329 )
...
ggml-ci
2025-06-22 20:10:07 +03:00
Concedo
fcb658453e
remove duplicate bundling for oldpc versions
2025-06-22 23:35:22 +08:00
Concedo
2d822d3059
fixed a typo
2025-06-22 23:28:29 +08:00
Concedo
fb13e3e51b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# src/llama-context.cpp
# tests/test-backend-ops.cpp
2025-06-22 23:26:15 +08:00
Concedo
abc1d8ac25
better way of checking for avx2 support
2025-06-22 22:56:50 +08:00
uvos
af3373f1ad
HIP: enable vec fattn on RDNA4 ( #14323 )
2025-06-22 16:51:23 +02:00
yuiseki
5d5c066de8
mtmd : fix Pixtral OOM with large images by capping image_size to 1024 ( #14326 )
...
Mistral Small 2506 models using Pixtral vision encoder were running out
of GPU memory when processing images larger than 1024x1024 pixels due to
exponential memory growth from unlimited image size.
This fix applies the same 1024x1024 limit used by Qwen2VL models to
prevent OOM issues while maintaining compatibility with existing models.
2025-06-22 14:44:57 +02:00
Concedo
52dcfe42d6
try auto selecting correct backend while checking intrinsics
2025-06-22 18:16:02 +08:00
Sigbjørn Skjæret
40bfa04c95
common : use std::string_view now that we target c++17 ( #14319 )
2025-06-22 08:37:43 +03:00
Aman Gupta
aa064b2eb7
CUDA: add mean operation ( #14313 )
...
* CUDA: add mean operation
* add back sum_rows_f32_cuda
* Review: early exit if col!=0
2025-06-22 12:39:54 +08:00
Sigbjørn Skjæret
aa0ef5c578
gguf-py : fix Qwen3-Embedding eos token ( #14314 )
2025-06-21 18:12:05 +02:00
Concedo
72d467c6d5
vision is now working in ollama owui
2025-06-21 23:43:43 +08:00
Concedo
6039791adf
minor bugfixes
2025-06-21 18:41:28 +08:00
Concedo
45f589b78d
test gfx1200 again
2025-06-21 17:56:04 +08:00
Markus Tavenrath
bb16041cae
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. ( #13792 )
...
* Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.
* remove #ifdef for debug utils and add queue marker.
2025-06-21 08:17:12 +02:00
Sigbjørn Skjæret
58cba76a9a
gguf-py : fix TemplateProcessing pair when bos/eos is missing ( #14312 )
2025-06-21 07:33:21 +02:00
Georgi Gerganov
67ae5312e2
metal : fix thread-safety ( #14300 )
...
ggml-ci
2025-06-21 08:04:18 +03:00
Georgi Gerganov
692e3cdd0a
memory : rename interface to llama_memory_context_i ( #14296 )
...
* memory : rename interface to llama_memory_context_i
ggml-ci
* cont : fix comments
* cont : use "mctx" for referencing a memory context
ggml-ci
2025-06-21 08:03:46 +03:00
Daniel Han
b23fa0b3f4
convert : fix Llama 4 conversion ( #14311 )
2025-06-21 06:32:01 +02:00
Concedo
65ff041827
added more perf stats
2025-06-21 12:12:28 +08:00
Concedo
ea21a9d749
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/build.md
# scripts/sync-ggml.last
2025-06-21 10:40:07 +08:00
Wagner Bruna
08adfb53c9
Configurable VAE threshold limit ( #1601 )
...
* add backend support for changing the VAE tiling threshold
* trigger VAE tiling by image area instead of dimensions
I've tested with GGML_VULKAN_MEMORY_DEBUG all resolutions with
the same 768x768 area (even extremes like 64x9216), and many
below that: all consistently allocate 6656 bytes per image pixel.
As tiling is primarily useful to avoid excessive memory usage, it
seems reasonable to enable VAE tiling based on area rather than
maximum image side.
However, as there is currently no user interface option to change
it back to a lower value, it's best to maintain the default
behavior for now.
* replace the notile option with a configurable threshold
This allows selecting a lower threshold value, reducing the
peak memory usage.
The legacy sdnotile parameter gets automatically converted to
the new parameter, if it's the only one supplied.
* simplify tiling checks, 768 default visible in launcher
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2025-06-21 10:14:57 +08:00
Concedo
caea52407a
fix photomaker crash
2025-06-21 10:11:39 +08:00
Concedo
684d71e058
add old convert tool
2025-06-21 08:40:04 +08:00
Georgi Gerganov
06cbedfca1
sync : ggml
...
ggml-ci
2025-06-20 21:02:47 +03:00
Acly
b7147673f2
Add ggml_roll
(ggml/1274)
...
* ggml : add ggml_roll
* use set/get_op_params & std::min
2025-06-20 21:02:47 +03:00
David Chiu
d860dd99a4
docs : fix the link to llama.h ( #14293 )
2025-06-20 19:43:35 +02:00