Concedo
db2e5e43d9
allow whisper interrogate mode for audio files
2025-07-19 16:51:58 +08:00
Concedo
490b13af83
whitespace
2025-07-19 15:08:55 +08:00
Concedo
b0b7a07b34
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/parallel/parallel.cpp
2025-07-18 23:49:45 +08:00
Georgi Gerganov
2adf8d83ac
parallel : add option for different RNG seeds ( #14757 )
...
ggml-ci
2025-07-18 17:33:41 +03:00
Oliver Simons
021cc28bef
cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs ( #14741 )
...
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs
Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.
* Exclude `project_per_layer_input` by matching node names
This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.
* Revert unnecessary formatting changes
2025-07-18 04:35:32 -07:00
Georgi Gerganov
d498af3d5a
graph : avoid huge warm-up graphs for MoE models ( #14753 )
...
* graph : avoid huge warm-up graphs for MoE models
ggml-ci
* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Georgi Gerganov
eacdeb5bfc
model : fix build after merge conflict ( #14754 )
2025-07-18 11:53:55 +03:00
lgai-exaone
e0cb5c5cb8
model : add EXAONE 4.0 support ( #14630 )
2025-07-18 10:45:49 +02:00
Aman Gupta
f9a31eea06
CUDA: set_rows + cpy.cu refactor ( #14712 )
2025-07-18 14:54:18 +08:00
Concedo
b8e3280432
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# ggml/src/ggml-sycl/ggml-sycl.cpp
2025-07-18 13:46:32 +08:00
Georgi Gerganov
8f974bc1e9
graph : refactor context to not pass gf explicitly ( #14629 )
...
ggml-ci
2025-07-18 08:29:28 +03:00
kallewoof
226624639c
AutoGuess: Move Generic cases to end of file and put Kimi with other ChatML variants ( #1648 )
...
* AutoGuess: Move Generic cases to end of file and put Kimi with other ChatML variants
* patch Kimi ChatML template
2025-07-18 13:24:21 +08:00
Concedo
b028dd4e84
minor fixes
2025-07-18 13:22:59 +08:00
Nexes the Elder
09651d09ff
graph : Pass the graph placeholder message in debug mode ( #14748 )
...
Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.
2025-07-18 07:25:54 +03:00
Neo Zhang Jianyu
349ea79fce
use max work group size for device to replace the magic number ( #14732 )
2025-07-18 10:23:14 +08:00
Piotr Wilkin (ilintar)
670e1360cd
convert : fix Ernie4.5 MoE without shared experts ( #14746 )
2025-07-18 01:17:16 +02:00
Wroclaw
760b4484e3
nix : use optionalAttrs for env mkDerivation attrset argument ( #14726 )
2025-07-17 15:18:16 -07:00
Piotr Wilkin (ilintar)
cb887f1bc1
model: add Ernie 4.5 MoE support ( #14658 )
...
* Add Ernie4.5 MoE
* Fix Flake errors.
* Properly encode/decode MoE layer step
* Correct tensor mappings (.weight)
* Pass and read n_ff_exp
* n_ff_shexp calculation and further minor changes
* Rope fixes.
* .gitignore fix
* Add unit32 cast for Linux builds
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Further fixes from code review
* Fix trailing whitespace
* Reenable missing experts error
* Code style from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix non-MoE regression
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-17 23:15:32 +02:00
Georgi Gerganov
d6fb3f6b49
kv-cache : fix k-shift for multiple streams ( #14742 )
...
ggml-ci
2025-07-17 20:52:33 +03:00
Concedo
1ca666f9c1
allow handling multipart files up to 999
2025-07-18 01:18:28 +08:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs ( #14482 )
...
* llama : reuse compute graphs
ggml-ci
* llama-bench : add graph reuse parameter
ggml-ci
* cont : remove the parameter and the sched resets
ggml-ci
* graph : rename update() to can_reuse()
ggml-ci
* params : remove is_same()
ggml-ci
* graph : set res->params in llm_graph_context constructor
ggml-ci
* graph : avoid set_max_nodes in llm_graph_result
ggml-ci
* kv-cache : reuse llama_context's graph result instance
ggml-ci
* context : reset the previous graph result upon memory updates
ggml-ci
* batch : llama_ubatch now carries its data instead of pointing to balloc
ggml-ci
* merge : fix build
ggml-ci
* graph : fix can_reuse() checks when flash-attention is disabled
* graph : move llm_graph_result impl in source file + debug env
ggml-ci
2025-07-17 19:08:33 +03:00
Concedo
8cf812eddd
updated lite
2025-07-17 20:11:17 +08:00
Concedo
f57018f722
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-linux-cross.yml
2025-07-17 18:23:26 +08:00
Concedo
afca31bfbe
handle clean_env for remotetunnel
2025-07-17 18:21:22 +08:00
Tarek Dakhran
086cf81e88
llama : fix parallel processing for lfm2 ( #14705 )
2025-07-17 09:22:11 +02:00
Georgi Gerganov
d9b691081c
kv-cache : opt mask set input ( #14600 )
...
ggml-ci
2025-07-17 09:49:15 +03:00
Georgi Gerganov
ad57d3edd2
batch : fix uninitialized has_cpl flag ( #14733 )
...
ggml-ci
2025-07-17 09:45:54 +03:00
Concedo
a417cd87c7
updated lite
2025-07-17 12:09:06 +08:00
Concedo
d4a394ff73
label attached media with ids
2025-07-17 10:04:46 +08:00
Sigbjørn Skjæret
1ba45d4982
ci : disable failing vulkan crossbuilds ( #14723 )
2025-07-16 20:52:08 -03:00
Sigbjørn Skjæret
19e5943d9e
convert : make hf token optional ( #14717 )
...
* make hf token optional
* fail if we can't get necessary tokenizer config
2025-07-16 23:17:43 +02:00
Diner Burger
496957e1cb
llama : fix parameter order for hybrid memory initialization ( #14725 )
2025-07-16 21:17:25 +02:00
Concedo
bdff33e0de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# examples/parallel/parallel.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# scripts/server-bench.py
# src/llama-kv-cache-unified.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-07-17 00:28:37 +08:00
Concedo
f0564f9caf
updated lite, added better separators for multimodal chunks (universal)
2025-07-17 00:11:08 +08:00
Reese Levine
21c021745d
ggml: Add initial WebGPU backend ( #14521 )
...
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults
* Initialize webgpu device
* Making progress on setting up the backend
* Finish more boilerplate/utility functions
* Organize file and work on alloc buffer
* Add webgpu_context to prepare for actually running some shaders
* Work on memset and add shader loading
* Work on memset polyfill
* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it
* Implement get_tensor and buffer_clear
* Finish rest of setup
* Start work on compute graph
* Basic mat mul working
* Work on emscripten build
* Basic WebGPU backend instructions
* Use EMSCRIPTEN flag
* Work on passing ci, implement 4d tensor multiplication
* Pass thread safety test
* Implement permuting for mul_mat and cpy
* minor cleanups
* Address feedback
* Remove division by type size in cpy op
* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends
* Fix name
* Fix macos dawn prefix path
2025-07-16 18:18:51 +03:00
tempstudio
b0f0ecc3dc
model : support output bias for qwen2 ( #14711 )
...
Co-authored-by: qwaqrm <qwaqrm@126.com>
2025-07-16 18:02:06 +03:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Aman Gupta
ab14019821
Support diffusion models: Add Dream 7B ( #14644 )
...
* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now
2025-07-16 20:03:51 +08:00
Georgi Gerganov
64978340b0
ggml : add asserts ( #14720 )
...
* ggml : add asserts
ggml-ci
* cont : fix constant type
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-16 14:43:32 +03:00
Georgi Gerganov
6ffd4e9c44
server : pre-calculate EOG logit biases ( #14721 )
...
ggml-ci
2025-07-16 14:04:12 +03:00
Shunta Saito
e4841d24d3
llama : fix parallel processing for plamo2 ( #14716 )
2025-07-16 12:12:22 +02:00
Georgi Gerganov
538cc77f7f
server : fix handling of the ignore_eos flag ( #14710 )
...
ggml-ci
2025-07-16 12:13:57 +03:00
Concedo
2a59adce0f
stay on macos 14
2025-07-16 15:47:33 +08:00
Johannes Gäßler
5cae766541
scripts: synthetic prompt mode for server-bench.py ( #14695 )
2025-07-16 09:33:28 +02:00
Sigbjørn Skjæret
4b91d6f71f
convert : only check for tokenizer folder if we need it ( #14704 )
2025-07-16 08:52:04 +02:00
Sigbjørn Skjæret
cf91f217f1
convert : add pre-computed hashes first to prevent order mishaps ( #14701 )
2025-07-16 08:51:12 +02:00
Concedo
cbe9fc87c5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# src/llama-vocab.cpp
2025-07-16 12:03:54 +08:00
Min-Hua
79e0b68c17
llama: add LLAMA_API to deprecated llama_kv_self_seq_div ( #14708 )
...
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?
Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.
Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
2025-07-16 07:00:42 +03:00
Ed Addario
c81f4192f9
gguf-py : dump bpw per layer and model in markdown mode ( #14703 )
2025-07-16 00:04:42 +02:00
Gabriel Larson
4a4f426944
model : add Kimi-K2 support ( #14654 )
...
* Kimi-K2 conversion
* add Kimi_K2 pre type
* Kimi-K2
* Kimi-K2 unicode
* Kimi-K2
* LLAMA_MAX_EXPERTS 384
* fix vocab iteration
* regex space fix
* add kimi-k2 to pre_computed_hashes
* Updated with kimi-k2 get_vocab_base_pre hash
* fix whitespaces
* fix flake errors
* remove more unicode.cpp whitespaces
* change set_vocab() flow
* add moonshotai-Kimi-K2.jinja to /models/templates/
* update moonshotai-Kimi-K2.jinja
* add kimi-k2 chat template
* add kimi-k2
* update NotImplementedError
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* except Exception
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){}
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-15 21:54:22 +02:00