Concedo
a11ab0b08e
reverse clip skip fix as it might be breaking some sdxl models
2025-05-30 10:40:03 +08:00
Concedo
2a309c144d
updated lite
2025-05-30 00:29:46 +08:00
Concedo
c881bb7348
match a few common oai voices
2025-05-29 23:29:17 +08:00
Concedo
e14aec58bc
embeds no offload qkv
2025-05-29 00:28:02 +08:00
Concedo
fcc1b43c06
embeddings change to encode
2025-05-28 23:24:33 +08:00
Concedo
26bf5b446d
fixed thread count <=0 , fixed clip skip <= 0
2025-05-28 00:38:15 +08:00
Concedo
8c701d7ded
Merge commit ' 72b090da2c' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/function-calling.md
# examples/embedding/embedding.cpp
# examples/retrieval/retrieval.cpp
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/Doxyfile
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/binbcast.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/concat.cpp
# ggml/src/ggml-sycl/conv.cpp
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/gla.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-sycl/tsembd.cpp
# ggml/src/ggml-sycl/wkv.cpp
# scripts/compare-commits.sh
# tests/test-chat.cpp
# tests/test-sampling.cpp
2025-05-28 00:28:41 +08:00
Concedo
868cb6aff7
Merge commit ' e121edc432' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# common/CMakeLists.txt
# docs/function-calling.md
# ggml/src/ggml-sycl/binbcast.cpp
# models/templates/README.md
# scripts/tool_bench.py
# src/llama-kv-cache.cpp
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/mtmd/clip.h
# tools/rpc/rpc-server.cpp
# tools/server/README.md
2025-05-28 00:20:45 +08:00
bandoti
72b090da2c
docs: remove link for llama-cli function calling ( #13810 )
2025-05-27 08:52:40 -03:00
Christian Kastner
7fe03e7446
ggml-cpu: x86 feature detection is specific to x86 ( #13811 )
2025-05-27 13:18:39 +02:00
Diego Devesa
952f3953c1
ggml : allow CUDA graphs when using pipeline parallelism ( #13814 )
2025-05-27 13:05:18 +02:00
Georgi Gerganov
81713121ee
kv-cells : track min/max used cells and per-sequence positions ( #13808 )
...
* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci
2025-05-27 13:49:41 +03:00
Georgi Gerganov
f9cd68398b
sampling : make sure samplers return at least 1 token ( #13822 )
...
* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci
2025-05-27 12:07:52 +03:00
Georgi Gerganov
4f81b33e32
llama : validate seq id batch input ( #13809 )
...
* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci
2025-05-27 09:40:59 +03:00
Olivier Chafik
cdf94a1802
server: --offline mode ( #13804 )
...
* server: --offline mode (env: LLAMA_OFFLINE)
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 22:34:27 +01:00
Georgi Gerganov
a26c4cc11e
scripts : add option to compare commits in Debug ( #13806 )
...
* scripts : add option to compare commits in Debug
* cont : reuse existing CMAKE_OPTS
2025-05-26 22:24:01 +03:00
Georgi Gerganov
4265a87b59
cuda : avoid cuGetErrorString ( #13791 )
...
ggml-ci
2025-05-26 22:14:52 +03:00
Akarshan Biswas
6f180b915c
SYCL: Add non contiguous support in RMS_NORM and NORM kernels ( #13611 )
...
* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div
2025-05-26 21:10:36 +05:30
Olivier Chafik
03f582ae8f
server: fix streaming crashes ( #13786 )
...
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
standby24x7
88c125f2ac
examples/training: Fix file name in README ( #13803 )
...
This patch fixes binary file names in README.md.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2025-05-26 16:55:24 +02:00
Olivier Chafik
d74e94c1b3
server: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )
...
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik
f13847cfb5
server: fix regression on streamed non-chat completion w/ stops ( #13785 )
...
* more forgiving message diffs: partial stop words aren't erased, full stops are
* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
Georgi Gerganov
79c137f776
examples : allow extracting embeddings from decoder contexts ( #13797 )
...
ggml-ci
2025-05-26 14:03:54 +03:00
Georgi Gerganov
22229314fc
llama : clarify deprecation message ( #13794 )
2025-05-26 12:57:50 +03:00
Concedo
85fd62c974
try update cmake for rocm hipblas (not sure if working)
2025-05-26 16:32:56 +08:00
Romain Biessy
9012eb9b45
sycl: Add more debug prints ( #13640 )
2025-05-26 10:28:53 +02:00
Concedo
89a3742ded
skip unquantizable clip layers
2025-05-26 16:02:49 +08:00
henk717
b8883e254a
KoboldCpp.sh updates ( #1562 )
...
* YR makefile upstream
* Create make_portable_rocm_libs.sh
* update makefile, support llama portable, ditch all unnecessary changes
* Delete make_portable_rocm_libs.sh should not be needed
* koboldcpp.sh updates
* Small rocm fixes
* ROCm is now a cuda version not a command
* Don't commit temp file
* Don't commit temp file
* 1200 has errors, removing it for now
* Only rebuild rocm with rebuild
* Update kcpp-build-release-linux.yaml
* Fix rocm filename
* ROCm Linux CI
* We need more diskspace
* Workaround for lockfile getting stuck
Why do I have to do hacks like this....
* Update kcpp-build-release-linux-rocm.yaml
* Dont apt update rocm
You don't allow us to apt update? Better not break things github!
* Container maybe?
* Turns out we aren't root, so we use sudo
* Cleanup ROCm CI PR
* Build for Runpods GPU
* We also need rocblas
* More cleanup just in case
* Update kcpp-build-release-linux-rocm.yaml
---------
Co-authored-by: LostRuins Concedo <39025047+LostRuins@users.noreply.github.com>
2025-05-26 15:24:49 +08:00
Jeff Bolz
fef693dc6b
vulkan: mark IM2COL as supporting non-contig ( #13783 )
2025-05-26 06:02:07 +02:00
Bizhao Shi
2d38b6e400
CANN: Add the basic supports of Flash Attention kernel ( #13627 )
...
* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline
2025-05-26 10:20:18 +08:00
Olivier Chafik
e121edc432
server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )
...
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 00:30:51 +01:00
Xuan-Son Nguyen
2f099b510f
webui : bump max upload file size to 500MB ( #13779 )
2025-05-25 18:02:18 +01:00
Sigbjørn Skjæret
aa50ba462f
tests : improve UGM tokenizer test coverage ( #13773 )
2025-05-25 16:22:29 +02:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Percy Piper
c508256db2
rpc : Fix build on OpenBSD ( #13541 )
2025-05-25 15:35:53 +03:00
Xuan-Son Nguyen
40aaa8a403
mtmd : add support for Qwen2-Audio and SeaLLM-Audio ( #13760 )
...
* mtmd : add Qwen2-Audio support
* small clean up
* update discussion link
* clarify mtmd_get_output_embd
* clarification in multimodal.md
* fix ultravox bug
* ggml_cont
2025-05-25 14:06:32 +02:00
ddpasa
a08c1d2845
docs : add Moondream2 pre-quantized link ( #13745 )
...
* Multimodal: Added Moondream2 model and fixed ggml.org link
* Apply suggestions from code review
---------
Co-authored-by: name <none@none.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-25 14:04:49 +02:00
Concedo
60268de62c
update targets for rocm
2025-05-25 18:41:15 +08:00
Olivier Chafik
d785f9c1fd
server: fix/test add_generation_prompt ( #13770 )
...
Co-authored-by: ochafik <ochafik@google.com>
2025-05-25 10:45:49 +01:00
Piotr Jasiukajtis
4032ca4066
llama : add support for Qwen3 MoE tied word embeddings ( #13768 )
2025-05-25 10:29:43 +02:00
Akarshan Biswas
515fdbf7ed
SYCL: revert "sycl: simplify bin_bcast_kernel ( #13383 )" ( #13752 )
...
Temporarily reverted due to failing fp16 DIV operation
This reverts commit 02cdd2d8b0 .
ggml-ci
2025-05-25 10:08:37 +03:00
Olivier Chafik
f5cd27b71d
server: streaming of tool calls and thoughts when --jinja is on (#12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Diego Devesa
a2d02d5793
releases : bundle llvm omp library in windows release ( #13763 )
2025-05-25 00:55:16 +02:00
Diego Devesa
17fc817b58
releases : enable openmp in windows cpu backend build ( #13756 )
2025-05-24 22:27:03 +02:00
Diego Devesa
2bd1b30f69
ggml-cpu : set openmp wait time if not set ( #13758 )
2025-05-24 22:26:47 +02:00
Concedo
f1422217ce
Merge branch 'upstream' into concedo_experimental
2025-05-25 00:00:08 +08:00
Concedo
bd960a90a6
removed unnecessary function
2025-05-24 23:59:31 +08:00
Concedo
779a41f23e
Merge commit ' c3a2624339' into concedo_experimental
2025-05-24 22:56:02 +08:00
Concedo
f97bbdde00
fix to allow all EOGs to trigger a stop, occam's glm4 fix,
2025-05-24 22:55:11 +08:00
0cc4m
259469c4b5
Move GLM4 f32 attention fix to the correct function ( #13750 )
2025-05-24 16:49:12 +02:00