Concedo
d577187875
update sdui
2025-12-21 20:35:19 +08:00
Concedo
7304640f72
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# docs/android.md
# docs/backend/hexagon/CMakeUserPresets.json
# examples/llama.android/app/src/main/res/layout/activity_main.xml
# examples/llama.android/app/src/main/res/layout/item_message_assistant.xml
# examples/llama.android/app/src/main/res/layout/item_message_user.xml
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/model-conversion/scripts/utils/common.py
# ggml/CMakeLists.txt
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# tests/test-arg-parser.cpp
# tools/server/README.md
2025-12-20 09:32:06 +08:00
Concedo
714ab0682e
Revert "Revert "llama : Async DirectIO model loading on Linux ( #18012 )""
...
This reverts commit a45fc5ee88 .
2025-12-20 09:25:10 +08:00
Julius Tischbein
f99ef53d2a
llama : Changing off_t to size_t for Windows ( #18204 )
2025-12-19 16:42:46 +02:00
Concedo
a45fc5ee88
Revert "llama : Async DirectIO model loading on Linux ( #18012 )"
...
This reverts commit 4d4f4cacd1 .
2025-12-19 19:06:30 +08:00
Concedo
58eb5573de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/hvx-utils.c
# ggml/src/ggml-hexagon/htp/main.c
# src/llama-model.cpp
# tools/server/README.md
2025-12-19 11:00:43 +08:00
Concedo
e005fc2587
Merge commit ' 8dcc3662a2' into concedo_experimental
...
Keep changes from https://github.com/ggml-org/llama.cpp/pull/18096 without https://github.com/ggml-org/llama.cpp/pull/14904
Reason is to maintain compatibility with 2023 w64devkit
# Conflicts:
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/speculative/speculative.cpp
# ggml/src/ggml-cpu/arch-fallback.h
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-cpu/repack.h
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/hvx-utils.c
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
2025-12-19 02:11:55 +08:00
Johannes Gäßler
57c1e05643
llama: offload output layer to GPU first ( #18148 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
2025-12-18 08:12:18 +01:00
Julius Tischbein
4d4f4cacd1
llama : Async DirectIO model loading on Linux ( #18012 )
...
* Uncached model read
* Removing additional --mmap arg
* Removing trailing whitespaces
* Adding fallback when O_DIRECT is not supported
* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp
* Adding maybe unused keyword for Mac and Windows.
* File seek aligned
* Removing all branches for direct_io in llama-model-loader.cpp
* Always use alignment from llama_file
* use_mmap=true
2025-12-18 08:27:19 +02:00
Johannes Gäßler
8dcc3662a2
llama-fit-params: fix memory print ( #18136 )
2025-12-17 21:10:03 +01:00
Georgi Gerganov
4301e27319
common : restore grammar-based rejection sampling ( #18137 )
...
* common : restart grammar-based rejection sampling
* sampling : allow null samplers
2025-12-17 19:46:00 +02:00
Concedo
1f2c9f6b62
gpt4v not working correctly
2025-12-17 21:02:16 +08:00
Concedo
1daeed5d4d
Merge commit ' 9963b81f63' into concedo_experimental
...
# Conflicts:
# .github/workflows/server.yml
# SECURITY.md
# docs/backend/SYCL.md
# examples/model-conversion/README.md
# examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
2025-12-17 20:30:34 +08:00
Tarek Dakhran
982060fadc
model: fix LFM2_MOE missing tensors ( #18132 )
2025-12-17 12:17:11 +01:00
Concedo
c93c4c5505
Merge commit ' 4a4f7e6550' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# CODEOWNERS
# README.md
# ci/run.sh
# docs/development/HOWTO-add-model.md
# grammars/README.md
# src/llama-context.cpp
# src/llama.cpp
# tools/CMakeLists.txt
# tools/completion/README.md
# tools/llama-bench/README.md
2025-12-17 14:30:39 +08:00
Johannes Gäßler
d0794e89d9
llama-fit-params: force disable mlock ( #18103 )
2025-12-17 00:50:12 +01:00
Johannes Gäßler
9dcac6cf9f
llama-fit-params: lower ctx size for multi GPU ( #18101 )
2025-12-17 00:49:34 +01:00
Johannes Gäßler
0e49a7b8b4
llama-fit-params: fix underflow for dense models ( #18095 )
2025-12-17 00:47:37 +01:00
Xuan-Son Nguyen
ef83fb8601
model: fix LFM2 missing tensors ( #18105 )
2025-12-16 19:07:43 +01:00
Concedo
050a5b1f52
Merge commit ' 4aced7a631' into concedo_experimental
...
# Conflicts:
# .devops/cann.Dockerfile
# .devops/cpu.Dockerfile
# .devops/cuda.Dockerfile
# .devops/intel.Dockerfile
# .devops/musa.Dockerfile
# .devops/rocm.Dockerfile
# .devops/tools.sh
# .devops/vulkan.Dockerfile
# .github/workflows/build.yml
# .github/workflows/release.yml
# .gitignore
# docs/ops.md
# docs/ops/SYCL.csv
# examples/batched/batched.cpp
# examples/eval-callback/eval-callback.cpp
# examples/gen-docs/gen-docs.cpp
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/model-conversion/scripts/causal/compare-logits.py
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/model-conversion/scripts/utils/check-nmse.py
# examples/parallel/parallel.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# examples/training/finetune.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/pad.cpp
# ggml/src/ggml-sycl/ssm_conv.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# pyrightconfig.json
# scripts/sync-ggml.last
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/imatrix.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/clip.cpp
# tools/perplexity/perplexity.cpp
# tools/server/README.md
2025-12-16 23:14:12 +08:00
Johannes Gäßler
ec98e20021
llama: fix early stop in params_fit if ctx is set ( #18070 )
Python Type-Check / pyright type-check (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
2025-12-16 14:24:00 +01:00
Xuan-Son Nguyen
7f2b2f3c77
arch: refactor LLM_TENSOR_NAMES ( #18051 )
...
* arch: refactor LLM_TENSOR_NAMES
* update docs
* typo
* fix LLM_ARCH_NEMOTRON_H_MOE
* show more meaningful error message on missing tensor
* fix and tested LLM_ARCH_NEMOTRON_H_MOE
2025-12-16 13:22:30 +01:00
Piotr Wilkin (ilintar)
a5251ca11d
Optimization: Qwen3 next autoregressive pass ( #17996 )
...
* It's Qwen3 Next, the lean mean token generation machine!
* Apply patches from thread
* Remove recurrent version, only keep chunked and autoregressive
* Remove unnecessary conts and asserts
* Remove more extra conts and asserts
* Cleanup masking
2025-12-16 11:59:53 +01:00
Xuan-Son Nguyen
3d86c6c2b5
model: support GLM4V vision encoder ( #18042 )
...
* convert ok
* no deepstack
* less new tensors
* cgraph ok
* add mrope for text model
* faster patch merger
* add GGML_ROPE_TYPE_MRNORM
* add support for metal
* move glm4v do dedicated graph
* convert: add norm_embd
* clip: add debugging fn
* working correctly
* fix style
* use bicubic
* fix mrope metal
* improve cpu
* convert to neox ordering on conversion
* revert backend changes
* force stop if using old weight
* support moe variant
* fix conversion
* fix convert (2)
* Update tools/mtmd/clip-graph.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* process mrope_section on TextModel base class
* resolve conflict merge
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 11:25:26 +01:00
Concedo
e88bf41fdc
Merge commit ' 12280ae905' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# common/CMakeLists.txt
# docs/docker.md
# examples/model-conversion/scripts/causal/compare-logits.py
# ggml/src/ggml-hexagon/htp/rope-ops.c
# tests/test-backend-ops.cpp
# tests/test-barrier.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2025-12-16 16:29:01 +08:00
Chris Peterson
2aa45ef9e3
llama: Include algorithm header needed for C++23 ( #18078 )
2025-12-16 09:37:55 +02:00
Georgi Gerganov
c560316440
graph : reuse SSM graphs ( #16490 )
...
* graph : reuse hybrid graphs
* graph : reuse recurrent graphs
* graph : fix reuse check for recurrent inputs
* memory : move the recurrent state into the memory context
* Revert "memory : move the recurrent state into the memory context"
This reverts commit 00f115fe810815d4a22a6dee0acc346131e970e1.
* cont : fix build
2025-12-16 09:36:21 +02:00
Daniel Bevenius
2995341730
llama : add support for NVIDIA Nemotron 3 Nano ( #18058 )
...
* llama : add support for NVIDIA Nemotron Nano 3
This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling
the conversion and running of this model.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 07:19:26 +01:00
HelloKS
9d52f17ae3
model : add KORMo model ( #18032 )
...
* vocab: add KORMo Tokenizer
* model: add KORMoForCausalLM
* vocab: change pretokenizer to qwen2
* lint: fix unintended line removal
* model: make qwen2 bias tensor optional
* model: use qwen2 architecture for KORMo
2025-12-15 18:51:43 +01:00
ssweens
4529c660c8
kv-cache: Fix state restore fragmented cache ( #17982 )
...
* kv-cache : fix state restore with fragmented cache (#17527 )
Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache.
* tests : update logic
* cleanup: tightened state_read_meta sig, added is_contiguous case
* fix: state_read_meta arg reorder loose ends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-15 19:28:35 +02:00
Johannes Gäßler
b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization ( #16653 )
...
* llama: automatically fit args to free memory
llama-fit-params tool
* fix CI
* hints for bug reports, ensure no reallocation
* fix segfault with Vulkan
* add llama-fit-params to CI
* fix CI
* fix CI
* fix CI
* minor adjustments
* fix assignment of 1 dense layer
* fix logger not being reset on model load failure
* remove --n-gpu-layer hint on model load failure
* fix llama-fit-params verbosity
* fix edge case
* fix typo [no ci]
2025-12-15 09:24:59 +01:00
Xuan-Son Nguyen
0759b09c90
graph: add f_attn_temp_offset ( #18025 )
2025-12-14 13:05:59 +01:00
Georgi Gerganov
609a2d0268
models : fix YaRN regression + consolidate logic ( #18006 )
...
* models : fix YaRN regression + consolidate logic
* cont : fix the fix
* cont : remove header
* cont : add header
2025-12-14 08:34:56 +02:00
Jeff Bolz
5266379bca
llama_context: synchronize before reallocating output buffer ( #17974 )
2025-12-13 09:19:51 -06:00
Georgi Gerganov
7bed317f53
models : fix the attn_factor for mistral3 graphs + improve consistency ( #17945 )
...
* models : fix the attn_factor for mistral3 graphs
* cont : rework attn_factor correction logic
* cont : make deepseek2 consistent
* cont : add TODO
* cont : special-case DSv2
* cont : revert Mistral 3 Large changes
* cont : fix DS2 to use the original attn_factor
* cont : minor comments
2025-12-12 17:12:40 +02:00
Concedo
34d243bf3c
Merge commit ' b677721819' into concedo_experimental
...
# Conflicts:
# CONTRIBUTING.md
# common/chat.cpp
# docs/ops.md
# docs/ops/CPU.csv
# docs/ops/CUDA.csv
# docs/ops/OpenCL.csv
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-sycl/softmax.cpp
# grammars/README.md
# src/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-grammar-integration.cpp
# tests/test-grammar-parser.cpp
# tests/test-llama-grammar.cpp
# tools/mtmd/CMakeLists.txt
2025-12-11 23:33:19 +08:00
Concedo
278e45becf
Merge commit ' 2fa51c19b0' into concedo_experimental
...
# Conflicts:
# .github/actions/windows-setup-cuda/action.yml
# .github/workflows/build-linux-cross.yml
# .github/workflows/release.yml
# README.md
# docs/build-riscv64-spacemit.md
# examples/model-conversion/logits.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# models/templates/Kimi-K2-Instruct.jinja
# models/templates/Kimi-K2-Thinking.jinja
# tests/test-chat.cpp
# tools/server/README.md
2025-12-11 23:04:48 +08:00
Concedo
fd0d0cab03
move pipeline parallelism to a --pipelineparallel launch flag
2025-12-11 21:03:41 +08:00
Georgi Gerganov
d9f8f60618
batch : fix sequence id ownership ( #17915 )
...
* batch : fix sequence id ownage
* cont : reduce allocations
2025-12-11 14:29:47 +02:00
Georgi Gerganov
4dff236a52
ggml : remove GGML_KQ_MASK_PAD constant ( #17910 )
...
* ggml : remove GGML_KQ_MASK_PAD constant
* cont : remove comment
2025-12-10 20:53:16 +02:00
Eric Zhang
b677721819
model : Qwen3-Next-80B-A3B has 48 layers ( #17898 )
...
* model : Qwen3-Next-80B-A3B has 48 layers
* model : Add 80B-A3B type name
2025-12-10 15:22:40 +01:00
Rhys-T
63908b631a
cmake: fix Mach-O current version number ( #17877 )
...
PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd
is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the
Mach-O 'current version' field's 'micro' part, which only goes up to
255. This just sets the Mach-O current version to 0 to get it building
properly again.
Fixes #17258 .
2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret
42b12b5608
model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B ( #12652 )
...
* nit, DeepSeek V1 MoE is 16B
* base type on n_ff_exp instead
2025-12-09 12:15:06 +01:00
Aldehir Rojas
e39502e74b
llama : add token matching support to llama-grammar ( #17816 )
...
* llama : add token support to llama-grammar
* fix inverse token comment
* refactor trigger_patterns to replay tokens instead of the entire string
* add token documentation
* fix test-llama-grammar
* improve test cases for tokens
2025-12-09 00:32:57 -06:00
philip-essential
1d2a1ab73d
model : support Rnj-1 ( #17811 )
...
* add support for rnj1
* refactor gemma3 to support rnj-1
* address review comments
2025-12-09 04:49:03 +01:00
Sigbjørn Skjæret
c8554b66e0
graph : use fill instead of scale_bias in grouped expert selection ( #17867 )
...
* use fill instead of scale_bias in grouped expert selection
* do not explicitly use _inplace
2025-12-08 21:29:59 +01:00
Piotr Wilkin (ilintar)
e4e9c4329c
Make graph_max_nodes vary by ubatch size ( #17794 )
...
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph
* Update src/llama-context.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add missing const
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:32:41 +01:00
Xuan-Son Nguyen
4d3726278b
model: add llama 4 scaling for mistral-large (deepseek arch) ( #17744 )
2025-12-07 22:29:54 +01:00
Concedo
17c0c8d55d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/backend/zDNN.md
# docs/build.md
# docs/ops.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# src/llama-quant.cpp
# tests/test-backend-ops.cpp
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
2025-12-07 16:48:38 +08:00
Concedo
7c5d271d6c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# .github/workflows/winget.yml
# CMakeLists.txt
# CODEOWNERS
# CONTRIBUTING.md
# cmake/build-info.cmake
# docs/ops.md
# docs/ops/BLAS.csv
# docs/ops/Metal.csv
# examples/CMakeLists.txt
# examples/save-load-state/save-load-state.cpp
# examples/simple-cmake-pkg/README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py
# src/llama-quant.cpp
# tests/test-backend-ops.cpp
# tools/server/CMakeLists.txt
2025-12-07 16:37:32 +08:00