Commit graph

1047 commits

Author SHA1 Message Date
Concedo
acfc1e56d2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	tests/test-regex-partial.cpp
2026-01-04 11:14:33 +08:00
Aldehir Rojas
cef1d23c5a
common/grammar : replace problematic backtracking regex [\s\S]* (#18342)
* grammar : add support for std::regex_search() with trigger patterns

* common : update hermes2 pro trigger to search instead of match

* common : use regex_search with anchoring for partial matching

* common : adjust regex partial tests to use new pattern

* grammar : check pattern directly instead of adding a type

* common : adjust existing patterns to match new semantics
2026-01-03 16:02:43 -06:00
Georgi Gerganov
c69c7ebc90
graph : fix graph reuse logic when n_pos_per_embd > 1 (#18566) 2026-01-03 23:59:06 +02:00
Georgi Gerganov
a554a1ecc7
context : fix reserve token padding to n_seqs (#18536) 2026-01-03 15:45:34 +02:00
Concedo
e4abf643fa Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	src/CMakeLists.txt
#	src/llama-vocab.cpp
2026-01-03 15:37:30 +08:00
Prabod
5755e52d15
model : Maincoder-1B support (#18534)
* Add Maincoder model support

* Removed SPM model vocabulary setting and MOE related GGUF parameters
Removed trailing spaces from maincoder.cpp

* removed set_vocab

* added new line

* Fix formatting

* Add a new line for PEP8
2026-01-02 20:11:59 +01:00
Georgi Gerganov
af1e8e1a6c
graph : reduce topology branching (#18548) 2026-01-02 19:01:56 +02:00
Georgi Gerganov
d84a6a98be
vocab : reduce debug logs about non-EOG control tokens (#18541)
* vocab : reduce debug logs about non-EOG control tokens

* cont : add comment
2026-01-02 16:17:33 +02:00
Concedo
7e1ae49e7d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cuda/ggml-cuda.cu
#	tests/test-backend-ops.cpp
#	tools/mtmd/CMakeLists.txt
2026-01-02 11:05:20 +08:00
Sigbjørn Skjæret
169ee68ffb
model : remove modern-bert iswa template (#18529)
* remove modern-bert iswa template

* forgotten
2026-01-02 00:06:42 +01:00
tt
ced765be44
model: support youtu-vl model (#18479)
* Support Youtu-VL Model

* merge code

* fix bug

* revert qwen2 code & support rsplit in minja.hpp

* update warm info

* fix annotation

* u

* revert minja.hpp

* fix

* Do not write routed_scaling_factor to gguf when routed_scaling_factor is None

* fix expert_weights_scale

* LGTM after whitespace fixes

* fix

* fix

* fix

* layers to layer_index

* enum fix

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 19:25:54 +01:00
o7si
d0a6a31470
model : add support for JinaBertModel with non-gated ffn (#18475)
* WIP: Initial commit for fixing JinaBert original FF type support

* convert: add jina-v2-de tokenizer variant for German_Semantic_V3

* convert: fix token collision in BERT phantom vocab conversion

* convert: add feed_forward_type metadata

* model: add feed_forward_type metadata for jina-bert-v2

* model: jina-bert-v2 support standard GELU FFN variant

* model: remove ffn_type, detect FFN variant from tensor dimensions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* revert collision fix to be handled in separate PR

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:38:51 +01:00
HelloKS
f4f5019254
model: add Solar Open model (#18511)
* model: add Solar-Open model

* vocab: add solar-open to end eog blacklist

* model: add proper llm type

* chat: basic template for solar open

* typo: fix comment about vocab

* convert: sugested changes

* convert: suggested changes

* chat: change reasoning end tag for solar-open

* llama-chat: add solar-open template
2026-01-01 18:01:43 +01:00
Concedo
54e419f587 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	docs/ops.md
#	docs/ops/Metal.csv
#	ggml/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	grammars/README.md
#	models/templates/llama-cpp-deepseek-r1.jinja
#	scripts/sync-ggml.last
#	tests/test-chat.cpp
2026-01-01 15:34:10 +08:00
Concedo
66ccf8f6b8 Merge commit 'f14f4e421b' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	AGENTS.md
#	CONTRIBUTING.md
#	docs/build.md
#	examples/llama.android/app/build.gradle.kts
#	examples/llama.android/app/src/main/java/com/example/llama/MainActivity.kt
#	examples/llama.android/app/src/main/res/layout/activity_main.xml
#	examples/llama.android/gradle/libs.versions.toml
#	examples/llama.android/lib/src/main/cpp/ai_chat.cpp
#	examples/llama.android/lib/src/main/java/com/arm/aichat/InferenceEngine.kt
#	examples/llama.android/lib/src/main/java/com/arm/aichat/internal/InferenceEngineImpl.kt
#	examples/model-conversion/scripts/causal/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/retrieval/retrieval.cpp
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-cuda/mmq.cu
#	ggml/src/ggml-cuda/mmq.cuh
#	src/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
#	tools/server/CMakeLists.txt
2026-01-01 15:20:56 +08:00
triplenom
9e10bd2eaf
llama: handle short reads in direct I/O path (#18504) 2026-01-01 10:24:43 +08:00
Daniel Bevenius
ac1d0eb7bf
llama : fix typo in comment in llama-kv-cache.h [no ci] (#18489) 2025-12-30 17:20:14 +01:00
Xuan-Son Nguyen
cd78e57c3a
lora: count lora nodes in graph_max_nodes (#18469)
* lora: count lora nodes in graph_max_nodes

* 3 nodes per weight

* 4 nodes

* keep track n_lora_nodes from llama_model

* fix assert

* rm redundant header

* common: load adapters before context creation

* use 6 nodes
2025-12-30 15:53:12 +01:00
Jay Zenith
c32fa21db8
sampling: reuse token data buffer in llama_sampler_sample (#18365)
* sampling: reuse token data buffer in llama_sampler_sample

* move cur buffer before timing section, after samplers

* minor : fix build

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-30 16:27:49 +02:00
momonga
9c675c7140
model : Plamo3 support (#17304)
* plamo3

* fix plamo3

* clean code

* clean up the code

* fix diff

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* add chat_template if exist

* clean up the code

* fix cpu-backend

* chore: whitespace trim fix + typo fix

* Fix: address review feedback

* restore `FREQ_BASE_SWA` constant

* Fix: address review feedback2

* Fix:typecheck

* Fix: address review feedback3

* final cleanup

---------

Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 17:28:31 +01:00
Concedo
0e26e4d354 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/010-bug-compilation.yml
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	.github/ISSUE_TEMPLATE/019-bug-misc.yml
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
2025-12-28 23:47:55 +08:00
Concedo
82d562ad7b unstable merge 2025-12-28 23:03:03 +08:00
Johannes Gäßler
f8d561eb87
llama-fit-params: fix step size for last device (#18415) 2025-12-28 10:52:09 +01:00
Johannes Gäßler
a4bf35889e
llama-fit-params: fix overflow check (#18354) 2025-12-27 20:20:45 +01:00
Johannes Gäßler
026d2ad472
llama: fix magic number of 999 for GPU layers (#18266)
* llama: fix magic number of 999 for GPU layers

* use strings for -ngl, -ngld

* enacapsulate n_gpu_layers, split_mode
2025-12-27 20:18:35 +01:00
Johannes Gäßler
a52dc60ba3
llama_fit_params: return enum for fail vs. error (#18374) 2025-12-27 09:59:19 +01:00
Johannes Gäßler
9045c9afe5
llama-fit-params: fix Gemma 3 calculation (#18372) 2025-12-27 09:56:04 +01:00
Xuan-Son Nguyen
4cbafad4f0
model: support MiMo-V2-Flash (#18328)
* mimov2: convert ok

* rename mimov2 --> mimo2

* fix conversion

* runnable not incorrect

* use sink

* add_sliding_window_pattern

* add swa and per-layer n_head_kv

* correct params

* somewhat working

* correct gating func

* nits

* mimo2: wire RMS eps + MoE bias + converter guards

* add co-author

Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>

* use add_rope_freq_base_swa

---------

Co-authored-by: Aaryan Kapoor <aaryankapoor2006@gmail.com>
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>
2025-12-24 23:07:08 +01:00
Concedo
6cc71db85a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/SYCL.md
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/run-org-model.py
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
2025-12-25 00:06:27 +08:00
Concedo
3589a5e136 Merge commit '12ee1763a6' into concedo_experimental
# Conflicts:
#	docs/backend/hexagon/README.md
#	docs/backend/hexagon/developer.md
#	examples/gen-docs/gen-docs.cpp
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	examples/sycl/run-llama2.sh
#	examples/sycl/run-llama3.sh
#	examples/sycl/win-run-llama2.bat
#	examples/sycl/win-run-llama3.bat
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp-utils.h
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/htp-dma.c
#	ggml/src/ggml-hexagon/htp/htp-dma.h
#	ggml/src/ggml-hexagon/htp/hvx-utils.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-opencl/kernels/transpose.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/snapdragon/adb/run-cli.sh
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2025-12-24 23:57:41 +08:00
Concedo
d1983959d2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	AGENTS.md
#	common/CMakeLists.txt
#	docs/development/parsing.md
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
#	tests/test-grammar-llguidance.cpp
#	tests/test-tokenizer-0.cpp
#	tests/test-tokenizer-1-bpe.cpp
#	tests/test-tokenizer-1-spm.cpp
#	tools/batched-bench/batched-bench.cpp
#	tools/cli/cli.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-12-24 23:42:28 +08:00
Saba Fallah
54132f1b1f
model : support for LlamaBidirectionalModel architecture (#18220)
* model: llama-embed-nemotron

* minor: python lint

* changed arch-name

* templated llm_build_llama to be used for both llama and llama-embed arch
2025-12-24 14:02:36 +01:00
Alessandro98-git
96e33a814e
model : fix div-by-zero for Nemotron V2 (#18309)
* llama-model : fix Nemotron V2 crash by moving MoE parameters calculation

* remove whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-23 03:04:57 +01:00
Ryan Mangeno
dfc959b886
model : Granite Embedding support (#15641)
ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!

* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only

* conversion now working, hf -> gguf

* working on support, now working on building graph

* some cleanup

* cleanup

* continuing

* correct tensor shape for qkv

* fixed tensor mappings and working on buildin graph

* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this

* cleanup

* cleanup

* cleanup

* more cleanup

* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention  keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more

* added cls token per previous modern bert attempt, still working on checking out the rest

* fixed pre tokenizer and still working through previous pr

* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer

* fixed pre tokenizer

* working on swa with local and global alternating attention

* some cleanup and now fails on build attn

* starting to work, and some cleanup, currently failing on last layer construction in graph build

* alternating rope implemented and modern bert graph build succeeds

* fixed asser for equal ubatch seq

* cleanup

* added mask check in vocab

* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values

* reuse variable

* removed repeat

* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL

* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...

* more modular hparam setting

* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* removed redundant hparam set

* enums for model sizes

* conversion for modern-bert model supported rather than just granite-small

* Update src/llama-model.cpp

Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>

* Update src/llama-model.cpp

Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>

* fixed ordering of enum for freq_base_swa

* fixed where I added residual, now gives much much better embeddings~

* readded cacheless logic

* removing whitespace

* conversion now working for swa pattern - dense every n layers

* modern bert put into seperate src file

* removing whitespace

* fixed whitespace and newline errors in editorconfig job

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* better naming convention, n_swa_pattern -> swa_period

* reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support

* fixing pyright type-check fail

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-saver.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* added descriptions in llama-model

* fixed tensor mappings for conversion

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* mapping name for size

* nits

* unused

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
2025-12-23 00:28:19 +01:00
Johannes Gäßler
147a521636
tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
Concedo
d577187875 update sdui 2025-12-21 20:35:19 +08:00
Concedo
7304640f72 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	docs/android.md
#	docs/backend/hexagon/CMakeUserPresets.json
#	examples/llama.android/app/src/main/res/layout/activity_main.xml
#	examples/llama.android/app/src/main/res/layout/item_message_assistant.xml
#	examples/llama.android/app/src/main/res/layout/item_message_user.xml
#	examples/model-conversion/scripts/causal/run-org-model.py
#	examples/model-conversion/scripts/utils/common.py
#	ggml/CMakeLists.txt
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	tests/test-arg-parser.cpp
#	tools/server/README.md
2025-12-20 09:32:06 +08:00
Concedo
714ab0682e Revert "Revert "llama : Async DirectIO model loading on Linux (#18012)""
This reverts commit a45fc5ee88.
2025-12-20 09:25:10 +08:00
Julius Tischbein
f99ef53d2a
llama : Changing off_t to size_t for Windows (#18204) 2025-12-19 16:42:46 +02:00
Concedo
a45fc5ee88 Revert "llama : Async DirectIO model loading on Linux (#18012)"
This reverts commit 4d4f4cacd1.
2025-12-19 19:06:30 +08:00
Concedo
58eb5573de Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/hvx-utils.c
#	ggml/src/ggml-hexagon/htp/main.c
#	src/llama-model.cpp
#	tools/server/README.md
2025-12-19 11:00:43 +08:00
Concedo
e005fc2587 Merge commit '8dcc3662a2' into concedo_experimental
Keep changes from https://github.com/ggml-org/llama.cpp/pull/18096 without https://github.com/ggml-org/llama.cpp/pull/14904
Reason is to maintain compatibility with 2023 w64devkit

# Conflicts:
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/speculative/speculative.cpp
# ggml/src/ggml-cpu/arch-fallback.h
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-cpu/repack.h
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/hvx-utils.c
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
2025-12-19 02:11:55 +08:00
Johannes Gäßler
57c1e05643
llama: offload output layer to GPU first (#18148)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
2025-12-18 08:12:18 +01:00
Julius Tischbein
4d4f4cacd1
llama : Async DirectIO model loading on Linux (#18012)
* Uncached model read

* Removing additional --mmap arg

* Removing trailing whitespaces

* Adding fallback when O_DIRECT is not supported

* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp

* Adding maybe unused keyword for Mac and Windows.

* File seek aligned

* Removing all branches for direct_io in llama-model-loader.cpp

* Always use alignment from llama_file

* use_mmap=true
2025-12-18 08:27:19 +02:00
Johannes Gäßler
8dcc3662a2
llama-fit-params: fix memory print (#18136) 2025-12-17 21:10:03 +01:00
Georgi Gerganov
4301e27319
common : restore grammar-based rejection sampling (#18137)
* common : restart grammar-based rejection sampling

* sampling : allow null samplers
2025-12-17 19:46:00 +02:00
Concedo
1f2c9f6b62 gpt4v not working correctly 2025-12-17 21:02:16 +08:00
Concedo
1daeed5d4d Merge commit '9963b81f63' into concedo_experimental
# Conflicts:
#	.github/workflows/server.yml
#	SECURITY.md
#	docs/backend/SYCL.md
#	examples/model-conversion/README.md
#	examples/model-conversion/scripts/embedding/compare-embeddings-logits.sh
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
2025-12-17 20:30:34 +08:00
Tarek Dakhran
982060fadc
model: fix LFM2_MOE missing tensors (#18132) 2025-12-17 12:17:11 +01:00
Concedo
c93c4c5505 Merge commit '4a4f7e6550' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	CODEOWNERS
#	README.md
#	ci/run.sh
#	docs/development/HOWTO-add-model.md
#	grammars/README.md
#	src/llama-context.cpp
#	src/llama.cpp
#	tools/CMakeLists.txt
#	tools/completion/README.md
#	tools/llama-bench/README.md
2025-12-17 14:30:39 +08:00