Commit graph

1642 commits

Author SHA1 Message Date
Concedo
1ebadc515e add streaming support for oai tools (+2 squashed commit)
Squashed commit:

[4d080b37] qwen2.5vl surgery script

[4bebe7e5] add streaming support for oai tools
2025-03-31 16:49:15 +08:00
Concedo
911669087a add tentative support for qwen2.5vl vision from HimariO fork 2025-03-29 22:52:43 +08:00
Concedo
396875e1c4 update api docs and lite 2025-03-29 15:39:25 +08:00
Benson Wong
5d01670266
server : include speculative decoding stats when timings_per_token is enabled (#12603)
* Include speculative decoding stats when timings_per_token is true

New fields added to the `timings` object:

  - draft_n           : number of draft tokens generated
  - draft_accepted_n  : number of draft tokens accepted
  - draft_accept_ratio: ratio of accepted/generated

* Remove redundant draft_accept_ratio var

* add draft acceptance rate to server console output
2025-03-28 10:05:44 +02:00
Radoslav Gerganov
ef03229ff4
rpc : update README for cache usage (#12620) 2025-03-28 09:44:13 +02:00
Radoslav Gerganov
ab6ab8f809
rpc : send hash when tensor data is above some fixed threshold (#12496)
* rpc : send hash when tensor data is above some fixed threshold

ref #10095

* rpc : put cache under $HOME/.cache/llama.cpp

* try to fix win32 build

* another try to fix win32 build

* remove llama as dependency
2025-03-28 08:18:04 +02:00
Piotr
2099a9d5db
server : Support listening on a unix socket (#12613)
* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

---------

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-03-27 23:41:04 +01:00
Ivy233
02082f1519
clip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566)
* [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize.
After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA.

* [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.
2025-03-26 15:06:04 +01:00
Eric Curtin
ef19c71769
run: de-duplicate fmt and format functions and optimize (#11596) 2025-03-25 18:46:11 +01:00
Concedo
ea358369cc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ci/README.md
#	ci/run.sh
#	docs/backend/CUDA-FEDORA.md
#	docs/build.md
#	docs/install.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/common.cuh
#	tests/test-backend-ops.cpp
2025-03-26 00:18:01 +08:00
Marius Gerdes
77f9c6bbe5
server : Add verbose output to OAI compatible chat endpoint. (#12246)
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
2025-03-23 19:30:26 +01:00
Concedo
7030ebf401 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/backend/SYCL.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
#	ggml/src/ggml-sycl/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-22 00:32:42 +08:00
Concedo
c1e58419c7 support for voice cloning is done (+2 squashed commit)
Squashed commit:

[e7301628] support for voice cloning is done

[1653c576] wip adding voice cloning
2025-03-21 22:28:59 +08:00
marcoStocchi
ea1518e839
llama-tts : avoid crashes related to bad model file paths (#12482) 2025-03-21 11:12:45 +02:00
Woof Dog
e04643063b
webui : Prevent rerendering on textarea input (#12299)
* webui: Make textarea uncontrolled to eliminate devastating lag

* Update index.html.gz

* use signal-style implementation

* rm console log

* no duplicated savedInitValue set

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-20 15:57:43 +01:00
Concedo
0c90d2ebcf Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	cmake/common.cmake
#	docs/backend/SYCL.md
#	examples/main/README.md
#	examples/speculative/speculative.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-03-19 19:27:11 +08:00
Georgi Gerganov
c6af2161b2
speculative : fix seg fault in certain cases (#12454) 2025-03-18 19:35:11 +02:00
Georgi Gerganov
810e0af3f5
server : fix warmup draft cache type (#12446)
ggml-ci
2025-03-18 12:05:42 +02:00
Sigbjørn Skjæret
60c902926c
docs : bring llama-cli conversation/template docs up-to-date (#12426) 2025-03-17 21:14:32 +01:00
Concedo
5d7c5e9e33 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/tts/tts.cpp
2025-03-16 15:42:39 +08:00
marcoStocchi
f4c3dd5daa
llama-tts : add '-o' option (#12398)
* added -o option to specify an output file name

* llama-tts returns ENOENT in case of file write error

note : PR #12042 is closed as superseded with this one.
2025-03-15 17:23:11 +01:00
Concedo
67851e5415 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/run/run.cpp
#	ggml/src/ggml-cann/aclnn_ops.cpp
2025-03-15 19:54:19 +08:00
Concedo
bfc30066c9 fixed a clip processing bug 2025-03-15 17:49:49 +08:00
Eric Curtin
9f2250ba72
Add CLI arg to llama-run to adjust the number of threads used (#12370)
We default to 4, sometimes we want to manually adjust this

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-03-14 16:41:20 +00:00
Victor
add2a3aa5a
server: fix "--grammar-file" parameter (#12285) 2025-03-14 11:21:17 +01:00
Concedo
0db4ae6237 traded my ink for a pen 2025-03-14 11:58:15 +08:00
Georgi Gerganov
e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181)
* llama : refactor llama_context, llama_kv_cache, llm_build_context

ggml-ci

* graph : don't mutate the KV cache during defrag

ggml-ci

* context : reduce virtuals + remove test function

ggml-ci

* context : move interface implementation to source file + factory

ggml-ci

* graph : move KV cache build functions to llama_context impl

ggml-ci

* graph : remove model reference from build_pooling

ggml-ci

* graph : remove llama_model reference

ggml-ci

* kv_cache : provide rope factors

ggml-ci

* graph : rework inputs to use only unique_ptr, remove attn input abstraction

ggml-ci

* context : remove llama_context_i abstraction

ggml-ci

* context : clean-up

ggml-ci

* graph : clean-up

ggml-ci

* llama : remove redundant keywords (struct, enum)

ggml-ci

* model : adapt gemma3

ggml-ci

* graph : restore same attention ops as on master

ggml-ci

* llama : remove TODO + fix indent

ggml-ci
2025-03-13 12:35:44 +02:00
Ishaan Gandhi
2048b5913d
server : fix crash when using verbose output with input tokens that are not in printable range (#12178) (#12338)
* Fix DOS index bug

* Remove new APIs

* remove extra line

* Remove from API

* Add extra newline

* Update examples/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-13 11:10:05 +01:00
Concedo
1ef41c2124 streamline output console log (+1 squashed commits)
Squashed commits:

[ca474bdd] streamline output console log
2025-03-13 15:33:49 +08:00
Concedo
16137f4281 gemma3 now works correctly 2025-03-13 14:34:18 +08:00
Concedo
77debb1b1b gemma3 vision works, but is using more tokens than expected - may need resizing 2025-03-13 00:31:16 +08:00
Daniel Bevenius
80a02aa858
llama.swiftui : fix xcframework dir in README [no ci] (#12353)
This commit fixes the path to the xcframework in the README file which I
had forgotten to change after renaming the build directory.
2025-03-12 13:45:32 +01:00
Xuan-Son Nguyen
7841fc723e
llama : Add Gemma 3 support (+ experimental vision capability) (#12343)
* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
2025-03-12 09:30:24 +01:00
Xuan-Son Nguyen
96e1280839
clip : bring back GPU support (#12322)
* clip : bring back GPU support

* use n_gpu_layers param

* fix double free

* ggml_backend_init_by_type

* clean up
2025-03-11 09:20:16 +01:00
marcoStocchi
6ef79a67ca
common : refactor '-o' option (#12278)
As discussed in PR 'llama-tts : add -o option' (#12042):

* common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option.

* cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.
2025-03-10 13:34:13 +02:00
Olivier Chafik
be421fc429
tool-call: ensure there's always a non-empty tool call id (#12292) 2025-03-10 09:45:29 +00:00
Olivier Chafik
2b3a25c212
sampler: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291)
* Fix typo in lazy grammar handling (fixes trigger tokens)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-10 09:44:42 +00:00
tc-mb
8352cdc87b
llava : fix bug in minicpm-v code (#11513)
* fix bug in minicpm-v code

* update readme of minicpm-v
2025-03-10 10:33:24 +02:00
Concedo
6b7c3ae1d3 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	AUTHORS
#	README.md
#	ci/run.sh
#	docs/build.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-metal/CMakeLists.txt
#	scripts/sync-ggml.last
2025-03-10 10:32:41 +08:00
Georgi Gerganov
7ab364390f
server : infill gen ends on new line (#12254) 2025-03-07 20:54:30 +02:00
Sigbjørn Skjæret
8fad3c7a7c
server : Log original chat template parsing error (#12233) 2025-03-07 11:15:33 +01:00
Concedo
ec43d2b147 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	common/common.cpp
#	examples/embedding/embedding.cpp
#	examples/json_schema_to_grammar.py
#	examples/llama.android/llama/src/main/cpp/llama-android.cpp
#	examples/llama.swiftui/README.md
#	examples/llama.swiftui/llama.swiftui.xcodeproj/project.pbxproj
#	examples/lookahead/lookahead.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	requirements.txt
#	requirements/requirements-all.txt
#	scripts/fetch_server_test_models.py
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
2025-03-06 18:54:58 +08:00
Aaron Teo
e9b2f84f14
llava: add big-endian conversion for image encoder (#12218)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-03-06 09:33:21 +01:00
Han Yin
57b6abf85a
android : fix KV cache log message condition (#12212) 2025-03-06 08:22:49 +02:00
Olivier Chafik
669912d9a5
tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034)
* sampler: turn lazy grammar trigger words to regexes

* add scripts/tool_bench.sh & .py

* constrain llama json output regardless of function name if matches at beginning

* update relaxed newline space rule in grammar tests

* support add_generation_prompt query parameter (useful for /apply_template)

* Update src/llama-grammar.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-05 13:05:13 +00:00
Clauszy
06a92a193a
server : fix cache reuse logic (#12161)
The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
2025-03-05 09:25:45 +02:00
Daniel Bevenius
a057897ad4
llama : add xcframework build script (#11996)
* llama : add xcframework build script

This commit adds a script to build an XCFramework for Apple
ios, macos, visionos, and tvos platforms.

The generated XCFramework can then be added to a project and used in
the same way as a regular framework. The llama.swiftui example project
has been updated to use the XCFramework and can be started using the
following command:
```console
$ open examples/llama.swiftui/llama.swiftui.xcodeproj/
```

Refs: https://github.com/ggml-org/llama.cpp/issues/10747

* examples : remove llama.cpp (source dir ref) from project.pbxproj

This commit removes the reference to llama.cpp from the project.pbxproj
file since Package.swift has been removed.

* ci : updated build.yml to use build-xcframework.sh

* ci : add xcframework build to github releases

This commit adds the ability to create a GitHub release with the
xcframework build artifact.

* scripts : add apple app validation scripts

This commit adds scripts that can validate the iOS, macOS, tvOS, and
VisionOS applications. The scripts create a simple test app project,
copy the llama.xcframework to the test project, build and archive the
app, create an IPA from the archive, and validate the IPA using altool.

The motivation for this is to provide some basic validation and
hopefully avoid having to manually validate apps in Xcode.

* llama : remove Package.swift

This commit removes the Package.swift file, as we are now building an
XCFramework for the project.

* llama : remove Sources and spm-headers directories

* llama : use TargetConditionals.h for visionOS/tvOS
2025-03-05 06:30:31 +01:00
mgroeber9110
5bbe6a9fe9
ggml : portability fixes for VS 2017 (#12150)
* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-04 18:53:26 +02:00
Concedo
0cddbe1f0b Merge branch 'upstream' into concedo_experimental 2025-03-05 00:22:06 +08:00
Sigbjørn Skjæret
56d7a9f812
main: allow preloading conversation with -p and add -st / --single-turn (#12145)
* Add chat template formatting to -no-cnv

* only enable prompt formatting if explicitly enabled

* add -st / --single-turn

* add --single-turn and -p in conversation mode

* fix -sys + -p

* reword warning

* small readability change and fix (long) outdated example usage

* only activate single turn in conversation mode
2025-03-04 12:19:39 -04:00