Commit graph

7391 commits

Author SHA1 Message Date
Olivier Chafik
a83f528688
tool-call: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539)
* An empty tool_call_id is better than none!

* sync: minja (tool call name optional https://github.com/google/minja/pull/36)

* Force-disable parallel_tool_calls if template doesn't support it

* More debug logs

* Llama 3.x tools: accept / trigger on more varied spaced outputs

* Fix empty content for functionary v3.2 tool call

* Add proper tool call docs to server README

* readme: function calling *is* supported now

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-31 14:15:25 +00:00
Olivier Chafik
b1bcd309fc
fix stop regression (#11543) 2025-01-31 13:48:31 +00:00
Olivier Chafik
5783575c9d
Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533) 2025-01-31 08:24:29 +00:00
Olivier Chafik
4a2b196d03
server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531) 2025-01-31 10:12:40 +02:00
Steve Grubb
1bd3047a93
common: Add missing va_end (#11529)
The va_copy man page states that va_end must be called to revert
whatever the copy did. For some implementaions, not calling va_end
has no consequences. For others it could leak memory.
2025-01-31 07:58:55 +02:00
Daniel Bevenius
a2df2787b3
server : update help metrics processing/deferred (#11512)
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.

Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```

With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
2025-01-31 06:04:53 +01:00
Olivier Chafik
553f1e46e9
ci: ccache for all github worfklows (#11516) 2025-01-30 22:01:06 +00:00
Olivier Chafik
8b576b6c55
Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639)
---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-30 19:13:58 +00:00
uvos
27d135c970 HIP: require at least HIP 5.5 2025-01-30 16:25:44 +01:00
uvos
6af1ca48cb HIP: Prepare reduction operators for wave 64 2025-01-30 16:25:44 +01:00
uvos
c300e68ef4 CUDA/HIP: add warp_size to cuda_device_info 2025-01-30 16:25:44 +01:00
Concedo
7a5499e77b added one more backend for clblast noavx2 and clblast failsafe 2025-01-30 22:47:22 +08:00
Concedo
898856e183 cleaned up unused flags from makefile, updated lite 2025-01-30 19:34:55 +08:00
Concedo
fd84b062f9 allow reuse of clip embds 2025-01-30 19:02:45 +08:00
Olivier Chafik
3d804dec76
sync: minja (#11499) 2025-01-30 10:30:27 +00:00
mgroeber9110
ffd0821c57
vocab : correctly identify LF token for GPT-2 style BPE tokenizer (#11496) 2025-01-30 12:10:59 +02:00
Daniel Bevenius
4314e56c4f
server : use lambda instead of std::bind (#11507)
This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.

The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.

Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md
2025-01-30 11:05:00 +01:00
Concedo
ba5e94eed2 Revert "Update requirements.txt - include pyinstaller (#1341)"
This reverts commit c27fcc4d4f.
2025-01-30 17:57:48 +08:00
askmyteapot
c27fcc4d4f
Update requirements.txt - include pyinstaller (#1341) 2025-01-30 17:34:44 +08:00
Isaac McFadyen
496e5bf46b
server : (docs) added response format for /apply-template [no ci] (#11503) 2025-01-30 10:11:53 +01:00
Guspan Tanadi
7919256c57
readme : reference examples relative links (#11505) 2025-01-30 06:58:02 +01:00
Daniel Bevenius
e0449763a4
server : update json snippets in README.md [no ci] (#11492)
This commit updates some of JSON snippets in README.md file and
removes the `json` language tag from the code blocks.

The motivation for this changes is that if there is invalid json in a
code snippet these are highlighted in red which can make it somewhat
difficult to read and can be a little distracting.
2025-01-30 05:48:14 +01:00
Nigel Bosch
eb7cf15a80
server : add /apply-template endpoint for additional use cases of Minja functionality (#11489)
* add /apply-template endpoint to server

* remove unnecessary line

* add /apply-template documentation

* return only "prompt" field in /apply-template

* use suggested idea instead of my overly verbose way
2025-01-29 19:45:44 +01:00
Rémy Oudompheng
66ee4f297c
vulkan: implement initial support for IQ2 and IQ3 quantizations (#11360)
* vulkan: initial support for IQ3_S

* vulkan: initial support for IQ3_XXS

* vulkan: initial support for IQ2_XXS

* vulkan: initial support for IQ2_XS

* vulkan: optimize Q3_K by removing branches

* vulkan: implement dequantize variants for coopmat2

* vulkan: initial support for IQ2_S

* vulkan: vertically realign code

* port failing dequant callbacks from mul_mm

* Fix array length mismatches

* vulkan: avoid using workgroup size before it is referenced

* tests: increase timeout for Vulkan llvmpipe backend

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-01-29 18:29:39 +01:00
Concedo
f4e2f4b069 disable context shift when using mrope 2025-01-30 00:36:05 +08:00
Concedo
646df4b126 default to autoguess for chat completions adapter 2025-01-30 00:25:13 +08:00
Concedo
70f1d8d746 vision can set max res (+1 squashed commits)
Squashed commits:

[938fc655] vision can set max res
2025-01-30 00:19:49 +08:00
Daniel Bevenius
e51c47b401
server : update auto gen files comments [no ci] (#11484)
* server : update auto gen files comments

This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.

The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269b ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").

* squash! server : update auto gen files comments [no ci]

Move comments about file generation to README.md.

* squash! server : update auto gen files comments [no ci]

Remove the comments in server.cpp that mention that information
can be found in the README.md file.
2025-01-29 16:34:18 +01:00
Jeff Bolz
2711d0215f
vulkan: Catch pipeline creation failure and print an error message (#11436)
* vulkan: Catch pipeline creation failure and print an error message

Also, fix some warnings from my on-demand compile change.

* vulkan: fix pipeline creation logging
2025-01-29 09:26:50 -06:00
Concedo
2f69432774 makefile indentation fix (+1 squashed commits)
Squashed commits:

[f640eb59] makefile indentation fix
2025-01-29 22:18:54 +08:00
Eric Curtin
f0d4b29edf
Parse https://ollama.com/library/ syntax (#11480)
People search for ollama models using the web ui, this change
allows one to copy the url from the browser and for it to be
compatible with llama-run.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-29 11:23:10 +00:00
Georgi Gerganov
815857791d
sync : ggml 2025-01-29 11:25:29 +02:00
William Tambellini
1a0e87d291
ggml : add option to not print stack on abort (ggml/1081)
* Add option to not print stack on abort

Add option/envvar to disable stack printing on abort.
Also link some unittests with Threads to fix link errors on
ubuntu/g++11.

* Update ggml/src/ggml.c

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-29 11:24:53 +02:00
issixx
d2e518e9b4
ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)
some threads kept looping and failed to terminate properly after an abort during CPU execution.

Co-authored-by: issi <issi@gmail.com>
2025-01-29 11:24:51 +02:00
Daniel Bevenius
b636228c0a
embedding : enable --no-warmup option (#11475)
This commit enables the `--no-warmup` option for the llama-embeddings.

The motivation for this change is to allow the user to disable the
warmup when running the the program.
2025-01-29 10:38:54 +02:00
Molly Sophia
325afb370a
llama: fix missing k_cache store for rwkv6qwen2 (#11445)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-01-29 12:07:21 +08:00
Emreerdog
794fe23f29
cmake: add hints for locating ggml on Windows using Llama find-package (#11466) 2025-01-28 19:22:06 -04:00
peidaqi
cf8cc856d7
server : Fixed wrong function name in llamacpp server unit test (#11473)
The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True
2025-01-29 00:03:42 +01:00
Xuan-Son Nguyen
d0c08040b6
ci : fix build CPU arm64 (#11472)
* ci : fix build CPU arm64

* failed, trying ubuntu 22

* vulkan: ubuntu 24

* vulkan : jammy --> noble
2025-01-29 00:02:56 +01:00
uvos
be5ef7963f
HIP: Supress transformation warning in softmax.cu
loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.
2025-01-28 23:06:32 +01:00
Nikita Sarychev
cae9fb4361
HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (#11080)
This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.
2025-01-28 16:42:20 +01:00
Eric Curtin
7fee2889e6
Add github protocol pulling and http:// (#11465)
As pulling protocols to llama-run

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-28 14:45:41 +00:00
Nuno
d7d1eccacc
docker: allow installing pip packages system-wide (#11437)
Signed-off-by: rare-magma <rare-magma@posteo.eu>
2025-01-28 14:17:25 +00:00
someone13574
4bf3119d61
cmake : don't fail on GGML_CPU=OFF (#11457) 2025-01-28 15:15:34 +01:00
Concedo
558bc5c901 tts can now set a length limit 2025-01-28 22:06:59 +08:00
Nuno
f643120bad
docker: add perplexity and bench commands to full image (#11438)
Signed-off-by: rare-magma <rare-magma@posteo.eu>
2025-01-28 10:42:32 +00:00
Concedo
c5d4e07664 Merge commit 'acd38efee3' into concedo_experimental
# Conflicts:
#	.devops/cpu.Dockerfile
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	README.md
#	cmake/llama-config.cmake.in
#	examples/simple-cmake-pkg/.gitignore
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
2025-01-28 18:16:44 +08:00
Akarshan Biswas
6e84b0ab8e
SYCL : SOFTMAX F16 mask support and other fixes (#11261)
Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021.
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).

* SYCL: SOFTMAX F16 mask support and other fixes

* test-backend-ops: Add F16 mask test cases
2025-01-28 09:56:58 +00:00
Concedo
6bf0b2d062 try casting the numeric fields read 2025-01-28 17:43:28 +08:00
Michael Engel
2b8525d5c8
Handle missing model in CLI parameters for llama-run (#11399)
The HTTP client in llama-run only prints an error in case the download of
a resource failed. If the model name in the CLI parameter list is missing,
this causes the application to crash.
In order to prevent this, a check for the required model parameter has been
added and errors for resource downloads get propagated to the caller.

Signed-off-by: Michael Engel <mengel@redhat.com>
2025-01-28 08:32:40 +00:00