Commit graph

79 commits

Author SHA1 Message Date
Concedo
39fad991cc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/main/README.md
#	examples/run/run.cpp
2025-02-14 11:34:29 +08:00
Maxim Evtush
7b891bdc86
fix: typos in documentation files (#11791)
* Update ggml.c

* Update arg.cpp

* Update speculative.h
2025-02-10 23:21:31 +01:00
Concedo
27b9358baf Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/run/run.cpp
#	scripts/sync-ggml.last
2025-02-08 01:31:49 +08:00
Daniel Bevenius
b7552cfcbc
common : add default embeddings presets (#11677)
* common : add default embeddings presets

This commit adds default embeddings presets for the following models:
- bge-small-en-v1.5
- e5-small-v2
- gte-small

These can be used with llama-embedding and llama-server.

For example, with llama-embedding:
```console
./build/bin/llama-embedding --embd-gte-small-default -p "Hello, how are you?"
```

And with llama-server:
```console
./build/bin/llama-server --embd-gte-small-default
```
And the embeddings endpoint can then be called with a POST request:
```console
curl --request POST \
    --url http://localhost:8080/embeddings \
    --header "Content-Type: application/json" \
    --data '{"input": "Hello, how are you?"}'
```

I'm not sure if these are the most common embedding models but hopefully
this can be a good starting point for discussion and further
improvements.

Refs: https://github.com/ggerganov/llama.cpp/issues/10932
2025-02-07 09:15:22 +01:00
Concedo
db6db9dff9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/close-issue.yml
#	.github/workflows/server.yml
#	AUTHORS
#	CMakeLists.txt
#	Makefile
#	README.md
#	cmake/llama.pc.in
#	common/CMakeLists.txt
#	docs/build.md
#	examples/batched.swift/Sources/main.swift
#	examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
#	examples/llava/CMakeLists.txt
#	examples/llava/clip.h
#	examples/run/run.cpp
#	examples/server/README.md
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
2025-02-07 00:52:31 +08:00
Radoslav Gerganov
1bef571f6a
arg : list RPC devices first when using --list-devices (#11655)
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref #11435
2025-02-04 18:16:20 +02:00
Concedo
f13498df13 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	.devops/vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	.github/workflows/server.yml
#	Makefile
#	README.md
#	cmake/llama-config.cmake.in
#	common/CMakeLists.txt
#	examples/gbnf-validator/gbnf-validator.cpp
#	examples/run/run.cpp
#	examples/server/README.md
#	examples/server/tests/README.md
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tests/test-grammar-integration.cpp
2025-02-01 17:14:59 +08:00
Daniel Bevenius
b636228c0a
embedding : enable --no-warmup option (#11475)
This commit enables the `--no-warmup` option for the llama-embeddings.

The motivation for this change is to allow the user to disable the
warmup when running the the program.
2025-01-29 10:38:54 +02:00
Concedo
bec231422a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	common/CMakeLists.txt
#	docs/backend/SYCL.md
#	docs/build.md
#	docs/docker.md
#	examples/export-lora/export-lora.cpp
#	examples/main/README.md
#	examples/main/main.cpp
#	examples/run/README.md
#	examples/run/run.cpp
#	examples/server/README.md
#	examples/simple-chat/simple-chat.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
2025-01-25 14:16:50 +08:00
Olivier Chafik
6171c9d258
Add Jinja template support (#11016)
* Copy minja from 58f0ca6dd7

* Add --jinja and --chat-template-file flags

* Add missing <optional> include

* Avoid print in get_hf_chat_template.py

* No designated initializers yet

* Try and work around msvc++ non-macro max resolution quirk

* Update test_chat_completion.py

* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template

* Refactor test-chat-template

* Test templates w/ minja

* Fix deprecation

* Add --jinja to llama-run

* Update common_chat_format_example to use minja template wrapper

* Test chat_template in e2e test

* Update utils.py

* Update test_chat_completion.py

* Update run.cpp

* Update arg.cpp

* Refactor common_chat_* functions to accept minja template + use_jinja option

* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE

* Revert LLAMA_CHATML_TEMPLATE refactor

* Normalize newlines in test-chat-templates for windows tests

* Forward decl minja::chat_template to avoid eager json dep

* Flush stdout in chat template before potential crash

* Fix copy elision warning

* Rm unused optional include

* Add missing optional include to server.cpp

* Disable jinja test that has a cryptic windows failure

* minja: fix vigogne (https://github.com/google/minja/pull/22)

* Apply suggestions from code review

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Finish suggested renamings

* Move chat_templates inside server_context + remove mutex

* Update --chat-template-file w/ recent change to --chat-template

* Refactor chat template validation

* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)

* Warn against missing eos / bos tokens when jinja template references them

* rename: common_chat_template[s]

* reinstate assert on chat_templates.template_default

* Update minja to b8437df626

* Update minja to https://github.com/google/minja/pull/25

* Update minja from https://github.com/google/minja/pull/27

* rm unused optional header

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-21 13:18:51 +00:00
Georgi Gerganov
80d0d6b4b7
common : add -hfd option for the draft model (#11318)
* common : add -hfd option for the draft model

* cont : fix env var

* cont : more fixes
2025-01-20 22:29:43 +02:00
LostRuins Concedo
6390a998bf
tts : add guide tokens support (#11186)
* Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences.

* applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start
2025-01-18 12:20:57 +02:00
Concedo
96407502cd Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/llama-bench/llama-bench.cpp
#	examples/llama.android/llama/src/main/cpp/llama-android.cpp
#	examples/llama.android/llama/src/main/java/android/llama/cpp/LLamaAndroid.kt
#	src/llama-vocab.cpp
#	tests/test-backend-ops.cpp
2025-01-17 23:13:50 +08:00
Radoslav Gerganov
667d72846c
rpc : early register backend devices (#11262)
Early register RPC devices and do not propagate RPC specifics in the
llama model structures.

ref: #10609
2025-01-17 10:57:09 +02:00
Concedo
11cd7c7bb0 survived the storm, again 2025-01-16 22:25:18 +08:00
Xuan Son Nguyen
84a44815f7
cli : auto activate conversation mode if chat template is available (#11214)
* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
2025-01-13 20:18:12 +01:00
Xuan Son Nguyen
00b4c3da62
common : support tag-based --hf-repo like on ollama (#11195)
* common : support tag-based hf_repo like on ollama

* fix build

* various fixes

* small fixes

* fix style

* fix windows build?

* move common_get_hf_file to common.cpp

* fix complain with noreturn
2025-01-13 13:56:23 +01:00
Concedo
07173e84a0 add ability to use guide tokens for TTS, ref: https://github.com/ggerganov/llama.cpp/pull/11186 2025-01-11 17:18:21 +08:00
Concedo
1b49dc305f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	.github/workflows/editorconfig.yml
#	examples/run/run.cpp
#	examples/server/README.md
#	scripts/sync-ggml.last
2025-01-09 16:50:29 +08:00
Georgi Gerganov
a3c1232c3f
arg : option to exclude arguments from specific examples (#11136)
* arg : option to exclude arguments from specific examples

ggml-ci

* readme : remove old args [no ci]
2025-01-08 12:55:36 +02:00
Concedo
f9f1585a7f broken merge - kcpp changes will be applied above this commit for better tracking. 2025-01-03 23:49:17 +08:00
Georgi Gerganov
f66f582927
llama : refactor src/llama.cpp (#10902)
* llama : scatter llama.cpp into multiple modules (wip)

* llama : control-vector -> adapter

* llama : arch

* llama : mmap

ggml-ci

* ci : remove BUILD_SHARED_LIBS=OFF

ggml-ci

* llama : arch (cont)

ggml-ci

* llama : chat

ggml-ci

* llama : model

ggml-ci

* llama : hparams

ggml-ci

* llama : adapter

ggml-ci

* examples : fix

ggml-ci

* rebase

ggml-ci

* minor

* llama : kv cache

ggml-ci

* llama : impl

ggml-ci

* llama : batch

ggml-ci

* cont

ggml-ci

* llama : context

ggml-ci

* minor

* llama : context (cont)

ggml-ci

* llama : model loader

ggml-ci

* common : update lora

ggml-ci

* llama : quant

ggml-ci

* llama : quant (cont)

ggml-ci

* minor [no ci]
2025-01-03 10:18:53 +02:00
Concedo
4c56b7cada Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/gbnf-validator/gbnf-validator.cpp
#	examples/llava/clip.cpp
#	examples/run/README.md
#	examples/run/run.cpp
#	examples/server/README.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	src/llama.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-llama-grammar.cpp
2024-12-21 09:41:49 +08:00
Molly Sophia
0a11f8b7b5
convert : fix RWKV v6 model conversion (#10913)
* Enable --no-context-shift for llama-perplexity example

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV 6: Fix error in ggml_cuda_op_bin_bcast

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-12-20 11:44:58 +02:00
Georgi Gerganov
36319dec5d
tts : small QoL for easy model fetch (#10903) 2024-12-19 17:35:15 +02:00
Concedo
ee486bad3e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	examples/CMakeLists.txt
#	examples/batched/batched.cpp
#	examples/gritlm/gritlm.cpp
#	examples/llama.android/llama/build.gradle.kts
#	examples/main/README.md
#	examples/retrieval/retrieval.cpp
#	examples/server/CMakeLists.txt
#	examples/server/README.md
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml.c
#	scripts/compare-commits.sh
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tests/test-sampling.cpp
2024-12-19 11:57:43 +08:00
Georgi Gerganov
0bf2d10c55
tts : add OuteTTS support (#10784)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : be explicit about the pooling type in the tests

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* llama : add OuteTTS support (wip)

* wip

* extract features

* first conv

* group norm

* resnet conv

* resnet

* attn

* pos net

* layer norm

* convnext

* head

* hann window

* fix n_embd + remove llama.cpp hacks

* compute hann window

* fft

* spectrum processing

* clean-up

* tts : receive input text and generate codes

* clip : fix new conv name

* tts : minor fix

* tts : add header + minor fixes

ggml-ci

* tts : add matchematical constant

ggml-ci

* tts : fix sampling + cut initial noise

* tts : fixes

* tts : update default samplers

ggml-ci

* tts : text pre-processing

* tts : outetts-voc -> wavtokenizer-dec

* tts : remove hardcoded constants

ggml-ci

* tts : fix tensor shapes

* llama : refactor wavtokenizer tensors

ggml-ci

* cont

ggml-ci

* cont [no ci]

* llama : update WavTokenizer to non-causal attn

* llama : handle no-vocab detokenization

* tts : add Python example for OuteTTS (wip)

* tts : extend python example to generate spectrogram

ggml-ci

* server : fix rebase artifacts

* tts : enable "return_tokens" in Python example

ggml-ci

* tts : minor fixes

* common : support HF download for vocoder
2024-12-18 19:27:21 +02:00
Georgi Gerganov
644fd71b44
sampling : refactor + optimize penalties sampler (#10803)
* sampling : refactor + optimize penalties sampler

ggml-ci

* common : apply ignore_eos as logit bias

ggml-ci

* batched : remove penalties sampler

* params : allow penalty_last_n == -1 to be equal to context size

ggml-ci

* common : by default, move the penalties at the end of the sampling chain

ggml-ci

* common : ignore all EOG tokens

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* common : move back the penalties at the front of the sampling chain

ggml-ci

* readme : restore hint about --ignore-eos flag [no ci]

* llama : minor

ggml-ci

* webui : update

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Concedo
ed75f8a741 up to date merge, without vulkan-gen-shaders. They will be built before each release from now on, as they are very large 2024-12-13 17:18:01 +08:00
Concedo
de64b9198c merge checkpoint 2 - functional merge without q4_0_4_4 (need regen shaders) 2024-12-13 17:04:19 +08:00
Concedo
4c4ce5e808 rewritten checkpoint 1 - before coopmat 2024-12-13 16:55:23 +08:00
Xuan Son Nguyen
adffa6ffd5
common : improve -ctv -ctk CLI arguments (#10806)
* common : improve ctv ctk cli argument

* regenerate docs

* even better approach

* use std::vector
2024-12-12 22:53:05 +01:00
Xuan Son Nguyen
9fdb124304
common : add missing env var for speculative (#10801) 2024-12-12 16:57:32 +01:00
Bartowski
ae4b922614
imatrix : Add imatrix to --no-context-shift (#10766)
This allows for setting the --no-context-shift value in llama-imatrix which is required for models like DeepSeek
2024-12-10 18:23:50 +01:00
Yüg
a86ad841f1
server : add flag to disable the web-ui (#10762) (#10751)
Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>
2024-12-10 18:22:34 +01:00
Xuan Son Nguyen
f162d45a21
common : bring back --no-warmup to server (#10686) 2024-12-06 13:29:05 +01:00
Xuan Son Nguyen
642330ac7c
llama : add enum for built-in chat templates (#10623)
* llama : add enum for supported chat templates

* use "built-in" instead of "supported"

* arg: print list of built-in templates

* fix test

* update server README
2024-12-02 22:10:19 +01:00
Concedo
ec95241e38 temp checkpoint 2024-11-30 11:59:27 +08:00
Johannes Gäßler
890719311b
common: fix warning message when no GPU found (#10564) 2024-11-28 18:15:25 +01:00
Xuan Son Nguyen
9f912511bc
common : fix duplicated file name with hf_repo and hf_file (#10550) 2024-11-27 22:30:52 +01:00
Concedo
ec581b19d8 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/010-bug-compilation.yml
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	.github/ISSUE_TEMPLATE/019-bug-misc.yml
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	examples/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/llama-bench/llama-bench.cpp
#	examples/server/README.md
#	examples/server/server.cpp
#	examples/simple-chat/simple-chat.cpp
#	examples/simple/simple.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-amx/CMakeLists.txt
#	ggml/src/ggml-blas/CMakeLists.txt
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-kompute/CMakeLists.txt
#	ggml/src/ggml-kompute/ggml-kompute.cpp
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-rpc/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	pocs/CMakeLists.txt
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-quantize-fns.cpp
2024-11-26 17:01:20 +08:00
Diego Devesa
10bce0450f
llama : accept a list of devices to use to offload a model (#10497)
* llama : accept a list of devices to use to offload a model

* accept `--dev none` to completely disable offloading

* fix dev list with dl backends

* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
Concedo
83350ec314 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/020-enhancement.yml
#	.github/ISSUE_TEMPLATE/030-research.yml
#	.github/ISSUE_TEMPLATE/040-refactor.yml
#	.github/workflows/build.yml
#	Makefile
#	common/CMakeLists.txt
#	examples/CMakeLists.txt
#	examples/infill/infill.cpp
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup-stats.cpp
#	examples/lookup/lookup.cpp
#	examples/parallel/parallel.cpp
#	examples/retrieval/retrieval.cpp
#	examples/save-load-state/save-load-state.cpp
#	examples/speculative/speculative.cpp
#	flake.lock
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/kernels/CMakeLists.txt
#	ggml/src/ggml-cann/kernels/dup.cpp
#	ggml/src/ggml-cann/kernels/get_row_f16.cpp
#	ggml/src/ggml-cann/kernels/get_row_f32.cpp
#	ggml/src/ggml-cann/kernels/get_row_q4_0.cpp
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
2024-11-25 16:26:08 +08:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example (#10362)
* speculative : refactor and add a simpler example

ggml-ci

* speculative : clean-up and add comments and TODOs [no ci]

* speculative : manage context in common_speculative

ggml-ci

* speculative : simplify

ggml-ci

* speculative : simplify (cont)

ggml-ci

* speculative : add --draft-min CLI arg

* speculative : minor fixup

* make : build fixes

* speculative : do not redraft previous drafts

ggml-ci

* speculative : fix the draft sampling

ggml-ci

* speculative : fix compile warning

* common : refactor args

ggml-ci

* common : change defaults [no ci]

* common : final touches

ggml-ci
2024-11-25 09:58:41 +02:00
Concedo
282a647689 Merge commit '467576b6cc' into concedo_experimental
# Conflicts:
#	.gitignore
#	Makefile
#	README.md
#	common/common.h
#	docs/build.md
#	examples/infill/infill.cpp
#	examples/perplexity/perplexity.cpp
#	examples/server/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-opt.cpp
#	tests/test-quantize-perf.cpp
2024-11-21 16:05:21 +08:00
Johannes Gäßler
4e54be0ec6
llama/ex: remove --logdir argument (#10339) 2024-11-16 23:00:41 +01:00
Concedo
a46f8acd03 note: also has support for completion tokens count 2024-11-01 00:44:14 +08:00
Georgi Gerganov
8d8ff71536
llama : remove Tail-Free sampling (#10071)
ggml-ci
2024-10-29 10:42:05 +02:00
wwoodsTM
ff252ea48e
llama : add DRY sampler (#9702)
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Michael Podvitskiy
d80fb71f8b
llama: string_split fix (#10022)
* llama: Refactor string_split to use template specialization,  fixes parsing strings with spaces

* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
2024-10-25 17:57:54 +02:00