Olivier Chafik
6171c9d258
Add Jinja template support ( #11016 )
...
* Copy minja from 58f0ca6dd7
* Add --jinja and --chat-template-file flags
* Add missing <optional> include
* Avoid print in get_hf_chat_template.py
* No designated initializers yet
* Try and work around msvc++ non-macro max resolution quirk
* Update test_chat_completion.py
* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template
* Refactor test-chat-template
* Test templates w/ minja
* Fix deprecation
* Add --jinja to llama-run
* Update common_chat_format_example to use minja template wrapper
* Test chat_template in e2e test
* Update utils.py
* Update test_chat_completion.py
* Update run.cpp
* Update arg.cpp
* Refactor common_chat_* functions to accept minja template + use_jinja option
* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE
* Revert LLAMA_CHATML_TEMPLATE refactor
* Normalize newlines in test-chat-templates for windows tests
* Forward decl minja::chat_template to avoid eager json dep
* Flush stdout in chat template before potential crash
* Fix copy elision warning
* Rm unused optional include
* Add missing optional include to server.cpp
* Disable jinja test that has a cryptic windows failure
* minja: fix vigogne (https://github.com/google/minja/pull/22 )
* Apply suggestions from code review
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Finish suggested renamings
* Move chat_templates inside server_context + remove mutex
* Update --chat-template-file w/ recent change to --chat-template
* Refactor chat template validation
* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)
* Warn against missing eos / bos tokens when jinja template references them
* rename: common_chat_template[s]
* reinstate assert on chat_templates.template_default
* Update minja to b8437df626
* Update minja to https://github.com/google/minja/pull/25
* Update minja from https://github.com/google/minja/pull/27
* rm unused optional header
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-21 13:18:51 +00:00
Georgi Gerganov
80d0d6b4b7
common : add -hfd option for the draft model ( #11318 )
...
* common : add -hfd option for the draft model
* cont : fix env var
* cont : more fixes
2025-01-20 22:29:43 +02:00
LostRuins Concedo
6390a998bf
tts : add guide tokens support ( #11186 )
...
* Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences.
* applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start
2025-01-18 12:20:57 +02:00
Concedo
96407502cd
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/llama-bench/llama-bench.cpp
# examples/llama.android/llama/src/main/cpp/llama-android.cpp
# examples/llama.android/llama/src/main/java/android/llama/cpp/LLamaAndroid.kt
# src/llama-vocab.cpp
# tests/test-backend-ops.cpp
2025-01-17 23:13:50 +08:00
Radoslav Gerganov
667d72846c
rpc : early register backend devices ( #11262 )
...
Early register RPC devices and do not propagate RPC specifics in the
llama model structures.
ref: #10609
2025-01-17 10:57:09 +02:00
Concedo
11cd7c7bb0
survived the storm, again
2025-01-16 22:25:18 +08:00
Xuan Son Nguyen
84a44815f7
cli : auto activate conversation mode if chat template is available ( #11214 )
...
* cli : auto activate conversation mode if chat template is detected
* add warn on bad template
* update readme (writing with the help of chatgpt)
* update readme (2)
* do not activate -cnv for non-instruct models
2025-01-13 20:18:12 +01:00
Xuan Son Nguyen
00b4c3da62
common : support tag-based --hf-repo like on ollama ( #11195 )
...
* common : support tag-based hf_repo like on ollama
* fix build
* various fixes
* small fixes
* fix style
* fix windows build?
* move common_get_hf_file to common.cpp
* fix complain with noreturn
2025-01-13 13:56:23 +01:00
Concedo
07173e84a0
add ability to use guide tokens for TTS, ref: https://github.com/ggerganov/llama.cpp/pull/11186
2025-01-11 17:18:21 +08:00
Concedo
1b49dc305f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/docker.yml
# .github/workflows/editorconfig.yml
# examples/run/run.cpp
# examples/server/README.md
# scripts/sync-ggml.last
2025-01-09 16:50:29 +08:00
Georgi Gerganov
a3c1232c3f
arg : option to exclude arguments from specific examples ( #11136 )
...
* arg : option to exclude arguments from specific examples
ggml-ci
* readme : remove old args [no ci]
2025-01-08 12:55:36 +02:00
Concedo
f9f1585a7f
broken merge - kcpp changes will be applied above this commit for better tracking.
2025-01-03 23:49:17 +08:00
Georgi Gerganov
f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00
Concedo
4c56b7cada
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/gbnf-validator/gbnf-validator.cpp
# examples/llava/clip.cpp
# examples/run/README.md
# examples/run/run.cpp
# examples/server/README.md
# ggml/src/ggml-cpu/CMakeLists.txt
# src/llama.cpp
# tests/test-grammar-integration.cpp
# tests/test-llama-grammar.cpp
2024-12-21 09:41:49 +08:00
Molly Sophia
0a11f8b7b5
convert : fix RWKV v6 model conversion ( #10913 )
...
* Enable --no-context-shift for llama-perplexity example
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV 6: Fix error in ggml_cuda_op_bin_bcast
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-12-20 11:44:58 +02:00
Georgi Gerganov
36319dec5d
tts : small QoL for easy model fetch ( #10903 )
2024-12-19 17:35:15 +02:00
Concedo
ee486bad3e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# examples/CMakeLists.txt
# examples/batched/batched.cpp
# examples/gritlm/gritlm.cpp
# examples/llama.android/llama/build.gradle.kts
# examples/main/README.md
# examples/retrieval/retrieval.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml.c
# scripts/compare-commits.sh
# scripts/sync-ggml.last
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-sampling.cpp
2024-12-19 11:57:43 +08:00
Georgi Gerganov
0bf2d10c55
tts : add OuteTTS support ( #10784 )
...
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : be explicit about the pooling type in the tests
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* llama : add OuteTTS support (wip)
* wip
* extract features
* first conv
* group norm
* resnet conv
* resnet
* attn
* pos net
* layer norm
* convnext
* head
* hann window
* fix n_embd + remove llama.cpp hacks
* compute hann window
* fft
* spectrum processing
* clean-up
* tts : receive input text and generate codes
* clip : fix new conv name
* tts : minor fix
* tts : add header + minor fixes
ggml-ci
* tts : add matchematical constant
ggml-ci
* tts : fix sampling + cut initial noise
* tts : fixes
* tts : update default samplers
ggml-ci
* tts : text pre-processing
* tts : outetts-voc -> wavtokenizer-dec
* tts : remove hardcoded constants
ggml-ci
* tts : fix tensor shapes
* llama : refactor wavtokenizer tensors
ggml-ci
* cont
ggml-ci
* cont [no ci]
* llama : update WavTokenizer to non-causal attn
* llama : handle no-vocab detokenization
* tts : add Python example for OuteTTS (wip)
* tts : extend python example to generate spectrogram
ggml-ci
* server : fix rebase artifacts
* tts : enable "return_tokens" in Python example
ggml-ci
* tts : minor fixes
* common : support HF download for vocoder
2024-12-18 19:27:21 +02:00
Georgi Gerganov
644fd71b44
sampling : refactor + optimize penalties sampler ( #10803 )
...
* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Concedo
ed75f8a741
up to date merge, without vulkan-gen-shaders. They will be built before each release from now on, as they are very large
2024-12-13 17:18:01 +08:00
Concedo
de64b9198c
merge checkpoint 2 - functional merge without q4_0_4_4 (need regen shaders)
2024-12-13 17:04:19 +08:00
Concedo
4c4ce5e808
rewritten checkpoint 1 - before coopmat
2024-12-13 16:55:23 +08:00
Xuan Son Nguyen
adffa6ffd5
common : improve -ctv -ctk CLI arguments ( #10806 )
...
* common : improve ctv ctk cli argument
* regenerate docs
* even better approach
* use std::vector
2024-12-12 22:53:05 +01:00
Xuan Son Nguyen
9fdb124304
common : add missing env var for speculative ( #10801 )
2024-12-12 16:57:32 +01:00
Bartowski
ae4b922614
imatrix : Add imatrix to --no-context-shift ( #10766 )
...
This allows for setting the --no-context-shift value in llama-imatrix which is required for models like DeepSeek
2024-12-10 18:23:50 +01:00
Yüg
a86ad841f1
server : add flag to disable the web-ui ( #10762 ) ( #10751 )
...
Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>
2024-12-10 18:22:34 +01:00
Xuan Son Nguyen
f162d45a21
common : bring back --no-warmup to server ( #10686 )
2024-12-06 13:29:05 +01:00
Xuan Son Nguyen
642330ac7c
llama : add enum for built-in chat templates ( #10623 )
...
* llama : add enum for supported chat templates
* use "built-in" instead of "supported"
* arg: print list of built-in templates
* fix test
* update server README
2024-12-02 22:10:19 +01:00
Concedo
ec95241e38
temp checkpoint
2024-11-30 11:59:27 +08:00
Johannes Gäßler
890719311b
common: fix warning message when no GPU found ( #10564 )
2024-11-28 18:15:25 +01:00
Xuan Son Nguyen
9f912511bc
common : fix duplicated file name with hf_repo and hf_file ( #10550 )
2024-11-27 22:30:52 +01:00
Concedo
ec581b19d8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# Package.swift
# examples/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/llama-bench/llama-bench.cpp
# examples/server/README.md
# examples/server/server.cpp
# examples/simple-chat/simple-chat.cpp
# examples/simple/simple.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-amx/CMakeLists.txt
# ggml/src/ggml-blas/CMakeLists.txt
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-kompute/CMakeLists.txt
# ggml/src/ggml-kompute/ggml-kompute.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# pocs/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-quantize-fns.cpp
2024-11-26 17:01:20 +08:00
Diego Devesa
10bce0450f
llama : accept a list of devices to use to offload a model ( #10497 )
...
* llama : accept a list of devices to use to offload a model
* accept `--dev none` to completely disable offloading
* fix dev list with dl backends
* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
Concedo
83350ec314
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# .github/workflows/build.yml
# Makefile
# common/CMakeLists.txt
# examples/CMakeLists.txt
# examples/infill/infill.cpp
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/kernels/CMakeLists.txt
# ggml/src/ggml-cann/kernels/dup.cpp
# ggml/src/ggml-cann/kernels/get_row_f16.cpp
# ggml/src/ggml-cann/kernels/get_row_f32.cpp
# ggml/src/ggml-cann/kernels/get_row_q4_0.cpp
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
2024-11-25 16:26:08 +08:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00
Concedo
282a647689
Merge commit ' 467576b6cc
' into concedo_experimental
...
# Conflicts:
# .gitignore
# Makefile
# README.md
# common/common.h
# docs/build.md
# examples/infill/infill.cpp
# examples/perplexity/perplexity.cpp
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tests/test-quantize-perf.cpp
2024-11-21 16:05:21 +08:00
Johannes Gäßler
4e54be0ec6
llama/ex: remove --logdir argument ( #10339 )
2024-11-16 23:00:41 +01:00
Concedo
a46f8acd03
note: also has support for completion tokens count
2024-11-01 00:44:14 +08:00
Georgi Gerganov
8d8ff71536
llama : remove Tail-Free sampling ( #10071 )
...
ggml-ci
2024-10-29 10:42:05 +02:00
wwoodsTM
ff252ea48e
llama : add DRY sampler ( #9702 )
...
* sampling : add DRY sampler (post-refactor)
* DRY: Trying to fix coauthors, removed unneeded line
* DRY: Fixed redundant code
* DRY: Fixed crash issue due to DRY being in chain but uninitialized
---------
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Michael Podvitskiy
d80fb71f8b
llama: string_split fix ( #10022 )
...
* llama: Refactor string_split to use template specialization, fixes parsing strings with spaces
* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
2024-10-25 17:57:54 +02:00
Concedo
94a5a27b85
Alone in the darkness
...
They're coming for you
I know they will try to catch me too
Alone in the darkness
They're calling for you
There's nowhere to run for cover
2024-10-24 22:29:20 +08:00
Daniel Bevenius
674804a996
arg : fix typo in embeddings argument help [no ci] ( #9994 )
...
This commit fixes two typos in the help text for the `--embd-normalize`
and `--embd-separator` arguments. It also updates common.h which contain
the same typo in two comments.
2024-10-22 10:40:02 +03:00
Daniel Bevenius
94008cc760
arg : fix attention non-causal arg value hint ( #9985 )
...
This commit updates the argument value hint for the `--attention`
argument to `non-causal`.
The motivation for this change is that the only values for this argument
are `causal` and `non-causal`.
2024-10-21 21:12:52 +03:00
Concedo
a9dbcdd3ec
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build.md
# examples/infill/infill.cpp
# examples/main/README.md
# examples/server/README.md
# flake.lock
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-sampling.cpp
2024-10-17 16:36:02 +08:00
MaggotHATE
fbc98b748e
sampling : add XTC sampler ( #9742 )
...
* Initial XTC commit
Adds XTC sampler, not activated by default, but recommended settings by default.
* Cleanup
* Simplified chances calculation
To be more inline with the original implementation, chance is calculated once at the beginning.
* First fixes by comments
Still need to look into sorting
* Fixed trailing backspaces
* Fixed RNG to be reproduceable
Thanks to @slaren for directions
* Fixed forgotten header
* Moved `min_keep`
Moved from conditions to a simple check at the end.
* Fixed broken randomization
Thanks to @slaren for explanation
* Swapped sorting for a custom algorithm
Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
* Algorithm rework
1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.
* Added XTC to `test-sampling`
* Simplified algorithm and more tests
* Updated info in common and args
* Merged back lost commits in common and arg
* Update dump info in common
* Fixed incorrect min_keep check
* Added XTC to README
* Renamed parameters, fixed info and defaults
* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off
* Initial server support
* Added XTC to server UIs
* Fixed labels in old server UI
* Made algorithm safer and more readable
* Removed xtc_threshold_max
* Fixed arg after update
* Quick fixes by comments
* Simplified algorithm since threshold_max is removed
* Renamed random distribution
* Fixed tests and outdated README
* Small fixes
2024-10-15 12:54:55 +02:00
Georgi Gerganov
c7181bd294
server : reuse cached context chunks ( #9866 )
...
ggml-ci
2024-10-13 18:52:48 +03:00
Georgi Gerganov
1bde94dd02
server : remove self-extend features ( #9860 )
...
* server : remove self-extend
ggml-ci
* server : fix context limit check to use slot.n_past
ggml-ci
2024-10-12 16:06:31 +03:00
Georgi Gerganov
95c76e8e92
server : remove legacy system_prompt feature ( #9857 )
...
* server : remove legacy system_prompt feature
ggml-ci
* readme : update [no ci]
* server : fix non-transformer logic + remove response from /props
2024-10-12 14:51:54 +03:00
Georgi Gerganov
11ac9800af
llama : improve infill support and special token detection ( #9798 )
...
* llama : improve infill support
ggml-ci
* llama : add more FIM token strings
ggml-ci
* server : update prompt on slot restore (#9800 )
* gguf : deprecate old FIM token KVs
2024-10-12 08:21:51 +03:00