Concedo
a46f8acd03
note: also has support for completion tokens count
2024-11-01 00:44:14 +08:00
Georgi Gerganov
8d8ff71536
llama : remove Tail-Free sampling ( #10071 )
...
ggml-ci
2024-10-29 10:42:05 +02:00
Georgi Gerganov
8125e6cbfc
server : don't overfill the batch during infill ( #10018 )
...
ggml-ci
2024-10-28 08:49:32 +02:00
wwoodsTM
ff252ea48e
llama : add DRY sampler ( #9702 )
...
* sampling : add DRY sampler (post-refactor)
* DRY: Trying to fix coauthors, removed unneeded line
* DRY: Fixed redundant code
* DRY: Fixed crash issue due to DRY being in chain but uninitialized
---------
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Michael Podvitskiy
d80fb71f8b
llama: string_split fix ( #10022 )
...
* llama: Refactor string_split to use template specialization, fixes parsing strings with spaces
* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
2024-10-25 17:57:54 +02:00
Georgi Gerganov
bc5ba007b2
server : check that the prompt fits in the slot's context ( #10030 )
...
ggml-ci
2024-10-25 10:13:46 +03:00
Xuan Son Nguyen
958367bf53
server : refactor slot input data, move tokenizer to HTTP thread ( #10023 )
...
* server : refactor slot input data, move tokenizer to HTTP thread
* move prompt_tokens.empty() check
* fix incorrect if branch
* fix infinite generation loop
* bring back infill validation
* add infill test
* try fixing format_infill
* fix test
* remove redundant code
* rename completion to inference
* update docs
* use llama_tokens everywhere
2024-10-24 21:51:22 +02:00
Concedo
94a5a27b85
Alone in the darkness
...
They're coming for you
I know they will try to catch me too
Alone in the darkness
They're calling for you
There's nowhere to run for cover
2024-10-24 22:29:20 +08:00
wwoodsTM
0a1c750c80
server : samplers accept the prompt correctly ( #10019 )
2024-10-23 22:27:51 +03:00
Xuan Son Nguyen
cda0e4b648
llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch ( #9745 )
...
* refactor llama_batch_get_one
* adapt all examples
* fix simple.cpp
* fix llama_bench
* fix
* fix context shifting
* free batch before return
* use common_batch_add, reuse llama_batch in loop
* null terminated seq_id list
* fix save-load-state example
* fix perplexity
* correct token pos in llama_batch_allocr
2024-10-18 23:18:01 +02:00
Georgi Gerganov
8901755ba3
server : add n_indent parameter for line indentation requirement ( #9929 )
...
ggml-ci
2024-10-18 07:32:19 +03:00
Concedo
a9dbcdd3ec
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build.md
# examples/infill/infill.cpp
# examples/main/README.md
# examples/server/README.md
# flake.lock
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-sampling.cpp
2024-10-17 16:36:02 +08:00
Alexey Parfenov
1f66b699c4
server : fix the disappearance of the end of the text ( #9867 )
...
* server: fix the disappearance of the end of the text when streaming with stop strings
* simplify "send text" checks
2024-10-16 11:35:53 +03:00
Georgi Gerganov
223c25a72f
server : improve infill context reuse ( #9894 )
...
ggml-ci
2024-10-15 16:28:55 +03:00
MaggotHATE
fbc98b748e
sampling : add XTC sampler ( #9742 )
...
* Initial XTC commit
Adds XTC sampler, not activated by default, but recommended settings by default.
* Cleanup
* Simplified chances calculation
To be more inline with the original implementation, chance is calculated once at the beginning.
* First fixes by comments
Still need to look into sorting
* Fixed trailing backspaces
* Fixed RNG to be reproduceable
Thanks to @slaren for directions
* Fixed forgotten header
* Moved `min_keep`
Moved from conditions to a simple check at the end.
* Fixed broken randomization
Thanks to @slaren for explanation
* Swapped sorting for a custom algorithm
Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
* Algorithm rework
1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.
* Added XTC to `test-sampling`
* Simplified algorithm and more tests
* Updated info in common and args
* Merged back lost commits in common and arg
* Update dump info in common
* Fixed incorrect min_keep check
* Added XTC to README
* Renamed parameters, fixed info and defaults
* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off
* Initial server support
* Added XTC to server UIs
* Fixed labels in old server UI
* Made algorithm safer and more readable
* Removed xtc_threshold_max
* Fixed arg after update
* Quick fixes by comments
* Simplified algorithm since threshold_max is removed
* Renamed random distribution
* Fixed tests and outdated README
* Small fixes
2024-10-15 12:54:55 +02:00
Georgi Gerganov
d4c19c0f5c
server : accept extra_context for the infill endpoint ( #9874 )
...
* server : accept extra_context for the infill endpoint
ggml-ci
* server : update readme [no ci]
* server : use repo-level FIM pattern if possible
ggml-ci
2024-10-13 21:31:35 +03:00
Georgi Gerganov
c7181bd294
server : reuse cached context chunks ( #9866 )
...
ggml-ci
2024-10-13 18:52:48 +03:00
Georgi Gerganov
edc265661c
server : add option to time limit the generation phase ( #9865 )
...
ggml-ci
2024-10-12 16:14:27 +03:00
Georgi Gerganov
1bde94dd02
server : remove self-extend features ( #9860 )
...
* server : remove self-extend
ggml-ci
* server : fix context limit check to use slot.n_past
ggml-ci
2024-10-12 16:06:31 +03:00
Georgi Gerganov
95c76e8e92
server : remove legacy system_prompt feature ( #9857 )
...
* server : remove legacy system_prompt feature
ggml-ci
* readme : update [no ci]
* server : fix non-transformer logic + remove response from /props
2024-10-12 14:51:54 +03:00
Georgi Gerganov
11ac9800af
llama : improve infill support and special token detection ( #9798 )
...
* llama : improve infill support
ggml-ci
* llama : add more FIM token strings
ggml-ci
* server : update prompt on slot restore (#9800 )
* gguf : deprecate old FIM token KVs
2024-10-12 08:21:51 +03:00
Concedo
e692a79aab
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# CMakeLists.txt
# CONTRIBUTING.md
# docs/android.md
# docs/docker.md
# examples/embedding/embedding.cpp
# examples/imatrix/imatrix.cpp
# examples/infill/infill.cpp
# examples/llama-bench/llama-bench.cpp
# examples/main/README.md
# examples/parallel/parallel.cpp
# examples/perplexity/perplexity.cpp
# examples/quantize-stats/quantize-stats.cpp
# examples/save-load-state/save-load-state.cpp
# examples/server/README.md
# examples/simple/CMakeLists.txt
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/CMakeLists.txt
# ggml/src/ggml-blas.cpp
# pocs/vdot/q8dot.cpp
# pocs/vdot/vdot.cpp
# scripts/debug-test.sh
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-quantize-fns.cpp
# tests/test-quantize-perf.cpp
# tests/test-tokenizer-0.cpp
# tests/test-tokenizer-1-bpe.cpp
# tests/test-tokenizer-1-spm.cpp
2024-10-11 11:59:59 +08:00
Diego Devesa
7eee341bee
common : use common_ prefix for common library functions ( #9805 )
...
* common : use common_ prefix for common library functions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-10 22:57:42 +02:00
Xuan Son Nguyen
458367a906
server : better security control for public deployments ( #9776 )
...
* server : more explicit endpoint access settings
* protect /props endpoint
* fix tests
* update server docs
* fix typo
* fix tests
2024-10-08 13:27:04 +02:00
Concedo
da6cf261a8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/close-issue.yml
# .github/workflows/nix-ci-aarch64.yml
# .github/workflows/nix-ci.yml
# README.md
# ci/run.sh
# examples/server/README.md
# ggml/src/ggml-cuda.cu
# ggml/src/ggml-metal.m
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2024-10-05 22:24:08 +08:00
Georgi Gerganov
8c475b97b8
rerank : use [SEP] token instead of [BOS] ( #9737 )
...
* rerank : use [SEP] token instead of [BOS]
ggml-ci
* common : sanity check for non-NULL tokens
ggml-ci
* ci : adjust rank score interval
ggml-ci
* ci : add shebang to run.sh
ggml-ci
2024-10-05 15:55:04 +03:00
Concedo
ce7f9c9a2c
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-rocm.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .github/workflows/build.yml
# .github/workflows/python-type-check.yml
# CMakeLists.txt
# CONTRIBUTING.md
# README.md
# ci/run.sh
# examples/embedding/embedding.cpp
# examples/server/README.md
# flake.lock
# ggml/include/ggml.h
# ggml/src/ggml.c
# requirements/requirements-convert_legacy_llama.txt
# scripts/sync-ggml.last
# src/llama-vocab.cpp
# src/llama.cpp
# tests/test-backend-ops.cpp
# tests/test-grad0.cpp
# tests/test-tokenizer-0.cpp
2024-10-02 01:00:57 +08:00
Georgi Gerganov
f4d2b8846a
llama : add reranking support ( #9510 )
...
* py : add XLMRobertaForSequenceClassification [no ci]
* py : fix scalar-tensor conversion [no ci]
* py : fix position embeddings chop [no ci]
* llama : read new cls tensors [no ci]
* llama : add classigication head (wip) [no ci]
* llama : add "rank" pooling type
ggml-ci
* server : add rerank endpoint
ggml-ci
* llama : aboud ggml_repeat during classification
* rerank : cleanup + comments
* server : accept /rerank endpoint in addition to /v1/rerank [no ci]
* embedding : parse special tokens
* jina : support v1 reranker
* vocab : minor style
ggml-ci
* server : initiate tests for later
ggml-ci
* server : add docs
* llama : add comment [no ci]
* llama : fix uninitialized tensors
* ci : add rerank tests
ggml-ci
* add reranking test
* change test data
* Update examples/server/server.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* add `--reranking` argument
* update server docs
* llama : fix comment [no ci]
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-28 17:42:03 +03:00
Concedo
ea55f69dc1
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .dockerignore
# .github/workflows/build.yml
# .github/workflows/docker.yml
# Makefile
# README.md
# examples/infill/infill.cpp
# examples/perplexity/perplexity.cpp
# examples/server/README.md
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/CMakeLists.txt
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tests/test-sampling.cpp
2024-09-27 11:21:28 +08:00
Xuan Son Nguyen
afbbfaa537
server : add more env vars, improve gen-docs ( #9635 )
...
* server : add more env vars, improve gen-docs
* update server docs
* LLAMA_ARG_NO_CONTEXT_SHIFT
2024-09-25 14:05:13 +02:00
StrangeBytesDev
0aa15011e3
server : add newline after chat example ( #9616 )
2024-09-24 09:04:39 +03:00
Xuan Son Nguyen
0b3bf966f4
server : add --no-context-shift option ( #9607 )
...
* server : add --no-context-shift option
* small fix
* Update examples/server/tests/features/embeddings.feature
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* tests : minor fix
* revert usage of GGML_ASSERT
* update server documentation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-23 22:23:54 +02:00
Concedo
55a249d222
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/imatrix/imatrix.cpp
# examples/infill/infill.cpp
# examples/perplexity/perplexity.cpp
2024-09-20 18:03:45 +08:00
Georgi Gerganov
6026da52d6
server : clean-up completed tasks from waiting list ( #9531 )
...
ggml-ci
2024-09-19 12:44:53 +03:00
Concedo
29625c3d2e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/server.yml
# CMakeLists.txt
# Makefile
# README.md
# ci/run.sh
# common/CMakeLists.txt
# common/common.cpp
# docs/backend/SYCL.md
# examples/embedding/embedding.cpp
# examples/imatrix/imatrix.cpp
# examples/infill/infill.cpp
# examples/llama-bench/llama-bench.cpp
# examples/main/README.md
# examples/parallel/parallel.cpp
# examples/perplexity/perplexity.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# examples/server/bench/README.md
# examples/server/tests/README.md
# examples/speculative/speculative.cpp
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# scripts/compare-commits.sh
# scripts/compare-llama-bench.py
# tests/CMakeLists.txt
2024-09-19 14:53:57 +08:00
Eric Zhang
f799155ab8
server : fix OpenSSL build (remove obsolete LOG_INFO
) ( #9529 )
2024-09-18 09:28:20 +03:00
Georgi Gerganov
6262d13e0b
common : reimplement logging ( #9418 )
...
https://github.com/ggerganov/llama.cpp/pull/9418
2024-09-15 20:46:12 +03:00
Concedo
ab41e324d6
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# examples/server/CMakeLists.txt
# ggml/src/CMakeLists.txt
2024-09-15 19:28:05 +08:00
VoidIsVoid
dcdcee3a74
server: add data: [DONE] to /chat/completions stream response ( #9459 )
2024-09-14 11:36:44 +02:00
Xuan Son Nguyen
feff4aa846
server : add loading html page while model is loading ( #9468 )
...
* Adding loading page for '/' server requests
* set content when model is loading
* removed loading html file
* updated cmakelist
* updated makefile
* cleaned up whitespace
* cleanup for PR removed error
* updated server test to handle 503 HTML
* updated server test to handle 503 HTML
* ca†ch 503 before parsing json
* revert test
* account for both api and web browser requests
* precommit corrections
* eol fix
* revert changes to pre-commit
* removed print statement
* made loading message more descriptive
* also support .html files
---------
Co-authored-by: VJHack <flymyplane21@gmail.com>
Co-authored-by: Vinesh Janarthanan <36610342+VJHack@users.noreply.github.com>
2024-09-13 14:23:11 +02:00
Concedo
e44ddf26ef
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/server.yml
# CMakeLists.txt
# Makefile
# examples/embedding/embedding.cpp
# examples/imatrix/imatrix.cpp
# examples/llama-bench/llama-bench.cpp
# examples/llava/MobileVLM-README.md
# examples/parallel/parallel.cpp
# examples/perplexity/perplexity.cpp
# examples/quantize/CMakeLists.txt
# examples/server/README.md
# examples/speculative/speculative.cpp
# tests/test-backend-ops.cpp
2024-09-13 16:17:24 +08:00
Mathijs Henquet
78203641fe
server : Add option to return token pieces in /tokenize endpoint ( #9108 )
...
* server : added with_pieces functionality to /tokenize endpoint
* server : Add tokenize with pieces tests to server.feature
* Handle case if tokenizer splits along utf8 continuation bytes
* Add example of token splitting
* Remove trailing ws
* Fix trailing ws
* Maybe fix ci
* maybe this fix windows ci?
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-09-12 22:30:11 +02:00
Concedo
13394368b6
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/embedding/embedding.cpp
# examples/infill/infill.cpp
# examples/perplexity/perplexity.cpp
# flake.lock
# src/llama-sampling.cpp
2024-09-11 20:27:53 +08:00
slaren
49006c67b4
llama : move random seed generation to the samplers ( #9398 )
...
* llama_sampler_penalties : clamp penalty_last_n to zero
2024-09-10 18:04:25 +02:00
Concedo
a947558e0e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# common/CMakeLists.txt
# common/common.cpp
# common/common.h
# examples/embedding/embedding.cpp
# examples/imatrix/imatrix.cpp
# examples/infill/infill.cpp
# examples/parallel/parallel.cpp
# examples/perplexity/perplexity.cpp
# examples/rpc/README.md
# examples/save-load-state/save-load-state.cpp
# examples/server/README.md
# examples/speculative/speculative.cpp
# tests/test-sampling.cpp
2024-09-10 16:39:23 +08:00
Xuan Son Nguyen
bfe76d4a17
common : move arg parser code to arg.cpp
( #9388 )
...
* common : move arg parser to arg.cpp
* better categorize args
* add cmake
* missing climits
* missing cstdarg
* common : more explicit includes
* fix build
* refactor gpt_params_parse
* update server readme
* fix test
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-09 23:36:09 +02:00
slaren
5fb5e24811
llama : minor sampling refactor (2) ( #9386 )
2024-09-09 17:10:46 +02:00
Concedo
b63158005f
All samplers moved to kcpp side
2024-09-09 18:14:11 +08:00
Concedo
12fd16bfd4
Merge commit ' df270ef745
' into concedo_experimental
...
# Conflicts:
# Makefile
# common/CMakeLists.txt
# common/common.h
# common/sampling.cpp
# common/sampling.h
# examples/infill/infill.cpp
# examples/llama-bench/llama-bench.cpp
# examples/quantize-stats/quantize-stats.cpp
# examples/server/server.cpp
# include/llama.h
# src/llama-sampling.cpp
# src/llama-sampling.h
# src/llama.cpp
# tests/test-grammar-integration.cpp
# tests/test-grammar-parser.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-llama-grammar.cpp
# tests/test-sampling.cpp
2024-09-09 17:10:08 +08:00
Concedo
70cdb55cc9
Merge commit ' 947538acb8
' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakePresets.json
# examples/llama-bench/llama-bench.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-quantize-fns.cpp
2024-09-09 11:26:34 +08:00