Commit graph

1696 commits

Author SHA1 Message Date
Miwa / Ensan
5c9f90cba1
swift : fix prompt tokenization logic (#4321) 2023-12-04 15:43:45 +02:00
Concedo
ac36aee001 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
2023-12-03 21:56:29 +08:00
Concedo
48544cd2ef Revert "Revert "ggml : add ggml_soft_max_ext (#4256)""
This reverts commit a8e66ef31c.
2023-12-03 21:46:50 +08:00
Ed Lee
33e171d1e9
server : fix OpenAI API stop field to be optional (#4299)
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)
2023-12-03 11:10:43 +02:00
Rickard Edén
6949b50df5
py : add grammar to oai like api (#4294) 2023-12-03 11:03:25 +02:00
Concedo
a8e66ef31c Revert "ggml : add ggml_soft_max_ext (#4256)"
This reverts commit ef47ec18da.
2023-12-03 00:42:01 +08:00
Georgi Gerganov
d5a1cbde60
llama : support optional tensors (#4283) 2023-12-01 20:35:47 +02:00
Miwa / Ensan
b220222a64
swift : fix token_to_piece implementation (#4278)
* Fix token_to_piece implementation in Swift

* Fix errors
2023-12-01 20:19:45 +02:00
Concedo
495bb3ab1e Merge branch 'master' into concedo_experimental 2023-12-01 23:48:20 +08:00
Concedo
4f40c226a0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
2023-12-01 23:46:59 +08:00
Georgi Gerganov
ef47ec18da
ggml : add ggml_soft_max_ext (#4256)
* metal : implement soft_max_ext

* cuda : implement soft_max_ext

* ggml : implement soft_max_ext (CPU)

* batched-bench : print threads

ggml-ci

* metal : simplify soft_max encoding

ggml-ci

* cuda : use 512 threads for soft_max instead of 32

* ggml : update soft max cpu

* cuda : do warp-based block reduce

* cuda : increase max block size to 1024

* cuda : fix warp reduction initialization of shared mem

* metal : warp-based reduction for soft max kernel

* metal : warp-based reduce for rms_norm

* metal : simplify soft max kernel

ggml-ci

* alloc : fix build with debug
2023-12-01 10:51:24 +02:00
Ziad Ben Hadj-Alouane
1d144112c0
server : add --log-disable to disable logging to file (#4260)
* * add --log-disable to disable logging to file in the server example

* * typo fix
2023-12-01 00:25:49 +02:00
Ziad Ben Hadj-Alouane
f43f09366d
server : add single-client multi-prompt support (#4232)
* * add multiprompt support

* * cleanup

* * more cleanup

* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests

* * remove all references to mutex_multitasks

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* * change to set

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-12-01 00:25:04 +02:00
John
33c9892af5
llava : ShareGPT4V compatibility (vision encoder only loading) (#4172)
* ShareGPT4 compatibility (vision encoder only loading)

Load only a CLIP vision encoder (as supplied by ShareGPT finetunes)
Corrects the argument parsing for --img_mean and --img_std (which were previously not parsed but attempted to access)
Defines defaults for img_mean and img_std which are equal to the llava 1.5 CLIP encoder, so you do not have to provide them

* Update convert-image-encoder-to-gguf.py
2023-11-30 23:11:14 +01:00
Andrew Godfrey
8efa0f6ebe
main : pass LOG_TEE callback to llama.cpp log (#4033)
* main : Call llama_log_set to use LOG_TEE

* tabs to spaces
2023-11-30 23:56:19 +02:00
Miwa / Ensan
bde629bb53
batched.swift : update README.md (#4214)
docs: update how to run
2023-11-30 23:45:17 +02:00
rhjdvsgsgks
e2bd725f4b
py : fix oai proxy (#3972)
* fix oai proxy

fix generation not stoped while bot stop talking in chat mode

fix possible `slot_id` not exist

response for cors (and pre flight)

* oai proxy: workaround for some client (such as Chatbox)

* use stop as separator to replace hardcoded `\n`
2023-11-30 22:50:40 +02:00
Concedo
e9724cdc9d Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-11-30 14:31:53 +08:00
Georgi Gerganov
1f5cd83275
examples : add readme files 2023-11-29 11:00:17 +02:00
Concedo
581021ab93 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	scripts/build-info.cmake
2023-11-28 20:57:56 +08:00
Bailey Chittle
bb03290c17
examples : iOS example with swift ui (#4159)
* copy to llama.cpp as subdir

* attempt enabling metal, fails

* ggml metal compiles!

* Update README.md

* initial conversion to new format, utf8 errors?

* bug fixes, but now has an invalid memory access :(

* added O3, now has insufficient memory access

* begin sync with master

* update to match latest code, new errors

* fixed it!

* fix for loop conditionals, increase result size

* fix current workflow errors

* attempt a llama.swiftui workflow

* Update .github/workflows/build.yml

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-27 16:56:52 +02:00
Concedo
8acd7be734 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-11-27 14:06:14 +08:00
Georgi Gerganov
3e73d31d9c
lookahead : support -n -1 infinite generation 2023-11-26 21:52:23 +02:00
Georgi Gerganov
922754a8d6
lookahead : add example for lookahead decoding (#4207)
* lookahead : init

* lookahead : generate and store n-grams

* lookahead : use loop instead recursion to generate n-grams

* lookahead : initial working implementation

* lookahead : filter repeating n-grams

* lookahead : use deterministic init

* lookahead : add to Makefile

* lookahead : fix a bug in the seq_id of the lookahead tokens

* lookahead : add comments

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-11-26 20:33:07 +02:00
Georgi Gerganov
af19d35734
server : OAI API compatibility (#4198)
* Add openai-compatible POST /v1/chat/completions API endpoint to server example

* fix code style

* Update server README.md

* Improve server README.md

* Fix server.cpp code style according to review

* server : some style changes

* server : indentation

* server : enable special tokens during tokenization by default

* server : minor code style

* server : change random string generator

* straightforward /v1/models endpoint

---------

Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
2023-11-25 11:29:06 +02:00
eastriver
2568a4bf54
main.swift : fix eos checking (#4197)
llama_token_eos(const struct llama_model *) is currently getting struct llama_context type variable context as a parameter.
2023-11-24 11:25:10 +02:00
Haohui Mai
55978ce09b
Fix incorrect format strings and uninitialized variables. (#4133)
* Fix incorrect format strings and uninitialized variables.

* Address comments

* Add the missing include statement
2023-11-23 22:56:53 +01:00
Georgi Gerganov
6b0a7420d0
llama : KV cache view API + better KV cache management (#4170)
* llama : keep track of used KV cells + better KV cache management

* llama : zero KV cache used upon clear

ggml-ci

* llama : allow exporting a view of the KV cache (#4180)

* Allow exporting a view of the KV cache

* Allow dumping the sequences per cell in common

* Track max contiguous cells value and position as well

* Fix max contiguous empty cells index calculation

Make dump functions deal with lengths or sequences counts > 10 better

* Fix off by one error in dump_kv_cache_view

* Add doc comments for KV cache view functions

Eliminate cell sequence struct; use llama_seq_id directly

Minor cleanups

* common : add -dkvc arg for enabling kv cache dumps

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-23 19:07:56 +02:00
Daniel Bevenius
9d5949f04b
examples : fix typo in parallel example doc comment (#4181)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-11-23 13:34:20 +02:00
Galunid
0b871f1a04
finetune - update readme to mention llama support only (#4148) 2023-11-20 19:30:00 +01:00
Concedo
56a5fa7a60 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-tokenizer-0-falcon.py
#	tests/test-tokenizer-0-llama.py
2023-11-20 22:37:06 +08:00
Seb C
881800d1f0
main : Add ChatML functionality to main example (#4046)
Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com>
2023-11-20 14:56:59 +01:00
Branden Butler
40a34fe8d0
speculative : fix prompt tokenization in speculative example (#4025)
* Support special tokens and not adding BOS to prompt in speculative

* Adapt to new should_add_bos function

* Ensure tgt and dft have same add_bos setting
2023-11-20 11:50:04 +02:00
Georgi Gerganov
dae06c06e5
Revert "finetune : add --n-gpu-layers flag info to --help (#4128)"
This reverts commit 05e8301e45.
2023-11-19 19:16:07 +02:00
Clark Saben
05e8301e45
finetune : add --n-gpu-layers flag info to --help (#4128) 2023-11-19 18:56:38 +02:00
SoftwareRenderer
936c79b227
server : relay error messages (#4131) 2023-11-19 18:54:10 +02:00
Kerfuffle
28a2e6e7d4
tokenize example: Respect normal add BOS token behavior (#4126)
Allow building with Makefile
2023-11-18 14:48:17 -07:00
Concedo
6bf8ee4aea Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	ggml-cuda.cu
#	tests/test-tokenizer-0-falcon.py
#	tests/test-tokenizer-0-llama.py
2023-11-18 11:10:45 +08:00
Georgi Gerganov
5ad387e994
tokenize : fix trailing whitespace 2023-11-17 18:01:38 +02:00
zakkor
2fa02b4b3d
examples : add tokenize (#4039) 2023-11-17 17:36:44 +02:00
Huawei Lin
c7cce1246e
llava : fix compilation warning that fread return value is not used (#4069) 2023-11-17 17:22:56 +02:00
Jiří Podivín
f7d5e97542
py : remove superfluous import statements (#4076)
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Jiri Podivin <jpodivin@redhat.com>
2023-11-17 17:20:53 +02:00
Jiří Podivín
ba4cf5c0bf
train : move number of gpu layers argument parsing to common/train.cpp (#4074)
- introduces help entry for the argument
 - cuts '--gpu-layers' form in order to simplify usage and documentation.

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Jiri Podivin <jpodivin@redhat.com>
2023-11-17 17:19:16 +02:00
Andrew Godfrey
947f64f163
finetune : zero the loraB initial vectors (#4082)
* finetune : zero the loraB initial vectors

Without this, the first iteration is starting out far from the base model, instead of exactly on it.
Zeroing loraB is what the paper recommends. loralib also zeroes at least one of the init vector pairs
(though it departs from the paper in using a different distribution for the other vector, in some cases).

* tabs to spaces

* Use ggml_set_zero instead of adding a new function
2023-11-17 11:23:11 +01:00
Kerfuffle
91f6499393
Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.

* Respect add_bos_token GGUF metadata value

* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-16 19:14:37 -07:00
Concedo
35a97e14b2 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	docs/token_generation_performance_tips.md
#	grammars/README.md
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-grad0.cpp
#	tests/test-opt.cpp
2023-11-15 16:59:53 +08:00
M. Yusuf Sarıgöz
bd90eca237
llava : fix regression for square images in #3613 (#4056) 2023-11-13 18:20:52 +03:00
Georgi Gerganov
4760e7cc0b
sync : ggml (backend v2) (#3912)
* sync : ggml (backend v2) (wip)

* sync : migrate examples and llama.cpp to dynamic graphs (wip)

* sync : update tests + fix max op params to 64

ggml-ci

* sync : ggml-cuda

ggml-ci

* llama : fix save/load state context size

ggml-ci

* sync : try to fix build on tvOS

* sync : pass custom graph sizes in training examples

* sync : update graph copies to new ggml API

* sync : update sync-ggml.sh with new files

* scripts : fix header in sync script

* train : fix context size calculations

* llama : increase inference graph size up to 4096 nodes

* train : allocate grads for backward graphs

* train : allocate grads for gb_tmp
2023-11-13 14:16:23 +02:00
Richard Kiss
532dd74e38
Fix some documentation typos/grammar mistakes (#4032)
* typos

* Update examples/parallel/README.md

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-11 23:04:58 -07:00
Alexey Parfenov
d96ca7ded7
server : fix crash when prompt exceeds context size (#3996) 2023-11-10 23:48:21 -06:00