Commit graph

229 commits

Author SHA1 Message Date
Concedo
6d32e7fc8b Merge commit 'a6803cab94' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	Makefile
#	build.zig
#	flake.nix
#	ggml-cuda.cu
#	ggml.h
#	tests/test-grad0.c
#	tests/test-opt.c
2023-07-18 19:12:06 +08:00
Shangning Xu
c48c525f87
examples : fixed path typos in embd-input (#2214) 2023-07-14 21:40:05 +03:00
Howard Su
32c5411631
Revert "Support using mmap when applying LoRA (#2095)" (#2206)
Has perf regression when mlock is used.

This reverts commit 2347463201.
2023-07-13 21:58:25 +08:00
Concedo
5941514e95 Merge commit '5bf2a27718' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	README.md
2023-07-12 13:05:16 +08:00
Spencer Sutton
5bf2a27718
ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178)
* Add ggml changes

* Update train-text-from-scratch for change

* mpi : adapt to new ggml_tensor->src

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11 19:31:10 +03:00
Bach Le
c9c74b4e3f
llama : add classifier-free guidance (#2135)
* Initial implementation

* Remove debug print

* Restore signature of llama_init_from_gpt_params

* Free guidance context

* Make freeing of guidance_ctx conditional

* Make Classifier-Free Guidance a sampling function

* Correct typo. CFG already means context-free grammar.

* Record sampling time in llama_sample_classifier_free_guidance

* Shift all values by the max value before applying logsoftmax

* Fix styling based on review
2023-07-11 19:18:43 +03:00
Howard Su
2347463201
Support using mmap when applying LoRA (#2095)
* Support using mmap when applying LoRA

* Fix Linux

* Update comment to reflect the support lora with mmap
2023-07-11 22:37:01 +08:00
Concedo
b0b131499f Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	tests/test-tokenizer-0.cpp
2023-07-11 16:12:15 +08:00
Evan Miller
5656d10599
mpi : add support for distributed inference via MPI (#2099)
* MPI support, first cut

* fix warnings, update README

* fixes

* wrap includes

* PR comments

* Update CMakeLists.txt

* Add GH workflow, fix test

* Add info to README

* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)

* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()

* mpi : move all MPI logic into ggml-mpi

Not tested yet

* mpi : various fixes - communication now works but results are wrong

* mpi : fix output tensor after MPI compute (still not working)

* mpi : fix inference

* mpi : minor

* Add OpenMPI to GH action

* [mpi] continue-on-error: true

* mpi : fix after master merge

* [mpi] Link MPI C++ libraries to fix OpenMPI

* tests : fix new llama_backend API

* [mpi] use MPI_INT32_T

* mpi : factor out recv / send in functions and reuse

* mpi : extend API to allow usage with outer backends (e.g. Metal)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-10 18:49:56 +03:00
Concedo
50097e6c7f Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
#	llama.cpp
2023-07-10 20:08:27 +08:00
Nigel Bosch
db4047ad5c
main : escape prompt prefix/suffix (#2151) 2023-07-09 11:56:18 +03:00
Concedo
15576bc865 Merge branch 'kquant_vocab_fix' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	README.md
#	llama.cpp
#	tests/CMakeLists.txt
#	tests/test-grad0.c
#	tests/test-opt.c
2023-07-08 20:43:20 +08:00
Qingyou Meng
1d656d6360
ggml : change ggml_graph_compute() API to not require context (#1999)
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287

* rewrite: no longer consider backward compitability; plan and make_plan

* minor: rename ctx as plan; const

* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward

* add static ggml_graph_compute_sugar()

* minor: update comments

* reusable buffers

* ggml : more consistent naming + metal fixes

* ggml : fix docs

* tests : disable grad / opt + minor naming changes

* ggml : add ggml_graph_compute_with_ctx()

- backwards compatible API
- deduplicates a lot of copy-paste

* ci : enable test-grad0

* examples : factor out plan allocation into a helper function

* llama : factor out plan stuff into a helper function

* ci : fix env

* llama : fix duplicate symbols + refactor example benchmark

* ggml : remove obsolete assert + refactor n_tasks section

* ggml : fix indentation in switch

* llama : avoid unnecessary bool

* ggml : remove comments from source file and match order in header

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 19:24:01 +03:00
Concedo
32102c2064 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-07-07 14:15:39 +08:00
Judd
36680f6e40
convert : update for baichuan (#2081)
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.

Co-authored-by: Judd <foldl@boxvest.com>
2023-07-06 19:23:49 +03:00
tslmy
a17a2683d8
alpaca.sh : update model file name (#2074)
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
2023-07-06 19:17:50 +03:00
Concedo
220aa707e6 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	pocs/vdot/q8dot.cpp
#	pocs/vdot/vdot.cpp
#	scripts/sync-ggml.sh
#	tests/test-grad0.c
#	tests/test-quantize-fns.cpp
#	tests/test-quantize-perf.cpp
2023-07-06 15:40:40 +08:00
Tobias Lütke
31cfbb1013
Expose generation timings from server & update completions.js (#2116)
* use javascript generators as much cleaner API

Also add ways to access completion as promise and EventSource

* export llama_timings as struct and expose them in server

* update readme, update baked includes

* llama : uniform variable names + struct init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson
983b555e9d
Update Server Instructions (#2113)
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text

---------

Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 21:03:19 +03:00
Stephan Walter
1b107b8550
ggml : generalize quantize_fns for simpler FP16 handling (#1237)
* Generalize quantize_fns for simpler FP16 handling

* Remove call to ggml_cuda_mul_mat_get_wsize

* ci : disable FMA for mac os actions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson
8567c76b53
Update server instructions for web front end (#2103)
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 18:13:35 +03:00
Nigel Bosch
7f0e9a775e
embd-input: Fix input embedding example unsigned int seed (#2105) 2023-07-05 07:33:33 +08:00
jwj7140
f257fd2550
Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
2023-07-04 21:06:12 +03:00
Tobias Lütke
7ee76e45af
Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04 16:05:27 +02:00
Concedo
69add28324 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
2023-07-04 18:51:42 +08:00
Henri Vasserman
1cf14ccef1
fix server crashes (#2076) 2023-07-04 00:05:23 +03:00
WangHaoranRobin
d7d2e6a0f0
server: add option to output probabilities for completion (#1962)
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
2023-07-03 00:38:44 +03:00
Concedo
b85ea580d3 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-07-02 14:45:25 +08:00
Georgi Gerganov
79f634a19d
embd-input : fix returning ptr to temporary 2023-07-01 18:46:00 +03:00
Georgi Gerganov
04606a1599
train : fix compile warning 2023-07-01 18:45:44 +03:00
Concedo
67cb0b2760 Merge branch 'master' into concedo_experimental 2023-06-30 23:25:40 +08:00
Howard Su
b8c8dda75f
Use unsigned for random seed (#2006)
* Use unsigned for random seed. Keep -1 as the value to use a time based seed.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-29 06:15:15 -07:00
Concedo
dff5575647 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	Makefile
#	ggml-opencl.cpp
#	llama.cpp
2023-06-29 17:35:28 +08:00
Johannes Gäßler
7f9753fa12
CUDA GPU acceleration for LoRAs + f16 models (#1970) 2023-06-28 18:35:54 +02:00
ningshanwutuobang
cfa0750bc9
llama : support input embeddings directly (#1910)
* add interface for float input

* fixed inpL shape and type

* add examples of input floats

* add test example for embd input

* fixed sampling

* add free for context

* fixed add end condition for generating

* add examples for llava.py

* add READMD for llava.py

* add READMD for llava.py

* add example of PandaGPT

* refactor the interface and fixed the styles

* add cmake build for embd-input

* add cmake build for embd-input

* Add MiniGPT-4 example

* change the order of the args of llama_eval_internal

* fix ci error
2023-06-28 18:53:37 +03:00
Concedo
282376c85a Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	tests/test-quantize-perf.cpp
2023-06-27 19:15:27 +08:00
Howard Su
0be54f75a6
baby-llama : fix build after ggml_rope change (#2016) 2023-06-27 08:07:13 +03:00
Georgi Gerganov
181e8d9755
llama : fix rope usage after ChatGLM change 2023-06-27 00:37:33 +03:00
David Yang
eaa6ca5a61
ggml : increase max tensor name + clean up compiler warnings in train-text (#1988)
* Clean up compiler warnings in train-text

Some brackets to disambiguate order of operations

* Increase GGML_MAX_NAME

Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
2023-06-26 22:45:32 +03:00
zrm
b853d45601
ggml : add NUMA support (#1556)
* detect NUMA systems and pin work threads to nodes (linux)

* disable mmap prefetch/readahead for NUMA systems

* avoid sending finalize op to thread pool if it does nothing

* silence robot

* fix args

* make --numa a param

* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

* lower synchronization overhead

* statically allocate

* move numa state to g_state

* add description for --numa

* ggml : minor style changes

* ggml : minor style + try fix sanitizer build

* llama : allow to initialize backend with NUMA support

* llama : avoid ggml include in llama-util.h

* ggml : style / formatting

* ggml : fix handling of ops with n_threads > n_tasks > 1

* server : utilize numa parameter

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-26 20:57:59 +03:00
Concedo
e4c9aea840 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-06-26 10:35:47 +08:00
Concedo
d2034ced7b Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
#	build.zig
#	flake.nix
#	tests/test-grad0.c
#	tests/test-sampling.cpp
#	tests/test-tokenizer-0.cpp
2023-06-25 17:01:15 +08:00
anon998
c2a08f87b8
fix server sampling: top k sampler first (#1977)
Co-authored-by: anon <anon@example.org>
2023-06-25 10:48:36 +02:00
Didzis Gosko
527b6fba1d
llama : make model stateless and context stateful (llama_state) (#1797)
* llama : make model stateless and context stateful

* llama : minor cleanup

* llama : update internal API declaration

* Apply suggestions from code review

fix style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Missing model memory release

* Fix style

* Add deprecated warning for public API function llama_init_from_file

* Update public API use cases: move away from deprecated llama_init_from_file

* Deprecate public API function llama_apply_lora_from_file

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24 11:47:58 +03:00
Concedo
b4c532e862 Merge branch 'master' into concedo_experimental 2023-06-20 17:26:27 +08:00
Henri Vasserman
20568fe60f
[Fix] Reenable server embedding endpoint (#1937)
* Add back embedding feature

* Update README
2023-06-20 01:12:39 +03:00
Concedo
d0d3c4f32b Merge remote-tracking branch 'origin/master' into concedo_experimental
# Conflicts:
#	README.md
2023-06-18 22:53:10 +08:00
Kawrakow
90cc59d6ab
examples : fix examples/metal (#1920)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18 10:52:10 +03:00
Concedo
278427d9a4 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
2023-06-18 15:29:44 +08:00
Georgi Gerganov
4f9c43e3bd
minor : warning fixes 2023-06-17 20:24:11 +03:00