Commit graph

506 commits

Author SHA1 Message Date
l3utterfly
415e99fec2
Stream save llama context data to file instead of allocating entire buffer upfront (#2488)
* added stream saving context data to file to avoid allocating unnecessary amounts of memory

* generalised copying state data to file or buffer

* added comments explaining how copy_state_data works

* fixed trailing whitespaces

* fixed save load state example

* updated save load state to use public function in llama.cpp

* - restored breakage of the llama_copy_state_data API
- moved new logic for copying llama state data to internal function

* fixed function declaration order

* restored save load state example

* fixed whitepace

* removed unused llama-util.h include

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* Apply code review suggestions

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-08-04 13:29:52 +02:00
Concedo
e221843147 trying out mmq
Merge branch 'master' into concedo_experimental

# Conflicts:
#	CMakeLists.txt
#	README.md
2023-07-31 22:51:15 +08:00
Concedo
3e370f83ef Warning: Very experimental merge, do not use until confirmed stable. 2023-07-31 22:33:43 +08:00
Johannes Gäßler
0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues (#2453) 2023-07-31 15:44:35 +02:00
slaren
9d2382b3e4
Fix Metal backend broken from the allocator changes (#2455)
* fix Metal backend broken from the allocator changes
2023-07-31 11:02:53 +02:00
slaren
a113689571
ggml : add graph tensor allocator (#2411)
* ggml : add graph tensor allocator

* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset

* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset
2023-07-30 15:58:01 +02:00
Concedo
cde3760e52 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	ggml.h
#	llama.cpp
2023-07-29 17:47:00 +08:00
eric8607242
ee1b497c98
llama : support more diverse tokenizers? (#2420)
* supporting more diverse tokenizers

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-28 21:10:05 +03:00
Rand Xie
65cdf34bdc
llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433) 2023-07-28 11:42:53 +03:00
Georgi Gerganov
1a941869cb
metal : disable graph concurrency optimization due to bug (#2413) 2023-07-27 11:00:54 +03:00
slaren
5488fb789e
ggml : allocate graphs in a context (#2392)
* ggml : graph allocation in contexts

* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx

* llama.cpp : allocate graph in the context

* add GGML_PAD

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-26 15:56:53 +02:00
Concedo
b184380aae Revert "a better default rms_norm_eps"
This reverts commit 0c26799e77.
2023-07-26 10:23:45 +08:00
Concedo
f53d2aabb4 Merge branch 'master' into concedo_experimental 2023-07-26 10:19:59 +08:00
Kawrakow
eb542d3932
Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 18:35:53 +03:00
Concedo
6a054b80b0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	scripts/build-info.sh
2023-07-25 22:55:55 +08:00
Concedo
0c26799e77 a better default rms_norm_eps 2023-07-25 22:51:01 +08:00
slaren
da1889834a
ggml : improve graph build time via hash table lookup (#2329)
* improve graph build time

* ggml_tensor : use 1 bit per flag

* use a hash table instead
2023-07-25 15:32:20 +03:00
Shouzheng Liu
1aa18ef994
metal : concurrently dispatch commands (#2358)
* metal: concurrently dispatch commands

Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.

* metal: don't call find_concurrency automatically.

* metal : code style changes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-25 15:00:19 +03:00
Concedo
3e68cdd26a Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-grad0.c
2023-07-25 18:52:48 +08:00
slaren
41c674161f
make rms_norm_eps a parameter (#2374)
* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci
2023-07-24 17:57:12 +02:00
Concedo
6d71e100fe buff buffers 2023-07-24 20:33:17 +08:00
Concedo
66328fcd80 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-07-24 15:44:26 +08:00
Concedo
94499dba25 added support for 70b llama 2 2023-07-24 15:20:18 +08:00
Evan Jones
84e09a7d8b
llama : add grammar-based sampling (#1773)
* llama, main : constrain sampling to grammar

* allow loading grammar from file

* fix whitespace errors

* handle & print parser errors

* add comments to grammar syntax and allow newlines where unambiguous

* add missing include

* support alternates in root rule

* fix bugs with empty token and EOS

* adjust JSON grammar

* remove swp file

* rewrite ternary expressions

Co-authored-by: Henri Vasserman <henv@hot.ee>

* use struct for grammar elements and add Unicode support

* add unicode escapes

* add inverse char ranges

* only sample full tokens (no peeking or truncation)

* llama : minor style changes

blindly applied in online editor - hopefully I didn't break something

* update help text

* add warning message if EOS is disabled

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 23:58:10 -04:00
Concedo
910744e2c0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	flake.nix
#	llama.cpp
2023-07-23 22:37:38 +08:00
Georgi Gerganov
e76d630df1
llama : grouped-query attention + LLaMAv2 70B support (#2276)
* CUDA: GQA implementation

* llama : support for GQA and LLaMAv2 70B

ggml-ci

* py : fix hparams parsing (if-else blocks)

ggml-ci

* py : oh boy ..

ggml-ci

* help : fix gqa value for 70B

ggml-ci

---------

Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
Christian Demsar
a940458e48
llama : print max tensor size to stderr (#2336) 2023-07-23 14:56:34 +03:00
Concedo
aa05eadb6f Merge branch 'master' into concedo_experimental
# Conflicts:
#	llama.cpp
2023-07-23 16:22:44 +08:00
Georgi Gerganov
b47b8a9cfe
llama : optimize memory buffers (#2325) 2023-07-22 21:17:57 +03:00
Concedo
3aec3038d4 bump scratch buffers 2023-07-22 18:12:18 +08:00
Concedo
343ae756fa Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.nix
#	ggml-cuda.cu
2023-07-22 11:51:30 +08:00
Georgi Gerganov
513f861953
ggml : fix rope args order + assert (#2054) 2023-07-21 14:51:34 +03:00
Guillaume "Vermeille" Sanchez
ab0e26bdfb
llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280) 2023-07-21 13:58:36 +03:00
Georgi Gerganov
ae178ab46b
llama : make tensor_split ptr instead of array (#2272) 2023-07-21 13:10:51 +03:00
Concedo
06c08576f7 Merge remote-tracking branch 'origin/master' into concedo_experimental 2023-07-20 21:02:40 +08:00
Georgi Gerganov
fff0e0eafe llama : fix regression from #2000 - could not load no-mmap models 2023-07-20 13:47:26 +03:00
Concedo
13e34d5058 Merge remote-tracking branch 'origin/master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
#	flake.nix
#	tests/CMakeLists.txt

update readme and lite
2023-07-19 18:28:29 +08:00
Rinne
294f424554
llama : extend API to get max devices at runtime (#2253) 2023-07-19 10:06:40 +03:00
Concedo
374fffb9c6 Reworking rope WIP 2023-07-19 00:54:41 +08:00
Georgi Gerganov
d01bccde9f
ci : integrate with ggml-org/ci (#2250)
* ci : run ctest

ggml-ci

* ci : add open llama 3B-v2 tests

ggml-ci

* ci : disable wget progress output

ggml-ci

* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations

ggml-ci

* tests : try to fix tail free sampling test

ggml-ci

* ci : add K-quants

ggml-ci

* ci : add short perplexity tests

ggml-ci

* ci : add README.md

* ppl : add --chunks argument to limit max number of chunks

ggml-ci

* ci : update README
2023-07-18 14:24:43 +03:00
Concedo
6d32e7fc8b Merge commit 'a6803cab94' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	Makefile
#	build.zig
#	flake.nix
#	ggml-cuda.cu
#	ggml.h
#	tests/test-grad0.c
#	tests/test-opt.c
2023-07-18 19:12:06 +08:00
Alex Klinkhamer
b7647436cc
llama : fix t_start_sample_us initialization warning (#2238) 2023-07-17 00:01:45 +03:00
Xiao-Yong Jin
6e7cca4047
llama : add custom RoPE (#2054)
* Implement customizable RoPE

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

1. Extending Context to 8K
   kaiokendev
   https://kaiokendev.github.io/til#extending-context-to-8k

2. Extending Context Window of Large Language Models via Positional Interpolation
   Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
   https://arxiv.org/abs/2306.15595

3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
   https://www.reddit.com/user/bloc97
   https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

* ggml-metal: fix custom rope

* common: fix argument names in help

* llama: increase MEM_REQ_EVAL for MODEL_3B

It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.

* llama: make MEM_REQ_EVAL depend on n_ctx

* server: use proper Content-Type in curl examples

Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded

Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192

With Content-Type: application/json, we can send large json data.

* style : minor fixes, mostly indentations

* ggml : fix asserts

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 13:34:16 +03:00
Bach Le
7513b7b0a1
llama : add functions that work directly on model (#2197)
* Remove vocab reference from context

* Add functions that works directly with model
2023-07-14 21:55:24 +03:00
Concedo
5941514e95 Merge commit '5bf2a27718' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	README.md
2023-07-12 13:05:16 +08:00
Bach Le
c9c74b4e3f
llama : add classifier-free guidance (#2135)
* Initial implementation

* Remove debug print

* Restore signature of llama_init_from_gpt_params

* Free guidance context

* Make freeing of guidance_ctx conditional

* Make Classifier-Free Guidance a sampling function

* Correct typo. CFG already means context-free grammar.

* Record sampling time in llama_sample_classifier_free_guidance

* Shift all values by the max value before applying logsoftmax

* Fix styling based on review
2023-07-11 19:18:43 +03:00
LostRuins
bbef28218f
Possible solution to allow K-quants on models with n_vocab!=32000 (#2148)
* This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.

* Fix indentation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* As an alternative, to avoid failing on Metal due to lack of Q8_0 support, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11 22:01:08 +08:00
Concedo
b0b131499f Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	tests/test-tokenizer-0.cpp
2023-07-11 16:12:15 +08:00
Evan Miller
5656d10599
mpi : add support for distributed inference via MPI (#2099)
* MPI support, first cut

* fix warnings, update README

* fixes

* wrap includes

* PR comments

* Update CMakeLists.txt

* Add GH workflow, fix test

* Add info to README

* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)

* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()

* mpi : move all MPI logic into ggml-mpi

Not tested yet

* mpi : various fixes - communication now works but results are wrong

* mpi : fix output tensor after MPI compute (still not working)

* mpi : fix inference

* mpi : minor

* Add OpenMPI to GH action

* [mpi] continue-on-error: true

* mpi : fix after master merge

* [mpi] Link MPI C++ libraries to fix OpenMPI

* tests : fix new llama_backend API

* [mpi] use MPI_INT32_T

* mpi : factor out recv / send in functions and reuse

* mpi : extend API to allow usage with outer backends (e.g. Metal)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-10 18:49:56 +03:00
Concedo
11ebfea8c0 Merge branch 'kquant_vocab_fix' into concedo_experimental 2023-07-10 23:28:48 +08:00