Commit graph

466 commits

Author SHA1 Message Date
Concedo
793cfd136c fixed 70B detection again, try fix horde issues, fixed lite unicode issue, fixed cmake for cuda 2023-08-09 01:05:00 +08:00
Johannes Gäßler
acfc5478ff
CUDA: tighter VRAM scratch size for 65b/70b (#2551) 2023-08-08 14:38:16 +02:00
Concedo
3554080502 fixed blasbatchmul multiplier 2023-08-08 00:41:02 +08:00
Concedo
3c7d938d95 update lite, resize scratch buffers for blasbatch 2048 2023-08-08 00:32:51 +08:00
Johannes Gäßler
3d9a551816
Fixed mmap prefetch for GPU offloading (#2529) 2023-08-07 10:09:40 +02:00
Concedo
0e41b94f40 improve detection for 70B. 2023-08-07 10:43:06 +08:00
Concedo
fb44d72a78 Merge remote-tracking branch 'johannes/cuda-fix-mmap-prefetch' into concedo_experimental 2023-08-07 10:17:43 +08:00
JohannesGaessler
d9024df759 Fixed mmap prefetch for GPU offloading 2023-08-06 20:28:16 +02:00
Concedo
d442888626 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-08-06 22:47:33 +08:00
Concedo
bcfdd0e662 fixed bbs -1 and allow bbs = 2048 2023-08-06 17:47:05 +08:00
l3utterfly
415e99fec2
Stream save llama context data to file instead of allocating entire buffer upfront (#2488)
* added stream saving context data to file to avoid allocating unnecessary amounts of memory

* generalised copying state data to file or buffer

* added comments explaining how copy_state_data works

* fixed trailing whitespaces

* fixed save load state example

* updated save load state to use public function in llama.cpp

* - restored breakage of the llama_copy_state_data API
- moved new logic for copying llama state data to internal function

* fixed function declaration order

* restored save load state example

* fixed whitepace

* removed unused llama-util.h include

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* Apply code review suggestions

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-08-04 13:29:52 +02:00
Concedo
e221843147 trying out mmq
Merge branch 'master' into concedo_experimental

# Conflicts:
#	CMakeLists.txt
#	README.md
2023-07-31 22:51:15 +08:00
Concedo
3e370f83ef Warning: Very experimental merge, do not use until confirmed stable. 2023-07-31 22:33:43 +08:00
Johannes Gäßler
0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues (#2453) 2023-07-31 15:44:35 +02:00
slaren
9d2382b3e4
Fix Metal backend broken from the allocator changes (#2455)
* fix Metal backend broken from the allocator changes
2023-07-31 11:02:53 +02:00
slaren
a113689571
ggml : add graph tensor allocator (#2411)
* ggml : add graph tensor allocator

* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset

* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset
2023-07-30 15:58:01 +02:00
Concedo
cde3760e52 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	ggml.h
#	llama.cpp
2023-07-29 17:47:00 +08:00
eric8607242
ee1b497c98
llama : support more diverse tokenizers? (#2420)
* supporting more diverse tokenizers

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-28 21:10:05 +03:00
Rand Xie
65cdf34bdc
llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433) 2023-07-28 11:42:53 +03:00
Georgi Gerganov
1a941869cb
metal : disable graph concurrency optimization due to bug (#2413) 2023-07-27 11:00:54 +03:00
slaren
5488fb789e
ggml : allocate graphs in a context (#2392)
* ggml : graph allocation in contexts

* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx

* llama.cpp : allocate graph in the context

* add GGML_PAD

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-26 15:56:53 +02:00
Concedo
b184380aae Revert "a better default rms_norm_eps"
This reverts commit 0c26799e77.
2023-07-26 10:23:45 +08:00
Concedo
f53d2aabb4 Merge branch 'master' into concedo_experimental 2023-07-26 10:19:59 +08:00
Kawrakow
eb542d3932
Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 18:35:53 +03:00
Concedo
6a054b80b0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	scripts/build-info.sh
2023-07-25 22:55:55 +08:00
Concedo
0c26799e77 a better default rms_norm_eps 2023-07-25 22:51:01 +08:00
slaren
da1889834a
ggml : improve graph build time via hash table lookup (#2329)
* improve graph build time

* ggml_tensor : use 1 bit per flag

* use a hash table instead
2023-07-25 15:32:20 +03:00
Shouzheng Liu
1aa18ef994
metal : concurrently dispatch commands (#2358)
* metal: concurrently dispatch commands

Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.

* metal: don't call find_concurrency automatically.

* metal : code style changes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-25 15:00:19 +03:00
Concedo
3e68cdd26a Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-grad0.c
2023-07-25 18:52:48 +08:00
slaren
41c674161f
make rms_norm_eps a parameter (#2374)
* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci
2023-07-24 17:57:12 +02:00
Concedo
6d71e100fe buff buffers 2023-07-24 20:33:17 +08:00
Concedo
66328fcd80 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-07-24 15:44:26 +08:00
Concedo
94499dba25 added support for 70b llama 2 2023-07-24 15:20:18 +08:00
Evan Jones
84e09a7d8b
llama : add grammar-based sampling (#1773)
* llama, main : constrain sampling to grammar

* allow loading grammar from file

* fix whitespace errors

* handle & print parser errors

* add comments to grammar syntax and allow newlines where unambiguous

* add missing include

* support alternates in root rule

* fix bugs with empty token and EOS

* adjust JSON grammar

* remove swp file

* rewrite ternary expressions

Co-authored-by: Henri Vasserman <henv@hot.ee>

* use struct for grammar elements and add Unicode support

* add unicode escapes

* add inverse char ranges

* only sample full tokens (no peeking or truncation)

* llama : minor style changes

blindly applied in online editor - hopefully I didn't break something

* update help text

* add warning message if EOS is disabled

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 23:58:10 -04:00
Concedo
910744e2c0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	flake.nix
#	llama.cpp
2023-07-23 22:37:38 +08:00
Georgi Gerganov
e76d630df1
llama : grouped-query attention + LLaMAv2 70B support (#2276)
* CUDA: GQA implementation

* llama : support for GQA and LLaMAv2 70B

ggml-ci

* py : fix hparams parsing (if-else blocks)

ggml-ci

* py : oh boy ..

ggml-ci

* help : fix gqa value for 70B

ggml-ci

---------

Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
Christian Demsar
a940458e48
llama : print max tensor size to stderr (#2336) 2023-07-23 14:56:34 +03:00
Concedo
aa05eadb6f Merge branch 'master' into concedo_experimental
# Conflicts:
#	llama.cpp
2023-07-23 16:22:44 +08:00
Georgi Gerganov
b47b8a9cfe
llama : optimize memory buffers (#2325) 2023-07-22 21:17:57 +03:00
Concedo
3aec3038d4 bump scratch buffers 2023-07-22 18:12:18 +08:00
Concedo
343ae756fa Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.nix
#	ggml-cuda.cu
2023-07-22 11:51:30 +08:00
Georgi Gerganov
513f861953
ggml : fix rope args order + assert (#2054) 2023-07-21 14:51:34 +03:00
Guillaume "Vermeille" Sanchez
ab0e26bdfb
llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280) 2023-07-21 13:58:36 +03:00
Georgi Gerganov
ae178ab46b
llama : make tensor_split ptr instead of array (#2272) 2023-07-21 13:10:51 +03:00
Concedo
06c08576f7 Merge remote-tracking branch 'origin/master' into concedo_experimental 2023-07-20 21:02:40 +08:00
Georgi Gerganov
fff0e0eafe llama : fix regression from #2000 - could not load no-mmap models 2023-07-20 13:47:26 +03:00
Concedo
13e34d5058 Merge remote-tracking branch 'origin/master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
#	flake.nix
#	tests/CMakeLists.txt

update readme and lite
2023-07-19 18:28:29 +08:00
Rinne
294f424554
llama : extend API to get max devices at runtime (#2253) 2023-07-19 10:06:40 +03:00
Concedo
374fffb9c6 Reworking rope WIP 2023-07-19 00:54:41 +08:00
Georgi Gerganov
d01bccde9f
ci : integrate with ggml-org/ci (#2250)
* ci : run ctest

ggml-ci

* ci : add open llama 3B-v2 tests

ggml-ci

* ci : disable wget progress output

ggml-ci

* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations

ggml-ci

* tests : try to fix tail free sampling test

ggml-ci

* ci : add K-quants

ggml-ci

* ci : add short perplexity tests

ggml-ci

* ci : add README.md

* ppl : add --chunks argument to limit max number of chunks

ggml-ci

* ci : update README
2023-07-18 14:24:43 +03:00