Commit graph

200 commits

Author SHA1 Message Date
Concedo
6bf8ee4aea Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	ggml-cuda.cu
#	tests/test-tokenizer-0-falcon.py
#	tests/test-tokenizer-0-llama.py
2023-11-18 11:10:45 +08:00
Andrew Godfrey
b83e149ec6
cuda : get_row_rounding F32 (#4095)
* Fix #4017

* Update ggml-cuda.cu

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-11-17 10:01:15 +02:00
Georgi Gerganov
4f447a4833
llama : fix data units (#4101)
* llama : fix data units

ggml-ci

* Revert "llama : fix data units"

This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4.

* llama : disambiguate data units

ggml-ci
2023-11-17 10:00:15 +02:00
slaren
1cf2850d52
ggml-cuda : increase max graph size (#4084) 2023-11-15 14:58:13 +02:00
Concedo
35a97e14b2 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	docs/token_generation_performance_tips.md
#	grammars/README.md
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-grad0.cpp
#	tests/test-opt.cpp
2023-11-15 16:59:53 +08:00
Georgi Gerganov
3d68f364f1
ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060)
ggml-ci
2023-11-13 16:55:52 +02:00
Georgi Gerganov
4760e7cc0b
sync : ggml (backend v2) (#3912)
* sync : ggml (backend v2) (wip)

* sync : migrate examples and llama.cpp to dynamic graphs (wip)

* sync : update tests + fix max op params to 64

ggml-ci

* sync : ggml-cuda

ggml-ci

* llama : fix save/load state context size

ggml-ci

* sync : try to fix build on tvOS

* sync : pass custom graph sizes in training examples

* sync : update graph copies to new ggml API

* sync : update sync-ggml.sh with new files

* scripts : fix header in sync script

* train : fix context size calculations

* llama : increase inference graph size up to 4096 nodes

* train : allocate grads for backward graphs

* train : allocate grads for gb_tmp
2023-11-13 14:16:23 +02:00
Kerfuffle
bb50a792ec
Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041)
* Add ReLU and SQR CUDA ops to fix Persimmon offloading

* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
2023-11-13 01:58:15 -07:00
Concedo
f277ed0e8c Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-11-07 15:23:08 +08:00
Meng Zhang
46876d2a2c
cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946)
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build

* doc: add comments to ggml_cublas_loaded()

* fix defined(...)
2023-11-07 08:49:08 +02:00
Concedo
78ca0667a4 Merge branch 'master' into concedo_experimental 2023-11-06 16:58:58 +08:00
slaren
2833a6f63c
ggml-cuda : fix f16 mul mat (#3961)
* ggml-cuda : fix f16 mul mat

ggml-ci

* silence common.cpp warning (bonus)
2023-11-05 18:45:16 +01:00
Jared Van Bortel
132d25b8a6
cuda : fix disabling device with --tensor-split 1,0 (#3951)
Co-authored-by: slaren <slarengh@gmail.com>
2023-11-05 10:08:57 -05:00
Concedo
a62468ec4c Merge branch 'master' into concedo_experimental
should fix multigpu
2023-11-05 22:14:40 +08:00
slaren
48ade94538
cuda : revert CUDA pool stuff (#3944)
* Revert "cuda : add ROCM aliases for CUDA pool stuff (#3918)"

This reverts commit 629f917cd6.

* Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)"

This reverts commit d6069051de.

ggml-ci
2023-11-05 09:12:13 +02:00
Concedo
1e7088a80b autopick cublas in gui if possible, better layer picking logic 2023-11-05 01:35:27 +08:00
Concedo
38471fbe06 tensor core info better printout (+1 squashed commits)
Squashed commits:

[be4ef93f] tensor core info better printout
2023-11-04 08:38:25 +08:00
Concedo
9bc2e35b2e Merge branch 'master' into concedo_experimental 2023-11-03 23:51:32 +08:00
slaren
abb77e7319
ggml-cuda : move row numbers to x grid dim in mmv kernels (#3921) 2023-11-03 12:13:09 +01:00
Concedo
c07c9b857d Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-11-03 11:17:07 +08:00
Kerfuffle
629f917cd6
cuda : add ROCM aliases for CUDA pool stuff (#3918) 2023-11-02 21:58:22 +02:00
Georgi Gerganov
c7743fe1c1
cuda : fix const ptrs warning causing ROCm build issues (#3913) 2023-11-02 20:32:11 +02:00
Oleksii Maryshchenko
d6069051de
cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)
* Using cuda memory pools for async alloc/dealloc.

* If cuda device doesnt support memory pool than use old implementation.

* Removed redundant cublasSetStream

---------

Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>
2023-11-02 19:10:39 +02:00
Concedo
bc4ff72317 not working merge 2023-11-02 17:52:40 +08:00
Georgi Gerganov
4d719a6d4e
cuda : check if this fixes Pascal card regression (#3882) 2023-11-02 08:35:10 +02:00
cebtenzzre
2fffa0d61f
cuda : fix RoPE after #2268 (#3897) 2023-11-02 07:49:44 +02:00
slaren
d480d2c204 ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891)
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel

* fix warnings

(cherry picked from commit d02e98cde0)
2023-11-02 11:19:53 +08:00
Concedo
1ab18ecb53 Merge commit 'c43c2da8af' into concedo_experimental
# Conflicts:
#	llama.cpp
2023-11-02 11:17:59 +08:00
slaren
d02e98cde0
ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891)
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel

* fix warnings
2023-11-01 23:10:09 +01:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling (#2268)
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter (#3762)
* Add '-ngl' support to finetune.cpp

* Add fprintf in ggml_cuda_op_add

When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora

* Add 'finetune.sh', which currently fails when using GPU

"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"

* tweak finetune.sh

* Suppress some warnings in ggml.c

* Add f16 implementation to ggml_compute_forward_add_f16_f32

* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs

* finetune.sh: Edit comments

* Add "add_f16_f32_f32_cuda"

* Tweak an error message

* finetune.sh: Add an optional LLAMA_MODEL_DIR variable

* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable

* train : minor

* tabs to spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Concedo
bcb397953f Merge remote-tracking branch 'llama.cpp/try-fix-3869' into concedo_experimental 2023-11-01 18:29:08 +08:00
Georgi Gerganov
22cc9bef09
cuda : check if this fixes Pascal card regression 2023-10-31 20:01:47 +02:00
Concedo
6cf2b4c73b MMQ optimizations (+1 squashed commits)
Squashed commits:

[d87de001] mmq optimization (+1 squashed commits)

Squashed commits:

[f1f67af8] still allow mmq
2023-10-28 17:57:46 +08:00
Concedo
2ea3b567cf Merge: Testing speed of tensor cores vs MMQ 2023-10-28 16:41:42 +08:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance (#3776)
* cuda : prints wip

* cuda : new cublas gemm branch for multi-batch quantized src0

* cuda : add F32 sgemm branch

* cuda : fine-tune >= VOLTA params + use MMQ only for small batches

* cuda : remove duplicated cuBLAS GEMM code

* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Concedo
5db89b90b7 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	build.zig
#	ggml-opencl.cpp
#	tests/CMakeLists.txt
#	tests/test-double-float.cpp
#	tests/test-sampling.cpp
2023-10-25 23:58:15 +08:00
Georgi Gerganov
6961c4bd0b
batched-bench : print params at start 2023-10-25 10:26:27 +03:00
Georgi Gerganov
b2f7e04bd3
sync : ggml (conv ops + cuda MSVC fixes) (#3765)
ggml-ci
2023-10-24 21:51:20 +03:00
Georgi Gerganov
2b4ea35e56
cuda : add batched cuBLAS GEMM for faster attention (#3749)
* cmake : add helper for faster CUDA builds

* batched : add NGL arg

* ggml : skip nops in compute_forward

* cuda : minor indentation

* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)

* Apply suggestions from code review

These changes plus:

```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```

are needed to compile with ROCM. I haven't done performance testing, but it seems to work.

I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.

* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define

* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases

* cuda : reduce mallocs in cublasGemmBatchedEx branch

* cuda : add TODO for calling cublas from kernel + using mem pool

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-10-24 16:48:37 +03:00
Concedo
a723466d50 Merge branch 'master' into concedo_experimental 2023-10-10 17:21:42 +08:00
Jan Ploski
f5f9121de1
llm : add MPT support (#3417)
* CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545)

* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt

* mpt : protect against "clip_qkv": null in mpt-7b

* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models

* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)

* mpt : standardized all tensor names to follow GGUF spec

* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code

* mpt : fixed comment s/gptneox/mpt/

* mpt : remove tabs, trailing whitespace

* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt

* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252

* comment out n_past instead of marking it unused

* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]

* mpt : remove unused tokenizer_json in convert script

* ggml : remove obsolete n_past assert in ggml_alibi

* llama : print clam_kqv and max_alibi_bias hparams

---------

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-10 10:50:23 +03:00
Concedo
f288c6b5e3 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	build.zig
#	scripts/sync-ggml.sh
2023-10-10 00:09:46 +08:00
Georgi Gerganov
db3abcc114
sync : ggml (ggml-backend) (#3548)
* sync : ggml (ggml-backend)

ggml-ci

* zig : add ggml-backend to the build
2023-10-08 20:19:14 +03:00
Concedo
7ab01ee3c6 Merge branch 'master' into concedo_experimental 2023-10-01 10:22:05 +08:00
slaren
f5ef5cfb18
ggml-cuda : perform cublas mat mul of quantized types as f16 (#3412)
* ggml-cuda : perform cublas matrix multiplication of quantized types as fp16

* rename CC_TURING to CC_VOLTA

* disable fp16 mat mul completely with multi GPU
2023-09-30 18:12:57 +02:00
Concedo
5e6450161a functional merge 2023-09-30 12:31:57 +08:00
Concedo
b84e210f0d merge new rope param nonsense 2023-09-30 11:33:30 +08:00
Concedo
033e3bf844 prepare to merge parallel 2023-09-29 10:30:45 +08:00
slaren
16bc66d947
llama.cpp : split llama_context_params into model and context params (#3301)
* llama.cpp : split llama_context_params into model and context params

ggml-ci

* fix metal build

* fix freq_base/scale default to model value

* llama-bench : keep the same model between tests when possible

* move n_threads to llama_context_params, add n_threads_batch

* fix mpi build

* remove kv_size(), cuda scratch fixes

* remove low-vram option

* add n_threads_batch to system info, refactor to get_system_info()

* add documentation about --threads-batch to the READMEs

* llama-bench fix

* main : fix rope freq/scale warning

* llama.cpp : add llama_get_model
common : add llama_tokenize from model

* remove duplicated ctx/model functions

ggml-ci

* cuda : print total VRAM used
2023-09-28 22:42:38 +03:00