Concedo
4c90fdc5cd
Merge remote-tracking branch 'johannes/cuda-fix-output-size' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
2023-08-02 22:37:41 +08:00
JohannesGaessler
1e64d511d5
CUDA: Fix models with output size != 32000
2023-08-02 10:26:53 +02:00
Concedo
e221843147
trying out mmq
...
Merge branch 'master' into concedo_experimental
# Conflicts:
# CMakeLists.txt
# README.md
2023-07-31 22:51:15 +08:00
Concedo
3e370f83ef
Warning: Very experimental merge, do not use until confirmed stable.
2023-07-31 22:33:43 +08:00
Johannes Gäßler
0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues ( #2453 )
2023-07-31 15:44:35 +02:00
Johannes Gäßler
1215ed7d5c
CUDA: Implemented row flattening for non-glm RoPE ( #2468 )
2023-07-31 14:32:30 +02:00
Johannes Gäßler
2dbf518911
CUDA: fewer memory bank conflicts for mul_mat_q ( #2458 )
2023-07-31 13:18:51 +02:00
Concedo
82d0695f0f
Merge commit ' 9baf9ef304
' into concedo_experimental
2023-07-30 18:18:23 +08:00
Johannes Gäßler
11f3ca06b8
CUDA: Quantized matrix matrix multiplication ( #2160 )
...
* mmq implementation for non k-quants
* q6_K
* q2_K
* q3_k
* q4_K
* vdr
* q5_K
* faster q8_1 loading
* loop unrolling
* add __restrict__
* q2_K sc_high
* GGML_CUDA_MMQ_Y
* Updated Makefile
* Update Makefile
* DMMV_F16 -> F16
* Updated README, CMakeLists
* Fix CMakeLists.txt
* Fix CMakeLists.txt
* Fix multi GPU out-of-bounds
2023-07-29 23:04:44 +02:00
Johannes Gäßler
9baf9ef304
CUDA: faster multi GPU synchronization ( #2448 )
2023-07-29 23:04:10 +02:00
Concedo
6a054b80b0
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# scripts/build-info.sh
2023-07-25 22:55:55 +08:00
Concedo
3e68cdd26a
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# tests/test-grad0.c
2023-07-25 18:52:48 +08:00
Kawrakow
129d844c87
Fix Q4_K and Q5_K for QK_K = 64 on CUDA ( #2359 )
...
* Fix Q4_K and Q5_K for QK_K = 64
* Very slightly better Q5_K bit fiddling
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 13:48:04 +03:00
slaren
41c674161f
make rms_norm_eps a parameter ( #2374 )
...
* make rms_norm_eps a parameter
* add rms_norm_eps to command line
* fix baby llama, test-grad0
* use scientific notation for eps param in the help
ggml-ci
2023-07-24 17:57:12 +02:00
Concedo
8a9b40840b
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# tests/test-grad0.c
# tests/test-opt.c
2023-07-24 20:51:28 +08:00
Georgi Gerganov
5b2b2dc6ae
ggml : sync (unary ops refactor, static-correctness) ( #2370 )
...
* ggml : sync (unary ops, tests)
ggml-ci
* tests : remove unnecessary funcs
2023-07-24 14:46:21 +03:00
Concedo
993ba3b026
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# README.md
2023-07-24 11:59:00 +08:00
Kawrakow
2f9cf974a0
Some more Q4_K and Q5_K speedup on CUDA ( #2346 )
...
* Faster Q5_K on CUDA
* Small Q5_K improvement on older GPUs
* Spped up Q4_K on CUDA
GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t
* Spped up Q4_K on CUDA
GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080: 9.8 ms/t -> 9.5 ms/t
* Address PR comments
* Add some comments to satisfy PR reviewer
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-24 00:19:47 +03:00
Concedo
910744e2c0
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# flake.nix
# llama.cpp
2023-07-23 22:37:38 +08:00
slaren
95a6c595e7
ggml: move op parameters from tensors to ggml_tensor::op_params ( #2333 )
...
* ggml: move op parameters from tensors to ggml_tensor::op_params
* alibi: use memcpy for float params
* remove `src[1] = NULL` in ops
2023-07-23 14:36:02 +02:00
Georgi Gerganov
e76d630df1
llama : grouped-query attention + LLaMAv2 70B support ( #2276 )
...
* CUDA: GQA implementation
* llama : support for GQA and LLaMAv2 70B
ggml-ci
* py : fix hparams parsing (if-else blocks)
ggml-ci
* py : oh boy ..
ggml-ci
* help : fix gqa value for 70B
ggml-ci
---------
Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
Concedo
2e84eac7f6
Merge branch 'master' into concedo_experimental
2023-07-23 16:23:00 +08:00
Concedo
aa05eadb6f
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# llama.cpp
2023-07-23 16:22:44 +08:00
Kawrakow
d2a43664f9
Speed up Q4_K ( #2322 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-23 08:49:20 +03:00
Johannes Gäßler
b9b7d94fc1
CUDA: Fixed 7b q3_K_S with mul_mat_vec_q ( #2313 )
2023-07-22 21:27:34 +02:00
Concedo
343ae756fa
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .gitignore
# CMakeLists.txt
# Makefile
# README.md
# flake.nix
# ggml-cuda.cu
2023-07-22 11:51:30 +08:00
Kawrakow
d924522a46
Custom RoPE + bettter memory management for CUDA ( #2295 )
...
* Custom RoPE + bettter memory management for CUDA
* Adjusted look ahead in ggml_cuda_pool_malloc to 5%
This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:27:51 +03:00
Georgi Gerganov
ae178ab46b
llama : make tensor_split ptr instead of array ( #2272 )
2023-07-21 13:10:51 +03:00
Concedo
0d7240b320
modified rope for cuda
2023-07-19 14:16:27 +08:00
Concedo
374fffb9c6
Reworking rope WIP
2023-07-19 00:54:41 +08:00
Concedo
6d32e7fc8b
Merge commit ' a6803cab94
' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# Makefile
# build.zig
# flake.nix
# ggml-cuda.cu
# ggml.h
# tests/test-grad0.c
# tests/test-opt.c
2023-07-18 19:12:06 +08:00
Jiahao Li
7568d1a2b2
Support dup & cont ops on CUDA ( #2242 )
2023-07-17 20:39:29 +03:00
Bach Le
7cdd30bf1f
cuda : allocate all temporary ggml_tensor_extra_gpu from a fixed-size buffer ( #2220 )
2023-07-14 22:00:58 +03:00
Jiahao Li
206e01de11
cuda : support broadcast add & mul ( #2192 )
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-14 21:38:24 +03:00
Johannes Gäßler
4304bd3cde
CUDA: mul_mat_vec_q kernels for k-quants ( #2203 )
2023-07-14 19:44:08 +02:00
Georgi Gerganov
697966680b
ggml : sync (ggml_conv_2d, fix mul_mat bug, CUDA GLM rope)
2023-07-14 16:36:41 +03:00
Howard Su
ff5d58faec
Fix compile error on Windows CUDA ( #2207 )
2023-07-13 21:58:09 +08:00
Georgi Gerganov
680e6f9177
cuda : add gelu support
2023-07-12 20:32:15 +03:00
Johannes Gäßler
2b5eb72e10
Fixed __dp4a compute capability: 6.0 -> 6.1 ( #2189 )
2023-07-12 10:38:52 +02:00
Georgi Gerganov
f7d278faf3
ggml : revert CUDA broadcast changes from #2183 ( #2191 )
2023-07-12 10:54:19 +03:00
Concedo
5941514e95
Merge commit ' 5bf2a27718
' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# README.md
2023-07-12 13:05:16 +08:00
Concedo
8f4ed0d18c
fixed cmake, 8bit MMV should be working now
2023-07-12 11:22:55 +08:00
Sammy
7516488550
fix compilation ( #313 )
2023-07-12 10:44:56 +08:00
Georgi Gerganov
20d7740a9b
ggml : sync (abort callback, mul / add broadcast, fix alibi) ( #2183 )
2023-07-11 22:53:34 +03:00
Spencer Sutton
5bf2a27718
ggml : remove src0 and src1 from ggml_tensor and rename opt to src ( #2178 )
...
* Add ggml changes
* Update train-text-from-scratch for change
* mpi : adapt to new ggml_tensor->src
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11 19:31:10 +03:00
Concedo
4be167915a
added linear rope option, added warning for bad samplers
2023-07-11 18:08:19 +08:00
Concedo
50097e6c7f
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# README.md
# llama.cpp
2023-07-10 20:08:27 +08:00
Johannes Gäßler
64639555ff
Fixed OpenLLaMA 3b CUDA mul_mat_vec_q ( #2144 )
2023-07-08 20:01:44 +02:00
Concedo
15576bc865
Merge branch 'kquant_vocab_fix' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# Makefile
# README.md
# llama.cpp
# tests/CMakeLists.txt
# tests/test-grad0.c
# tests/test-opt.c
2023-07-08 20:43:20 +08:00
Johannes Gäßler
061f5f8d21
CUDA: add __restrict__ to mul mat vec kernels ( #2140 )
2023-07-08 00:25:15 +02:00