Commit graph

133 commits

Author SHA1 Message Date
Concedo
eed651494e Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	common/log.h
2023-09-02 11:24:28 +08:00
Engininja2
f04d002844
cuda : vsubss4 for older versions of ROCm/clang (#2942) 2023-09-01 23:33:19 +02:00
Concedo
f2c02dd06d Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	tests/test-grad0.cpp
2023-08-30 10:51:28 +08:00
Johannes Gäßler
92b1bbd2ec
CUDA: fix RoPE asserts, block sizes (#2833) 2023-08-28 14:23:55 +03:00
Concedo
4b00916ac7 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.dockerignore
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.lock
#	flake.nix
#	tests/CMakeLists.txt
2023-08-28 14:19:05 +08:00
Georgi Gerganov
eaa13a48ff
falcon : fix CUDA inference by making K and Q contiguous (#2830)
* falcon : fix CUDA inference by making K and Q contiguous

ggml-ci

* cuda : add assert to guard from non-cont ropes
2023-08-27 16:40:48 +03:00
Kawrakow
a6d1189fdd
k_quants tuning for Falcon-7b (#2816)
* Make ggml-cuda.cu build with QK_K = 64

Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces
a meaningful result.

* k_quants tuning for Falcon-7b

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-27 15:19:59 +03:00
Henri Vasserman
6bbc598a63
ROCm Port (#1087)
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP

---------

Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
2023-08-25 12:09:42 +03:00
Georgi Gerganov
3f460a2b72
cuda : add RoPE kernel for mode == 2 (NeoX) (#2760)
* cuda : add RoPE kernel for mode == 2 (NeoX)

* falcon : do not offload the embeddings layer
2023-08-25 11:55:59 +03:00
Concedo
b8372d4466 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	README.md
#	tests/CMakeLists.txt
2023-08-24 15:21:24 +08:00
Georgi Gerganov
cf658adc83
llm : add Falcon support (#2717)
* llama : refactor GGUF constants into static maps

* llama : check if model architecture is known

* llama : refactor llama_model_load_internal()

* gguf : add KV constant maps

* llm : read arch-specific KVs

* convert : add dummy scores + types

* falcon : load tensor data (CPU only)

* llama : fix loading progress bar

* llama : add arch member to llama_model

* falcon : CPU inference working

* falcon : support non-40B models

* falcon : minor

* llama : minor updates

ggml-ci

* convert-falcon-hf-to-gguf.py : fix special token mapping

* llama.cpp : llama default UNK token = id 0

* llama.cpp : fix bpe tokenizer

* llama.cpp : fix the fix of bpe tokenizer

* ggml : pass eps to ggml_norm

* metal : implement RoPE (mode = 2) + avoid ggml_repeat

* ggml : ggml_repeat always creates new tensor

* falcon : copy-paste self-attention from LLaMA

* metal : print extra compute pipeline info

* falcon : minor changes (still chasing the Metal problem)

* llama.cpp : fix linefeed token

* metal : fix GELU kernel numerical stability by using precise::tanh

* metal : temporary workaround for the concurrency optimization bug

* falcon : add CUDA offloading (#2739)

* llama : better model naming and size reporting

* llama : prep new tokenizer support

* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation

* llama : remove oboslete comment

ggml-ci

* common : remove obsolete BPE API + disable test-tokenizer-1

* llama : revert BPE special-case in llama_byte_to_token()

* cuda : add TODOs for RoPE NeoX implementation

* llama : default special tokens based on vocab type

* perplexity : add log for start of tokenization

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-08-23 23:08:04 +03:00
Concedo
af170fc2db Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
#	llama.cpp
#	scripts/sync-ggml.sh
#	tests/test-tokenizer-0.cpp
2023-08-23 17:08:09 +08:00
Johannes Gäßler
c63bb1d16a
CUDA: use mul_mat_q kernels by default (#2683) 2023-08-22 22:47:05 +02:00
Jiahao Li
800c9635b4
Fix CUDA softmax by subtracting max value before exp (#2665) 2023-08-22 20:27:06 +02:00
slaren
1123f7fbdf
ggml-cuda : use graph allocator (#2684)
use a different function for no_alloc to avoid breaking backwards compat, fixes lora

remove 512 n_batch limit

fixed 2048 batch size

cleanup

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-22 15:25:19 +02:00
Georgi Gerganov
ef3f333d37
ggml : sync latest (SAM + SD operators, CUDA alibi) (#2709)
* ggml : sync latest (SAM + SD operators, CUDA alibi)

ggml-ci

* ggml : fix tabs
2023-08-22 14:22:08 +03:00
Concedo
2d17c22437 functional commit before gguf merge 2023-08-22 18:20:06 +08:00
slaren
097e121e2f
llama : add benchmark example (#2626)
* llama : add benchmark example

* add to examples CMakeLists.txt

* fix msvc build

* add missing include

* add Bessel's correction to stdev calculation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* improve markdown formatting

* add missing include

* print warning is NDEBUG is not defined

* remove n_prompt and n_gen from the matrix, use each value separately instead

* better checks for non-optimized builds

* llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call

* fix json formatting

* add sql output

* add basic cpu and gpu info (linx/cuda only)

* markdown: also show values that differ from the default

* markdown: add build id

* cleanup

* improve formatting

* formatting

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-18 12:44:58 +02:00
Concedo
075d079a72 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	ggml-cuda.cu
#	llama-util.h
#	tests/CMakeLists.txt
2023-08-16 10:43:06 +08:00
Johannes Gäßler
1cd06fa25e
CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596) 2023-08-14 10:41:22 +02:00
Johannes Gäßler
f64d44a9b9
CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590) 2023-08-13 00:24:45 +02:00
Concedo
a07e6dd3ad revert cuda changes as they are bugggy 2023-08-09 22:36:41 +08:00
Concedo
f8376c7e61 up ver, fixed compile (+1 squashed commits)
Squashed commits:

[ca51aa9e] up ver
2023-08-09 21:31:24 +08:00
Concedo
ba09f1c807 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
#	ggml-cuda.cu
2023-08-09 21:18:34 +08:00
Johannes Gäßler
25d43e0eb5
CUDA: tuned mul_mat_q kernels (#2546) 2023-08-09 09:42:34 +02:00
Concedo
159ad9269d up ver, set the cuda pool malloc lookahead back to 5% instead of 2% (+1 squashed commits)
Squashed commits:

[e0f65278] up ver, set the cuda pool malloc lookahead back to 5% instead of 2%
2023-08-09 12:06:42 +08:00
Concedo
cae6a847ad cuda free only for non mmq (+2 squashed commit)
Squashed commit:

[3aca763a] only cuda free for non mmq

[e69a8c9f] revert to pool alloc to try again
2023-08-07 17:12:05 +08:00
Concedo
9f16a4c4ef switch to upstream implementation of pool malloc 2023-08-07 15:16:37 +08:00
Concedo
d442888626 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-08-06 22:47:33 +08:00
Johannes Gäßler
f514d1b306
CUDA: faster k-quant mul_mat_q kernels (#2525) 2023-08-05 18:20:44 +02:00
Cebtenzzre
4329d1acb0
CUDA: use min compute capability of GPUs actually used (#2506) 2023-08-04 17:35:22 +02:00
Cebtenzzre
02f9d96a86
CUDA: check if event is NULL before cudaStreamWaitEvent (#2505)
Fixes #2503
2023-08-04 17:34:32 +02:00
Johannes Gäßler
468ea24fb4
CUDA: faster non k-quant mul_mat_q kernels (#2483) 2023-08-02 18:04:04 +02:00
Johannes Gäßler
4f6b60c776
CUDA: Fix models with output size != 32000 (#2480) 2023-08-02 16:48:10 +02:00
Concedo
4c90fdc5cd Merge remote-tracking branch 'johannes/cuda-fix-output-size' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
2023-08-02 22:37:41 +08:00
JohannesGaessler
1e64d511d5 CUDA: Fix models with output size != 32000 2023-08-02 10:26:53 +02:00
Concedo
e221843147 trying out mmq
Merge branch 'master' into concedo_experimental

# Conflicts:
#	CMakeLists.txt
#	README.md
2023-07-31 22:51:15 +08:00
Concedo
3e370f83ef Warning: Very experimental merge, do not use until confirmed stable. 2023-07-31 22:33:43 +08:00
Johannes Gäßler
0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues (#2453) 2023-07-31 15:44:35 +02:00
Johannes Gäßler
1215ed7d5c
CUDA: Implemented row flattening for non-glm RoPE (#2468) 2023-07-31 14:32:30 +02:00
Johannes Gäßler
2dbf518911
CUDA: fewer memory bank conflicts for mul_mat_q (#2458) 2023-07-31 13:18:51 +02:00
Concedo
82d0695f0f Merge commit '9baf9ef304' into concedo_experimental 2023-07-30 18:18:23 +08:00
Johannes Gäßler
11f3ca06b8
CUDA: Quantized matrix matrix multiplication (#2160)
* mmq implementation for non k-quants

* q6_K

* q2_K

* q3_k

* q4_K

* vdr

* q5_K

* faster q8_1 loading

* loop unrolling

* add __restrict__

* q2_K sc_high

* GGML_CUDA_MMQ_Y

* Updated Makefile

* Update Makefile

* DMMV_F16 -> F16

* Updated README, CMakeLists

* Fix CMakeLists.txt

* Fix CMakeLists.txt

* Fix multi GPU out-of-bounds
2023-07-29 23:04:44 +02:00
Johannes Gäßler
9baf9ef304
CUDA: faster multi GPU synchronization (#2448) 2023-07-29 23:04:10 +02:00
Concedo
6a054b80b0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	scripts/build-info.sh
2023-07-25 22:55:55 +08:00
Concedo
3e68cdd26a Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-grad0.c
2023-07-25 18:52:48 +08:00
Kawrakow
129d844c87
Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359)
* Fix Q4_K and Q5_K for QK_K = 64

* Very slightly better Q5_K bit fiddling

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 13:48:04 +03:00
slaren
41c674161f
make rms_norm_eps a parameter (#2374)
* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci
2023-07-24 17:57:12 +02:00
Concedo
8a9b40840b Merge branch 'master' into concedo_experimental
# Conflicts:
#	tests/test-grad0.c
#	tests/test-opt.c
2023-07-24 20:51:28 +08:00
Georgi Gerganov
5b2b2dc6ae
ggml : sync (unary ops refactor, static-correctness) (#2370)
* ggml : sync (unary ops, tests)

ggml-ci

* tests : remove unnecessary funcs
2023-07-24 14:46:21 +03:00