Concedo
ec21fa7712
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .gitignore
# CMakeLists.txt
# Makefile
# Package.swift
# README.md
# ggml-cuda.cu
# llama.cpp
# llama.h
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
2023-12-08 17:42:26 +08:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) ( #4359 )
...
* sync : ggml (part 1)
* sync : ggml (part 2, CUDA)
* sync : ggml (part 3, Metal)
* ggml : build fixes
ggml-ci
* cuda : restore lost changes
* cuda : restore lost changes (StableLM rope)
* cmake : enable separable compilation for CUDA
ggml-ci
* ggml-cuda : remove device side dequantize
* Revert "cmake : enable separable compilation for CUDA"
This reverts commit 09e35d04b1c4ca67f9685690160b35bc885a89ac.
* cuda : remove assert for rope
* tests : add test-backend-ops
* ggml : fix bug in ggml_concat
* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`
* ci : try to fix macOS
* ggml-backend : remove backend self-registration
* ci : disable Metal for macOS cmake build
ggml-ci
* metal : fix "supports family" call
* metal : fix assert
* metal : print resource path
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 22:26:54 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache ( #4309 )
...
* per-layer KV
* remove unnecessary copies
* less code duplication, offload k and v separately
* llama : offload KV cache per-layer
* llama : offload K shift tensors
* llama : offload for rest of the model arches
* llama : enable offload debug temporarily
* llama : keep the KV related layers on the device
* llama : remove mirrors, perform Device -> Host when partial offload
* common : add command-line arg to disable KV cache offloading
* llama : update session save/load
* llama : support quantum K cache (#4312 )
* llama : support quantum K cache (wip)
* metal : add F32 -> Q8_0 copy kernel
* cuda : add F32 -> Q8_0 copy kernel
ggml-ci
* cuda : use mmv kernel for quantum cache ops
* llama : pass KV cache type through API
* llama : fix build
ggml-ci
* metal : add F32 -> Q4_0 copy kernel
* metal : add F32 -> Q4_1 copy kernel
* cuda : wip
* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels
* llama-bench : support type_k/type_v
* metal : use mm kernel only for quantum KV cache
* cuda : add comment
* llama : remove memory_f16 and kv_f16 flags
---------
Co-authored-by: slaren <slarengh@gmail.com>
* readme : add API change notice
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Concedo
48544cd2ef
Revert "Revert "ggml : add ggml_soft_max_ext ( #4256 )""
...
This reverts commit a8e66ef31c
.
2023-12-03 21:46:50 +08:00
Concedo
a8e66ef31c
Revert "ggml : add ggml_soft_max_ext ( #4256 )"
...
This reverts commit ef47ec18da
.
2023-12-03 00:42:01 +08:00
Concedo
495bb3ab1e
Merge branch 'master' into concedo_experimental
2023-12-01 23:48:20 +08:00
Georgi Gerganov
ef47ec18da
ggml : add ggml_soft_max_ext ( #4256 )
...
* metal : implement soft_max_ext
* cuda : implement soft_max_ext
* ggml : implement soft_max_ext (CPU)
* batched-bench : print threads
ggml-ci
* metal : simplify soft_max encoding
ggml-ci
* cuda : use 512 threads for soft_max instead of 32
* ggml : update soft max cpu
* cuda : do warp-based block reduce
* cuda : increase max block size to 1024
* cuda : fix warp reduction initialization of shared mem
* metal : warp-based reduction for soft max kernel
* metal : warp-based reduce for rms_norm
* metal : simplify soft max kernel
ggml-ci
* alloc : fix build with debug
2023-12-01 10:51:24 +02:00
Concedo
8acd7be734
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
2023-11-27 14:06:14 +08:00
slaren
8a052c131e
ggml-cuda : support stablelm rope ( #4156 )
...
* ggml-cuda : support stablelm rope
* remove unused freq_base kernel parameter
* add n_dims parameter to llm_build_k_shift, default to n_rot via overload
* llama : fix llm_build_k_shift args
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-24 18:04:31 +01:00
Haohui Mai
55978ce09b
Fix incorrect format strings and uninitialized variables. ( #4133 )
...
* Fix incorrect format strings and uninitialized variables.
* Address comments
* Add the missing include statement
2023-11-23 22:56:53 +01:00
Concedo
56a5fa7a60
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# tests/test-tokenizer-0-falcon.py
# tests/test-tokenizer-0-llama.py
2023-11-20 22:37:06 +08:00
Kerfuffle
2923f17f6f
Clean up ggml-cuda.cu warnings when compiling with clang (for ROCM) ( #4124 )
...
* ggml-cuda.cu: Clean up warnings when compiling with clang
* ggml-cuda.cu: Move static items into anonymous namespace
* ggml-cuda.cu: Fix use of namespace start macro
* Revert "ggml-cuda.cu: Fix use of namespace start macro"
This reverts commit 26c11490266c096e3e5731e05270a8f73a5b2874.
* Revert "ggml-cuda.cu: Move static items into anonymous namespace"
This reverts commit e29757e0f7535d1ac314300f0324684cc785e06c.
2023-11-18 08:11:18 -07:00
Concedo
6bf8ee4aea
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# ggml-cuda.cu
# tests/test-tokenizer-0-falcon.py
# tests/test-tokenizer-0-llama.py
2023-11-18 11:10:45 +08:00
Andrew Godfrey
b83e149ec6
cuda : get_row_rounding F32 ( #4095 )
...
* Fix #4017
* Update ggml-cuda.cu
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update ggml-cuda.cu
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-11-17 10:01:15 +02:00
Georgi Gerganov
4f447a4833
llama : fix data units ( #4101 )
...
* llama : fix data units
ggml-ci
* Revert "llama : fix data units"
This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4.
* llama : disambiguate data units
ggml-ci
2023-11-17 10:00:15 +02:00
slaren
1cf2850d52
ggml-cuda : increase max graph size ( #4084 )
2023-11-15 14:58:13 +02:00
Concedo
35a97e14b2
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# docs/token_generation_performance_tips.md
# grammars/README.md
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
# tests/test-grad0.cpp
# tests/test-opt.cpp
2023-11-15 16:59:53 +08:00
Georgi Gerganov
3d68f364f1
ggml : sync (im2col, GPU conv, 32-bit arm compat) ( #4060 )
...
ggml-ci
2023-11-13 16:55:52 +02:00
Georgi Gerganov
4760e7cc0b
sync : ggml (backend v2) ( #3912 )
...
* sync : ggml (backend v2) (wip)
* sync : migrate examples and llama.cpp to dynamic graphs (wip)
* sync : update tests + fix max op params to 64
ggml-ci
* sync : ggml-cuda
ggml-ci
* llama : fix save/load state context size
ggml-ci
* sync : try to fix build on tvOS
* sync : pass custom graph sizes in training examples
* sync : update graph copies to new ggml API
* sync : update sync-ggml.sh with new files
* scripts : fix header in sync script
* train : fix context size calculations
* llama : increase inference graph size up to 4096 nodes
* train : allocate grads for backward graphs
* train : allocate grads for gb_tmp
2023-11-13 14:16:23 +02:00
Kerfuffle
bb50a792ec
Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading ( #4041 )
...
* Add ReLU and SQR CUDA ops to fix Persimmon offloading
* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
2023-11-13 01:58:15 -07:00
Concedo
f277ed0e8c
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
2023-11-07 15:23:08 +08:00
Meng Zhang
46876d2a2c
cuda : supports running on CPU for GGML_USE_CUBLAS=ON build ( #3946 )
...
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build
* doc: add comments to ggml_cublas_loaded()
* fix defined(...)
2023-11-07 08:49:08 +02:00
Concedo
78ca0667a4
Merge branch 'master' into concedo_experimental
2023-11-06 16:58:58 +08:00
slaren
2833a6f63c
ggml-cuda : fix f16 mul mat ( #3961 )
...
* ggml-cuda : fix f16 mul mat
ggml-ci
* silence common.cpp warning (bonus)
2023-11-05 18:45:16 +01:00
Jared Van Bortel
132d25b8a6
cuda : fix disabling device with --tensor-split 1,0 ( #3951 )
...
Co-authored-by: slaren <slarengh@gmail.com>
2023-11-05 10:08:57 -05:00
Concedo
a62468ec4c
Merge branch 'master' into concedo_experimental
...
should fix multigpu
2023-11-05 22:14:40 +08:00
slaren
48ade94538
cuda : revert CUDA pool stuff ( #3944 )
...
* Revert "cuda : add ROCM aliases for CUDA pool stuff (#3918 )"
This reverts commit 629f917cd6
.
* Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903 )"
This reverts commit d6069051de
.
ggml-ci
2023-11-05 09:12:13 +02:00
Concedo
1e7088a80b
autopick cublas in gui if possible, better layer picking logic
2023-11-05 01:35:27 +08:00
Concedo
38471fbe06
tensor core info better printout (+1 squashed commits)
...
Squashed commits:
[be4ef93f] tensor core info better printout
2023-11-04 08:38:25 +08:00
Concedo
9bc2e35b2e
Merge branch 'master' into concedo_experimental
2023-11-03 23:51:32 +08:00
slaren
abb77e7319
ggml-cuda : move row numbers to x grid dim in mmv kernels ( #3921 )
2023-11-03 12:13:09 +01:00
Concedo
c07c9b857d
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# README.md
2023-11-03 11:17:07 +08:00
Kerfuffle
629f917cd6
cuda : add ROCM aliases for CUDA pool stuff ( #3918 )
2023-11-02 21:58:22 +02:00
Georgi Gerganov
c7743fe1c1
cuda : fix const ptrs warning causing ROCm build issues ( #3913 )
2023-11-02 20:32:11 +02:00
Oleksii Maryshchenko
d6069051de
cuda : use CUDA memory pool with async memory allocation/deallocation when available ( #3903 )
...
* Using cuda memory pools for async alloc/dealloc.
* If cuda device doesnt support memory pool than use old implementation.
* Removed redundant cublasSetStream
---------
Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>
2023-11-02 19:10:39 +02:00
Concedo
bc4ff72317
not working merge
2023-11-02 17:52:40 +08:00
Georgi Gerganov
4d719a6d4e
cuda : check if this fixes Pascal card regression ( #3882 )
2023-11-02 08:35:10 +02:00
cebtenzzre
2fffa0d61f
cuda : fix RoPE after #2268 ( #3897 )
2023-11-02 07:49:44 +02:00
slaren
d480d2c204
ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel ( #3891 )
...
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel
* fix warnings
(cherry picked from commit d02e98cde0
)
2023-11-02 11:19:53 +08:00
Concedo
1ab18ecb53
Merge commit ' c43c2da8af
' into concedo_experimental
...
# Conflicts:
# llama.cpp
2023-11-02 11:17:59 +08:00
slaren
d02e98cde0
ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel ( #3891 )
...
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel
* fix warnings
2023-11-01 23:10:09 +01:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling ( #2268 )
...
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter ( #3762 )
...
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Concedo
bcb397953f
Merge remote-tracking branch 'llama.cpp/try-fix-3869' into concedo_experimental
2023-11-01 18:29:08 +08:00
Georgi Gerganov
22cc9bef09
cuda : check if this fixes Pascal card regression
2023-10-31 20:01:47 +02:00
Concedo
6cf2b4c73b
MMQ optimizations (+1 squashed commits)
...
Squashed commits:
[d87de001] mmq optimization (+1 squashed commits)
Squashed commits:
[f1f67af8] still allow mmq
2023-10-28 17:57:46 +08:00
Concedo
2ea3b567cf
Merge: Testing speed of tensor cores vs MMQ
2023-10-28 16:41:42 +08:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance ( #3776 )
...
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Concedo
5db89b90b7
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .gitignore
# CMakeLists.txt
# Makefile
# README.md
# build.zig
# ggml-opencl.cpp
# tests/CMakeLists.txt
# tests/test-double-float.cpp
# tests/test-sampling.cpp
2023-10-25 23:58:15 +08:00
Georgi Gerganov
6961c4bd0b
batched-bench : print params at start
2023-10-25 10:26:27 +03:00