koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-13 02:19:41 +00:00

Author	SHA1	Message	Date
Concedo	ec21fa7712	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # CMakeLists.txt # Makefile # Package.swift # README.md # ggml-cuda.cu # llama.cpp # llama.h # scripts/sync-ggml.sh # tests/CMakeLists.txt	2023-12-08 17:42:26 +08:00
Georgi Gerganov	fe680e3d10	sync : ggml (new ops, tests, backend, etc.) (#4359 ) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit 09e35d04b1c4ca67f9685690160b35bc885a89ac. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 22:26:54 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Concedo	48544cd2ef	Revert "Revert "ggml : add ggml_soft_max_ext (#4256 )"" This reverts commit `a8e66ef31c`.	2023-12-03 21:46:50 +08:00
Concedo	a8e66ef31c	Revert "ggml : add ggml_soft_max_ext (#4256 )" This reverts commit `ef47ec18da`.	2023-12-03 00:42:01 +08:00
Concedo	495bb3ab1e	Merge branch 'master' into concedo_experimental	2023-12-01 23:48:20 +08:00
Georgi Gerganov	ef47ec18da	ggml : add ggml_soft_max_ext (#4256 ) * metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug	2023-12-01 10:51:24 +02:00
Concedo	8acd7be734	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # README.md	2023-11-27 14:06:14 +08:00
slaren	8a052c131e	ggml-cuda : support stablelm rope (#4156 ) * ggml-cuda : support stablelm rope * remove unused freq_base kernel parameter * add n_dims parameter to llm_build_k_shift, default to n_rot via overload * llama : fix llm_build_k_shift args --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-11-24 18:04:31 +01:00
Haohui Mai	55978ce09b	Fix incorrect format strings and uninitialized variables. (#4133 ) * Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement	2023-11-23 22:56:53 +01:00
Concedo	56a5fa7a60	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # tests/test-tokenizer-0-falcon.py # tests/test-tokenizer-0-llama.py	2023-11-20 22:37:06 +08:00
Kerfuffle	2923f17f6f	Clean up ggml-cuda.cu warnings when compiling with clang (for ROCM) (#4124 ) * ggml-cuda.cu: Clean up warnings when compiling with clang * ggml-cuda.cu: Move static items into anonymous namespace * ggml-cuda.cu: Fix use of namespace start macro * Revert "ggml-cuda.cu: Fix use of namespace start macro" This reverts commit 26c11490266c096e3e5731e05270a8f73a5b2874. * Revert "ggml-cuda.cu: Move static items into anonymous namespace" This reverts commit e29757e0f7535d1ac314300f0324684cc785e06c.	2023-11-18 08:11:18 -07:00
Concedo	6bf8ee4aea	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # ggml-cuda.cu # tests/test-tokenizer-0-falcon.py # tests/test-tokenizer-0-llama.py	2023-11-18 11:10:45 +08:00
Andrew Godfrey	b83e149ec6	cuda : get_row_rounding F32 (#4095 ) * Fix #4017 * Update ggml-cuda.cu Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update ggml-cuda.cu Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-11-17 10:01:15 +02:00
Georgi Gerganov	4f447a4833	llama : fix data units (#4101 ) * llama : fix data units ggml-ci * Revert "llama : fix data units" This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4. * llama : disambiguate data units ggml-ci	2023-11-17 10:00:15 +02:00
slaren	1cf2850d52	ggml-cuda : increase max graph size (#4084 )	2023-11-15 14:58:13 +02:00
Concedo	35a97e14b2	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # README.md # docs/token_generation_performance_tips.md # grammars/README.md # scripts/sync-ggml.sh # tests/CMakeLists.txt # tests/test-grad0.cpp # tests/test-opt.cpp	2023-11-15 16:59:53 +08:00
Georgi Gerganov	3d68f364f1	ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060 ) ggml-ci	2023-11-13 16:55:52 +02:00
Georgi Gerganov	4760e7cc0b	sync : ggml (backend v2) (#3912 ) * sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp	2023-11-13 14:16:23 +02:00
Kerfuffle	bb50a792ec	Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041 ) * Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers	2023-11-13 01:58:15 -07:00
Concedo	f277ed0e8c	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile	2023-11-07 15:23:08 +08:00
Meng Zhang	46876d2a2c	cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946 ) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)	2023-11-07 08:49:08 +02:00
Concedo	78ca0667a4	Merge branch 'master' into concedo_experimental	2023-11-06 16:58:58 +08:00
slaren	2833a6f63c	ggml-cuda : fix f16 mul mat (#3961 ) * ggml-cuda : fix f16 mul mat ggml-ci * silence common.cpp warning (bonus)	2023-11-05 18:45:16 +01:00
Jared Van Bortel	132d25b8a6	cuda : fix disabling device with --tensor-split 1,0 (#3951 ) Co-authored-by: slaren <slarengh@gmail.com>	2023-11-05 10:08:57 -05:00
Concedo	a62468ec4c	Merge branch 'master' into concedo_experimental should fix multigpu	2023-11-05 22:14:40 +08:00
slaren	48ade94538	cuda : revert CUDA pool stuff (#3944 ) * Revert "cuda : add ROCM aliases for CUDA pool stuff (#3918)" This reverts commit `629f917cd6`. * Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)" This reverts commit `d6069051de`. ggml-ci	2023-11-05 09:12:13 +02:00
Concedo	1e7088a80b	autopick cublas in gui if possible, better layer picking logic	2023-11-05 01:35:27 +08:00
Concedo	38471fbe06	tensor core info better printout (+1 squashed commits) Squashed commits: [be4ef93f] tensor core info better printout	2023-11-04 08:38:25 +08:00
Concedo	9bc2e35b2e	Merge branch 'master' into concedo_experimental	2023-11-03 23:51:32 +08:00
slaren	abb77e7319	ggml-cuda : move row numbers to x grid dim in mmv kernels (#3921 )	2023-11-03 12:13:09 +01:00
Concedo	c07c9b857d	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-11-03 11:17:07 +08:00
Kerfuffle	629f917cd6	cuda : add ROCM aliases for CUDA pool stuff (#3918 )	2023-11-02 21:58:22 +02:00
Georgi Gerganov	c7743fe1c1	cuda : fix const ptrs warning causing ROCm build issues (#3913 )	2023-11-02 20:32:11 +02:00
Oleksii Maryshchenko	d6069051de	cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903 ) * Using cuda memory pools for async alloc/dealloc. * If cuda device doesnt support memory pool than use old implementation. * Removed redundant cublasSetStream --------- Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>	2023-11-02 19:10:39 +02:00
Concedo	bc4ff72317	not working merge	2023-11-02 17:52:40 +08:00
Georgi Gerganov	4d719a6d4e	cuda : check if this fixes Pascal card regression (#3882 )	2023-11-02 08:35:10 +02:00
cebtenzzre	2fffa0d61f	cuda : fix RoPE after #2268 (#3897 )	2023-11-02 07:49:44 +02:00
slaren	d480d2c204	ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891 ) * ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel * fix warnings (cherry picked from commit `d02e98cde0`)	2023-11-02 11:19:53 +08:00
Concedo	1ab18ecb53	Merge commit '`c43c2da8af`' into concedo_experimental # Conflicts: # llama.cpp	2023-11-02 11:17:59 +08:00
slaren	d02e98cde0	ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891 ) * ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel * fix warnings	2023-11-01 23:10:09 +01:00
cebtenzzre	898aeca90a	llama : implement YaRN RoPE scaling (#2268 ) Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>	2023-11-01 18:04:33 -04:00
Andrew Godfrey	73bdcb395e	finetune : add -ngl parameter (#3762 ) * Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-11-01 13:49:04 +02:00
Concedo	bcb397953f	Merge remote-tracking branch 'llama.cpp/try-fix-3869' into concedo_experimental	2023-11-01 18:29:08 +08:00
Georgi Gerganov	22cc9bef09	cuda : check if this fixes Pascal card regression	2023-10-31 20:01:47 +02:00
Concedo	6cf2b4c73b	MMQ optimizations (+1 squashed commits) Squashed commits: [d87de001] mmq optimization (+1 squashed commits) Squashed commits: [f1f67af8] still allow mmq	2023-10-28 17:57:46 +08:00
Concedo	2ea3b567cf	Merge: Testing speed of tensor cores vs MMQ	2023-10-28 16:41:42 +08:00
Georgi Gerganov	2f9ec7e271	cuda : improve text-generation and batched decoding performance (#3776 ) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels	2023-10-27 17:01:23 +03:00
Concedo	5db89b90b7	Merge branch 'master' into concedo_experimental # Conflicts: # .gitignore # CMakeLists.txt # Makefile # README.md # build.zig # ggml-opencl.cpp # tests/CMakeLists.txt # tests/test-double-float.cpp # tests/test-sampling.cpp	2023-10-25 23:58:15 +08:00
Georgi Gerganov	6961c4bd0b	batched-bench : print params at start	2023-10-25 10:26:27 +03:00

1 2 3 4 5

212 commits