koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-11 01:24:36 +00:00

Author	SHA1	Message	Date
Concedo	6bf8ee4aea	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # ggml-cuda.cu # tests/test-tokenizer-0-falcon.py # tests/test-tokenizer-0-llama.py	2023-11-18 11:10:45 +08:00
Andrew Godfrey	b83e149ec6	cuda : get_row_rounding F32 (#4095 ) * Fix #4017 * Update ggml-cuda.cu Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update ggml-cuda.cu Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-11-17 10:01:15 +02:00
Georgi Gerganov	4f447a4833	llama : fix data units (#4101 ) * llama : fix data units ggml-ci * Revert "llama : fix data units" This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4. * llama : disambiguate data units ggml-ci	2023-11-17 10:00:15 +02:00
slaren	1cf2850d52	ggml-cuda : increase max graph size (#4084 )	2023-11-15 14:58:13 +02:00
Concedo	35a97e14b2	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # README.md # docs/token_generation_performance_tips.md # grammars/README.md # scripts/sync-ggml.sh # tests/CMakeLists.txt # tests/test-grad0.cpp # tests/test-opt.cpp	2023-11-15 16:59:53 +08:00
Georgi Gerganov	3d68f364f1	ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060 ) ggml-ci	2023-11-13 16:55:52 +02:00
Georgi Gerganov	4760e7cc0b	sync : ggml (backend v2) (#3912 ) * sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp	2023-11-13 14:16:23 +02:00
Kerfuffle	bb50a792ec	Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041 ) * Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers	2023-11-13 01:58:15 -07:00
Concedo	f277ed0e8c	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile	2023-11-07 15:23:08 +08:00
Meng Zhang	46876d2a2c	cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946 ) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)	2023-11-07 08:49:08 +02:00
Concedo	78ca0667a4	Merge branch 'master' into concedo_experimental	2023-11-06 16:58:58 +08:00
slaren	2833a6f63c	ggml-cuda : fix f16 mul mat (#3961 ) * ggml-cuda : fix f16 mul mat ggml-ci * silence common.cpp warning (bonus)	2023-11-05 18:45:16 +01:00
Jared Van Bortel	132d25b8a6	cuda : fix disabling device with --tensor-split 1,0 (#3951 ) Co-authored-by: slaren <slarengh@gmail.com>	2023-11-05 10:08:57 -05:00
Concedo	a62468ec4c	Merge branch 'master' into concedo_experimental should fix multigpu	2023-11-05 22:14:40 +08:00
slaren	48ade94538	cuda : revert CUDA pool stuff (#3944 ) * Revert "cuda : add ROCM aliases for CUDA pool stuff (#3918)" This reverts commit `629f917cd6`. * Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)" This reverts commit `d6069051de`. ggml-ci	2023-11-05 09:12:13 +02:00
Concedo	1e7088a80b	autopick cublas in gui if possible, better layer picking logic	2023-11-05 01:35:27 +08:00
Concedo	38471fbe06	tensor core info better printout (+1 squashed commits) Squashed commits: [be4ef93f] tensor core info better printout	2023-11-04 08:38:25 +08:00
Concedo	9bc2e35b2e	Merge branch 'master' into concedo_experimental	2023-11-03 23:51:32 +08:00
slaren	abb77e7319	ggml-cuda : move row numbers to x grid dim in mmv kernels (#3921 )	2023-11-03 12:13:09 +01:00
Concedo	c07c9b857d	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-11-03 11:17:07 +08:00
Kerfuffle	629f917cd6	cuda : add ROCM aliases for CUDA pool stuff (#3918 )	2023-11-02 21:58:22 +02:00
Georgi Gerganov	c7743fe1c1	cuda : fix const ptrs warning causing ROCm build issues (#3913 )	2023-11-02 20:32:11 +02:00
Oleksii Maryshchenko	d6069051de	cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903 ) * Using cuda memory pools for async alloc/dealloc. * If cuda device doesnt support memory pool than use old implementation. * Removed redundant cublasSetStream --------- Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>	2023-11-02 19:10:39 +02:00
Concedo	bc4ff72317	not working merge	2023-11-02 17:52:40 +08:00
Georgi Gerganov	4d719a6d4e	cuda : check if this fixes Pascal card regression (#3882 )	2023-11-02 08:35:10 +02:00
cebtenzzre	2fffa0d61f	cuda : fix RoPE after #2268 (#3897 )	2023-11-02 07:49:44 +02:00
slaren	d480d2c204	ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891 ) * ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel * fix warnings (cherry picked from commit `d02e98cde0`)	2023-11-02 11:19:53 +08:00
Concedo	1ab18ecb53	Merge commit '`c43c2da8af`' into concedo_experimental # Conflicts: # llama.cpp	2023-11-02 11:17:59 +08:00
slaren	d02e98cde0	ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891 ) * ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel * fix warnings	2023-11-01 23:10:09 +01:00
cebtenzzre	898aeca90a	llama : implement YaRN RoPE scaling (#2268 ) Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>	2023-11-01 18:04:33 -04:00
Andrew Godfrey	73bdcb395e	finetune : add -ngl parameter (#3762 ) * Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-11-01 13:49:04 +02:00
Concedo	bcb397953f	Merge remote-tracking branch 'llama.cpp/try-fix-3869' into concedo_experimental	2023-11-01 18:29:08 +08:00
Georgi Gerganov	22cc9bef09	cuda : check if this fixes Pascal card regression	2023-10-31 20:01:47 +02:00
Concedo	6cf2b4c73b	MMQ optimizations (+1 squashed commits) Squashed commits: [d87de001] mmq optimization (+1 squashed commits) Squashed commits: [f1f67af8] still allow mmq	2023-10-28 17:57:46 +08:00
Concedo	2ea3b567cf	Merge: Testing speed of tensor cores vs MMQ	2023-10-28 16:41:42 +08:00
Georgi Gerganov	2f9ec7e271	cuda : improve text-generation and batched decoding performance (#3776 ) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels	2023-10-27 17:01:23 +03:00
Concedo	5db89b90b7	Merge branch 'master' into concedo_experimental # Conflicts: # .gitignore # CMakeLists.txt # Makefile # README.md # build.zig # ggml-opencl.cpp # tests/CMakeLists.txt # tests/test-double-float.cpp # tests/test-sampling.cpp	2023-10-25 23:58:15 +08:00
Georgi Gerganov	6961c4bd0b	batched-bench : print params at start	2023-10-25 10:26:27 +03:00
Georgi Gerganov	b2f7e04bd3	sync : ggml (conv ops + cuda MSVC fixes) (#3765 ) ggml-ci	2023-10-24 21:51:20 +03:00
Georgi Gerganov	2b4ea35e56	cuda : add batched cuBLAS GEMM for faster attention (#3749 ) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-10-24 16:48:37 +03:00
Concedo	a723466d50	Merge branch 'master' into concedo_experimental	2023-10-10 17:21:42 +08:00
Jan Ploski	f5f9121de1	llm : add MPT support (#3417 ) * CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545) * mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt * mpt : protect against "clip_qkv": null in mpt-7b * mpt : quick fix to avoid "Strange model" warning when quantizing MPT models * mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?) * mpt : standardized all tensor names to follow GGUF spec * mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code * mpt : fixed comment s/gptneox/mpt/ * mpt : remove tabs, trailing whitespace * mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt * mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252 * comment out n_past instead of marking it unused * mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"] * mpt : remove unused tokenizer_json in convert script * ggml : remove obsolete n_past assert in ggml_alibi * llama : print clam_kqv and max_alibi_bias hparams --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-10 10:50:23 +03:00
Concedo	f288c6b5e3	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # build.zig # scripts/sync-ggml.sh	2023-10-10 00:09:46 +08:00
Georgi Gerganov	db3abcc114	sync : ggml (ggml-backend) (#3548 ) * sync : ggml (ggml-backend) ggml-ci * zig : add ggml-backend to the build	2023-10-08 20:19:14 +03:00
Concedo	7ab01ee3c6	Merge branch 'master' into concedo_experimental	2023-10-01 10:22:05 +08:00
slaren	f5ef5cfb18	ggml-cuda : perform cublas mat mul of quantized types as f16 (#3412 ) * ggml-cuda : perform cublas matrix multiplication of quantized types as fp16 * rename CC_TURING to CC_VOLTA * disable fp16 mat mul completely with multi GPU	2023-09-30 18:12:57 +02:00
Concedo	5e6450161a	functional merge	2023-09-30 12:31:57 +08:00
Concedo	b84e210f0d	merge new rope param nonsense	2023-09-30 11:33:30 +08:00
Concedo	033e3bf844	prepare to merge parallel	2023-09-29 10:30:45 +08:00
slaren	16bc66d947	llama.cpp : split llama_context_params into model and context params (#3301 ) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used	2023-09-28 22:42:38 +03:00

1 2 3 4

200 commits