mirror of
https://github.com/LostRuins/koboldcpp.git
synced 2025-09-11 17:44:38 +00:00
Squashed commit of the following:
commitb617f2847b
Merge:73cc5b8
92f44ff
Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit73cc5b88fb
Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit92f44ff7f7
Author: AT <manyoso@users.noreply.github.com> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (#1770) Co-authored-by: Adam Treat <adam@nomic.ai> commit245fc3c37d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit01dc509038
Merge:0833845
72ff528
Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental commit0833845268
Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit72ff5282bf
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit0bf7cf1b29
Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)" This reverts commit8432d4d9f7
. commit8432d4d9f7
Author: le.chang <cljs118@126.com> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) commit6fa1613f15
Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit0f291e1f65
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit7f181600c7
Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit8fc8179919
Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (#1673) * Modify Dockerfile default character set to improve compatibility (#1673) commitb50b570ed9
Author: Steven Roussey <sroussey@gmail.com> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (#1720) commit53aba3f393
Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit4161bdc04d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit0035858273
Author: johnson442 <56517414+johnson442@users.noreply.github.com> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (#1748)
This commit is contained in:
parent
dee692a63e
commit
4f665cd63d
4 changed files with 620 additions and 36 deletions
|
@ -1028,6 +1028,14 @@ static void llama_model_load_internal(
|
|||
}
|
||||
}
|
||||
|
||||
#if defined(GGML_USE_CLBLAST)
|
||||
if (file_version == LLAMA_FILE_VERSION_GGJT_V3) {
|
||||
if (hparams.ftype >= LLAMA_FTYPE_MOSTLY_Q2_K && hparams.ftype <= LLAMA_FTYPE_MOSTLY_Q6_K) {
|
||||
printf("\n===\nK-Quants are currently not supported with CLBlast!!!\nPlease select a q4_0, q4_0, q5_0 or q5_1 format instead!\n=====\n");
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
||||
if (vocab_only) {
|
||||
return;
|
||||
}
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue