Concedo
cee8042793
integrated new version of clblast kernels as a separate file
2023-05-13 12:53:28 +08:00
Concedo
53e7256a25
should be good to merge, only thing missing is clblast new quants
2023-05-13 12:07:29 +08:00
Concedo
05cf5f7d6e
partially working, but the blas matmul is broken
2023-05-13 11:35:38 +08:00
Concedo
b335f73a60
BACKWARDS COMPAT QUANT SHIM is ready, but upstream model converter is BORKED. BORK BORK.
2023-05-13 01:30:11 +08:00
Concedo
08810d5fee
interim merge. do not use
2023-05-13 00:33:55 +08:00
Concedo
e9caff1cda
Interim merge. Do not use.
...
Merge branch 'master' into concedo_experimental
# Conflicts:
# README.md
# SHA256SUMS
# examples/quantize/quantize.cpp
# ggml-opencl.c
# ggml.c
# ggml.h
# llama.cpp
# llama.h
2023-05-12 23:20:27 +08:00
Georgi Gerganov
b9fd7eee57
ggml : remove bit shuffling ( #1405 )
...
* ggml : remove Q4_0 bit shufling (ARM NEON)
* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)
* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)
* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)
* ggml : remove Q5_0 bit shuffling (ARM NEON)
* ggml : 2x faster scalar implementations
* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)
* ggml : simplify scalar dot
* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit
* ggml : fix Q4_1 quantization
* ggml : update cuBLAS + normalize variable names
* ggml : remove Q4_2 mode
* ggml : minor formatting
* ggml : fix Q5_0 quantization
* scripts : add script for measuring the time per token
* AVX implementations (#1370 )
* ggml : uniform 5th bit extraction
* llama : produce error upon loading old model files
* llama : fix model magic/version write
* ggml : speed-up Q5_0 + Q5_1 at 4 threads
* ggml : preserve old Q4 and Q5 formats
* ggml : simplify Q8_1 - no need for low / high sums anymore
* ggml : fix Q8_0 and Q8_1 rounding
* Revert "AVX implementations (#1370 )"
This reverts commit 948d124837f9d287d8490f41338e0e4cceb0814f.
* ggml : fix AVX2 implementation
* sha : update hashes for 7B and 13B
* readme : update timings + remove warning banner
* llama : update v2 PR number to 1405
* ggml : fix WASM comments
* ggml : back to original bit order
* readme : add note that Q4 and Q5 have been changed
* llama : fix return for unknown version
---------
Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
Concedo
19dbb3b2a5
Merge branch 'master' into concedo_experimental
2023-05-10 18:35:53 +08:00
Sami Farin
9f8dbc4787
use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler ( #1314 )
...
* use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler
Tested with a 13B model.
* use _mm_pause() in busyloop
* use _mm_pause() in busyloop on x86_64 to reduce power consumption
2023-05-09 14:29:20 +02:00
swittk
1b0fd45465
ggml : Allow usage of CLBlast alongside Accelerate.framework ( #1336 )
...
Minor edit in ggml.c which originally would prevent OpenCL from loading completely if GGML_USE_ACCELERATE was defined.
Minor speedup in prompt eval time.
2023-05-06 23:03:23 -04:00
Concedo
39f3d1cf48
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# examples/quantize/quantize.cpp
2023-05-05 21:34:33 +08:00
Ron Jailall
20fbf2a2a0
ggml : change immintrin.h to intrin.h for compatibility ( #1307 )
...
* change immintrin.h to intrin.h for compatibility
Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm
* conditional def of intrin.h
* fix typo in ggml.c
2023-05-04 18:05:59 +03:00
Concedo
e01dc631f7
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# README.md
2023-05-04 14:04:41 +08:00
Georgi Gerganov
799fdc1b5d
ggml : vectorize Q8_0 quantization
...
https://github.com/ggerganov/ggml/pull/127#issuecomment-1533648531
2023-05-03 23:24:20 +03:00
Concedo
ede8e4edbb
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
2023-05-03 23:34:50 +08:00
Georgi Gerganov
5d5817ca60
ggml : fix 32-bit ARM
2023-05-02 22:14:50 +03:00
Marvin Gießing
cc0bb7235c
ggml : fix ppc64le build error and make cmake detect Power processors ( #1284 )
...
* Fix ppc64le build issue
* Added support to detect ppc64* processors
2023-05-02 19:42:16 +03:00
slaren
2d099e5193
ggml: add names to tensors ( #1268 )
...
* ggml: add names to tensors
* minor improvements to dot file formatting
2023-05-02 16:03:00 +02:00
Concedo
94827172e0
Merge branch 'master' into concedo
...
# Conflicts:
# CMakeLists.txt
# Makefile
# ggml-cuda.cu
# ggml-cuda.h
2023-05-02 14:38:31 +08:00
slaren
58b367c2d7
cuBLAS: refactor and optimize f16 mat mul performance ( #1259 )
...
* cuBLAS: refactor, convert fp16 to fp32 on device
* cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16
* fix build
* cuBLAS: update block_q5_1
2023-05-01 18:11:07 +02:00
Kerfuffle
2bdc09646d
ggml : fix ggml_used_mem() ( #1264 )
2023-05-01 14:56:07 +03:00
Concedo
3de34ee492
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# ggml-opencl.c
2023-05-01 12:03:46 +08:00
Georgi Gerganov
7ff0dcd320
ggml : fix UB (int << 31)
2023-04-30 22:28:51 +03:00
Georgi Gerganov
6bc4400e67
ggml : add Q5 WASM SIMD + GGML_FTYPE
2023-04-30 19:07:43 +03:00
Concedo
3b5df18dbb
temp fix for compilation issues on OSX (M1)
2023-04-30 23:48:46 +08:00
Georgi Gerganov
3e5aa8a1c4
ggml : fix labels for GGML_OP_ALIBI
2023-04-30 10:25:46 +03:00
Concedo
0061b90ec6
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
2023-04-30 10:35:02 +08:00
Georgi Gerganov
c3ca7a5f05
ggml : fix 32-bit ARM NEON
2023-04-29 21:34:23 +03:00
Georgi Gerganov
e8c051611a
ggml : use vzip instead of vuzp for consistency
2023-04-29 21:12:56 +03:00
Georgi Gerganov
0b5a935099
ggml : fix visibility and unused warnings
2023-04-29 19:28:36 +03:00
Georgi Gerganov
ec728e44d7
ggml : fix #if for f32_f32 mul_mat (CLBlast) ( #1229 )
2023-04-29 18:43:42 +03:00
Georgi Gerganov
214b6a3570
ggml : adjust mul_mat_f16 work memory ( #1226 )
...
* llama : minor - remove explicity int64_t cast
* ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS
* ggml : add asserts to guard for incorrect wsize
2023-04-29 18:43:28 +03:00
Concedo
bb282a4ecf
reinstated the q4_3 format, for backwards compatibility.
2023-04-29 11:42:04 +08:00
slaren
7fc50c051a
cuBLAS: use host pinned memory and dequantize while copying ( #1207 )
...
* cuBLAS: dequantize simultaneously while copying memory
* cuBLAS: use host pinned memory
* cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory
* cuBLAS: also pin kv cache
* fix rebase
2023-04-29 02:04:18 +02:00
Henri Vasserman
b1ee8f59b4
cuBLAS: non-contiguous tensor support ( #1215 )
...
* Cuda: non-contiguous tensor support
* remove extra stuff
* rename
* fix error
* more fixes, now OpenBLAS and CLBlast build too
* now then?
2023-04-29 01:31:56 +02:00
Stephan Walter
36d19a603b
Remove Q4_3 which is no better than Q5 ( #1218 )
2023-04-28 23:10:43 +00:00
Georgi Gerganov
55390bcaf2
ggml : sync ggml (ggml_alibi)
2023-04-28 20:51:05 +03:00
Georgi Gerganov
11d902364b
ggml : add helper debug printf in soft_max
2023-04-28 17:59:08 +03:00
0cc4m
7296c961d9
ggml : add CLBlast support ( #1164 )
...
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 17:57:16 +03:00
Yann Follet
04aaae1d79
add avx2 for dot_q8_0_q8_0, 2x faster than scalar ( #1211 )
2023-04-28 11:59:48 +00:00
Stephan Walter
0b2da20538
ggml : slightly faster AVX2 implementation for Q5 ( #1197 )
2023-04-26 23:26:42 +03:00
Georgi Gerganov
574406dc7e
ggml : add Q5_0 and Q5_1 quantization ( #1187 )
...
* ggml : add Q5_0 quantization (cuBLAS only)
* ggml : fix Q5_0 qh -> uint32_t
* ggml : fix q5_0 histogram stats
* ggml : q5_0 scalar dot product
* ggml : q5_0 ARM NEON dot
* ggml : q5_0 more efficient ARM NEON using uint64_t masks
* ggml : rename Q5_0 -> Q5_1
* ggml : adding Q5_0 mode
* quantize : add Q5_0 and Q5_1 to map
* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195 )
---------
Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-26 23:14:13 +03:00
Georgi Gerganov
7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) ( #1179 )
...
* ggml : add Q8_0 quantization format (rename the old one to Q8_1)
* tests : fix test-quantize-fns
* ggml : finalize Q8_0 implementation
* ggml : use q4_0_q8_0 and q4_2_q8_0
* ggml : fix Q8_0 dot product bug (ARM)
* ggml : Q8_0 unroll x2
* ggml : fix bug - using wrong block type
* ggml : extend quantize_fns_t with "vec_dot_type"
* ggml : fix Q8_0 to use 255 values out of 256
* ggml : fix assert using wrong QK4_2 instead of QK4_3
2023-04-25 23:40:51 +03:00
unbounded
dd0eabc049
ggml : use full range for Q4_0 and Q4_2 quantization ( #729 )
...
* Use full range for q4_0 quantization
By keeping the sign of the highest magnitude, we can make sure the
highest value maps to -8, which is currently unused.
This is a bit of a freebie since it is fully backwards compatible with
the current format.
* Update quantize_row_q4_0 for AVX/AVX2
* Update quantize_row_q4_0 for WASM
Untested
* Update quantize_row_q4_0 for Arm NEON
* Update quantize_row_q4_0 for PowerPC
Untested
* Use full range for q4_2 quantization
2023-04-25 20:20:46 +03:00
xaedes
54bb60e268
ggml : fix bug in ggml_compute_forward_sum_f32 ( #1162 )
...
The sum over all rows is now computed instead of just the last row
2023-04-24 23:02:02 +02:00
Stephan Walter
2ec83428de
Fix build for gcc 8 and test in CI ( #1154 )
2023-04-24 15:38:26 +00:00
Georgi Gerganov
ec9cdb6752
ggml : do not print perf ops that have not been used at all
2023-04-23 18:32:52 +03:00
Georgi Gerganov
e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make"
2023-04-23 18:15:39 +03:00
Stephan Walter
53c8434398
Improve AVX2 for vec_dot_q4_3_q8_0 ( #1138 )
2023-04-23 11:01:03 +00:00
Yishuo Wang
c9e2c26f41
A better packNibbles
and mul_sum_i8_pairs_float
implementation using AVX512 ( #1119 )
2023-04-23 07:57:05 +00:00