Commit graph

116 commits

Author SHA1 Message Date
Concedo
82d74ca1a6 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
2023-04-21 16:24:30 +08:00
Georgi Gerganov
12b5900dbc
ggml : sync ggml (add GPT-NeoX RoPE implementation) 2023-04-20 23:32:59 +03:00
Georgi Gerganov
9ff334f3c9
ggml : fix bug in ggml_compute_forward_dup_f32() 2023-04-20 21:58:38 +03:00
Georgi Gerganov
8a1756abdf
ggml : do not break cuBLAS build (Q4_3 is not yet implemented) 2023-04-20 21:43:50 +03:00
Georgi Gerganov
66aab46079
ggml : fix Q4_3 quantization
Broke it during conflict resolution in last PR
2023-04-20 20:44:05 +03:00
Kawrakow
38de86a711
llama : multi-threaded quantization (#1075)
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-20 20:42:27 +03:00
Georgi Gerganov
e0305ead3a
ggml : add Q4_3 quantization (#1082) 2023-04-20 20:35:53 +03:00
Concedo
4605074245 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	ggml.c
2023-04-20 17:30:54 +08:00
Stephan Walter
c8c2c52482
AVX2 optimization for vec_dot_q4_2_q8_0 (#1068) 2023-04-20 08:45:41 +02:00
slaren
02d6988121
Improve cuBLAS performance by dequantizing on the GPU (#1065) 2023-04-20 03:14:14 +02:00
Kawrakow
f7d05095b4
Q4_2 quantization with rmse-optimized scale and quants (#1062)
* Q4_2 quantization with rmse-optimized scale and quants

For quantize-stats we get
q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012

For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks.

Quantization is slow (~90 seconds on my Mac for 7B) as not
multi-threaded as in PR #896.

* ggml : satisfy the sanitizer builds

Not sure why this makes them fail

* Better follow ggml conventions for function names

* Fixed type as per reviewer comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-19 20:20:14 +02:00
Georgi Gerganov
884e7d7a2b
ggml : use 8-bit precision for Q4_1 intermediate results (#1047)
* ggml : use 8-bit precision for Q4_1 intermediate results (ARM)

* ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32

56 ms/token with Q4_1 !

* ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051)

* gitignore : ignore ppl-*.txt files

---------

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
2023-04-19 20:10:08 +03:00
Stephan Walter
f3d4edf504
ggml : Q4 cleanup - remove 4-bit dot product code (#1061)
* Q4 cleanup

* Remove unused AVX512 Q4_0 code
2023-04-19 19:06:37 +03:00
Concedo
be1222c36e Merged the upstream cublas feature, 2023-04-19 20:45:37 +08:00
slaren
8944a13296
Add NVIDIA cuBLAS support (#1044) 2023-04-19 11:22:45 +02:00
Concedo
f662a9a230 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
2023-04-19 16:34:51 +08:00
slaren
6667401238
Multi-threaded ggml_cpy (#1035)
* Multi-threaded ggml_cpy

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Also fix wdata offset in ggml_compute_forward_add_q_f32

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-19 00:53:24 +02:00
Georgi Gerganov
77a73403ca
ggml : add new Q4_2 quantization (ARM only) (#1046)
* ggml : Q4_2 ARM

* ggml : add ggml_is_quantized()

* llama : update llama_type_name() with Q4_2 entry

* ggml : speed-up q4_2

- 4 threads: ~100ms -> ~90ms
- 8 threads:  ~55ms -> ~50ms

* ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
2023-04-18 23:54:57 +03:00
Georgi Gerganov
50a8a2af97
ggml : scratch that - vmlaq_n_f32 is always better
Had a background process that was messing with the timings
2023-04-18 23:11:23 +03:00
Georgi Gerganov
dcdd65e296
ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators 2023-04-18 22:59:17 +03:00
Concedo
ac61e34d5f Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
2023-04-18 17:38:10 +08:00
slaren
315a95a4d3
Add LoRA support (#820) 2023-04-17 17:28:55 +02:00
Georgi Gerganov
69b740289f
ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c 2023-04-17 16:16:23 +03:00
Ivan Komarov
f266259ad9
Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933) 2023-04-17 15:10:57 +02:00
Concedo
5a4d1b5d15 Merge branch 'master' into concedo
# Conflicts:
#	CMakeLists.txt
#	Makefile
2023-04-16 14:08:23 +08:00
Stephan Walter
2f7c8e014e
Fix potential int8 overflow in non-SIMD vec_dot (#986) 2023-04-15 18:28:56 +00:00
Concedo
3e992eabb4 Merge remote-tracking branch 'occam/clblast-gpu-dequant' into concedo 2023-04-16 00:26:54 +08:00
Stephan Walter
0ad964631f
Refactor ggml.c for future tensor types (#1001) 2023-04-15 16:25:38 +00:00
0cc4m
57d046eeb6 Enable dequantization on GPU for ClBlast 2023-04-15 18:04:24 +02:00
Georgi Gerganov
e95b6554b4
ggml : add Q8_0 quantization for intermediate results (#951)
* ggml : add Q8_0 quantization for intermediate results

* quantize-stats : fix test + add it to Makefile default

* Q8: use int8_t, AVX/AVX2 optimizations

* ggml : fix quantize_row_q8_0() ARM_NEON rounding

* minor : updates after rebase to latest master

* quantize-stats : delete obsolete strings

* ggml : fix q4_1 dot func

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-15 17:53:22 +03:00
Georgi Gerganov
aa485cee33
ggml : use posix_memalign on non-Windows env 2023-04-15 14:25:45 +03:00
Concedo
d00b865eb1 Merge branch 'master' into concedo
# Conflicts:
#	.devops/full.Dockerfile
#	Makefile
#	flake.nix
2023-04-15 11:33:43 +08:00
Pavol Rusnak
c56b715269
Expose type name from ggml (#970)
Avoid duplication of type names in utils

Co-authored-by: Håkon H. Hitland <haakon@likedan.net>
2023-04-14 20:05:37 +02:00
Kerfuffle
c9a59b70a5
ggml : add unary and binary map operations (#874)
* GGML map ops proof of concept.

* Various cleanups.

Add handling for task setting.

Add handling for ggml_compute_backward.

Rename functions to ggml_map_unary_f32 and ggml_map_binary_f32

Fix compiler warnings related to casting function pointers and `void *`

Reorder functions and definitions based on the GGML op number.

Use typedefs for map op function pointer types.

* Fix position of map ops cases in ggml_compute_forward
2023-04-14 17:43:55 +03:00
Concedo
a819f22cac Merge branch 'master' into concedo
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.nix
2023-04-14 21:40:33 +08:00
Georgi Gerganov
1623a6e9b4
ggml : minor 2023-04-14 13:31:29 +03:00
Georgi Gerganov
c14e0d2f23
ggml : always allocate buffers with size multiple of GGML_MEM_ALIGN 2023-04-14 13:31:15 +03:00
Georgi Gerganov
0f07cacb05
ggml : fix q4_1 dot product types 2023-04-14 09:45:42 +03:00
Howard Su
c5d70f5c9e
ggml : optimize rope function to avoid call powf in the tight loop (#807) 2023-04-14 09:24:52 +03:00
Georgi Gerganov
a3a2a0eda8
ggml : add GGML_DEFAULT_N_THREADS 2023-04-13 18:36:48 +03:00
Georgi Gerganov
d990e3fffc
ggml : speed-up ggml_vec_dot_q4_1() ARM_NEON + 32-bit ARM support (#900)
* ggml : speed-up q4_1 ARM_NEON by ~5%

* ggml : implement vaddvq when missing

* ggml : implement vminvq and vmaxvq when missing

* ggml : implement vzip when missing

* ggml : fix comment

* ggml : try to use correct ifdef
2023-04-13 18:32:36 +03:00
Stephan Walter
6232f2d7fd
ggml : optimize non-SIMD Q4_0 vector dot product (#703) 2023-04-13 17:59:50 +03:00
Pavol Rusnak
6c248707f5
ggml : introduce GGML_ALIGNED_MALLOC/GGML_ALIGNED_FREE macros (#884)
which allows us to use aligned_alloc or _aligned_malloc functions
2023-04-13 17:08:32 +03:00
Vladimir
8c3ffc2f04
ggml : update cblas_sgemm columns var to be more reasonable (#838) 2023-04-13 16:24:30 +03:00
Concedo
4faae0afa9 Merged upstream, fixed OSX compile errors, integrated noavx2 build into main 2023-04-12 18:08:55 +08:00
Pavol Rusnak
8b679987cd
Fix whitespace, add .editorconfig, add GitHub workflow (#883) 2023-04-11 19:45:44 +00:00
Concedo
9245c7d7d0 Merge branch 'master' into concedo 2023-04-11 23:38:15 +08:00
Concedo
23c675b2e6 integrated optional (experimentl) CLBlast support 2023-04-11 23:33:44 +08:00
Stephan Walter
3e6e70d8e8
Add enum llama_ftype, sync ggml_type to model files (#709) 2023-04-11 15:03:51 +00:00
comex
2663d2c678
Windows fixes (#890)
Mostly for msys2 and mingw64 builds, which are different from each other
and different from standard Visual Studio builds.  Isn't Windows fun?

- Define _GNU_SOURCE in more files (it's already used in ggml.c for
  Linux's sake).

- Don't use PrefetchVirtualMemory if not building for Windows 8 or later
  (mingw64 doesn't by default).  But warn the user about this situation
  since it's probably not intended.

- Check for NOMINMAX already being defined, which it is on mingw64.

- Actually use the `increment` variable (bug in my `pizza` PR).

- Suppress unused variable warnings in the fake pthread_create and
  pthread_join implementations for Windows.

- (not Windows-related) Remove mention of `asprintf` from comment;
  `asprintf` is no longer used.

Fixes #871.
2023-04-11 15:19:54 +02:00