Commit graph

670 commits

Author SHA1 Message Date
daboe01
cf267d1c71
make : add train-text-from-scratch (#1850)
* make finetuning example accessible

* fixed: targed was in wrong line

* fixed: name of executable was wrong

* fixed: naming of binary

* fixed: model path was wrong

* fixed clean target

* Update examples/train-text-from-scratch/README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-15 20:42:48 +03:00
sandyiscool
37e257c48e
make : clean *.so files (#1857) 2023-06-15 20:36:06 +03:00
Ycros
b1b8dc32c9
Fix Makefile for CUBLAS. (#241) 2023-06-15 14:46:47 +08:00
Concedo
f83b66606b Merge branch 'concedo' into concedo_experimental 2023-06-14 11:50:24 +08:00
tqcq
ce36167976
fix Fix the link on the Mac platform OpenCL method (#227)
merging this, please let me know if anything breaks.
2023-06-14 11:41:39 +08:00
Concedo
e4265198ed added cublas back into the makefile as some people requested 2023-06-14 11:34:40 +08:00
Kerfuffle
74d4cfa343
Allow "quantizing" to f16 and f32 (#1787)
* Allow "quantizing" to f16 and f32

Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS

Add brief help to the list of quantization types in the quantize tool

Ignore case for quantization type arguments in the quantize tool
2023-06-13 04:23:23 -06:00
rankaiyx
555275a693
make : add SSSE3 compilation use case (#1659) 2023-06-10 09:41:59 +03:00
Concedo
0833845268 merged metal patch directly into the file 2023-06-09 14:38:31 +08:00
Hyun-joo KIM
6fa1613f15
Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment 2023-06-09 01:47:36 +09:00
Concedo
a6a0fa338a cleanup indentation, fixing cublas build 2023-06-08 22:40:53 +08:00
Concedo
a979e71ddc add obj flags to all output make targets 2023-06-08 16:28:26 +08:00
Concedo
49a6be3d87 add llama metal compile flags as an option 2023-06-07 22:29:38 +08:00
Concedo
7b0707ff26 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
2023-06-07 17:06:56 +08:00
Georgi Gerganov
5c64a0952e
k-quants : allow to optionally disable at compile time (#1734)
* k-quants : put behind optional compile flag LLAMA_K_QUANTS

* build : enable k-quants by default
2023-06-07 10:59:52 +03:00
Concedo
ed603dcafc Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	docs/BLIS.md
#	llama.cpp
#	tests/test-quantize-fns.cpp
2023-06-06 23:12:01 +08:00
Georgi Gerganov
2d43387daf
ggml : fix builds, add ggml-quants-k.o (close #1712, close #1710) 2023-06-06 10:18:03 +03:00
Kawrakow
99009e72f8
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml

I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.

* Adding Q3_K and Q8_K (de)-quantization

* Q3_K now working on CUDA and AVX2/scalar

CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).

* Some improvement for Q3_K on CUDA

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

* Some more CUDA optimizations for Q3_K

Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.

* Adding Q4_K - scalar, AVX2, CUDA

Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).

* Adding Q6_K - scalar, AVX2, CUDA

Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).

* Adding Q5_K - scalar, AVX2, CUDA

Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.

* Per convention, all QX_K quantizations use Q5_K for output.weight

* Adding quantization mixes

* Quantization mixes: didn't quite get what I wanted in the last commit

* Q4_K dot product for ARM_NEON

* Q6_K dot product for ARM_NEON

* Q5_K dot product for ARM_NEON

* Adding Q3_K dot for ARM_NEON

It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.

* A very slightly faster ARM_NEON Q3_K dot

* Adding Q2_K - just CUDA for now

Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.

* Adding scalar and AVX2 Q2_K dot

* Adding ARM_NEON Q2_K dot

About the same performance as Q4_K.

* A slightly faster ARM_NEON Q2_K dot

Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.

* Fixed bug in Q2_K CUDA dot product kernel

Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.

In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
  ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).

* Don't print zeros/NaNs when no count histogram has been collected

* A 10% faster CUDA vector dot kernel for Q3_K

Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.

* A slightly daster Q4_K AVX2 dot product

For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.

* A slightly faster ARM_NEON A4_K dot product

* Minor

* Fix quantization error test

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.

* Fix docker build

I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.

* Added forgotten ggml.o dependence on k_quants.h to the Makefile

* Had unintentionally committed the Makefile with -Ofast enabled

* ggml : rename k_quants -> ggml-quants-k, use lowercase in code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 22:56:18 +03:00
Concedo
79df932d0a added dropdown for blasbatch. added capability to build avx clblast but not in default build for now 2023-06-05 22:50:21 +08:00
Georgi Gerganov
ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77fe811082f672538fc91bcfc9072d63.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
Concedo
c3c05fc33b further cleanup, refactor renamemode to hordeconfig 2023-06-04 11:57:46 +08:00
Concedo
6f82e17b7a added MPT support 2023-06-03 16:14:08 +08:00
Johannes Gäßler
3b126f654f
LLAMA_DEBUG adds debug symbols (#1617) 2023-05-28 21:01:02 +02:00
Kerfuffle
0df7d63e5b
Include server in releases + other build system cleanups (#1610)
Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases.

Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default)

Fix issue where `vdot` binary wasn't removed when running `make clean`.

Fix compile warnings in `server` example.

Add `.hpp` files to trigger workflow (the server example has one).
2023-05-27 11:04:14 -06:00
Concedo
92a0d77712 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
2023-05-27 17:44:14 +08:00
Johannes Gäßler
1fcdcc28b1
cuda : performance optimizations (#1530)
* xor hack

* block y dim

* loop unrolling

* Fixed cmake LLAMA_CUDA_BY option

* Removed hipblas compatibility code

* Define GGML_CUDA_DMMV_BLOCK_Y if not defined

* Fewer iters, more ops per iter

* Renamed DMMV X/Y compilation options
2023-05-26 00:07:29 +03:00
Concedo
bf482d1786 revert klite newline bug, trying to add win7 support 2023-05-24 22:21:01 +08:00
Concedo
cd4012c3ed minor fixes to debug logging, fixed a typo, added a new failsafe mode 2023-05-23 21:31:42 +08:00
0cc4m
2e6cd4b025
OpenCL Token Generation Acceleration (#1459)
* Move back to C++ for OpenCL

* Refactor OpenCL code to work more like the CUDA code, add missing functions

* Deduplicate dequant kernels

* Add OpenCL compile options

* Use compile args for preprocessing constants

* Restore default platform + device selection by id behavior

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Henri Vasserman <henv@hot.ee>
2023-05-23 00:33:24 +03:00
Concedo
b9f06a7670 mavx only for windows by default, let them eat march native. 2023-05-22 16:48:55 +08:00
Concedo
169a26d15f removed unused build targets 2023-05-22 13:53:10 +08:00
Concedo
587308a202 fixed some build errors on linux, changed icon resolution, added more error printing 2023-05-22 12:18:42 +08:00
Stefan Sydow
7780e4f479
make : .PHONY clean (#1553) 2023-05-21 17:03:44 +03:00
Concedo
c048bcfec4 remove old filever checks (+7 squashed commit)
Squashed commit:

[b72627a] new format not working

[e568870] old ver works

[7053b77] compile errors fixed, fixing linkers

[4ae8889] add new ver

[ff82dfd] file format checks

[25b8aa8] refactoring type names

[931063b] still merging
2023-05-21 00:15:39 +08:00
Zenix
b8ee340abe
feature : support blis and other blas implementation (#1536)
* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 17:58:31 +03:00
Georgi Gerganov
ea600071cb
Revert "feature : add blis and other BLAS implementation support (#1502)"
This reverts commit 07e9ace0f9.
2023-05-20 12:03:48 +03:00
Zenix
07e9ace0f9
feature : add blis and other BLAS implementation support (#1502)
* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 12:02:48 +03:00
Concedo
f561fe5a4a switch back to ofast for c 2023-05-17 10:04:54 +08:00
Concedo
504a2aa874 Merge remote-tracking branch 'fixmake/concedo' into concedo_experimental 2023-05-17 10:01:57 +08:00
horenbergerb
f29c25e7a1 hacky fix for linux cublas build 2023-05-16 12:29:04 -04:00
Concedo
196fbba527 Merge branch 'opencl-dev2' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
2023-05-16 17:04:33 +08:00
sandyiscool
2a5ee023ad
Add alternate include path for openblas (#1476)
In some linux distributions (fedora, for example), the include path for openblas is located at '/usr/local/include'
2023-05-16 10:30:15 +02:00
Concedo
e4e6994353 Not working, don't use. testing a merge 2023-05-16 12:33:24 +08:00
0cc4m
c77966524a Refactor OpenCL code to work more like the CUDA code, add missing functions 2023-05-14 17:01:46 +02:00
Concedo
e01e373e63 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	ggml.c
#	llama.cpp
2023-05-14 11:34:41 +08:00
Georgi Gerganov
bda4d7c215 make : fix PERF build with cuBLAS 2023-05-13 17:25:09 +03:00
Concedo
cee8042793 integrated new version of clblast kernels as a separate file 2023-05-13 12:53:28 +08:00
Concedo
08810d5fee interim merge. do not use 2023-05-13 00:33:55 +08:00
Concedo
e9caff1cda Interim merge. Do not use.
Merge branch 'master' into concedo_experimental

# Conflicts:
#	README.md
#	SHA256SUMS
#	examples/quantize/quantize.cpp
#	ggml-opencl.c
#	ggml.c
#	ggml.h
#	llama.cpp
#	llama.h
2023-05-12 23:20:27 +08:00
Concedo
62beded0e7 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	README.md
2023-05-07 19:10:01 +08:00