Commit graph

87 commits

Author SHA1 Message Date
Concedo
93a8e00dfa Merge branch 'master' into concedo
# Conflicts:
#	flake.nix
2023-04-26 18:01:35 +08:00
Georgi Gerganov
7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179)
* ggml : add Q8_0 quantization format (rename the old one to Q8_1)

* tests : fix test-quantize-fns

* ggml : finalize Q8_0 implementation

* ggml : use q4_0_q8_0 and q4_2_q8_0

* ggml : fix Q8_0 dot product bug (ARM)

* ggml : Q8_0 unroll x2

* ggml : fix bug - using wrong block type

* ggml : extend quantize_fns_t with "vec_dot_type"

* ggml : fix Q8_0 to use 255 values out of 256

* ggml : fix assert using wrong QK4_2 instead of QK4_3
2023-04-25 23:40:51 +03:00
Concedo
235daf4016 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	README.md
2023-04-25 20:44:22 +08:00
Georgi Gerganov
957c8ae21d
llama : increase scratch buffer size for 65B (ref #1152)
Temporary solution
2023-04-24 18:47:30 +03:00
Concedo
e58f1d1336 Merge branch 'master' into concedo_experimental 2023-04-24 19:43:17 +08:00
Georgi Gerganov
c4fe84fb0d
llama : refactor get / set state + remove redundant kv cache API (#1143) 2023-04-24 07:40:02 +03:00
Concedo
8e615c8245 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-04-24 12:20:08 +08:00
Georgi Gerganov
e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make" 2023-04-23 18:15:39 +03:00
Concedo
7c60441d71 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
2023-04-22 23:46:14 +08:00
Stephan Walter
c50b628810
Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122) 2023-04-22 10:54:13 +00:00
Concedo
1b7aa2b815 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
2023-04-22 16:22:08 +08:00
Georgi Gerganov
872c365a91 ggml : fix AVX build + update to new Q8_0 format 2023-04-22 11:08:12 +03:00
Concedo
1ea0e15292 Merge branch 'master' into concedo
# Conflicts:
#	llama.cpp
2023-04-22 16:07:27 +08:00
xaedes
b6e7f9b09e
llama : add api for getting/setting the complete state: rng, logits, embedding and kv_cache (#1105)
* reserve correct size for logits

* add functions to get and set the whole llama state:

including rng, logits, embedding and kv_cache

* remove unused variables

* remove trailing whitespace

* fix comment
2023-04-22 09:21:32 +03:00
Concedo
cee018960e Merge branch 'master' into concedo_experimental 2023-04-22 00:19:50 +08:00
xaedes
8687c1f258
llama : remember and restore kv cache data pointers (#1104)
because their value is stored in buf and overwritten by memcpy
2023-04-21 18:25:21 +03:00
Concedo
82d74ca1a6 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
2023-04-21 16:24:30 +08:00
Georgi Gerganov
d40fded93e
llama : fix comment for "output.weight" tensor 2023-04-21 10:24:02 +03:00
Georgi Gerganov
12b5900dbc
ggml : sync ggml (add GPT-NeoX RoPE implementation) 2023-04-20 23:32:59 +03:00
Kawrakow
38de86a711
llama : multi-threaded quantization (#1075)
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-20 20:42:27 +03:00
Georgi Gerganov
e0305ead3a
ggml : add Q4_3 quantization (#1082) 2023-04-20 20:35:53 +03:00
Concedo
be1222c36e Merged the upstream cublas feature, 2023-04-19 20:45:37 +08:00
slaren
8944a13296
Add NVIDIA cuBLAS support (#1044) 2023-04-19 11:22:45 +02:00
Concedo
f662a9a230 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
2023-04-19 16:34:51 +08:00
Georgi Gerganov
77a73403ca
ggml : add new Q4_2 quantization (ARM only) (#1046)
* ggml : Q4_2 ARM

* ggml : add ggml_is_quantized()

* llama : update llama_type_name() with Q4_2 entry

* ggml : speed-up q4_2

- 4 threads: ~100ms -> ~90ms
- 8 threads:  ~55ms -> ~50ms

* ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
2023-04-18 23:54:57 +03:00
Concedo
ac61e34d5f Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
2023-04-18 17:38:10 +08:00
slaren
315a95a4d3
Add LoRA support (#820) 2023-04-17 17:28:55 +02:00
Arik Poznanski
efd05648c8
llama : well-defined static initialization of complex objects (#927)
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build

* replaced use of auto with exact type to avoid using -std=c++14

* Made the assessors functions for static maps be static const
2023-04-17 17:41:53 +03:00
Ivan Komarov
f266259ad9
Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933) 2023-04-17 15:10:57 +02:00
Concedo
96fb12cfa2 Merge branch 'master' into concedo 2023-04-16 21:59:05 +08:00
Georgi Gerganov
3173a62eb9
stdout : vertical align outputs for better readibility 2023-04-16 13:59:27 +03:00
nanahi
2d3481c721
Fix msys2 build error and warnings (#1009) 2023-04-16 11:13:42 +02:00
Concedo
d00b865eb1 Merge branch 'master' into concedo
# Conflicts:
#	.devops/full.Dockerfile
#	Makefile
#	flake.nix
2023-04-15 11:33:43 +08:00
Pavol Rusnak
c56b715269
Expose type name from ggml (#970)
Avoid duplication of type names in utils

Co-authored-by: Håkon H. Hitland <haakon@likedan.net>
2023-04-14 20:05:37 +02:00
Concedo
a819f22cac Merge branch 'master' into concedo
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.nix
2023-04-14 21:40:33 +08:00
Georgi Gerganov
9190e8eac8
llama : merge llama_internal.h into llama.h
Hide it behind an #ifdef
2023-04-13 18:04:45 +03:00
Concedo
f4257a8eef Merge branch 'master' into concedo 2023-04-12 23:25:45 +08:00
Concedo
1bd5992da4 clean and refactor handling of flags 2023-04-12 23:25:31 +08:00
Stephan Walter
e7f6997f89
Don't crash on ftype (formerly f16) == 4 (#917) 2023-04-12 15:06:16 +00:00
Concedo
9245c7d7d0 Merge branch 'master' into concedo 2023-04-11 23:38:15 +08:00
Stephan Walter
3e6e70d8e8
Add enum llama_ftype, sync ggml_type to model files (#709) 2023-04-11 15:03:51 +00:00
comex
2663d2c678
Windows fixes (#890)
Mostly for msys2 and mingw64 builds, which are different from each other
and different from standard Visual Studio builds.  Isn't Windows fun?

- Define _GNU_SOURCE in more files (it's already used in ggml.c for
  Linux's sake).

- Don't use PrefetchVirtualMemory if not building for Windows 8 or later
  (mingw64 doesn't by default).  But warn the user about this situation
  since it's probably not intended.

- Check for NOMINMAX already being defined, which it is on mingw64.

- Actually use the `increment` variable (bug in my `pizza` PR).

- Suppress unused variable warnings in the fake pthread_create and
  pthread_join implementations for Windows.

- (not Windows-related) Remove mention of `asprintf` from comment;
  `asprintf` is no longer used.

Fixes #871.
2023-04-11 15:19:54 +02:00
Concedo
f53238f570 Merged the upstream updates for model loading code, and ditched the legacy llama loaders since they were no longer needed. 2023-04-10 12:00:34 +08:00
comex
180b693a47 Print model version.
Also improve model type printing, and fix indentation of an unrelated
switch statement.
2023-04-10 01:10:46 +02:00
comex
f963b63afa Rewrite loading code to try to satisfy everyone:
- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

    - Indicate loading progress when using mmap + mlock.  (Which led me
      to the interesting observation that on my Linux machine, with a
      warm file cache, mlock actually takes some time, whereas mmap
      without mlock starts almost instantly...)

      - To help implement this, move mlock support from ggml to the
        loading code.

- madvise/PrefetchVirtualMemory support (based on #740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.  The exceptions are converted to error codes at the
  API boundary.)

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)
2023-04-10 01:10:46 +02:00
Concedo
d335fae7c4 missed a print statement 2023-04-08 17:59:53 +08:00
Concedo
0b904e12db Merge branch 'master' into concedo
# Conflicts:
#	Makefile
2023-04-08 17:42:09 +08:00
Concedo
d8e37bfe75 new gpt2 format supported 2023-04-08 17:35:36 +08:00
unbounded
62cfc54f77
Add quantize-stats command for testing quantization (#728)
Command that calculates some statistics over the errors introduced by
quantization, like mean square error, max error and some percentile errors for layer
weights. Should be useful for testing quantization improvements.

Exposes some internal state from ggml and llama for testing
2023-04-08 00:09:18 +02:00
Ivan Stepanov
4953e9007f
llama : always sort logits before nucleus sampling (#812)
* Always sort logits before nucleus sampling

* remove second normalization

- fix windows build
- remove normalization since std::discrete_distribution does not require it
2023-04-07 19:02:12 +03:00