Commit graph

121 commits

Author SHA1 Message Date
Concedo
a97f7d5f91 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/main-cuda.Dockerfile
#	.devops/main-intel.Dockerfile
#	.devops/main-rocm.Dockerfile
#	.devops/main.Dockerfile
#	.devops/server-cuda.Dockerfile
#	.devops/server-intel.Dockerfile
#	.devops/server-rocm.Dockerfile
#	.devops/server.Dockerfile
#	.devops/tools.sh
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	llama.cpp
#	requirements.txt
#	requirements/requirements-convert-hf-to-gguf-update.txt
#	requirements/requirements-convert-hf-to-gguf.txt
#	requirements/requirements-convert-legacy-llama.txt
#	requirements/requirements-convert-llama-ggml-to-gguf.txt
#	scripts/check-requirements.sh
#	scripts/compare-llama-bench.py
#	scripts/convert-gg.sh
#	scripts/pod-llama.sh
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-tokenizer-0.sh
#	tests/test-tokenizer-random.py
2024-06-02 12:28:38 +08:00
Georgi Gerganov
0c27e6f62e
ggml : fix loongson compile warnings (#7537)
* ggml : fix loongson compile warnings

ggml-ci

* Fix loongarch quantize test fail.

Fix unexpected error introduced during rebase code.

* tests : disable json test due to lack of python on the CI node

ggml-ci

---------

Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>
2024-05-31 14:17:10 +03:00
junchao-loongson
d5c05821f3
ggml : fix loongarch build (O2 issue) (#7636) 2024-05-30 12:30:10 +03:00
Concedo
4ed9ba7352 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.lock
#	tests/test-backend-ops.cpp
2024-05-28 21:57:19 +08:00
Masaya, Kato
faa0e6979a
ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433)
* Add SVE support for q4_0_q8_0 q8_0_q8_0

* remove ifdef
2024-05-25 11:42:31 +03:00
Concedo
653050135b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	tests/test-chat-template.cpp
2024-05-24 16:22:38 +08:00
Georgi Gerganov
1debe72737
ggml : silence UB sanitizer error during iq2_xxs quantization (#0) 2024-05-23 17:25:38 +03:00
Concedo
9282c307ed this commit does not work, just for debugging 2024-05-23 20:13:47 +08:00
Georgi Gerganov
e84b71c2c6
ggml : drop support for QK_K=64 (#7473)
* ggml : drop support for QK_K=64

ggml-ci

* opencl : restore QK_K=256 define
2024-05-23 10:00:21 +03:00
Concedo
52f9911240 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	requirements.txt
#	scripts/LlamaConfig.cmake.in
2024-05-21 19:05:52 +08:00
junchao-loongson
65c58207ec
ggml : add loongarch lsx and lasx support (#6454)
* add loongarch lsx and lasx optimize code

* Add loongarch compilation support to makefile

* revert stb_image.h

* opt bytes_from_nibbles_32 and sum_i16_pairs_float

* fix undeclared

* format code

* update

* update 2

---------

Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-05-20 10:19:21 +03:00
slaren
e4e6f67be6
ggml : fix another case of quants nans (#7387) 2024-05-19 17:08:46 +02:00
Concedo
d5d5dda02b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	ggml-cuda.cu
#	tests/test-backend-ops.cpp
2024-05-19 17:55:20 +08:00
slaren
05834841dc
ggml : fix quants nans when all the group weights are very close to zero (#7313) 2024-05-18 02:39:54 +02:00
Concedo
47cbfd6150 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	llama.cpp
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/test-backend-ops.cpp
2024-05-17 22:30:41 +08:00
Herman Semenov
359cbe3f46
ggml-quants, llama : removed excess checks (#7274) 2024-05-17 10:08:49 +03:00
Max Krasnyansky
13ad16af12
Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (#7191)
* logging: add proper checks for clang to avoid errors and warnings with VA_ARGS

* build: add CMake Presets and toolchian files for Windows ARM64

* matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings

* ci: add support for optimized Windows ARM64 builds with MSVC and LLVM

* matmul-int8: fixed typos in q8_0_q8_0 matmuls

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* matmul-int8: remove unnecessary casts in q8_0_q8_0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16 12:47:36 +10:00
Georgi Gerganov
c3c88f296a ggml : try fix ppc64 (whisper/0) 2024-05-14 19:08:09 +03:00
Hong Bo PENG
0d26d8ccd8 ggml : optimize for ppc64le using VSX intrinsics (ggml/784)
* optimize for ppc64le using VSX intrinsics

* 1. code clean up by removing comments about overflow concern.

2. fix typo in suffix of scaling.

* Continue to fix typo in suffix of scaling for QK_K <> 256

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-14 19:08:09 +03:00
Concedo
2ee808a747 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	ci/run.sh
#	llama.cpp
#	models/ggml-vocab-llama-bpe.gguf.inp
#	models/ggml-vocab-llama-bpe.gguf.out
#	requirements.txt
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-tokenizer-1-bpe.cpp
2024-05-14 19:28:47 +08:00
Borislav Stanimirov
ef0d5e3ec9 build: fix and ignore msvc warnings (ggml/805) 2024-05-11 15:38:34 +03:00
Concedo
165a56088b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml.c
#	scripts/compare-llama-bench.py
#	tests/test-backend-ops.cpp
2024-05-08 18:19:28 +08:00
Justine Tunney
3855416027
ggml : introduce bfloat16 support (#6412)
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
2024-05-08 09:30:09 +03:00
Concedo
17a24d753c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/main-intel.Dockerfile
#	.devops/main-vulkan.Dockerfile
#	.devops/server-intel.Dockerfile
#	.devops/server-vulkan.Dockerfile
#	.github/workflows/bench.yml
#	.github/workflows/build.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/server.yml
#	.gitignore
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	flake.lock
#	llama.cpp
#	models/ggml-vocab-falcon.gguf
#	models/ggml-vocab-llama-spm.gguf
#	models/ggml-vocab-mpt.gguf
#	models/ggml-vocab-stablelm.gguf
#	models/ggml-vocab-starcoder.gguf
#	requirements.txt
#	scripts/check-requirements.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-tokenizer-0-bpe.py
#	tests/test-tokenizer-0-spm.py
#	tests/test-tokenizer-1-spm.cpp
2024-04-30 21:04:17 +08:00
slaren
017e6999b5
add basic tensor data validation function (#6884)
* add basic tensor data validation function

* add --check-tensors command line argument

tensor validation is disabled by default and can be enabled by adding
`--check-tensors` to the command line arguments.

quantize always validates tensors.
2024-04-26 18:39:58 +02:00
Concedo
544c36f751 merge, deprecate openblas 2024-04-26 19:24:59 +08:00
Georgi Gerganov
54770413c4
ggml : fix MIN / MAX macros (#6904)
ggml-ci
2024-04-25 15:12:28 +03:00
Concedo
a681cdd9ef Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/sampling.h
#	llama.h
#	tests/test-chat-template.cpp
2024-04-24 21:29:07 +08:00
Georgi Gerganov
c0d1b3e03e
ggml : move 32-bit arm compat in ggml-impl.h (#6865)
ggml-ci
2024-04-24 12:00:07 +03:00
Concedo
bfbaf0011c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	build.zig
#	tests/test-backend-ops.cpp
2024-04-17 18:22:47 +08:00
Justine Tunney
8cc91dc63c
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 21:55:30 +03:00
Concedo
d1bb126605 Merge branch 'upstream' into concedo
# Conflicts:
#	README.md
#	llama.cpp
#	otherarch/sdcpp/SDCPP_LICENSE
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.sh
2024-04-09 17:18:35 +08:00
Carolinabanana
5dc9dd7152
llama : add Command R Plus support (#6491)
* Add Command R Plus GGUF

* Add Command R Plus GGUF

* Loading works up to LayerNorm2D

* Export new tensors in 1D so they are not quantized.

* Fix embedding layer based on Noeda's example

* Whitespace

* Add line

* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)

* dranger003: Fix block index overflow in CUDA dequantizing.

* Reverted blocked multiplication code as it still has issues and could affect other Llama arches

* export norms as f32

* fix overflow issues during quant and other cleanup

* Type convention

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* dranger003: Fix more int overflow during quant.

---------

Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 11:16:13 +03:00
Concedo
81ac0e5656 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/llama-cpp-clblast.srpm.spec
#	.devops/llama-cpp-cuda.srpm.spec
#	.devops/llama-cpp.srpm.spec
#	.devops/nix/package.nix
#	.devops/server-cuda.Dockerfile
#	.devops/server-intel.Dockerfile
#	.devops/server-rocm.Dockerfile
#	.devops/server-vulkan.Dockerfile
#	.devops/server.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/code-coverage.yml
#	.github/workflows/docker.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/gguf-publish.yml
#	.github/workflows/nix-ci-aarch64.yml
#	.github/workflows/nix-ci.yml
#	.github/workflows/python-check-requirements.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/server.yml
#	.github/workflows/zig-build.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	examples/gguf-split/gguf-split.cpp
#	flake.lock
#	flake.nix
#	llama.cpp
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
2024-04-07 22:07:27 +08:00
Concedo
22f543d09b Merge commit '32c8486e1f' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	build.zig
#	llama.cpp
#	tests/test-backend-ops.cpp
2024-04-07 20:39:17 +08:00
Concedo
9c0fbf9f73 Merge commit 'ad3a0505e3' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/close-issue.yml
#	.github/workflows/code-coverage.yml
#	.github/workflows/docker.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/nix-ci-aarch64.yml
#	.github/workflows/nix-ci.yml
#	.github/workflows/python-check-requirements.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/server.yml
#	.github/workflows/zig-build.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	build.zig
#	common/CMakeLists.txt
#	llama.cpp
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
2024-04-06 18:32:57 +08:00
Kawrakow
cbc8343619
Make IQ1_M work for QK_K = 64 (#6327)
* iq1_m: make it work for QK_K = 64 (WIP)

* iq1_m: make it work for QK_K = 64 (scalar and AVX2)

* iq1_m: QK_K = 64 seems to work on Metal and ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-27 08:44:27 +01:00
Kawrakow
55c1b2a3bb
IQ1_M: 1.75 bpw quantization (#6302)
* iq1_m: basics

* iq1_m: basics-2

* iq1_m: CUDA dequantize works

Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.

* iq1_m: separate shifts for each group of 8 in a block

We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105

Not bad, but slightly higher than
  sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
 PPL = 9.14 for LLaMA-v2-7B
 PPL = 6.63 for LLaMA-v2-13B

* iq1_m: go to 3-bit scales

There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.

We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw

* iq1_m: scalar dot product

* iq1_m: AVX2 dot product

* iq1_m: very slightly faster AVX2 dot product

* iq1_m: ARM_NEON dot product

Works, but very slow (10.5 t/s)

* iq1_m: Metal - dequantize works, dot product does not

* iq1_m: Metal now works

About the same performance as iq1_s.

* iq1_m: minor

* iq1_m: checking pure iq1_m quantization

It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.

* iiq1_m: slightly faster ARM_NEON dot product

10.5 t/s -> 11.65 t/s

* iq1_m: faster ARM_NEON dot product

11.65 t/s -> 14.9 t/s

* iq1_m: another minor ARM_NEON dot product improvement

14.9 -> 15.0 t/s

* iq1_m: small PPL improvement via super-block scale adjustment

After quantizing block scales redo the super-block scale fit.

PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B  ) = 8.1624

* iq1_m: adapt to CUDA refactoring

* iq1_m: remove unused variable

We have progressed to warnings being errors.

* iq1_m: add to backend-ops tests

* iq1_m: fix Windows ARM

* iq1_m: use common definition of iq1m_scale_t

* cuda: assert -> NO_DEVICE_CODE

* iq1_M: PR comments

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-26 15:21:27 +01:00
Justine Tunney
7733f0c760
ggml : support AVX512VNNI (#6280)
This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some
architectures (e.g. AMD Zen 4).
2024-03-25 07:39:56 +02:00
Kawrakow
cfd3be76e3
ggml : same IQ4_NL quantization for CPU/CUDA/Metal (#6196)
* Make quantize_row_iq4_nl do the same thing is quantization on CUDA

* Make quantize_row_iq4_nl do the same thing is quantization on CUDA

This time for real. backend-ops tests pass.

* Now fix test-quantize-fns

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-21 14:59:38 +02:00
Concedo
ba950716a9 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	build.zig
#	llama.cpp
#	tests/test-tokenizer-1-bpe.cpp
#	tests/test-tokenizer-1-llama.cpp
2024-03-13 11:21:58 +08:00
Georgi Gerganov
8030da7afe
ggml : reuse quantum structs across backends (#5943)
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
2024-03-12 14:27:20 +02:00
Georgi Gerganov
184215e783
ggml : fix UB in IQ2_S and IQ3_S (#6012) 2024-03-12 13:49:55 +02:00
Kawrakow
44ca159faf
1.5 bit: we can do even better (#5999)
* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11 17:53:15 +02:00
Concedo
6a32c14e86 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	flake.lock
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/.gitignore
#	tests/test-backend-ops.cpp
2024-03-11 23:00:47 +08:00
Michael Podvitskiy
3202361c5b
ggml, ci : Windows ARM runner and build fixes (#5979)
* windows arm ci

* fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64

* fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned`

* fix `error C2065: '__fp16': undeclared identifier`
2024-03-11 11:28:51 +02:00
Kawrakow
be858f6205
Better 1.5 bit quantization (#5971)
* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11 07:51:49 +01:00
Georgi Gerganov
df4dc3e7cb
ggml : try fix 32-bit arm compat (whisper/1938)
* ggml : try fix 32-bit arm compat

* ggml : fix cont
2024-03-10 20:10:39 +02:00
Georgi Gerganov
8380ecfb21
ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951) 2024-03-09 17:36:20 +02:00
Georgi Gerganov
5b09797321
ggml : remove old quantization functions (#5942)
* ggml : remove old quantization functions

ggml-ci

* ggml : simplify ggml_quantize_chunk

ggml-ci

* ggml : restrict correctness

ggml-ci

* ggml : remove hist data from the quantization API

ggml-ci

* tests : remove hist usage in test-backend-ops

ggml-ci

* vulkan : remove hist and fix typo
2024-03-09 15:53:59 +02:00