Commit graph

195 commits

Author SHA1 Message Date
automaticcat
24a447e20a
ggml : add ggml_cpu_has_avx_vnni() (#4589)
* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-30 10:07:48 +02:00
Concedo
cc64f2cad1 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/bug.md
#	Makefile
#	README.md
#	ggml-cuda.cu
#	tests/test-grad0.cpp
2023-12-25 18:47:21 +08:00
slaren
5bf3953d7e
cuda : improve cuda pool efficiency using virtual memory (#4606)
* cuda : improve cuda pool efficiency using virtual memory

* fix mixtral

* fix cmake build

* check for vmm support, disable for hip

ggml-ci

* fix hip build

* clarify granularity

* move all caps to g_device_caps

* refactor error checking

* add cuda_pool_alloc, refactor most pool allocations

ggml-ci

* fix hip build

* CUBLAS_TF32_TENSOR_OP_MATH is not a macro

* more hip crap

* llama : fix msvc warnings

* ggml : fix msvc warnings

* minor

* minor

* cuda : fallback to CPU on host buffer alloc fail

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ensure allocations are always aligned

* act_size -> actual_size

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-12-24 14:34:22 +01:00
Concedo
3bca03d26b Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	Makefile
#	README.md
#	llama.cpp
2023-12-22 21:39:23 +08:00
bobqianic
0137ef88ea
ggml : extend enum ggml_log_level with GGML_LOG_LEVEL_DEBUG (#4579) 2023-12-22 08:47:01 +02:00
Concedo
230a638512 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	llama.cpp
#	tests/test-grad0.cpp
2023-12-22 14:40:13 +08:00
Georgi Gerganov
afefa319f1
ggml : change ggml_scale to take a float instead of tensor (#4573)
* ggml : change ggml_scale to take a float instead of tensor

* ggml : fix CPU implementation

* tests : fix test-grad0

ggml-ci
2023-12-21 23:20:49 +02:00
slaren
d232aca5a7
llama : initial ggml-backend integration (#4520)
* llama : initial ggml-backend integration

* add ggml-metal

* cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST
access all tensor data with ggml_backend_tensor_get/set

* add ggml_backend_buffer_clear
zero-init KV cache buffer

* add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data

* disable gpu backends with ngl 0

* more accurate mlock

* unmap offloaded part of the model

* use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap

* update quantize and lora

* update session copy/set to use ggml-backend

ggml-ci

* use posix_fadvise instead of posix_fadvise64

* ggml_backend_alloc_ctx_tensors_from_buft : remove old print

* llama_mmap::align_offset : use pointers instead of references for out parameters

* restore progress_callback behavior

* move final progress_callback call to load_all_data

* cuda : fix fprintf format string (minor)

* do not offload scales

* llama_mmap : avoid unmapping the same fragments again in the destructor

* remove unnecessary unmap

* metal : add default log function that prints to stderr, cleanup code

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21 21:07:46 +01:00
Concedo
96c12cf395 Merge branch 'master' into concedo_experimental 2023-12-21 20:03:21 +08:00
Eric Sommerlade
328b83de23
ggml : fixed check for _MSC_VER (#4535)
Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>
2023-12-19 18:17:01 +02:00
Concedo
49a5dfc604 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-12-19 16:07:48 +08:00
Ebey Abraham
b9e74f9bca
llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)
* phi2 implementation

* fix breaking change

* phi-2 : various fixes

* phi-2 : use layer norm eps

* py : whitespaces

* llama : fix meta KV override bug

* convert : phi don't add BOS token

* convert : revert "added_tokens_decoder" change

* phi-2 : scale Q instead of KQ for better precision

* ggml : fix NeoX rope to rotate just first n_dims

* cuda : less diff in the rope_neox kernel

* ggml : add ggml_mul_mat_set_prec

ggml-ci

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision

* cuda : remove oboslete comment

---------

Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-18 19:27:47 +02:00
Concedo
76a3ba42eb Merge branch 'master' into concedo_experimental
# Conflicts:
#	ggml.c
#	ggml.h
#	requirements.txt
#	tests/test-quantize-perf.cpp
2023-12-16 22:58:53 +08:00
slaren
6744dbe924
ggml : use ggml_row_size where possible (#4472)
* ggml : use ggml_row_size where possible

ggml-ci

* ggml : move ggml_nbytes_split to ggml-cuda.cu
2023-12-14 20:05:21 +01:00
slaren
cafcd4f895
ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
2023-12-14 16:52:08 +01:00
LostRuins
20a68a7030
ggml : add ggml_row_size() (fixes llama out of space) (#4461)
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

* do not cast to size_t, instead just use doubles

* ggml : add ggml_row_size(), deprecate ggml_type_sizef()

* ggml : fix row size compute to avoid overflows

* tests : fix sizey -> sizez

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14 14:13:33 +02:00
Concedo
53bbd1ee43 Merge branch 'pr_fix_buf_resize_type' into concedo_experimental 2023-12-14 16:45:18 +08:00
Concedo
05f7db4b29 do not cast to size_t, instead just use doubles 2023-12-14 16:43:34 +08:00
Concedo
c88fc19d59 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
2023-12-14 16:32:42 +08:00
Georgi Gerganov
4d98d9a656
sync : ggml (SD ops, tests, kernels) (#4444)
* sync : ggml (SD ops, tests, kernels)

ggml-ci

* cuda : restore im2col

ggml-ci

* metal : fix accuracy of dequantization kernels

ggml-ci

* cuda : restore correct im2col

ggml-ci

* metal : try to fix moe test by reducing expert size

ggml-ci

* cuda : fix bin bcast when src1 and dst have different types

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-13 21:54:54 +02:00
Concedo
e447af669c Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-12-13 21:09:47 +08:00
slaren
799a1cb13b
llama : add Mixtral support (#4406)
* convert : support Mixtral as LLAMA arch

* convert : fix n_ff typo

* llama : model loading

* ggml : sync latest ggml_mul_mat_id

* llama : update graph to support MoE

* llama : fix cur -> cur_expert

* llama : first working version

* llama : fix expert weighting in the FFN

* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)

* ggml : add n_as argument to ggml_mul_mat_id

* ggml : fix ggml_get_rows to take into account ne02 / ne11

* metal : add more general support for ggml_get_rows + tests

* llama : add basic support for offloading moe with CUDA

* metal : add/mul/div use general kernel when src1 not cont

* metal : reduce the kernel launches for ggml_mul_mat_id

* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D

* ggml : update get_rows f16 and q

* cuda : support non-contiguous src1 in get_rows

* llama : offload missing ffn_moe_silu

* metal : fix ggml_get_rows to work with non-cont src1

* metal : add indirect mat-vec kernels for all quantization types

* llama : do not quantize expert gating tensors

* llama : add n_expert and n_expert_used to hparams + change quants

* test-backend-ops : add moe test

* cuda : fix get_rows when ncols is odd

* convert : determine n_ctx correctly

* metal : fix ggml_mul_mat_id for F32

* test-backend-ops : make experts more evenly probable (test_moe)

* test-backend-ops : cleanup, add moe test for batches

* test-backend-ops : add cpy from f32 -> all types test

* test-backend-ops : fix dequantize block offset

* llama : fix hard-coded number of experts

* test-backend-ops : simplify and disable slow tests to avoid CI timeout

* test-backend-ops : disable MOE test with thread sanitizer

* cuda : fix mul_mat_id with multi gpu

* convert : use 1e6 rope_freq_base for mixtral

* convert : fix style

* convert : support safetensors format

* gguf-py : bump version

* metal : add cpy f16 -> f32 kernel

* metal : fix binary ops for ne10 % 4 != 0

* test-backend-ops : add one more sum_rows test

* ggml : do not use BLAS with ggml_mul_mat_id

* convert-hf : support for mixtral-instruct (#4428)

* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy

* metal : fix soft_max kernels

ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92

* metal : limit kernels to not use more than the allowed threads

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-13 14:04:25 +02:00
Concedo
c2c238b4f3 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-grad0.cpp
#	tests/test-quantize-perf.cpp
2023-12-13 14:49:03 +08:00
Taikono-Himazin
41a11aaf99
ggml : increased GGML_MAX_PARAMS to allow finetuning of 70b models (#4424) 2023-12-12 11:24:32 +02:00
Concedo
ec21fa7712 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	ggml-cuda.cu
#	llama.cpp
#	llama.h
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
2023-12-08 17:42:26 +08:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) (#4359)
* sync : ggml (part 1)

* sync : ggml (part 2, CUDA)

* sync : ggml (part 3, Metal)

* ggml : build fixes

ggml-ci

* cuda : restore lost changes

* cuda : restore lost changes (StableLM rope)

* cmake : enable separable compilation for CUDA

ggml-ci

* ggml-cuda : remove device side dequantize

* Revert "cmake : enable separable compilation for CUDA"

This reverts commit 09e35d04b1c4ca67f9685690160b35bc885a89ac.

* cuda : remove assert for rope

* tests : add test-backend-ops

* ggml : fix bug in ggml_concat

* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`

* ci : try to fix macOS

* ggml-backend : remove backend self-registration

* ci : disable Metal for macOS cmake build

ggml-ci

* metal : fix "supports family" call

* metal : fix assert

* metal : print resource path

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 22:26:54 +02:00
Concedo
48544cd2ef Revert "Revert "ggml : add ggml_soft_max_ext (#4256)""
This reverts commit a8e66ef31c.
2023-12-03 21:46:50 +08:00
Concedo
a8e66ef31c Revert "ggml : add ggml_soft_max_ext (#4256)"
This reverts commit ef47ec18da.
2023-12-03 00:42:01 +08:00
Concedo
495bb3ab1e Merge branch 'master' into concedo_experimental 2023-12-01 23:48:20 +08:00
Georgi Gerganov
ef47ec18da
ggml : add ggml_soft_max_ext (#4256)
* metal : implement soft_max_ext

* cuda : implement soft_max_ext

* ggml : implement soft_max_ext (CPU)

* batched-bench : print threads

ggml-ci

* metal : simplify soft_max encoding

ggml-ci

* cuda : use 512 threads for soft_max instead of 32

* ggml : update soft max cpu

* cuda : do warp-based block reduce

* cuda : increase max block size to 1024

* cuda : fix warp reduction initialization of shared mem

* metal : warp-based reduction for soft max kernel

* metal : warp-based reduce for rms_norm

* metal : simplify soft max kernel

ggml-ci

* alloc : fix build with debug
2023-12-01 10:51:24 +02:00
Concedo
e9724cdc9d Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-11-30 14:31:53 +08:00
Jared Van Bortel
64e64aa255
ggml : restore abort() in GGML_ASSERT (#4242) 2023-11-28 11:51:11 +02:00
Concedo
6bf8ee4aea Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	ggml-cuda.cu
#	tests/test-tokenizer-0-falcon.py
#	tests/test-tokenizer-0-llama.py
2023-11-18 11:10:45 +08:00
slaren
e85bb1a8e7
llama : add functions to get the model's metadata (#4013)
* llama : add functions to get the model's metadata

* format -> std::to_string

* better documentation
2023-11-17 17:17:37 +02:00
Concedo
35a97e14b2 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	docs/token_generation_performance_tips.md
#	grammars/README.md
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-grad0.cpp
#	tests/test-opt.cpp
2023-11-15 16:59:53 +08:00
Georgi Gerganov
3d68f364f1
ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060)
ggml-ci
2023-11-13 16:55:52 +02:00
Georgi Gerganov
4760e7cc0b
sync : ggml (backend v2) (#3912)
* sync : ggml (backend v2) (wip)

* sync : migrate examples and llama.cpp to dynamic graphs (wip)

* sync : update tests + fix max op params to 64

ggml-ci

* sync : ggml-cuda

ggml-ci

* llama : fix save/load state context size

ggml-ci

* sync : try to fix build on tvOS

* sync : pass custom graph sizes in training examples

* sync : update graph copies to new ggml API

* sync : update sync-ggml.sh with new files

* scripts : fix header in sync script

* train : fix context size calculations

* llama : increase inference graph size up to 4096 nodes

* train : allocate grads for backward graphs

* train : allocate grads for gb_tmp
2023-11-13 14:16:23 +02:00
Concedo
0b13ebed6a Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-11-08 20:54:09 +08:00
xaedes
e9c1cecb9d
ggml : fix backward rope after YaRN (#3974)
* fix backward process of rope

rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions.

the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.

to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:

for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.

* fix finetune rope call to use correct default attn_factor of 1.0f

* remove unused `ggml_rope_xpos_back`

it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants.

* fix comments explaining the sinus sign in ggml_forward_rope

* add missing function arguments in declaration

* fix function argument type in declaration
2023-11-07 10:04:51 +02:00
Concedo
bc4ff72317 not working merge 2023-11-02 17:52:40 +08:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling (#2268)
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Concedo
9342636408 Merge branch 'master' into concedo_experimental
# Conflicts:
#	flake.lock
#	flake.nix
2023-11-01 18:24:36 +08:00
Georgi Gerganov
71e3718abd
llama : refactor graph build code (#3837)
* llama : factor out ggml-alloc from graph graph build functions

ggml-ci

* metal : disable kernel load log

* llama : factor out tensor offloading outside the build call (wip)

ggml-ci

* llama : offload rest of the models

ggml-ci

* llama : update offload log messages to print node index

* llama : comments

* llama : support offloading result_norm + comments

* llama : factor graph input into a function

* llama : do tensor offload only with CUDA

* llama : fix res_norm offloading

* llama : try to optimize offloading code

* llama : fix non-CUDA build

* llama : try to fix build

* llama : move refact in correct place + optimize graph input

* llama : refactor tensor offloading as callback

* llama : add layer index to all tensor names

* llama : add functional header

* llama : comment

ggml-ci

* llama : remove obsolete map for layer counting

* llama : add llm_build helper functions (#3848)

* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function (#3849)

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
2023-11-01 08:04:02 +02:00
Concedo
cc5b282350 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	build.zig
#	flake.lock
#	flake.nix
#	ggml.c
2023-10-31 20:44:04 +08:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring (#3833)
* ggml : factor all quantization code in ggml-quants

ggml-ci

* ggml-quants : fix Zig and Swift builds + quantize tool

ggml-ci

* quantize : --pure option for disabling k-quant mixtures

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00
Concedo
5db89b90b7 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	build.zig
#	ggml-opencl.cpp
#	tests/CMakeLists.txt
#	tests/test-double-float.cpp
#	tests/test-sampling.cpp
2023-10-25 23:58:15 +08:00
Georgi Gerganov
b2f7e04bd3
sync : ggml (conv ops + cuda MSVC fixes) (#3765)
ggml-ci
2023-10-24 21:51:20 +03:00
Qin Yue Chen
8cf19d60dc
gguf : support big endian platform (#3552)
* check whether platform is 390x if yes->do not import immintrin.h

* support s390x big endian

* support --bigendian option for s390x
1. verified with baichuan7b-chat with float 16 on s390x
2. verified with baichuan7b-chat
3. verified with chinese-alpaca-2-13b-f16

* update format based on editor-config checker result

* Update convert-baichuan-hf-to-gguf.py

* 1. check in ggml.c if endianess is not match
2. update GGUF version
3. change get_pack_prefix to property
4. update information log

* always use "GGUF" as beginng of GGUF file

* Compare "GGUF" with file header char by char
1.  Set GGUF_MAGIC to "GGUF" string instead of int value
2. Compare "GGUF" char by char to ensure its byte order
3. Move bytes swap code from convert.py to gguf.py write_tensor_data

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-20 14:19:40 +03:00
Concedo
ee0681f0d9 convert some asserts into non-terminating since they are ovezealous 2023-10-15 16:12:20 +08:00
slaren
424b6381c4
ggml : add context enumeration functions (#3605)
finetune : fix assert failure in ggml-alloc
2023-10-13 12:23:10 +02:00