Commit graph

341 commits

Author SHA1 Message Date
Concedo
05f7db4b29 do not cast to size_t, instead just use doubles 2023-12-14 16:43:34 +08:00
Concedo
1ad8f0d80e Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values 2023-12-14 16:00:44 +08:00
slaren
799a1cb13b
llama : add Mixtral support (#4406)
* convert : support Mixtral as LLAMA arch

* convert : fix n_ff typo

* llama : model loading

* ggml : sync latest ggml_mul_mat_id

* llama : update graph to support MoE

* llama : fix cur -> cur_expert

* llama : first working version

* llama : fix expert weighting in the FFN

* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)

* ggml : add n_as argument to ggml_mul_mat_id

* ggml : fix ggml_get_rows to take into account ne02 / ne11

* metal : add more general support for ggml_get_rows + tests

* llama : add basic support for offloading moe with CUDA

* metal : add/mul/div use general kernel when src1 not cont

* metal : reduce the kernel launches for ggml_mul_mat_id

* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D

* ggml : update get_rows f16 and q

* cuda : support non-contiguous src1 in get_rows

* llama : offload missing ffn_moe_silu

* metal : fix ggml_get_rows to work with non-cont src1

* metal : add indirect mat-vec kernels for all quantization types

* llama : do not quantize expert gating tensors

* llama : add n_expert and n_expert_used to hparams + change quants

* test-backend-ops : add moe test

* cuda : fix get_rows when ncols is odd

* convert : determine n_ctx correctly

* metal : fix ggml_mul_mat_id for F32

* test-backend-ops : make experts more evenly probable (test_moe)

* test-backend-ops : cleanup, add moe test for batches

* test-backend-ops : add cpy from f32 -> all types test

* test-backend-ops : fix dequantize block offset

* llama : fix hard-coded number of experts

* test-backend-ops : simplify and disable slow tests to avoid CI timeout

* test-backend-ops : disable MOE test with thread sanitizer

* cuda : fix mul_mat_id with multi gpu

* convert : use 1e6 rope_freq_base for mixtral

* convert : fix style

* convert : support safetensors format

* gguf-py : bump version

* metal : add cpy f16 -> f32 kernel

* metal : fix binary ops for ne10 % 4 != 0

* test-backend-ops : add one more sum_rows test

* ggml : do not use BLAS with ggml_mul_mat_id

* convert-hf : support for mixtral-instruct (#4428)

* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy

* metal : fix soft_max kernels

ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92

* metal : limit kernels to not use more than the allowed threads

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-13 14:04:25 +02:00
Richard Kiss
9494d7c477
english : use typos to fix comments and logs (#4354) 2023-12-12 11:53:36 +02:00
Xiang (Kevin) Li
e18f7345a3
grammar : revert the replacement of llama_token_to_piece with id_to_token (#4396) 2023-12-09 23:29:27 +02:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Marcus Dunn
5f6e0c0dff
grammar : pre-computed pieces + reserve mem + less string copies (#4330)
* reserve space for codepoints

* improvement for the appended 0

* used precomputed token text for grammar sample

* reserve canidates_decoded

* reserve canidates_grammar

* remove candidates_decoded

* Revert "remove candidates_decoded"

This reverts commit 3773328080e6a139ee83198329a13cf4ff61d707.

* changed decode_utf8 to take src by ref
2023-12-05 22:55:12 +02:00
Kerfuffle
5aa365d88f
llama : allow overriding GGUF metadata when loading model (#4092)
* feat: Allow overriding GGUF metadata when loading model

* Fix the one time GCC is stricter than clang about something

* Step1

* Refactor... basically everything!

* Nuke obsolete GetArrayLen struct

* simplify std::string specialization

* Various cleanups

Add informational output when overrides are applied

Warn user when an override with the wrong type is specified

* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes

* llama : rearrange model params

* Update new GET_KEY call

Add note that metadata KV overrides aren't reflected in initial metadata KV info dump

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-05 19:19:18 +02:00
Georgi Gerganov
d7b800b8bc
llama : pad KV cache size (#4280)
* llama : pad KV cache size to 32

* metal : try to improve batched decoding
2023-12-03 10:58:16 +02:00
Georgi Gerganov
5a7d3125e7
llama : avoid using "optional" keyword (#4283) 2023-12-01 20:39:12 +02:00
Georgi Gerganov
d5a1cbde60
llama : support optional tensors (#4283) 2023-12-01 20:35:47 +02:00
CausalLM
03562f3a86
llama : support attention bias on LLaMA architecture (#4283)
* Support attention_bias on LLaMA architecture

QKVO bias, should fix InternLM (https://github.com/ggerganov/llama.cpp/issues/3133) and works for LLaMAfied Qwen models (https://github.com/ggerganov/llama.cpp/pull/3743#issuecomment-1825923608).

* check existence of qkvo bias while loading llama models

Tested on LLaMA2, CUDA and CPU.

* Update llama.cpp
2023-12-01 20:17:06 +02:00
Shijie
37c746d687
llama : add Qwen support (#4281)
* enable qwen to llama.cpp

* llama : do not GPU split bias tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-01 20:16:31 +02:00
Georgi Gerganov
880f57973b
llama : fix integer overflow during quantization (#4284)
happens with multi-threaded quantization of Qwen-72B

ggml-ci
2023-12-01 18:42:11 +02:00
Georgi Gerganov
ef47ec18da
ggml : add ggml_soft_max_ext (#4256)
* metal : implement soft_max_ext

* cuda : implement soft_max_ext

* ggml : implement soft_max_ext (CPU)

* batched-bench : print threads

ggml-ci

* metal : simplify soft_max encoding

ggml-ci

* cuda : use 512 threads for soft_max instead of 32

* ggml : update soft max cpu

* cuda : do warp-based block reduce

* cuda : increase max block size to 1024

* cuda : fix warp reduction initialization of shared mem

* metal : warp-based reduction for soft max kernel

* metal : warp-based reduce for rms_norm

* metal : simplify soft max kernel

ggml-ci

* alloc : fix build with debug
2023-12-01 10:51:24 +02:00
Jared Van Bortel
15f5d96037
build : fix build info generation and cleanup Makefile (#3920)
* cmake : fix joining of REAL_GIT_DIR

* fix includes with help from include-what-you-use

* make : remove unneeded deps and add test-rope target

* fix C includes in C++ source files

* Revert "fix includes with help from include-what-you-use"

This reverts commit 635e9fadfd516d4604a0fecf4a854bfb25ad17ae.
2023-12-01 00:23:08 +02:00
Daniel Bevenius
b18c66ca6e
llama : fix alignment of general.name in print meta (#4254)
* llama: fix alignment of general.name in print meta

This commit fixes the alignment of the general.name field in the
llm_load_print_meta function.

Currently the output looks like this:
```console
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name   = LLaMA v2
```
And with this commit it looks like this:
```console
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name     = LLaMA v2
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama: fix alignment of special tokens

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-11-30 23:43:08 +02:00
tarcey
954e22858c
llama : fix typical sampling (#4261)
Typical sampling was broken because after copying new_candidates into canditates, the "sorted" bool is left at "true", but the new data is no longer sorted according to probability. Patch to set "sorted" to false.

Test: Generating with temp=0.0001 (approx. argmax)  should generate the same sequence at typical>=1.0 and typical=0.9999 (approx. disabled, but enters the typical sampling codepath).
2023-11-30 23:40:23 +02:00
Georgi Gerganov
8406b0924b
ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant full offload checks in llama.cpp (#4240)
* ggml : use blas even if src0 is not F32

* llama : use n_threads_batch only when n_tokens >= 32

ggml-ci

* llama : revert n_threads_batch logic

ggml-ci
2023-11-28 10:32:03 +02:00
Marcus Dunn
f837c3a992
llama : grammar reserve space in decode_utf8 (#4210)
* reserve space for codepoints

* improvement for the appended 0
2023-11-25 18:58:23 +02:00
slaren
e9c13ff781
llama : set metal log callback correctly (#4204) 2023-11-24 18:10:01 +01:00
slaren
8a052c131e
ggml-cuda : support stablelm rope (#4156)
* ggml-cuda : support stablelm rope

* remove unused freq_base kernel parameter

* add n_dims parameter to llm_build_k_shift, default to n_rot via overload

* llama : fix llm_build_k_shift args

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-24 18:04:31 +01:00
Georgi Gerganov
6b0a7420d0
llama : KV cache view API + better KV cache management (#4170)
* llama : keep track of used KV cells + better KV cache management

* llama : zero KV cache used upon clear

ggml-ci

* llama : allow exporting a view of the KV cache (#4180)

* Allow exporting a view of the KV cache

* Allow dumping the sequences per cell in common

* Track max contiguous cells value and position as well

* Fix max contiguous empty cells index calculation

Make dump functions deal with lengths or sequences counts > 10 better

* Fix off by one error in dump_kv_cache_view

* Add doc comments for KV cache view functions

Eliminate cell sequence struct; use llama_seq_id directly

Minor cleanups

* common : add -dkvc arg for enabling kv cache dumps

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-23 19:07:56 +02:00
Galunid
8e672efe63
stablelm : simplify + speedup generation (#4153) 2023-11-21 16:22:30 +01:00
slaren
e937066420
gguf-py : export chat templates (#4125)
* gguf-py : export chat templates

* llama.cpp : escape new lines in gguf kv info prints

* gguf-py : bump version

* gguf-py : check chat_template type

* gguf-py : initialize chat_template
2023-11-19 11:10:52 +01:00
slaren
bbecf3f415
llama : increase max nodes (#4115) 2023-11-17 21:39:11 +02:00
slaren
e85bb1a8e7
llama : add functions to get the model's metadata (#4013)
* llama : add functions to get the model's metadata

* format -> std::to_string

* better documentation
2023-11-17 17:17:37 +02:00
Georgi Gerganov
4f447a4833
llama : fix data units (#4101)
* llama : fix data units

ggml-ci

* Revert "llama : fix data units"

This reverts commit f5feac831fe225ed7f3db938d115732a49dccfc4.

* llama : disambiguate data units

ggml-ci
2023-11-17 10:00:15 +02:00
Kerfuffle
91f6499393
Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.

* Respect add_bos_token GGUF metadata value

* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-16 19:14:37 -07:00
Jared Van Bortel
a6fc554e26
llama : restore prefix space in llama tokenizer (#4081) 2023-11-15 11:34:47 -05:00
Galunid
36eed0c42c
stablelm : StableLM support (#3586)
* Add support for stablelm-3b-4e1t
* Supports GPU offloading of (n-1) layers
2023-11-14 11:17:12 +01:00
Georgi Gerganov
4760e7cc0b
sync : ggml (backend v2) (#3912)
* sync : ggml (backend v2) (wip)

* sync : migrate examples and llama.cpp to dynamic graphs (wip)

* sync : update tests + fix max op params to 64

ggml-ci

* sync : ggml-cuda

ggml-ci

* llama : fix save/load state context size

ggml-ci

* sync : try to fix build on tvOS

* sync : pass custom graph sizes in training examples

* sync : update graph copies to new ggml API

* sync : update sync-ggml.sh with new files

* scripts : fix header in sync script

* train : fix context size calculations

* llama : increase inference graph size up to 4096 nodes

* train : allocate grads for backward graphs

* train : allocate grads for gb_tmp
2023-11-13 14:16:23 +02:00
Kerfuffle
bb50a792ec
Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041)
* Add ReLU and SQR CUDA ops to fix Persimmon offloading

* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
2023-11-13 01:58:15 -07:00
Galunid
df9d1293de
Unbreak persimmon after #3837 (#4010) 2023-11-10 14:24:54 +01:00
Meng Zhang
46876d2a2c
cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946)
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build

* doc: add comments to ggml_cublas_loaded()

* fix defined(...)
2023-11-07 08:49:08 +02:00
Meng Zhang
3d48f42efc
llama : mark LLM_ARCH_STARCODER as full offload supported (#3945)
as done in https://github.com/ggerganov/llama.cpp/pull/3827
2023-11-05 14:40:08 +02:00
cebtenzzre
3fdbe6b66b
llama : change yarn_ext_factor placeholder to -1 (#3922) 2023-11-03 08:31:58 +02:00
Georgi Gerganov
1efae9b7dc
llm : prevent from 1-D tensors being GPU split (#3697) 2023-11-02 09:54:44 +02:00
cebtenzzre
0eb332a10f
llama : fix llama_context_default_params after #2268 (#3893) 2023-11-01 19:29:14 -04:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling (#2268)
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Georgi Gerganov
c43c2da8af
llm : fix llm_build_kqv taking unused tensor (benign, #3837) 2023-11-01 23:08:30 +02:00
Georgi Gerganov
523e49b111
llm : fix falcon norm after refactoring (#3837) 2023-11-01 23:00:50 +02:00
Georgi Gerganov
50337961a6
llm : add llm_build_context (#3881)
* llm : add llm_build_context

* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv

* llm : restore the non-graph llm_build_ functional API

ggml-ci

* llm : cleanup + comments
2023-11-01 20:11:02 +02:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter (#3762)
* Add '-ngl' support to finetune.cpp

* Add fprintf in ggml_cuda_op_add

When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora

* Add 'finetune.sh', which currently fails when using GPU

"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"

* tweak finetune.sh

* Suppress some warnings in ggml.c

* Add f16 implementation to ggml_compute_forward_add_f16_f32

* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs

* finetune.sh: Edit comments

* Add "add_f16_f32_f32_cuda"

* Tweak an error message

* finetune.sh: Add an optional LLAMA_MODEL_DIR variable

* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable

* train : minor

* tabs to spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Georgi Gerganov
71e3718abd
llama : refactor graph build code (#3837)
* llama : factor out ggml-alloc from graph graph build functions

ggml-ci

* metal : disable kernel load log

* llama : factor out tensor offloading outside the build call (wip)

ggml-ci

* llama : offload rest of the models

ggml-ci

* llama : update offload log messages to print node index

* llama : comments

* llama : support offloading result_norm + comments

* llama : factor graph input into a function

* llama : do tensor offload only with CUDA

* llama : fix res_norm offloading

* llama : try to optimize offloading code

* llama : fix non-CUDA build

* llama : try to fix build

* llama : move refact in correct place + optimize graph input

* llama : refactor tensor offloading as callback

* llama : add layer index to all tensor names

* llama : add functional header

* llama : comment

ggml-ci

* llama : remove obsolete map for layer counting

* llama : add llm_build helper functions (#3848)

* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function (#3849)

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
2023-11-01 08:04:02 +02:00
kalomaze
238657db23
samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)
* Introduce the new Min-P sampler by @kalomaze
   The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.

* Min-P enabled and set to 0.05 default

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31 20:44:49 +01:00
Georgi Gerganov
207b51900e
ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861)
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h

ggml-ci

* tests : fix ARM build

* ggml : explicitly initialize deprecated type traits

* ggml : add math.h to ggml-impl.h

* ggml : remove duplicate static assert macros

* ggml : prefix lookup tables with ggml_

ggml-ci

* ggml-impl : move extern "C" to start of file
2023-10-30 19:19:15 +02:00
Kerfuffle
6e08281e58
Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)
* Extend llama_kv_cache_seq_rm to allow matichng any sequence

* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear

Use llama_kv_cache_clear for cache clearing

Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
Georgi Gerganov
71a09da301
llama : fix kv shift bug (#3835)
ggml-ci
2023-10-29 18:32:51 +02:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring (#3833)
* ggml : factor all quantization code in ggml-quants

ggml-ci

* ggml-quants : fix Zig and Swift builds + quantize tool

ggml-ci

* quantize : --pure option for disabling k-quant mixtures

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00