Commit graph

857 commits

Author SHA1 Message Date
Georgi Gerganov
bb0c139247
llama : check LLAMA_TRACE env for extra logging (#4929)
* llama : minor fix indent

* llama : check LLAMA_TRACE env for extra logging

ggml-ci
2024-01-14 13:26:53 +02:00
Georgi Gerganov
03c5267490
llama : use LLAMA_LOG_ macros for logging 2024-01-14 11:03:19 +02:00
Kawrakow
a128c38de8
Fix ffn_down quantization mix for MoE models (#4927)
* Fix ffn_down quantization mix for MoE models

In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).

* Fix the fix

* Review suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 10:53:39 +02:00
Karthik Kumar Viswanathan
ac32902a87
llama : support WinXP build with MinGW 8.1.0 (#3419) 2024-01-14 10:41:44 +02:00
Concedo
dc7bc0cb50 Merge commit '584d674be6' into concedo_experimental
# Conflicts:
#	.github/workflows/nix-flake-update.yml
#	Makefile
#	Package.swift
#	ggml-cuda.cu
#	tests/test-quantize-fns.cpp
2024-01-14 16:29:44 +08:00
Kawrakow
147b17ac94
2-bit quantizations (#4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:45:56 +02:00
Kawrakow
807179ec58
Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:44:30 +02:00
Georgi Gerganov
4be5ef556d
metal : remove old API (#4919)
ggml-ci
2024-01-13 20:45:45 +02:00
Georgi Gerganov
f172de03f1
llama : fix detokenization of non-special added-tokens (#4916)
Co-authored-by: goerch <jhr.walter@t-online.de>
2024-01-13 18:47:38 +02:00
David Friehs
df845cc982
llama : minimize size used for state save/load (#4820)
* examples : save-load-state: save only required state

* llama : only reserve n_vocab * n_batch at most for logits

llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.

* llama : always reserve n_vocab * n_batch for logits

llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.

* llama : only save and restore used logits

for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.

* llama : use ostringstream and istringstream for save and load

* llama : serialize rng into minimum amount of space required

* llama : break session version due to serialization changes
2024-01-13 18:29:43 +02:00
Georgi Gerganov
15ebe59210
convert : update phi-2 to latest HF repo (#4903)
* convert : update phi-2 to latest HF repo

ggml-ci

* py : try to fix flake stuff
2024-01-13 13:44:37 +02:00
slaren
e7e4df031b
llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 20:07:38 +01:00
Georgi Gerganov
584d674be6
llama : remove redundant assert for StableLM (#4901) 2024-01-12 20:54:12 +02:00
Georgi Gerganov
3cabe80630
llama : fix typo "imp_embd" -> "inp_embd" 2024-01-12 13:11:15 +02:00
Georgi Gerganov
f445c0e68c
llama : fix llm_build_k_shift to use correct n_rot (#4889)
* llama : fix llm_build_k_shift to use correct n_rot

ggml-ci

* llama : always use hparams.n_rot for ggml_rope_custom

ggml-ci

* convert : fix persimmon conversion to write correct n_rot
2024-01-12 13:01:56 +02:00
Kawrakow
469e75d0a3
llama : restore intended k-quants mixes for MoE models (#4872)
* Restore intended k-quants quantization mixes for MoE models

* Update Q2_K_S values in the quantize tool

Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 21:43:15 +02:00
Kawrakow
49662cbed3
ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:39:39 +02:00
pudepiedj
43f76bf1c3
main : print total token count and tokens consumed so far (#4874)
* Token count changes

* Add show token count

* Updating before PR

* Two requested changes

* Move param def posn
2024-01-11 18:14:52 +02:00
Brian
57d016ba2d
llama : add additional suffixes for model params (#4834)
* llm_load_print_meta: Add additional suffixs for model params

* Update llama.cpp model param log

remove unneeded comments and convert from > to >=
2024-01-10 16:09:53 +02:00
Austin
329ff61569
llama : recognize 1B phi models (#4847)
This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.
2024-01-10 15:39:09 +02:00
Concedo
66533c8424 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	Package.swift
#	README.md
#	tests/test-quantize-fns.cpp
2024-01-09 17:48:18 +08:00
Concedo
5cc64ebb52 dynatemp wizard 2024-01-09 15:51:32 +08:00
Kawrakow
dd5ae06405
SOTA 2-bit quants (#4773)
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-08 16:02:32 +01:00
Georgi Gerganov
b0034d93ce
examples : add passkey test (#3856)
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* make : add passkey target

* passkey : add "self-extend"-like context extension (#4810)

* llama : "self-extend"-like context extension

* passkey : add comment

* passkey : add readme
2024-01-08 11:14:04 +02:00
Concedo
f04b6e7287 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	CMakeLists.txt
#	README.md
#	ggml-metal.m
#	ggml.c
2024-01-08 14:18:49 +08:00
Georgi Gerganov
9dede37d81
llama : remove unused vars (#4796) 2024-01-07 14:29:36 +02:00
Georgi Gerganov
3c36213df8
llama : remove redundant GQA check (#4796) 2024-01-07 11:21:53 +02:00
Georgi Gerganov
d117d4dc5d
llama : print tensor meta for debugging 2024-01-07 09:51:12 +02:00
Concedo
b614a86dd9 disable print statements for dynatemp 2024-01-06 11:14:58 +08:00
kalomaze
123bff9a0f
Full DynaTemp implementation + UI (#600)
* move Dynatemp changes to new branch

* fix float header

* Properly reintroduce variable expert count

Controllable through experts.txt

* first pass at DynaTemp UI

Checkbox partial implemented, Min and Max Temp implemented

* DynaTemp UI Checkbox

Trigger DynaTemp on checkbox

* DynaTemp UI checkbox edition

Hell Yeah! DynaTemp!

* Remove greedy dynatemp

* Fix race condition caused by debug print

* Fixed broken presets and miro

Fixes broken presets and mirostat

* Remove debug function + HHI temp

Also removed unnecessary softmax double precision

* Fix whitespace (?) for generate function

* epic upstream renaming scheme fix

* fix stupid indents

* Other cleanup

Reintroduce unused rep pen function, move temp functions first before entropy dynamic temp

* Slight indent fix

* revert batch pyinstaller maker to mainline

and also delete experts.txt since adjustable routing is also being removed for the PR

* compact dynatemp into a single value dynatemp_range. This is a float which represents the allowed deviation from the min and max temperature when using dynatemp. Thus, if we want a value of dynatemp_min=0.3, dynatemp_max=0.5, then we would simply set temperature=0.4 and dynatemp_range=0.1. Functionally dynatemp would operate the same, but it would simplify usage and make it a single easy to adjust value.

---------

Co-authored-by: Alexander Abushady <aabushady214@gmail.com>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2024-01-06 11:13:16 +08:00
Concedo
234f79fe9d Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	ci/run.sh
#	llama.cpp
2024-01-03 22:33:38 +08:00
Georgi Gerganov
540938f890
llama : llama_model_desc print number of experts 2024-01-02 16:26:45 +02:00
Marcus Dunn
0040d42eeb
llama : replace all API facing int's with int32_t (#4577)
* replaced all API facing `int`'s with `int32_t`

* formatting and missed `int` in `llama_token_to_piece`
2024-01-02 16:15:16 +02:00
postmasters
83e633c27e
llama : differentiate the KV dims in the attention (#4657)
* Add n_key_dim and n_value_dim

Some models use values that are not derived from `n_embd`.
Also remove `n_embd_head` and `n_embd_gqa` because it is not clear
which "head" is referred to (key or value).

Fix issue #4648.

* Fix `llm_build_kqv` to use `n_value_gqa`

* Rebase

* Rename variables

* Fix llm_build_kqv to be more generic wrt n_embd_head_k

* Update default values for n_embd_head_k and n_embd_head_v

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix llm_load_tensors: the asserts were not backcompat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-02 13:51:28 +02:00
Concedo
fe7c200610 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/main-rocm.Dockerfile
#	README.md
#	flake.lock
#	flake.nix
#	ggml-cuda.cu
#	requirements.txt
#	tests/CMakeLists.txt
2023-12-31 00:42:59 +08:00
automaticcat
24a447e20a
ggml : add ggml_cpu_has_avx_vnni() (#4589)
* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-30 10:07:48 +02:00
manikbhandari
ea5497df5d
gpt2 : Add gpt2 architecture integration (#4555) 2023-12-28 15:03:57 +01:00
Nam D. Tran
f6793491b5
llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)
* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

---------

Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io>
Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-27 17:39:45 +02:00
Concedo
69ab1bf2f8 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-12-27 21:43:46 +08:00
slaren
dc68f0054c
cuda : fix vmm pool with multi GPU (#4620)
* cuda : fix vmm pool with multi GPU

* hip

* use recommended granularity instead of minimum

* better error checking

* fix mixtral

* use cudaMemcpy3DPeerAsync

* use cuda_pool_alloc in ggml_cuda_op_mul_mat

* consolidate error checking in ggml_cuda_set_device

* remove unnecessary inlines

ggml-ci

* style fixes

* only use vmm for the main device

* fix scratch buffer size, re-enable vmm pool for all devices

* remove unnecessary check id != g_main_device
2023-12-26 21:23:59 +01:00
Concedo
cc64f2cad1 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/bug.md
#	Makefile
#	README.md
#	ggml-cuda.cu
#	tests/test-grad0.cpp
2023-12-25 18:47:21 +08:00
Concedo
293395e0f5 Merge commit '708e179e85' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
2023-12-25 16:48:15 +08:00
Shintarou Okada
753be377b6
llama : add PLaMo model (#3557)
* add plamo mock

* add tensor loading

* plamo convert

* update norm

* able to compile

* fix norm_rms_eps hparam

* runnable

* use inp_pos

* seems ok

* update kqv code

* remove develop code

* update README

* shuffle attn_q.weight and attn_output.weight for broadcasting

* remove plamo_llm_build_kqv and use llm_build_kqv

* fix style

* update

* llama : remove obsolete KQ_scale

* plamo : fix tensor names for correct GPU offload

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-24 15:35:49 +02:00
slaren
5bf3953d7e
cuda : improve cuda pool efficiency using virtual memory (#4606)
* cuda : improve cuda pool efficiency using virtual memory

* fix mixtral

* fix cmake build

* check for vmm support, disable for hip

ggml-ci

* fix hip build

* clarify granularity

* move all caps to g_device_caps

* refactor error checking

* add cuda_pool_alloc, refactor most pool allocations

ggml-ci

* fix hip build

* CUBLAS_TF32_TENSOR_OP_MATH is not a macro

* more hip crap

* llama : fix msvc warnings

* ggml : fix msvc warnings

* minor

* minor

* cuda : fallback to CPU on host buffer alloc fail

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ensure allocations are always aligned

* act_size -> actual_size

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-12-24 14:34:22 +01:00
slaren
708e179e85
fallback to CPU buffer if host buffer alloc fails (#4610) 2023-12-23 16:10:51 +01:00
Concedo
3bca03d26b Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	Makefile
#	README.md
#	llama.cpp
2023-12-22 21:39:23 +08:00
slaren
48b7ff193e
llama : fix platforms without mmap (#4578)
* llama : fix platforms without mmap

* win32 : limit prefetch size to the file size

* fix win32 error clobber, unnecessary std::string in std::runtime_error
2023-12-22 13:12:53 +02:00
Concedo
230a638512 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	llama.cpp
#	tests/test-grad0.cpp
2023-12-22 14:40:13 +08:00
crasm
c7e9701f86
llama : add ability to cancel model loading (#4462)
* llama : Add ability to cancel model load

Updated llama_progress_callback so that if it returns false, the model
loading is aborted.

* llama : Add test for model load cancellation

* Fix bool return in llama_model_load, remove std::ignore use

* Update llama.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Fail test if model file is missing

* Revert "Fail test if model file is missing"

This reverts commit 32ebd525bf7e5a87ee8a3dbaab3d92ce79fbf23d.

* Add test-model-load-cancel to Makefile

* Revert "Revert "Fail test if model file is missing""

This reverts commit 2796953257ee5383fa7c8fe8fa8fc888c048fb0b.

* Simplify .gitignore for tests, clang-tidy fixes

* Label all ctest tests

* ci : ctest uses -L main

* Attempt at writing ctest_with_model

* ci : get ci/run.sh working with test-model-load-cancel

* ci : restrict .github/workflows/build.yml ctest to -L main

* update requirements.txt

* Disable test-model-load-cancel in make

* Remove venv before creation

* Restructure requirements.txt

Top-level now imports the specific additional requirements for each
python file. Using `pip install -r requirements.txt` will fail if
versions become mismatched in the per-file requirements.

* Make per-python-script requirements work alone

This doesn't break the main requirements.txt.

* Add comment

* Add convert-persimmon-to-gguf.py to new requirements.txt scheme

* Add check-requirements.sh script and GitHub workflow

* Remove shellcheck installation step from workflow

* Add nocleanup special arg

* Fix merge

see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573

* reset to upstream/master

* Redo changes for cancelling model load

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-12-22 08:19:36 +02:00
Georgi Gerganov
afefa319f1
ggml : change ggml_scale to take a float instead of tensor (#4573)
* ggml : change ggml_scale to take a float instead of tensor

* ggml : fix CPU implementation

* tests : fix test-grad0

ggml-ci
2023-12-21 23:20:49 +02:00