Commit graph

157 commits

Author SHA1 Message Date
Concedo
f73de33f74 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	ci/README.md
#	ci/run.sh
#	flake.lock
#	ggml-metal.m
#	ggml-opencl.cpp
#	ggml-vulkan-shaders.hpp
#	ggml-vulkan.cpp
#	ggml-vulkan.h
#	ggml.c
#	ggml_vk_generate_shaders.py
#	llama.cpp
#	llama.h
#	pocs/vdot/vdot.cpp
#	tests/test-llama-grammar.cpp
#	tests/test-sampling.cpp
2024-01-29 23:12:09 +08:00
0cc4m
2307523d32
ggml : add Vulkan backend (#2059)
* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 19:03:59 +02:00
Abhilash Majumder
0f648573dd
ggml : add unified SYCL backend for Intel GPUs (#2690)
* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 17:56:23 +02:00
Concedo
ed09a854f0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	ci/run.sh
#	ggml-opencl.cpp
#	tests/CMakeLists.txt
2024-01-27 11:45:07 +08:00
l3utterfly
5eaf9964fc
llama : dynamic temperature sampling (#4972)
* implemented dynamic temperature sampling from koboldcpp

* removed trailing whitespace

* removed unused temp parameter in llama_sample_entropy

* exposed exponent_val in dynamic temp sampler

* added debug check for printf statements

* use nullptr in llama_sample_softmax call during llama_sample_entropy

this avoids counting the time taken stats twice

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* return earlier if there is only 1 candiate (i.e. max_entropy == 0)

* reformat 't' case in llama_sample_queue

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* check for one or zero candidates case in llama_sample_entropy

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-01-25 22:06:22 +02:00
Concedo
2a4a7241e6 Merge branch 'vulkan_test' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	llama.cpp
2024-01-25 23:01:44 +08:00
Concedo
7b3866f211 vulkan implementation from occam (early access, squashed) 2024-01-25 18:13:19 +08:00
Concedo
08236ccc97 better abort handling, added support for dynatemp exponent 2024-01-23 16:56:12 +08:00
Concedo
f96f29be7b Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/nix/nixpkgs-instances.nix
#	.devops/nix/package.nix
#	.devops/nix/scope.nix
#	.github/workflows/build.yml
#	.github/workflows/nix-ci.yml
#	CMakeLists.txt
#	flake.nix
#	ggml.c
2024-01-22 22:31:22 +08:00
Kawrakow
66d575c45c
llama : add Q3_K_XS (#5060)
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S

* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K

Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22 12:43:33 +02:00
Concedo
1cb8a5e955 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	ci/run.sh
#	flake.lock
#	flake.nix
#	ggml-cuda.cu
#	ggml-cuda.h
#	scripts/get-wikitext-2.sh
#	tests/CMakeLists.txt
2024-01-21 14:32:15 +08:00
Concedo
71e9a64171 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/nix-ci.yml
#	CMakeLists.txt
#	Makefile
#	ggml-cuda.cu
#	ggml-opencl.cpp
#	llama.cpp
2024-01-20 23:27:42 +08:00
Georgi Gerganov
44a1a4a41a
backend : add eval callback (#4935)
* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* simple : no need for ggml_is_contiguous + fix bool parse

* llama : fix callback placement in llama_context_params

* backend : avoid double-ask callback calls

* simple : restore examples, imatrix will serve as a demo
2024-01-17 18:39:41 +02:00
David Friehs
4483396751
llama : apply classifier-free guidance to logits directly (#4951) 2024-01-15 15:06:52 +02:00
Concedo
dc7bc0cb50 Merge commit '584d674be6' into concedo_experimental
# Conflicts:
#	.github/workflows/nix-flake-update.yml
#	Makefile
#	Package.swift
#	ggml-cuda.cu
#	tests/test-quantize-fns.cpp
2024-01-14 16:29:44 +08:00
Kawrakow
147b17ac94
2-bit quantizations (#4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 09:45:56 +02:00
David Friehs
df845cc982
llama : minimize size used for state save/load (#4820)
* examples : save-load-state: save only required state

* llama : only reserve n_vocab * n_batch at most for logits

llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.

* llama : always reserve n_vocab * n_batch for logits

llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.

* llama : only save and restore used logits

for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.

* llama : use ostringstream and istringstream for save and load

* llama : serialize rng into minimum amount of space required

* llama : break session version due to serialization changes
2024-01-13 18:29:43 +02:00
slaren
e7e4df031b
llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 20:07:38 +01:00
Kawrakow
469e75d0a3
llama : restore intended k-quants mixes for MoE models (#4872)
* Restore intended k-quants quantization mixes for MoE models

* Update Q2_K_S values in the quantize tool

Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 21:43:15 +02:00
Kawrakow
49662cbed3
ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:39:39 +02:00
Concedo
66533c8424 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	Package.swift
#	README.md
#	tests/test-quantize-fns.cpp
2024-01-09 17:48:18 +08:00
Kawrakow
dd5ae06405
SOTA 2-bit quants (#4773)
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-08 16:02:32 +01:00
Georgi Gerganov
52531fdff8
main : add self-extend support (#4815)
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* llama : "self-extend"-like context extension

* passkey : add comment

* main : add Self-Extend support

* llama : add comment about llama_kv_cache_seq_div
2024-01-08 11:18:32 +02:00
Georgi Gerganov
b0034d93ce
examples : add passkey test (#3856)
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* make : add passkey target

* passkey : add "self-extend"-like context extension (#4810)

* llama : "self-extend"-like context extension

* passkey : add comment

* passkey : add readme
2024-01-08 11:14:04 +02:00
kalomaze
123bff9a0f
Full DynaTemp implementation + UI (#600)
* move Dynatemp changes to new branch

* fix float header

* Properly reintroduce variable expert count

Controllable through experts.txt

* first pass at DynaTemp UI

Checkbox partial implemented, Min and Max Temp implemented

* DynaTemp UI Checkbox

Trigger DynaTemp on checkbox

* DynaTemp UI checkbox edition

Hell Yeah! DynaTemp!

* Remove greedy dynatemp

* Fix race condition caused by debug print

* Fixed broken presets and miro

Fixes broken presets and mirostat

* Remove debug function + HHI temp

Also removed unnecessary softmax double precision

* Fix whitespace (?) for generate function

* epic upstream renaming scheme fix

* fix stupid indents

* Other cleanup

Reintroduce unused rep pen function, move temp functions first before entropy dynamic temp

* Slight indent fix

* revert batch pyinstaller maker to mainline

and also delete experts.txt since adjustable routing is also being removed for the PR

* compact dynatemp into a single value dynatemp_range. This is a float which represents the allowed deviation from the min and max temperature when using dynatemp. Thus, if we want a value of dynatemp_min=0.3, dynatemp_max=0.5, then we would simply set temperature=0.4 and dynatemp_range=0.1. Functionally dynatemp would operate the same, but it would simplify usage and make it a single easy to adjust value.

---------

Co-authored-by: Alexander Abushady <aabushady214@gmail.com>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2024-01-06 11:13:16 +08:00
Concedo
234f79fe9d Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	ci/run.sh
#	llama.cpp
2024-01-03 22:33:38 +08:00
Concedo
e49d398f73 use same struct size for cuda and non cuda (+1 squashed commits)
Squashed commits:

[6eee8e2f] use same struct size for cuda and non cuda
2024-01-03 16:05:54 +08:00
Marcus Dunn
0040d42eeb
llama : replace all API facing int's with int32_t (#4577)
* replaced all API facing `int`'s with `int32_t`

* formatting and missed `int` in `llama_token_to_piece`
2024-01-02 16:15:16 +02:00
Concedo
3bca03d26b Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	Makefile
#	README.md
#	llama.cpp
2023-12-22 21:39:23 +08:00
Concedo
230a638512 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	llama.cpp
#	tests/test-grad0.cpp
2023-12-22 14:40:13 +08:00
crasm
c7e9701f86
llama : add ability to cancel model loading (#4462)
* llama : Add ability to cancel model load

Updated llama_progress_callback so that if it returns false, the model
loading is aborted.

* llama : Add test for model load cancellation

* Fix bool return in llama_model_load, remove std::ignore use

* Update llama.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Fail test if model file is missing

* Revert "Fail test if model file is missing"

This reverts commit 32ebd525bf7e5a87ee8a3dbaab3d92ce79fbf23d.

* Add test-model-load-cancel to Makefile

* Revert "Revert "Fail test if model file is missing""

This reverts commit 2796953257ee5383fa7c8fe8fa8fc888c048fb0b.

* Simplify .gitignore for tests, clang-tidy fixes

* Label all ctest tests

* ci : ctest uses -L main

* Attempt at writing ctest_with_model

* ci : get ci/run.sh working with test-model-load-cancel

* ci : restrict .github/workflows/build.yml ctest to -L main

* update requirements.txt

* Disable test-model-load-cancel in make

* Remove venv before creation

* Restructure requirements.txt

Top-level now imports the specific additional requirements for each
python file. Using `pip install -r requirements.txt` will fail if
versions become mismatched in the per-file requirements.

* Make per-python-script requirements work alone

This doesn't break the main requirements.txt.

* Add comment

* Add convert-persimmon-to-gguf.py to new requirements.txt scheme

* Add check-requirements.sh script and GitHub workflow

* Remove shellcheck installation step from workflow

* Add nocleanup special arg

* Fix merge

see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573

* reset to upstream/master

* Redo changes for cancelling model load

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-12-22 08:19:36 +02:00
Marcus Dunn
31f27758fa
llama : allow getting n_batch from llama_context in c api (#4540)
* allowed getting n_batch from llama_context in c api

* changed to use `uint32_t` instead of `int`

* changed to use `uint32_t` instead of `int` in `llama_n_ctx`

* Update llama.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21 21:57:48 +02:00
Concedo
e8cf7f6ed3 Merge remote-tracking branch 'origin/master' into concedo_experimental 2023-12-17 14:37:14 +08:00
slaren
c6c4fc081c
lora : add support for non-llama models (#3333)
* lora : add support for non-llama models

ggml-ci

* avoid leaking ggml_context on failure
cleanup

ggml-ci

* lora : allow 1d tensors

* lora : include embd and output layers in size calculation

* fix style
2023-12-16 18:58:46 +01:00
Concedo
c2c238b4f3 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-grad0.cpp
#	tests/test-quantize-perf.cpp
2023-12-13 14:49:03 +08:00
crasm
6391817cd1
llama : document logits_all deprecation (#4418)
llama_context_params.logits_all is a parameter for controlling
llama_eval. This documents that logits_all should not be used with
llama_decode and llama_batch.
2023-12-12 11:25:57 +02:00
Concedo
ec21fa7712 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	ggml-cuda.cu
#	llama.cpp
#	llama.h
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
2023-12-08 17:42:26 +08:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Kerfuffle
5aa365d88f
llama : allow overriding GGUF metadata when loading model (#4092)
* feat: Allow overriding GGUF metadata when loading model

* Fix the one time GCC is stricter than clang about something

* Step1

* Refactor... basically everything!

* Nuke obsolete GetArrayLen struct

* simplify std::string specialization

* Various cleanups

Add informational output when overrides are applied

Warn user when an override with the wrong type is specified

* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes

* llama : rearrange model params

* Update new GET_KEY call

Add note that metadata KV overrides aren't reflected in initial metadata KV info dump

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-05 19:19:18 +02:00
Concedo
8acd7be734 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-11-27 14:06:14 +08:00
crasm
3014b5415d
Update docs for yarn_ext_factor <0.0 as unspecified instead of NaN (#4189) 2023-11-25 10:47:07 -05:00
Georgi Gerganov
6b0a7420d0
llama : KV cache view API + better KV cache management (#4170)
* llama : keep track of used KV cells + better KV cache management

* llama : zero KV cache used upon clear

ggml-ci

* llama : allow exporting a view of the KV cache (#4180)

* Allow exporting a view of the KV cache

* Allow dumping the sequences per cell in common

* Track max contiguous cells value and position as well

* Fix max contiguous empty cells index calculation

Make dump functions deal with lengths or sequences counts > 10 better

* Fix off by one error in dump_kv_cache_view

* Add doc comments for KV cache view functions

Eliminate cell sequence struct; use llama_seq_id directly

Minor cleanups

* common : add -dkvc arg for enabling kv cache dumps

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-23 19:07:56 +02:00
Concedo
6bf8ee4aea Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	ggml-cuda.cu
#	tests/test-tokenizer-0-falcon.py
#	tests/test-tokenizer-0-llama.py
2023-11-18 11:10:45 +08:00
slaren
e85bb1a8e7
llama : add functions to get the model's metadata (#4013)
* llama : add functions to get the model's metadata

* format -> std::to_string

* better documentation
2023-11-17 17:17:37 +02:00
Kerfuffle
91f6499393
Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.

* Respect add_bos_token GGUF metadata value

* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-16 19:14:37 -07:00
Concedo
d7729ac3eb Merge branch 'master' into concedo_experimental 2023-11-03 16:00:05 +08:00
Georgi Gerganov
05816027d6
common : YAYF (yet another YARN fix) (#3925)
ggml-ci
2023-11-03 09:24:00 +02:00
Concedo
bc4ff72317 not working merge 2023-11-02 17:52:40 +08:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling (#2268)
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Concedo
9342636408 Merge branch 'master' into concedo_experimental
# Conflicts:
#	flake.lock
#	flake.nix
2023-11-01 18:24:36 +08:00