Commit graph

108 commits

Author SHA1 Message Date
Concedo
76a3ba42eb Merge branch 'master' into concedo_experimental
# Conflicts:
#	ggml.c
#	ggml.h
#	requirements.txt
#	tests/test-quantize-perf.cpp
2023-12-16 22:58:53 +08:00
slaren
cafcd4f895
ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
2023-12-14 16:52:08 +01:00
Concedo
c88fc19d59 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
2023-12-14 16:32:42 +08:00
Siwen Yu
9fb13f9584
common : add --version option to show build info in CLI (#4433) 2023-12-13 14:50:14 +02:00
Concedo
c2c238b4f3 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-grad0.cpp
#	tests/test-quantize-perf.cpp
2023-12-13 14:49:03 +08:00
Richard Kiss
9494d7c477
english : use typos to fix comments and logs (#4354) 2023-12-12 11:53:36 +02:00
Concedo
ec21fa7712 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	ggml-cuda.cu
#	llama.cpp
#	llama.h
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
2023-12-08 17:42:26 +08:00
Georgi Gerganov
bcc0eb4591
llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Georgi Gerganov
caa9249217
common : fix compile warning 2023-12-06 10:41:03 +02:00
Kerfuffle
5aa365d88f
llama : allow overriding GGUF metadata when loading model (#4092)
* feat: Allow overriding GGUF metadata when loading model

* Fix the one time GCC is stricter than clang about something

* Step1

* Refactor... basically everything!

* Nuke obsolete GetArrayLen struct

* simplify std::string specialization

* Various cleanups

Add informational output when overrides are applied

Warn user when an override with the wrong type is specified

* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes

* llama : rearrange model params

* Update new GET_KEY call

Add note that metadata KV overrides aren't reflected in initial metadata KV info dump

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-05 19:19:18 +02:00
MaggotHATE
52c8bc3cf3
sampling : custom samplers order (#4285)
* Samplers sequence order w parameter

* Cleaned commented code

* Fixed formatting

* Rewrote with unordered_map

* Revert and rewrite, too many problems and safeguards would be needed

* Fixed code style

* Code style fixes according to review

* More readable samplers input string, fixed help

* Style fix in sampler_queue

* Formatting fixes

* Fixing whitespaces
2023-12-05 12:05:51 +02:00
Ikko Eltociear Ashimine
4fa44e84ad
grammar-parser : fix typo (#4318)
preceeding -> preceding
2023-12-04 09:57:35 +02:00
Concedo
4f40c226a0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
2023-12-01 23:46:59 +08:00
Jared Van Bortel
15f5d96037
build : fix build info generation and cleanup Makefile (#3920)
* cmake : fix joining of REAL_GIT_DIR

* fix includes with help from include-what-you-use

* make : remove unneeded deps and add test-rope target

* fix C includes in C++ source files

* Revert "fix includes with help from include-what-you-use"

This reverts commit 635e9fadfd516d4604a0fecf4a854bfb25ad17ae.
2023-12-01 00:23:08 +02:00
Concedo
581021ab93 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	scripts/build-info.cmake
2023-11-28 20:57:56 +08:00
bandoti
b38a16dfcf
cmake : fix issue with version info not getting baked into LlamaConfig.cmake (#3970)
* Split CPP generation from build-info query

* Remove blank lines

* Add BUILD_SHARED_LIBS option
2023-11-27 21:25:42 +02:00
Concedo
8acd7be734 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
2023-11-27 14:06:14 +08:00
Georgi Gerganov
6b0a7420d0
llama : KV cache view API + better KV cache management (#4170)
* llama : keep track of used KV cells + better KV cache management

* llama : zero KV cache used upon clear

ggml-ci

* llama : allow exporting a view of the KV cache (#4180)

* Allow exporting a view of the KV cache

* Allow dumping the sequences per cell in common

* Track max contiguous cells value and position as well

* Fix max contiguous empty cells index calculation

Make dump functions deal with lengths or sequences counts > 10 better

* Fix off by one error in dump_kv_cache_view

* Add doc comments for KV cache view functions

Eliminate cell sequence struct; use llama_seq_id directly

Minor cleanups

* common : add -dkvc arg for enabling kv cache dumps

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-23 19:07:56 +02:00
Concedo
56a5fa7a60 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	tests/test-tokenizer-0-falcon.py
#	tests/test-tokenizer-0-llama.py
2023-11-20 22:37:06 +08:00
Seb C
881800d1f0
main : Add ChatML functionality to main example (#4046)
Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com>
2023-11-20 14:56:59 +01:00
kchro3
262005ad9d
common : comma should be semicolon (#4137) 2023-11-19 18:52:57 +02:00
Concedo
6bf8ee4aea Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	ggml-cuda.cu
#	tests/test-tokenizer-0-falcon.py
#	tests/test-tokenizer-0-llama.py
2023-11-18 11:10:45 +08:00
Jannis Schönleber
9e87ef60e1
common : improve yaml log escaping (#4080)
* logging: improve escaping in yaml output

* logging: include review feedback
2023-11-17 17:24:07 +02:00
Jiří Podivín
ba4cf5c0bf
train : move number of gpu layers argument parsing to common/train.cpp (#4074)
- introduces help entry for the argument
 - cuts '--gpu-layers' form in order to simplify usage and documentation.

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Jiri Podivin <jpodivin@redhat.com>
2023-11-17 17:19:16 +02:00
Kerfuffle
91f6499393
Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.

* Respect add_bos_token GGUF metadata value

* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-16 19:14:37 -07:00
Concedo
35a97e14b2 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	docs/token_generation_performance_tips.md
#	grammars/README.md
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-grad0.cpp
#	tests/test-opt.cpp
2023-11-15 16:59:53 +08:00
Georgi Gerganov
4760e7cc0b
sync : ggml (backend v2) (#3912)
* sync : ggml (backend v2) (wip)

* sync : migrate examples and llama.cpp to dynamic graphs (wip)

* sync : update tests + fix max op params to 64

ggml-ci

* sync : ggml-cuda

ggml-ci

* llama : fix save/load state context size

ggml-ci

* sync : try to fix build on tvOS

* sync : pass custom graph sizes in training examples

* sync : update graph copies to new ggml API

* sync : update sync-ggml.sh with new files

* scripts : fix header in sync script

* train : fix context size calculations

* llama : increase inference graph size up to 4096 nodes

* train : allocate grads for backward graphs

* train : allocate grads for gb_tmp
2023-11-13 14:16:23 +02:00
Concedo
f277ed0e8c Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-11-07 15:23:08 +08:00
Damian Stewart
381efbf480
llava : expose as a shared library for downstream projects (#3613)
* wip llava python bindings compatibility

* add external llava API

* add base64 in-prompt image support

* wip refactor image loading

* refactor image load out of llava init

* cleanup

* further cleanup; move llava-cli into its own file and rename

* move base64.hpp into common/

* collapse clip and llava libraries

* move llava into its own subdir

* wip

* fix bug where base64 string was not removed from the prompt

* get libllava to output in the right place

* expose llava methods in libllama.dylib

* cleanup memory usage around clip_image_*

* cleanup and refactor *again*

* update headerdoc

* build with cmake, not tested (WIP)

* Editorconfig

* Editorconfig

* Build with make

* Build with make

* Fix cyclical depts on Windows

* attempt to fix build on Windows

* attempt to fix build on Windows

* Upd TODOs

* attempt to fix build on Windows+CUDA

* Revert changes in cmake

* Fix according to review comments

* Support building as a shared library

* address review comments

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-11-07 00:36:23 +03:00
Concedo
78ca0667a4 Merge branch 'master' into concedo_experimental 2023-11-06 16:58:58 +08:00
slaren
2833a6f63c
ggml-cuda : fix f16 mul mat (#3961)
* ggml-cuda : fix f16 mul mat

ggml-ci

* silence common.cpp warning (bonus)
2023-11-05 18:45:16 +01:00
Kerfuffle
d9ccce2e33
Allow common process_escapes to handle \x sequences (#3928)
* Allow common process_escapes to handle \x sequences

* Fix edge case when second hex digit is NUL
2023-11-05 10:06:06 -07:00
Concedo
d7729ac3eb Merge branch 'master' into concedo_experimental 2023-11-03 16:00:05 +08:00
Georgi Gerganov
8f961abdc4
speculative : change default p_accept to 0.5 + CLI args (#3919)
ggml-ci
2023-11-03 09:41:56 +02:00
Georgi Gerganov
05816027d6
common : YAYF (yet another YARN fix) (#3925)
ggml-ci
2023-11-03 09:24:00 +02:00
Concedo
c07c9b857d Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-11-03 11:17:07 +08:00
Andrei
51b2fc11f7
cmake : fix relative path to git submodule index (#3915) 2023-11-02 21:40:31 +02:00
Concedo
bc4ff72317 not working merge 2023-11-02 17:52:40 +08:00
cebtenzzre
b12fa0d1c1
build : link against build info instead of compiling against it (#3879)
* cmake : fix build when .git does not exist

* cmake : simplify BUILD_INFO target

* cmake : add missing dependencies on BUILD_INFO

* build : link against build info instead of compiling against it

* zig : make build info a .cpp source instead of a header

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>

* cmake : revert change to CMP0115

---------

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
2023-11-02 08:50:16 +02:00
Concedo
1ab18ecb53 Merge commit 'c43c2da8af' into concedo_experimental
# Conflicts:
#	llama.cpp
2023-11-02 11:17:59 +08:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling (#2268)
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Georgi Gerganov
ff8f9a88da
common : minor (#3715) 2023-11-01 21:15:55 +02:00
bandoti
0e40806c1c
common : allow caller to handle help/argument exceptions (#3715)
* Allow caller to handle help/argument exceptions

* Prepend newline to usage output

* Add new gpt_params_parse_ex function to hide arg-parse impl

* Fix issue blocking success case

* exit instead of returning false

* Update common/common.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/common.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-01 19:42:01 +02:00
staviq
a2758d08e4
log : make generating separate log files optional (#3787)
* impl --log-new, --log-append

* Update common/log.h

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* Update common/log.h

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* Apply suggestions from code review

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 16:18:27 +02:00
l3utterfly
e75dfdd31b
sampling : null grammar field after reset (#3885) 2023-11-01 15:40:43 +02:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter (#3762)
* Add '-ngl' support to finetune.cpp

* Add fprintf in ggml_cuda_op_add

When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora

* Add 'finetune.sh', which currently fails when using GPU

"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"

* tweak finetune.sh

* Suppress some warnings in ggml.c

* Add f16 implementation to ggml_compute_forward_add_f16_f32

* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs

* finetune.sh: Edit comments

* Add "add_f16_f32_f32_cuda"

* Tweak an error message

* finetune.sh: Add an optional LLAMA_MODEL_DIR variable

* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable

* train : minor

* tabs to spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Concedo
ae2cd56de8 kobold integration of min_p sampler (+1 squashed commits)
Squashed commits:

[8ad2e349] kobold integration for min_p sampler
2023-11-01 19:08:45 +08:00
Concedo
9342636408 Merge branch 'master' into concedo_experimental
# Conflicts:
#	flake.lock
#	flake.nix
2023-11-01 18:24:36 +08:00
kalomaze
238657db23
samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)
* Introduce the new Min-P sampler by @kalomaze
   The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.

* Min-P enabled and set to 0.05 default

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31 20:44:49 +01:00
Concedo
cc5b282350 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	build.zig
#	flake.lock
#	flake.nix
#	ggml.c
2023-10-31 20:44:04 +08:00