Commit graph

227 commits

Author SHA1 Message Date
Concedo
02f92f6ecc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-rocm.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-cpp-cuda.srpm.spec
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	examples/llama.android/llama/src/main/cpp/CMakeLists.txt
#	flake.lock
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	grammars/README.md
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	tests/test-chat-template.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
2024-06-30 10:59:42 +08:00
Sigbjørn Skjæret
38373cfbab
Add SPM infill support (#8016)
* add --spm-infill option

* support --spm-infill

* support --spm-infill
2024-06-28 12:53:43 +02:00
Xuan Son Nguyen
16791b8f0b
Add chatml fallback for cpp llama_chat_apply_template (#8160)
* add chatml fallback for cpp `llama_chat_apply_template`

* remove redundant code
2024-06-27 18:14:19 +02:00
jukofyork
97877eb10b
Control vector loading fixes (#8137)
* Fixed leak in llama_control_vector_load_one() and allow llama_control_vector_load() to grow

* refactored `llama_control_vector_load_one()`

* allow multiple directions for same layer in same file

* llama_control_vector_load_one() and llama_control_vector_load() now break on error

* removed unnecessary ggml_free() call
2024-06-27 16:48:07 +02:00
Concedo
f3dfa96dbc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.devops/llama-server.Dockerfile
#	.github/workflows/docker.yml
#	README.md
#	llama.cpp
#	tests/test-chat-template.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-llama-grammar.cpp
2024-06-26 18:59:10 +08:00
Xuan Son Nguyen
49c03c79cd
cvector: better prompt handling, add "mean vector" method (#8069)
* remove completions file

* fix inverted vector

* add mean method

* code style

* remove inverted pca hotfix
2024-06-25 13:59:54 +02:00
Xuan Son Nguyen
48e6b92cc3
Add chat template support for llama-cli (#8068)
* add chat template support for llama-cli

* add help message

* server: simplify format_chat

* more consistent naming

* improve

* add llama_chat_format_example

* fix server

* code style

* code style

* Update examples/main/main.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-25 21:56:49 +10:00
Concedo
151ff95a67 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	ggml-cuda.cu
#	ggml-cuda/common.cuh
2024-06-25 19:25:14 +08:00
Concedo
12dfb92436 Merge commit 'd62e4aaa02' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	Makefile
#	common/common.cpp
#	ggml.c
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
2024-06-25 18:27:12 +08:00
HatsuneMikuUwU33
f702a90e24
Update control vector help (#8104) 2024-06-25 10:44:48 +02:00
Yann Follet
646ef4a9cf
embedding : more cli arguments (#7458)
* add parameters for embeddings
--embd-normalize
--embd-output-format
--embd-separator
description in the README.md

* Update README.md

fix tipo

* Trailing whitespace

* fix json generation, use " not '

* fix merge master

* fix code formating
group of parameters // embedding
print usage for embedding parameters

---------

Co-authored-by: Brian <mofosyne@gmail.com>
2024-06-24 08:30:24 +03:00
Xuan Son Nguyen
3e58b0ee35
cvector: fix CI + correct help message (#8064)
* cvector: fix CI + correct help message

* also correct --pca-iter
2024-06-22 18:11:30 +02:00
Concedo
92afdfcae4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/server.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	llama.cpp
#	requirements/requirements-convert-hf-to-gguf-update.txt
#	requirements/requirements-convert-hf-to-gguf.txt
#	requirements/requirements-convert-legacy-llama.txt
#	scripts/sync-ggml.last
#	tests/test-tokenizer-random.py
2024-06-22 01:33:44 +08:00
Douglas Hanley
80ea089d77
llama : allow pooled embeddings on any model (#7477)
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples

* find result_norm/result_embd tensors properly; update output allocation logic

* only use embd output for pooling_type NONE

* get rid of old causal_attn accessor

* take out attention_type; add in llama_set_embeddings

* bypass logits when doing non-NONE pooling
2024-06-21 08:38:22 +03:00
Johannes Gäßler
abd894ad96
common: fix warning (#8036)
* common: fix warning

* Update common/common.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-20 16:40:13 +02:00
Concedo
967c1d8df5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	flake.lock
#	tests/test-backend-ops.cpp
2024-06-17 15:14:47 +08:00
Concedo
ba9ef4d01b fix to allow clblast to work even after blas backend splitoff 2024-06-17 15:02:55 +08:00
Xuan Son Nguyen
0c7b3595b9
Add cvector-generator example (#7514)
* add control-vector-generator

* calc diff

* add comments

* proof-of-concept stdlib implementation

Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish.

* param parsing, refactor, comments

Added basic command-line parameters for outfile and one each positive/negative prompt.

Refactored some messy code in PCA computation and GGUF exporting.

Left a bunch of comments regarding further work needed.

* example template completions

Implements an example template set built from the positive/negative prompts like the control vector Python implementation.

* add multi prompts, multi-thread for PCA

* fix mem error

* add debugs

* fix matrix transpose multiplication

you have got to be kidding me

* preliminary template/multiprompt support

model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish

* fix zero output & param parsing, functional templating

fixed a bug where the output file had no tensor data/was all zero

fixed a bug where single hyphen flags were not being correctly parsed

implements creation of templated prompts from input (still need to adapt based on model)

* fix square_diff matmul index range and CRLF->LF line endings

fixed a logic error where square_diff would not multiply all rows

fixed a formatting error where the provided completions.txt had CRLF line endings

* add command-line args for num threads, num completions file lines, always reload model

refactored a few things and did what the commit message says on the tin

* code aestheticization

* fix compiler warnings

* in-series multithreading for prompt embedding?

added commented-out code to attempt to start implementing mutlithreading for embedding in main

* remove unnecessary multithreading

* interim fix memory leak

* translated everything but PCA (I think)

* tentatively translate the rest

* fix ggml errors and make new ones

at least it compiles and runs

* fix cb_eval

* temporary commit while I move dev environments

it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent

* update debug statements

* pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped

* update comments

* (wip) refactor

* clean up PCA ggml implementation

* fix shape of v_diff_original

* add n_batch for pca

* working version

* remember to copy back the last_eigenvector

* fix n_completions

* bring back n_completions

* default n_pca_batch to 20

* fix macos build

* add to makefile all targets

* use ggml_format_name

* add readme

* fix .editorconfig

* use ggml_backend_tensor_copy

* attemp to fix compile problem on mac

* fix compile warn

* reuse allocr

* move param parser to common

* better error handling

* clean up a bit

* add print_usage

* shorten help msg

* beautify help msg

* escape prompt by default

* change compile target to llama-cvector-generator

* typo

* disable GPU for PCA

* code style

---------

Co-authored-by: Christian Zhou-Zheng <christianzhouzheng@gmail.com>
2024-06-15 18:53:40 +02:00
Concedo
562d980140 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full.Dockerfile
#	.devops/main-cuda.Dockerfile
#	.devops/main-rocm.Dockerfile
#	.devops/main-vulkan.Dockerfile
#	.devops/main.Dockerfile
#	.devops/server-cuda.Dockerfile
#	.devops/server.Dockerfile
#	README.md
#	common/CMakeLists.txt
#	grammars/README.md
#	tests/test-grammar-integration.cpp
#	tests/test-grammar-parser.cpp
#	tests/test-json-schema-to-grammar.cpp
2024-06-09 17:30:05 +08:00
Olivier Chafik
d4d915d351
url: save -mu downloads to new cache location (#7826)
* url: save -mu download to new cache location

* url: fs_get_cache_file_path util

* url: tweak sig of fs_get_cache_file
2024-06-08 21:21:08 +02:00
sasha0552
7a16ce7db2
server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
Georgi Gerganov
ee459f40f6
server : fix --threads-http arg (#7801) 2024-06-06 19:19:59 +03:00
Georgi Gerganov
f83351f9a6
imatrix : migrate to gpt_params (#7771)
* imatrix : migrate to gpt_params

ggml-ci

* imatrix : add --save-frequency cli arg

* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
Concedo
6659742a2d do not merge the removal of opencl 2024-06-05 10:57:52 +08:00
Georgi Gerganov
1442677f92
common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Georgi Gerganov
554c247caf
ggml : remove OpenCL (#7735)
ggml-ci
2024-06-04 21:23:20 +03:00
Concedo
94753ad103 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
2024-06-03 18:33:23 +08:00
0cc4m
3d7ebf6312
Vulkan Mixture of Experts (MoE) support (#7628)
* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU
2024-06-03 10:59:14 +02:00
Concedo
4ed9ba7352 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.lock
#	tests/test-backend-ops.cpp
2024-05-28 21:57:19 +08:00
Brian
d298382ad9
main: replace --no-special with --special (#7534)
This also flips the default behavior of the output to not include control token by default.
2024-05-27 00:10:17 +10:00
Justine Tunney
00c6390793
main : don't print special tokens with --grammar (#6923)
* main : don't print special tokens with --grammar

The CLI interface was recently changed to print special control tokens
like the </s> stop message one. This token shouldn't be printed if the
grammar flag was passed, unless the grammar specifies it, because that
breaks shell-scriptability.

* main: use seperate stream for control characters

* main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out

* main: dprintf isn't part of the IEEE POSIX standard. Just use write().

* main: remove --ctrl-token-fd-out in favor for fcntl() based detection

* common.cpp: accidentally removed --interactive-first

* main: only merge stdout and control token if not in conversation or grammar mode

* main: rejig control token descriptor handling

* main: must check pipe status on very top of program

* main: renamed --no-special from  --ctrl-token-no-out and other refactoring

* main: refactor ctrl_token_no_out --> no_special

* llama: rename llama_token_is_control_token() to llama_token_is_control()

* main: remove special token file descriptor feature (#5)

---------

Co-authored-by: Brian <mofosyne@gmail.com>
2024-05-25 19:04:03 +10:00
Masaya, Kato
faa0e6979a
ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433)
* Add SVE support for q4_0_q8_0 q8_0_q8_0

* remove ifdef
2024-05-25 11:42:31 +03:00
Xuan Son Nguyen
902184dd3a
fix missing slash in fs_get_cache_directory() (#7503)
* fix missing slash in fs_get_cache_directory()

* use LOCALAPPDATA for fs_get_cache_directory()

* better code style
2024-05-25 13:30:59 +10:00
Concedo
9282c307ed this commit does not work, just for debugging 2024-05-23 20:13:47 +08:00
Georgi Gerganov
6ff13987ad
common : normalize naming style (#7462)
* common : normalize naming style

ggml-ci

* common : match declaration / definition order

* zig : try to fix build
2024-05-22 20:04:20 +03:00
Concedo
4a3c8c190b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	tests/test-backend-ops.cpp
2024-05-22 15:04:31 +08:00
Amir
11474e756d
examples: cache hf model when --model not provided (#7353)
* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided
2024-05-21 17:13:12 +03:00
Concedo
47cbfd6150 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	llama.cpp
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/test-backend-ops.cpp
2024-05-17 22:30:41 +08:00
Herman Semenov
359cbe3f46
ggml-quants, llama : removed excess checks (#7274) 2024-05-17 10:08:49 +03:00
Concedo
2ee808a747 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	ci/run.sh
#	llama.cpp
#	models/ggml-vocab-llama-bpe.gguf.inp
#	models/ggml-vocab-llama-bpe.gguf.out
#	requirements.txt
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-tokenizer-1-bpe.cpp
2024-05-14 19:28:47 +08:00
Radoslav Gerganov
5e31828d3e
ggml : add RPC backend (#6829)
* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos
2024-05-14 14:27:19 +03:00
Justine Tunney
4e3880978f
Fix memory bug in grammar parser (#7194)
The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug

    ./llamafile -m foo.gguf -p bar --grammar 'root::="'

Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
2024-05-10 21:01:08 +10:00
HanishKVC
f89fe2732c
Main+: optionally allow special tokens from user in interactive mode (#7097)
@hanishkvc added a new `--interactive-specials` flag which would allow for inserting special tokens from user side into the embedding stream.
2024-05-10 20:21:58 +10:00
Concedo
d084f78faa Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	common/common.cpp
#	requirements/requirements-convert-hf-to-gguf-update.txt
#	requirements/requirements-convert-hf-to-gguf.txt
#	requirements/requirements-convert.txt
#	tests/CMakeLists.txt
#	tests/test-json-schema-to-grammar.cpp
2024-05-09 15:13:34 +08:00
Johannes Gäßler
c12452c7ae
JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143) 2024-05-08 21:53:08 +02:00
Dawid Potocki
83330d8cd6
main : add --conversation / -cnv flag (#7108) 2024-05-08 17:32:32 +03:00
Concedo
6c000cbe7a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.flake8
#	.github/workflows/bench.yml
#	.github/workflows/python-lint.yml
#	.pre-commit-config.yaml
#	Makefile
#	README.md
#	models/ggml-vocab-bert-bge.gguf.inp
#	models/ggml-vocab-bert-bge.gguf.out
#	models/ggml-vocab-deepseek-coder.gguf.inp
#	models/ggml-vocab-deepseek-coder.gguf.out
#	models/ggml-vocab-deepseek-llm.gguf.inp
#	models/ggml-vocab-deepseek-llm.gguf.out
#	models/ggml-vocab-falcon.gguf.inp
#	models/ggml-vocab-falcon.gguf.out
#	models/ggml-vocab-gpt-2.gguf.inp
#	models/ggml-vocab-gpt-2.gguf.out
#	models/ggml-vocab-llama-bpe.gguf.inp
#	models/ggml-vocab-llama-bpe.gguf.out
#	models/ggml-vocab-llama-spm.gguf.inp
#	models/ggml-vocab-llama-spm.gguf.out
#	models/ggml-vocab-mpt.gguf.inp
#	models/ggml-vocab-mpt.gguf.out
#	models/ggml-vocab-phi-3.gguf
#	models/ggml-vocab-phi-3.gguf.inp
#	models/ggml-vocab-phi-3.gguf.out
#	models/ggml-vocab-refact.gguf
#	models/ggml-vocab-starcoder.gguf.inp
#	models/ggml-vocab-starcoder.gguf.out
#	requirements/requirements-convert.txt
#	scripts/compare-llama-bench.py
#	scripts/run-with-preset.py
#	scripts/verify-checksum-models.py
#	tests/CMakeLists.txt
#	tests/test-tokenizer-0.cpp
2024-05-06 18:09:45 +08:00
viric
fcd84a0f5a
Fix Linux /sys cpu path to guess number of cores (#7064) 2024-05-04 15:26:53 +02:00
Concedo
17a24d753c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/main-intel.Dockerfile
#	.devops/main-vulkan.Dockerfile
#	.devops/server-intel.Dockerfile
#	.devops/server-vulkan.Dockerfile
#	.github/workflows/bench.yml
#	.github/workflows/build.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/server.yml
#	.gitignore
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	flake.lock
#	llama.cpp
#	models/ggml-vocab-falcon.gguf
#	models/ggml-vocab-llama-spm.gguf
#	models/ggml-vocab-mpt.gguf
#	models/ggml-vocab-stablelm.gguf
#	models/ggml-vocab-starcoder.gguf
#	requirements.txt
#	scripts/check-requirements.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-tokenizer-0-bpe.py
#	tests/test-tokenizer-0-spm.py
#	tests/test-tokenizer-1-spm.cpp
2024-04-30 21:04:17 +08:00
Georgi Gerganov
9c67c2773d
ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API

* ggml : fix GQA support in ggml_flash_attn_ext

* ggml : online attention (CPU)

* metal : initial implementation

* metal : f16 precision

* metal : reduce branches

* metal : specialize for head size

* wip : 8 rows per simd group

* wip : 4 rows per simd group

* wip : template for rows per warp

* metal : parallelize across KV size

* metal : parallel reduce across heads

* metal : efficient flash_attn_f16 implementation

* metal : avoid redundant loads of the attention

* metal : scale and mask in matrix form

* metal : fix comment

* llama : avoid ggml_cast, use F32 query

* metal : add parallel reduce version (disabled)

* metal : move output into local memory + optimize

- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments

* metal : add tests, fix scaling, support C > 32

* metal : improve precision

* ggml : fix f16 mad

* metal : minor

* metal : support Q > 8

* tests : add ATTN tests

* metal : disable buffer allocation logs

* tests : more

* metal : faster inner loop for C == 32

* metal : fix array initialization

* tests : ifdef

* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext

* ggml : fix ggml_soft_max mask requirement

* cuda : fix soft_max to use correct mask size

* cuda : add flash_attn kernel (wip)

* metal : optimize softmax for C > 32

* metal : optimize softmax

* tests : minor fix

* cuda : avoid zeroing fragments

* tests : update dims

* cuda : fix __hisinf() result check

* cuda : avoid warp_reduce for smax

* cuda : use int instead of int64_t

Noticeably improves performance (thanks to Johannes)

* cuda : make loops use the same loop values

Thanks Johannes again for the tip

* cuda : unroll some of the loops

* cuda : avoid __hisinf branches

* cuda : use half2 in softmax

* cuda : switch to 1 warp for bs > 16

* cuda : speed-up reduce part of the kernel

* cuda : unroll Q*K^T loop

* cuda : fix -INF block check

* cuda : simplify softmax

* cuda : fix matrix names

* cuda : minor

* llama : adapt to F16 KQ_pos

* llama : adapt new models to F16 KQ_mask

* ggml : fix F16 store (ARM NEON)

* llama : fix type of KQ_mask and KQ_pos

* ggml : fix CPU soft_max

* tests : add hs=256

* cuda : fix build

* metal : improve perf via smaller int registers

* cuda : adapt soft_max to F16 mask and pos

* CUDA: faster FlashAttention, kernel for bs == 1

* 16 cols for Phi-2

* no vec for hs, no hs==256 ncols==32 for Volta

* adjust kernel selection logic

* 4 warps, 256 stride for all D

* no ncols == 64

* Multiple parallel blocks for batch size 1

* fix compile warnings

* fix excessive KQ_b loads

* fix cmake build

* fix KV cache padding, NaN from INFINITY (#6438)

* llama : flash_attn cparam + fix defrag

* server: support flash_attn param

* server: bench: enable flash_attn param

* CUDA: refactor host code, dyn. par. blocks

* fix flash_attn_vec_f16 race condition

* flush softmax exp below threshold to 0

* store temp KQ in registers

* Calculate KQ as FP32 if KQV has GGML_PREC_F32

* Add __hgt2_mask implementation for CUDA 11

* fix KQ FP32 precision fpr parallel_blocks > 1

* llama-bench : add -fa,--flash-attn arg

* metal : add BS=1 kernel for flash attention (#6508)

* metal : add BS=1 kernel for flash attention (wip)

* metal : support more than 1 warps

* metal : opts

* metal : opt

* metal : switch to parallel reduce

* metal : reduce registers

* metal : simplify

* metal : initial FA vec kernel

* metal : use F32 attention accumulators

* batched-bench : add fattn arg

* llama : simplify llama_build_kv_store

ggml-ci

* llama : adapt build_olmo to changes

* ggml : fix arm fp16 store on windows

* metal : clean-up

* metal : clean-up kernel code

* metal : minor

* tests : remove benchmarks

ggml-ci

* ggml : fix avx512 const correctness

ggml-ci

* ggml : fix soft_max with bias on CPU

ggml-ci

* common : print --flash-attn in help

* ggml : fix num dimensions in ggml_flash_attn_ext

* llama : force disable flash attention for incompatible models

* ggml : ggml_soft_max support F16/F32 mask/pos

ggml-ci

* cuda : uint -> uint32_t

* cuda : "constexpr dim3" -> "const dim3"

ggml-ci

* cuda : try to fix __hgt2_mask

ggml-ci

* ggml : add TODO's for F16/F32 mask/pos support in other backends

* llama : replace bool need_kq_pos with use_alibi

* llama : prep ALiBi support for BERT models

ggml-ci

* llama : fix n_batch requirements

ggml-ci

* cont

* server : add help for --flash-attn arg

* llama : disable FA for AMD

* tests : remove TMP_ATTN_BENCH

ggml-ci

* llama : support save/load state with FA enabled

ggml-ci

* ci : add CUDA save-load-state tests

ggml-ci

* llama : llama_kv_cache_clear zeroes data + fix save-load seq

ggml-ci

* llama : fix copy-paste errors, add TODO

* llama : disallow incompatible states

* llama : update llama_state_get_size after v_trans field

* metal : remove tmp log

* llama : add static reminder for llama_state_get_size

* metal : fix max nsg

ggml-ci

* ci : fix arg order

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 12:16:08 +03:00