* metal : implement soft_max_ext
* cuda : implement soft_max_ext
* ggml : implement soft_max_ext (CPU)
* batched-bench : print threads
ggml-ci
* metal : simplify soft_max encoding
ggml-ci
* cuda : use 512 threads for soft_max instead of 32
* ggml : update soft max cpu
* cuda : do warp-based block reduce
* cuda : increase max block size to 1024
* cuda : fix warp reduction initialization of shared mem
* metal : warp-based reduction for soft max kernel
* metal : warp-based reduce for rms_norm
* metal : simplify soft max kernel
ggml-ci
* alloc : fix build with debug
* ggml : use blas even if src0 is not F32
* llama : use n_threads_batch only when n_tokens >= 32
ggml-ci
* llama : revert n_threads_batch logic
ggml-ci
* Remove logically superfluous assertions and order by dimension
* Use cblas_sgemm() to implement ggml_compute_forward_out_prod()
* Remove ggml_compute_forward_out_prod_use_blas(), fix compiling errors on cmake/zig, remove trailing whitespace
* Add openBLAS support for sgemm() in compute_forward_out_prod()
* fix backward process of rope
rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions.
the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.
to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:
for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.
* fix finetune rope call to use correct default attn_factor of 1.0f
* remove unused `ggml_rope_xpos_back`
it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants.
* fix comments explaining the sinus sign in ggml_forward_rope
* add missing function arguments in declaration
* fix function argument type in declaration
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* cmake : add helper for faster CUDA builds
* batched : add NGL arg
* ggml : skip nops in compute_forward
* cuda : minor indentation
* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)
* Apply suggestions from code review
These changes plus:
```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```
are needed to compile with ROCM. I haven't done performance testing, but it seems to work.
I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.
* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define
* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases
* cuda : reduce mallocs in cublasGemmBatchedEx branch
* cuda : add TODO for calling cublas from kernel + using mem pool
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
* check whether platform is 390x if yes->do not import immintrin.h
* support s390x big endian
* support --bigendian option for s390x
1. verified with baichuan7b-chat with float 16 on s390x
2. verified with baichuan7b-chat
3. verified with chinese-alpaca-2-13b-f16
* update format based on editor-config checker result
* Update convert-baichuan-hf-to-gguf.py
* 1. check in ggml.c if endianess is not match
2. update GGUF version
3. change get_pack_prefix to property
4. update information log
* always use "GGUF" as beginng of GGUF file
* Compare "GGUF" with file header char by char
1. Set GGUF_MAGIC to "GGUF" string instead of int value
2. Compare "GGUF" char by char to ensure its byte order
3. Move bytes swap code from convert.py to gguf.py write_tensor_data
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* WIP: start implementing LLaVA
* rm scratch buf for now, will revert after cleanup
* LLaVA image encoder is working. will combine with llama
* Add llava inference code, but it's buggy. debugging
* LLaVA is working e2e, needs to optimize memory allocation + cleanup
* Use ggml_allocr + rm unnecessary code
* fix: crlf -> lf
* fix: new line at EoF
* fix: trailing whitespace
* Add readme
* Update readme
* Some cleanup
* Are you happy editorconfig?
* rm unused batch image preprocessing
* rm unused import
* fix: rm designated initializers
* introduce pad-to-square mode for non-square images
* are you happy editorconfig?
* gitignore /llava
* Handle cases where image file does not exist
* add llava target to Makefile
* add support for 13b model variant
* Maybe seed is unlucky?
* Check if apples are compared to apples
* are you happy editorconfig?
* Use temperature = 0.1 by default
* command line: use gpt_params_parse()
* minor
* handle default n_predict
* fix typo
* llava : code formatting, rename files, fix compile warnings
* do not use Wno-cast-qual for MSVC
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545)
* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt
* mpt : protect against "clip_qkv": null in mpt-7b
* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models
* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)
* mpt : standardized all tensor names to follow GGUF spec
* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code
* mpt : fixed comment s/gptneox/mpt/
* mpt : remove tabs, trailing whitespace
* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt
* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252
* comment out n_past instead of marking it unused
* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]
* mpt : remove unused tokenizer_json in convert script
* ggml : remove obsolete n_past assert in ggml_alibi
* llama : print clam_kqv and max_alibi_bias hparams
---------
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* sync : ggml (conv 1d + 2d updates)
ggml-ci
* ggml : fix UB in q5_0 and q5_1 quantize code
ggml.c:1033:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
ggml.c:1081:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
ggml-ci
* tests : fix UB in test-quantize-perf