Concedo
35a97e14b2
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# docs/token_generation_performance_tips.md
# grammars/README.md
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
# tests/test-grad0.cpp
# tests/test-opt.cpp
2023-11-15 16:59:53 +08:00
Galunid
36eed0c42c
stablelm : StableLM support ( #3586 )
...
* Add support for stablelm-3b-4e1t
* Supports GPU offloading of (n-1) layers
2023-11-14 11:17:12 +01:00
Georgi Gerganov
4760e7cc0b
sync : ggml (backend v2) ( #3912 )
...
* sync : ggml (backend v2) (wip)
* sync : migrate examples and llama.cpp to dynamic graphs (wip)
* sync : update tests + fix max op params to 64
ggml-ci
* sync : ggml-cuda
ggml-ci
* llama : fix save/load state context size
ggml-ci
* sync : try to fix build on tvOS
* sync : pass custom graph sizes in training examples
* sync : update graph copies to new ggml API
* sync : update sync-ggml.sh with new files
* scripts : fix header in sync script
* train : fix context size calculations
* llama : increase inference graph size up to 4096 nodes
* train : allocate grads for backward graphs
* train : allocate grads for gb_tmp
2023-11-13 14:16:23 +02:00
Kerfuffle
bb50a792ec
Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading ( #4041 )
...
* Add ReLU and SQR CUDA ops to fix Persimmon offloading
* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
2023-11-13 01:58:15 -07:00
Concedo
a6e6b8b96b
Merge branch 'master' into concedo_experimental
2023-11-10 22:27:11 +08:00
Galunid
df9d1293de
Unbreak persimmon after #3837 ( #4010 )
2023-11-10 14:24:54 +01:00
Concedo
f277ed0e8c
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
2023-11-07 15:23:08 +08:00
Meng Zhang
46876d2a2c
cuda : supports running on CPU for GGML_USE_CUBLAS=ON build ( #3946 )
...
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build
* doc: add comments to ggml_cublas_loaded()
* fix defined(...)
2023-11-07 08:49:08 +02:00
Concedo
a62468ec4c
Merge branch 'master' into concedo_experimental
...
should fix multigpu
2023-11-05 22:14:40 +08:00
Meng Zhang
3d48f42efc
llama : mark LLM_ARCH_STARCODER as full offload supported ( #3945 )
...
as done in https://github.com/ggerganov/llama.cpp/pull/3827
2023-11-05 14:40:08 +02:00
cebtenzzre
3fdbe6b66b
llama : change yarn_ext_factor placeholder to -1 ( #3922 )
2023-11-03 08:31:58 +02:00
Concedo
bc2027b008
Merge remote-tracking branch 'ceb/fix-fast-ext-factor' into concedo_experimental
2023-11-03 11:21:14 +08:00
cebtenzzre
25fef506cf
llama : change yarn_ext_factor placeholder to -1
2023-11-02 21:53:59 -04:00
Concedo
42eabf2f2f
rope fixes
2023-11-02 20:41:16 +08:00
Concedo
bc4ff72317
not working merge
2023-11-02 17:52:40 +08:00
Georgi Gerganov
1efae9b7dc
llm : prevent from 1-D tensors being GPU split ( #3697 )
2023-11-02 09:54:44 +02:00
Concedo
1ab18ecb53
Merge commit ' c43c2da8af
' into concedo_experimental
...
# Conflicts:
# llama.cpp
2023-11-02 11:17:59 +08:00
cebtenzzre
0eb332a10f
llama : fix llama_context_default_params after #2268 ( #3893 )
2023-11-01 19:29:14 -04:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling ( #2268 )
...
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Georgi Gerganov
c43c2da8af
llm : fix llm_build_kqv taking unused tensor (benign, #3837 )
2023-11-01 23:08:30 +02:00
Georgi Gerganov
523e49b111
llm : fix falcon norm after refactoring ( #3837 )
2023-11-01 23:00:50 +02:00
Georgi Gerganov
50337961a6
llm : add llm_build_context ( #3881 )
...
* llm : add llm_build_context
* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv
* llm : restore the non-graph llm_build_ functional API
ggml-ci
* llm : cleanup + comments
2023-11-01 20:11:02 +02:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter ( #3762 )
...
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00
Concedo
9342636408
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# flake.lock
# flake.nix
2023-11-01 18:24:36 +08:00
Georgi Gerganov
71e3718abd
llama : refactor graph build code ( #3837 )
...
* llama : factor out ggml-alloc from graph graph build functions
ggml-ci
* metal : disable kernel load log
* llama : factor out tensor offloading outside the build call (wip)
ggml-ci
* llama : offload rest of the models
ggml-ci
* llama : update offload log messages to print node index
* llama : comments
* llama : support offloading result_norm + comments
* llama : factor graph input into a function
* llama : do tensor offload only with CUDA
* llama : fix res_norm offloading
* llama : try to optimize offloading code
* llama : fix non-CUDA build
* llama : try to fix build
* llama : move refact in correct place + optimize graph input
* llama : refactor tensor offloading as callback
* llama : add layer index to all tensor names
* llama : add functional header
* llama : comment
ggml-ci
* llama : remove obsolete map for layer counting
* llama : add llm_build helper functions (#3848 )
* llama : add llm_build_norm helper function
ggml-ci
* llama : add llm_build_ffn helper function (#3849 )
ggml-ci
* llama : add llm_build_k_shift helper
ggml-ci
* llama : fix offloading after recent changes
* llama : add llm_build_kv_store helper
ggml-ci
* llama : remove obsolete offload names
* llama : fix llm_build_k_shift to use n_head_kv instead of n_head
* llama : simplify falcon Q, K, V computation
* llama : remove obsolete comments in build graphs
* llama : add llm_build_kqv helper
ggml-ci
* llama : minor
* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
* llama : fix input allocation logic
* llama : update offload functions for KQ tensors
* llama : normalize tensor names
ggml-ci
* llama : enable warning about not offloaded tensors
* llama : remove extra ; + deduplicate gate_b logic
* llama : add llm_build_inp_embd helper
2023-11-01 08:04:02 +02:00
kalomaze
238657db23
samplers : Min-P sampler implementation [alternative to Top P/Top K] ( #3841 )
...
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31 20:44:49 +01:00
Concedo
e62f38abd1
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# tests/test-double-float.cpp
# tests/test-quantize-fns.cpp
2023-10-31 21:09:49 +08:00
Concedo
cc5b282350
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# build.zig
# flake.lock
# flake.nix
# ggml.c
2023-10-31 20:44:04 +08:00
Georgi Gerganov
207b51900e
ggml : move FP16 <-> FP32 code to ggml-impl.h ( #3861 )
...
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
2023-10-30 19:19:15 +02:00
Kerfuffle
6e08281e58
Extend llama_kv_cache_seq_rm to allow matching any sequence ( #3843 )
...
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
Georgi Gerganov
71a09da301
llama : fix kv shift bug ( #3835 )
...
ggml-ci
2023-10-29 18:32:51 +02:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring ( #3833 )
...
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00
Concedo
338d6c265d
fixes to smartcontextpro
2023-10-29 10:42:37 +08:00
Kerfuffle
bd6d9e2059
llama : allow quantizing k-quants to fall back when tensor size incompatible ( #3747 )
...
* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit
2023-10-28 14:54:24 +03:00
Georgi Gerganov
fdee152e4e
starcoder : add GPU offloading ( #3827 )
...
* starcoder : do not GPU split 1D bias tensors
* starcoder : offload layers to GPU
ggml-ci
2023-10-28 12:06:08 +03:00
Concedo
2ea3b567cf
Merge: Testing speed of tensor cores vs MMQ
2023-10-28 16:41:42 +08:00
Concedo
15f525c580
revamped smart context for llama models
2023-10-28 12:59:08 +08:00
cebtenzzre
6d459cbfbe
llama : correctly report GGUFv3 format ( #3818 )
2023-10-27 17:33:53 -04:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance ( #3776 )
...
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Concedo
5db89b90b7
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .gitignore
# CMakeLists.txt
# Makefile
# README.md
# build.zig
# ggml-opencl.cpp
# tests/CMakeLists.txt
# tests/test-double-float.cpp
# tests/test-sampling.cpp
2023-10-25 23:58:15 +08:00
Concedo
c9983a72d6
prevent lora with clblast
2023-10-25 15:18:03 +08:00
Marcus Dunn
5be6c803fa
llama : remove token functions with context
args in favor of model
( #3720 )
...
* added `llama_model_token_*` variants to all the `llama_token_*` functions.
* added `LLAMA_API`
* formatting
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* removed old `llama_token` functions
* changed 3 more functions to take in model
- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`
* added back docs
* fixed main.cpp
* changed token functions to use new model variants
* changed token functions to use new model variants
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-23 22:40:03 +03:00
goerch
9e70cc0322
Add test for MPT tokenization ( #3728 )
...
* Add test for MPT tokenization
* Revert code motion
* Remove unnecessary restriction in test case
* Clarify logic in conversion
2023-10-22 21:21:42 +02:00
Kerfuffle
a5e7dbd614
llama : validate special token ids are in range when loading GGUF model ( #3635 )
...
* Add validation for special token ids to llama.cpp
Small optimization for llama_byte_to_token SPM mode
* Fix BPE newline check, only I could break something so simple
* Killll meeeeee
* Account for GGUF_KEY_KEY only setting when the key exists
* Minor code cleanups.
* Fix convert.py error msg when added tokens are out of range
* Make gguf SpecialVocab vocab size-aware
Update conversion scripts accordingly
* Avoid a string copy
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-22 21:14:56 +03:00
Concedo
cff75061fe
fixed some old models failing due to tokenizer changes, update lite (+1 squashed commits)
...
Squashed commits:
[9dee81ec] fixed some old models failing due to tokenizer changes, update lite tooltip (+3 squashed commit)
Squashed commit:
[5ab95a79] fixes
[a561d5e2] fixed some old models failing due to tokenizer changes
[95e65daf] lite updates
2023-10-22 11:04:59 +08:00
Georgi Gerganov
d1031cf49c
sampling : refactor init to use llama_sampling_params ( #3696 )
...
* sampling : refactor init to use llama_sampling_params
* llama : combine repetition, frequency and presence penalties in 1 call
* examples : remove embd-input and gptneox-wip
* sampling : rename penalty params + reduce size of "prev" vector
* sampling : add llama_sampling_print helper
* sampling : hide prev behind API and apply #3661
ggml-ci
2023-10-20 21:07:23 +03:00
Herman Semenov
f439e506e8
ggml : fix rope + llama minor optimizations ( #3560 )
...
* Minor fixes and fixed memleak
* Using const auto references in range-based loop C++17
2023-10-20 13:02:12 +03:00
Concedo
957e245285
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
2023-10-19 23:32:52 +08:00
Georgi Gerganov
0e89203b51
speculative : add tree-based sampling example ( #3624 )
...
* sampling : one sequence per sampling context
ggml-ci
* speculative : add tree-based sampling support
ggml-ci
* speculative : reuse the n_parallel CLI param
* speculative : refactor sampling
* examples : fix build after sampling refactoring
ggml-ci
* batched : fix n_seq_id
* sampling : fix malloc
ggml-ci
* swift : fix build
ggml-ci
* swift : try to fix build
ggml-ci
* prompts : add assistant.txt
* common : add llama_batch_add() and llama_batch_clear() helpers
* speculative : minor refactor
ggml-ci
* minor : comments + rename
ggml-ci
* speculative : fix off-by-one for n_drafted
* speculative : fix the n_drafted fix + p constants
2023-10-18 16:21:57 +03:00
Concedo
c1ca1de2ac
fixed support for old falcon models
2023-10-18 17:20:44 +08:00