* fixes#7999
The `build_command_r` forgot to add the control vector.
* Fixes qwen2 too
* Fixed all models' control vectors
* Removed double calls to `cb(cur, "l_out", il)`
* Moved control vector logic to llama_control_vector:apply_to()
* llama : add T5 model architecture, tensors and model header parameters
* llama : add implementation of Unigram tokenizer with SentencePiece-like text normalization using precompiled charsmap
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* llama : return nullptr from llama_grammar_init
This commit updates llama_grammar_init to return nullptr instead of
throwing an exception.
The motivation for this is that this function is declared inside an
extern "C" block and is intended/may be used from C code which will not
be able to handle exceptions thrown, and results in undefined behavior.
On Windows and using MSVC the following warning is currently generated:
```console
C:\llama.cpp\llama.cpp(13998,1): warning C4297: 'llama_grammar_init':
function assumed not to throw an exception but does
C:\llama.cpp\llama.cpp(13998,1): message :
__declspec(nothrow), throw(), noexcept(true), or noexcept was specified
on the function
```
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* squash! llama : return nullptr from llama_grammar_init
Add checks for nullptr when calling llama_grammar_init.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Clint Herron <hanclinto@gmail.com>
* ggml : remove ggml_task_type and GGML_PERF
* check abort_callback on main thread only
* vulkan : remove usage of ggml_compute_params
* remove LLAMA_PERF
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples
* find result_norm/result_embd tensors properly; update output allocation logic
* only use embd output for pooling_type NONE
* get rid of old causal_attn accessor
* take out attention_type; add in llama_set_embeddings
* bypass logits when doing non-NONE pooling
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B
* fix: QWEN2MOE support for expert_feed_forward_length
previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH
n_ff_exp and n_ff_shared_exp are now properly calculated
* update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM
* fix: QWEN2MOE support for expert_feed_forward_length
previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH
n_ff_exp and n_ff_shexp are now properly calculated
* Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive.
* Free resources except for backend.
* Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA.
* Apply suggestions from code review
Co-authored-by: slaren <slarengh@gmail.com>
* Fix editorconfig and unused variable
* Fix issues with Windows build
---------
Co-authored-by: slaren <slarengh@gmail.com>
* separate DPCT helpers outside
* replace global variables with context
* remove useless extra
* update mul_mat condition
* remove duplicate buft initialization
* remove duplicate extra and global work group size
* remove useless backend check
* remove duplicated extras
* use macro for group_size and remove cuda-related
* support for Poro chat pre-tokenizer
* add support for Poro pre-tokenizer
* Update convert-hf-to-gguf-update.py
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Change Poro-34B-chat to poro-chat
* Change Poro-34B-chat to poro-chat
* Update convert-hf-to-gguf-update.py
* Update llama.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: add changes to handle jina v2 base code
* fix: do not complicate things
* fix: fix the usage of the code model
* fix: fix comments
* fix: fix linting issues
* fix: remove ollama patches
* style : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
* Add per token attributes enum
* Using phi-3 for testing 'rstrip'
* Using jina-v2 for testing 'lstrip'
* Brute force test for 'lstrip' and 'rstrip'
* Implement 'rstrip' and 'lstrip'
* Update phi-3 GGUF file (obsolete since 917dc8c)
* Replace llama_token_type with llama_token_attribs
* llama : offload to RPC in addition to other backends
* - fix copy_tensor being called on the src buffer instead of the dst buffer
- always initialize views in the view_src buffer
- add RPC backend to Makefile build
- add endpoint to all RPC object names
* add rpc-server to Makefile
* Update llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* llama : cache llama_token_to_piece
ggml-ci
* llama : use vectors and avoid has_cache
ggml-ci
* llama : throw on unknown tokenizer types
ggml-ci
* llama : print a log of the total cache size
* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
- Fix unicode edge case combinations.
- Split by whitspace in the same pass.
* Discard all tokens when no matching found.
* Add optional MLP bias for Granite models
Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/llama.cpp/issues/7116
Still needs some more changes to properly support Granite.
* llama: honor add_space_prefix from the model configuration
propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* llama: add support for small granite models
it works only for the small models 3b and 8b.
The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Steffen Roecker <sroecker@redhat.com>