Commit graph

961 commits

Author SHA1 Message Date
Concedo
92afdfcae4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/server.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	llama.cpp
#	requirements/requirements-convert-hf-to-gguf-update.txt
#	requirements/requirements-convert-hf-to-gguf.txt
#	requirements/requirements-convert-legacy-llama.txt
#	scripts/sync-ggml.last
#	tests/test-tokenizer-random.py
2024-06-22 01:33:44 +08:00
Georgi Gerganov
a927b0f3dd
llama : optimize long word tokenization with WPM (#8034)
ggml-ci
2024-06-21 08:51:28 +03:00
Douglas Hanley
80ea089d77
llama : allow pooled embeddings on any model (#7477)
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples

* find result_norm/result_embd tensors properly; update output allocation logic

* only use embd output for pooling_type NONE

* get rid of old causal_attn accessor

* take out attention_type; add in llama_set_embeddings

* bypass logits when doing non-NONE pooling
2024-06-21 08:38:22 +03:00
jaime-m-p
37bef89433
tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
2024-06-18 18:40:52 +02:00
Concedo
c9c050f323 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
2024-06-19 00:33:33 +08:00
Ștefan-Gabriel Muscalu
a94e6ff877
update: support Qwen2-57B-A14B (#7835)
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shared_exp are now properly calculated

* update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shexp are now properly calculated
2024-06-17 21:08:46 +02:00
Georgi Gerganov
7c26775adb
llama : disable FA if KV head size do not match (#7982) 2024-06-17 19:40:01 +03:00
Frank Mai
c637fcd34d
fix: divide 0 exception in mamba (#7932)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-06-17 16:11:08 +02:00
Markus Tavenrath
6a2f0b3474
Implement non-mapped async IO for CUDA on Windows. (#7896)
* Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive.

* Free resources except for backend.

* Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA.

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* Fix editorconfig and unused variable

* Fix issues with Windows build

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-17 16:10:15 +02:00
Concedo
967c1d8df5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	flake.lock
#	tests/test-backend-ops.cpp
2024-06-17 15:14:47 +08:00
Concedo
ba9ef4d01b fix to allow clblast to work even after blas backend splitoff 2024-06-17 15:02:55 +08:00
Georgi Gerganov
52399254b3
unicode : avoid char32_t (#7957)
ggml-ci
2024-06-16 14:51:40 +03:00
Meng, Hengyu
7b2f4a7d19
[SYCL] remove global variables (#7710)
* separate DPCT helpers outside

* replace global variables with context

* remove useless extra

* update mul_mat condition

* remove duplicate buft initialization

* remove duplicate extra and global work group size

* remove useless backend check

* remove duplicated extras

* use macro for group_size and remove cuda-related
2024-06-15 14:05:10 +08:00
Sigbjørn Skjæret
6fcd1331ef
llama : more checks before assuming FIM tokens (#7644)
* More checks before assuming FIM tokens for Llama arch

* extensive token check
2024-06-14 13:20:04 +03:00
Elaine
41b9260f18
convert : add Poro-34B-chat tokenizer support (#7713)
* support for Poro chat pre-tokenizer

* add support for Poro pre-tokenizer

* Update convert-hf-to-gguf-update.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Change Poro-34B-chat to poro-chat

* Change Poro-34B-chat to poro-chat

* Update convert-hf-to-gguf-update.py

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-14 13:16:49 +03:00
Concedo
49e4c3fd7b adjust lite default port, disable double BOS warning, whisper and SD go quiet when horde mode is set too 2024-06-13 15:10:35 +08:00
slaren
f578b86b21
move BLAS to a separate backend (#6210)
* move BLAS to a separate backend

* rename GGML_USE_OPENBLAS to GGML_USE_BLAS

* alloc : reuse same buffer when the same buffer type if used multiple times

* set number of threads automatically for openblas and blis

* sched : print assignments when GGML_SCHED_DEBUG env variable is set

* sched : allow ops with weights on an incompatible buffer type

This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-13 03:11:35 +02:00
Concedo
562d980140 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full.Dockerfile
#	.devops/main-cuda.Dockerfile
#	.devops/main-rocm.Dockerfile
#	.devops/main-vulkan.Dockerfile
#	.devops/main.Dockerfile
#	.devops/server-cuda.Dockerfile
#	.devops/server.Dockerfile
#	README.md
#	common/CMakeLists.txt
#	grammars/README.md
#	tests/test-grammar-integration.cpp
#	tests/test-grammar-parser.cpp
#	tests/test-json-schema-to-grammar.cpp
2024-06-09 17:30:05 +08:00
Concedo
02357eadf8 Merge commit '7672adeec7' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	kompute-shaders/op_rope_f16.comp
#	kompute-shaders/op_rope_f32.comp
#	kompute-shaders/rope_common.comp
#	tests/test-backend-ops.cpp
#	tests/test-grad0.cpp
#	tests/test-rope.cpp
2024-06-09 15:35:51 +08:00
slaren
c9ee7118d5
check for nans in imatrix and quantize (#7807)
* imatrix : detect nan/inf values

* quantize : check imatrix for nan/inf values
2024-06-07 09:01:29 +03:00
Clint Herron
ad675e1c67
Added support for . (any character) token in grammar engine. (#6467)
* Added support for . (any characer) token in grammar engine.

* Add integration tests for any-character symbol.
2024-06-06 06:08:52 -07:00
Joan Fontanals
f5d7b268ec
llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
Georgi Gerganov
2b3389677a
ggml : refactor rope norm/neox (#7634)
* ggml : unify rope norm/neox (CPU)

* ggml : fix compile warning

* ggml : remove GLM rope mode

ggml-ci

* metal : better rope implementation

ggml-ci

* cuda : better rope implementation

ggml-ci

* naming : n_orig_ctx -> n_ctx_orig

ggml-ci

* dev : add reminders to update backends

ggml-ci

* vulkan : fix ggml_rope_ext() usage

* cuda : fix array size + indents

ggml-ci
2024-06-05 11:29:20 +03:00
Concedo
6659742a2d do not merge the removal of opencl 2024-06-05 10:57:52 +08:00
Concedo
e3e21cc44d Merge commit '0cd6bd3483' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	models/ggml-vocab-phi-3.gguf
#	scripts/compare-commits.sh
#	tests/test-tokenizer-random.py
2024-06-05 10:53:52 +08:00
Georgi Gerganov
1442677f92
common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Georgi Gerganov
554c247caf
ggml : remove OpenCL (#7735)
ggml-ci
2024-06-04 21:23:20 +03:00
Georgi Gerganov
0cd6bd3483
llama : remove beam search (#7736) 2024-06-04 21:23:05 +03:00
jaime-m-p
3b38d48609
Per token attributes (#7685)
* Add per token attributes enum
* Using phi-3 for testing 'rstrip'
* Using jina-v2 for testing 'lstrip'
* Brute force test for 'lstrip' and 'rstrip'
* Implement 'rstrip' and 'lstrip'
* Update phi-3 GGUF file (obsolete since 917dc8c)
* Replace llama_token_type with llama_token_attribs
2024-06-04 09:17:17 +02:00
Radoslav Gerganov
bde7cd3cd9
llama : offload to RPC in addition to other backends (#7640)
* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-03 20:03:26 +03:00
Concedo
94753ad103 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
2024-06-03 18:33:23 +08:00
0cc4m
3d7ebf6312
Vulkan Mixture of Experts (MoE) support (#7628)
* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU
2024-06-03 10:59:14 +02:00
zhangkaihuo
6f28a333c1
llama : MiniCPM support tied embeddings (#7664)
* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>
2024-06-03 10:49:30 +03:00
Concedo
8b29d5f848 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	flake.lock
#	llama.cpp
2024-06-03 14:46:12 +08:00
Georgi Gerganov
549279d804
llama : avoid double token-to-piece cache (#7654)
ggml-ci
2024-06-03 08:34:43 +03:00
Concedo
a97f7d5f91 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/main-cuda.Dockerfile
#	.devops/main-intel.Dockerfile
#	.devops/main-rocm.Dockerfile
#	.devops/main.Dockerfile
#	.devops/server-cuda.Dockerfile
#	.devops/server-intel.Dockerfile
#	.devops/server-rocm.Dockerfile
#	.devops/server.Dockerfile
#	.devops/tools.sh
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	llama.cpp
#	requirements.txt
#	requirements/requirements-convert-hf-to-gguf-update.txt
#	requirements/requirements-convert-hf-to-gguf.txt
#	requirements/requirements-convert-legacy-llama.txt
#	requirements/requirements-convert-llama-ggml-to-gguf.txt
#	scripts/check-requirements.sh
#	scripts/compare-llama-bench.py
#	scripts/convert-gg.sh
#	scripts/pod-llama.sh
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-tokenizer-0.sh
#	tests/test-tokenizer-random.py
2024-06-02 12:28:38 +08:00
Johannes Gäßler
9b596417af
CUDA: quantized KV support for FA vec (#7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-01 08:44:14 +02:00
Georgi Gerganov
5921b8f089
llama : cache llama_token_to_piece (#7587)
* llama : cache llama_token_to_piece

ggml-ci

* llama : use vectors and avoid has_cache

ggml-ci

* llama : throw on unknown tokenizer types

ggml-ci

* llama : print a log of the total cache size
2024-05-31 02:01:41 +10:00
Georgi Gerganov
fb76ec31a9
ggml : fix YARN + add tests + add asserts (#7617)
* tests : add rope tests

ggml-ci

* ggml : fixes (hopefully)

ggml-ci

* tests : add non-cont tests

ggml-ci

* cuda : add asserts for rope/norm + fix DS2

ggml-ci

* ggml : assert contiguousness

* tests : reduce RoPE tests

ggml-ci
2024-05-29 20:17:31 +03:00
jaime-m-p
02c1ecad07
Tokenizer WPM fixes (#7500)
* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
  - Fix unicode edge case combinations.
  - Split by whitspace in the same pass.
* Discard all tokens when no matching found.
2024-05-28 21:46:34 +02:00
Giuseppe Scrivano
5442939fcc
llama : support small Granite models (#7481)
* Add optional MLP bias for Granite models

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/llama.cpp/issues/7116
Still needs some more changes to properly support Granite.

* llama: honor add_space_prefix from the model configuration

propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* llama: add support for small granite models

it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Steffen Roecker <sroecker@redhat.com>
2024-05-28 21:49:49 +03:00
fairydreaming
ee3dff6b8e
Add support for DeepseekV2ForCausalLM (#7519)
* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-28 17:07:05 +02:00
Concedo
4ed9ba7352 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.lock
#	tests/test-backend-ops.cpp
2024-05-28 21:57:19 +08:00
Georgi Gerganov
8b99e2aa66
llama : handle unknown utf8 bytes (#7588) 2024-05-28 13:55:35 +03:00
Bartowski
c429b33beb
llama : add Smaug 70B support (#7402) 2024-05-26 15:28:35 +03:00
Justine Tunney
00c6390793
main : don't print special tokens with --grammar (#6923)
* main : don't print special tokens with --grammar

The CLI interface was recently changed to print special control tokens
like the </s> stop message one. This token shouldn't be printed if the
grammar flag was passed, unless the grammar specifies it, because that
breaks shell-scriptability.

* main: use seperate stream for control characters

* main: use dprintf and add --ctrl-token-no-out and --ctrl-token-fd-out

* main: dprintf isn't part of the IEEE POSIX standard. Just use write().

* main: remove --ctrl-token-fd-out in favor for fcntl() based detection

* common.cpp: accidentally removed --interactive-first

* main: only merge stdout and control token if not in conversation or grammar mode

* main: rejig control token descriptor handling

* main: must check pipe status on very top of program

* main: renamed --no-special from  --ctrl-token-no-out and other refactoring

* main: refactor ctrl_token_no_out --> no_special

* llama: rename llama_token_is_control_token() to llama_token_is_control()

* main: remove special token file descriptor feature (#5)

---------

Co-authored-by: Brian <mofosyne@gmail.com>
2024-05-25 19:04:03 +10:00
Masaya, Kato
faa0e6979a
ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433)
* Add SVE support for q4_0_q8_0 q8_0_q8_0

* remove ifdef
2024-05-25 11:42:31 +03:00
fairydreaming
fbca2f27fc
Add support for ArcticForCausalLM (#7020)
* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-24 14:31:13 +02:00
Concedo
653050135b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	tests/test-chat-template.cpp
2024-05-24 16:22:38 +08:00
Tristan Druyen
007489e895
Fix phi3 chat template confusion with zephyr (#7449)
* Fix phi3 template matching vs zephyr

* Add regression test for new phi3 chat template

* Implement review suggestions

* Fix phi3 jinja test templates & match by <|end|>

* Apply suggestion

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Add all phi3 template variants in tests

* Remove unneeded message trimming

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Fix tests to not expect trimmed messages

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-05-23 16:15:15 +02:00