Commit graph

4371 commits

Author SHA1 Message Date
Georgi Gerganov
044ec4b2a5
embedding : add EOS token if not present (#899) 2024-03-14 15:14:14 +02:00
Georgi Gerganov
77178eedc8
gguf-py : fix dtype check (#6045) 2024-03-14 13:32:14 +02:00
Jian Liao
15a333260a
readme : improve readme for Llava-1.6 example (#6044)
Co-authored-by: Jian Liao <jianliao@adobe.com>
2024-03-14 13:18:23 +02:00
Pierrick Hymbert
43241adf22
server: disable debug release type sanitizer, simplify trigger (#6047)
- increase time out for server
 - do not fail fast
2024-03-14 13:15:39 +02:00
Georgi Gerganov
a44bc969e4
llama : fix typo 2024-03-14 13:13:06 +02:00
Michael Podvitskiy
2c4fb69246
llama : optimize defrag moves + fix fragmentation calculation (#6037)
* attempt to reduce the impact of a worst-case scenario

* fragmentation calculation fix

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-14 12:56:48 +02:00
Ondřej Čertík
3ca23481dd
gguf-py : add support for I8, I16 and I32 (#6045)
* Refactor dtype handling to be extensible

This code is equivalent as before, but now it is prepared to easily add
more NumPy dtypes.

* Add support for I8, I16 and I32

These types are allowed in the GGUF specification.

* Add support for I8, I16 and I32 to gguf_writer

* Add support for I8, I16, I32 to gguf_reader
2024-03-14 12:40:14 +02:00
Georgi Gerganov
3fe8d7a17f
ggml : designate enum vals for integer types (#6050) 2024-03-14 12:38:37 +02:00
Georgi Gerganov
68265ebfc6
embedding : print all resulting embeddings (#899) 2024-03-14 12:37:20 +02:00
Georgi Gerganov
381da2d9f0
metal : build metallib + fix embed path (#6015)
* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library
2024-03-14 11:55:23 +02:00
Georgi Gerganov
0fd6c1f015
embedding : print cosine similarity (#899) 2024-03-14 10:12:29 +02:00
Concedo
f3b7651102 added ignoremissing param 2024-03-14 13:46:42 +08:00
Concedo
ec5dea14d7 merged, try to fix metal build 2024-03-14 11:15:50 +08:00
Linwei Wang
19885d205e
readme : update details about running llama in Termux on Android (#6039) 2024-03-13 20:34:40 +02:00
Georgi Gerganov
76a936c893
readme : update API changes and hot topics 2024-03-13 20:33:56 +02:00
Clint Herron
463628372d
grammar : handle missing "root" node (#6004) 2024-03-13 20:10:40 +02:00
slaren
f30ea47a87
llama : add pipeline parallelism support (#6017)
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13 18:54:21 +01:00
slaren
d8fd0ccf6a
test-backend-ops : skip CPU backend by default (#6028) 2024-03-13 15:58:30 +02:00
Concedo
9f102b9db6 update makefile 2024-03-13 21:53:52 +08:00
AidanBeltonS
b3d978600f
Update get version (#6025) 2024-03-13 18:47:54 +05:30
Xuan Son Nguyen
99b71c068f
Server: Use multi-task for embeddings endpoint (#6001)
* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}
2024-03-13 11:39:11 +01:00
Concedo
7a2de82c96 updated lite 2024-03-13 18:27:19 +08:00
Concedo
a9435163ab fixed uploading non square images 2024-03-13 14:19:51 +08:00
Concedo
85287c7701 handle uploading non square images 2024-03-13 13:57:14 +08:00
Concedo
47c42fd45c fix for mamba processing 2024-03-13 13:27:46 +08:00
Concedo
ba950716a9 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	build.zig
#	llama.cpp
#	tests/test-tokenizer-1-bpe.cpp
#	tests/test-tokenizer-1-llama.cpp
2024-03-13 11:21:58 +08:00
slaren
306d34be7a
ci : remove tidy-review (#6021) 2024-03-12 17:55:19 +02:00
Concedo
edb05e761f Update some prints 2024-03-12 21:40:36 +08:00
Concedo
88705cb89a improve quiet mode for SD 2024-03-12 20:50:39 +08:00
Georgi Gerganov
8030da7afe
ggml : reuse quantum structs across backends (#5943)
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
2024-03-12 14:27:20 +02:00
Concedo
60d234550b fix colab 2024-03-12 20:09:49 +08:00
Georgi Gerganov
184215e783
ggml : fix UB in IQ2_S and IQ3_S (#6012) 2024-03-12 13:49:55 +02:00
Concedo
6c6ad93f01 added basic support for password protection (+2 squashed commit)
Squashed commit:

[ff91ca72] added basic support for password protection

[91b0b208] updated docs
2024-03-12 19:47:12 +08:00
Georgi Gerganov
48358b2e5b
sycl : update IQ1_S kernels (WIP - not working!) (#5995)
* sycl : try to fix after IQ1_S changes

* sycl : iq1s_grid -> iq1s_grid_gpu

* sycl : fix grid type
2024-03-12 11:15:05 +02:00
Concedo
a69bc44e7a edit colab (+1 squashed commits)
Squashed commits:

[c7ccb99d] update colab with llava
2024-03-12 15:24:53 +08:00
gliptic
5cdb371731
grammar : fix unnecessarily retained pointer to rules (#6003) 2024-03-11 21:59:03 +02:00
Kawrakow
44ca159faf
1.5 bit: we can do even better (#5999)
* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-11 17:53:15 +02:00
Georgi Gerganov
05b06210c9
llama : more consistent names of count variables (#5994)
* llama : more consistent names of count variables

ggml-ci

* llama : n_parallel -> n_seq_max

* common : fix param name

* examples : fix param name
2024-03-11 17:49:47 +02:00
Georgi Gerganov
83796e62bc
llama : refactor unicode stuff (#5992)
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
2024-03-11 17:47:47 +02:00
Concedo
6a32c14e86 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	flake.lock
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/.gitignore
#	tests/test-backend-ops.cpp
2024-03-11 23:00:47 +08:00
Concedo
9229ea664e if no existing filepath, do not use cwd, use last path instead 2024-03-11 22:19:38 +08:00
Stefan Kapusniak
4dd1c2b81a
Improve launcher file dialog initial paths (#740)
- In the launcher, if an existing value is set for a file value (e.g.
Model), use that file's directory the initial directory when the
file dialog is opened with 'Browse'.
- In the launcher always set the intial directory for 'Load' to
cwd.
2024-03-11 22:05:46 +08:00
Concedo
95c8090967 updated lite 2024-03-11 21:59:18 +08:00
Concedo
227f59dab6 added a simple program to do quantization for clip models 2024-03-11 21:50:30 +08:00
Jakub N
828defefb6
Update server docker image URLs (#5997) 2024-03-11 14:40:42 +01:00
Concedo
2dc647f892 updated lite (+1 squashed commits)
Squashed commits:

[f33ea44a] updated lite
2024-03-11 20:10:34 +08:00
Concedo
d59ec68753 added interrogate endpoint (+1 squashed commits)
Squashed commits:

[7bf96261] added interrogate endpoint
2024-03-11 18:50:18 +08:00
Xuan Son Nguyen
caa106d4e0
Server: format error to json (#5961)
* server: format error to json

* server: do not crash on grammar error

* fix api key test case

* revert limit max n_predict

* small fix

* correct coding style

* update completion.js

* launch_slot_with_task

* update docs

* update_slots

* update webui

* update readme
2024-03-11 10:56:41 +01:00
Concedo
e4946b96ea support llava with gpt4v openai endpoint 2024-03-11 17:36:10 +08:00
Michael Podvitskiy
3202361c5b
ggml, ci : Windows ARM runner and build fixes (#5979)
* windows arm ci

* fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64

* fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned`

* fix `error C2065: '__fp16': undeclared identifier`
2024-03-11 11:28:51 +02:00