Commit graph

405 commits

Author SHA1 Message Date
Concedo
909e4334f9 fixed indentation in makefile 2024-04-08 16:27:13 +08:00
Concedo
87175db07d try to fix cuda build makefile 2024-04-08 16:18:13 +08:00
Concedo
22f543d09b Merge commit '32c8486e1f' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	build.zig
#	llama.cpp
#	tests/test-backend-ops.cpp
2024-04-07 20:39:17 +08:00
Concedo
a530afa1e4 Merge commit '280345968d' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/llama-cpp-cuda.srpm.spec
#	.devops/main-cuda.Dockerfile
#	.devops/nix/package.nix
#	.devops/server-cuda.Dockerfile
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	ci/run.sh
#	docs/token_generation_performance_tips.md
#	flake.lock
#	llama.cpp
#	scripts/LlamaConfig.cmake.in
#	scripts/compare-commits.sh
#	scripts/server-llm.sh
#	tests/test-quantize-fns.cpp
2024-04-07 20:27:17 +08:00
Concedo
bec16d182b Merge commit '2f34b865b6' into concedo_experimental
# Conflicts:
#	.clang-tidy
#	CMakeLists.txt
#	Makefile
#	ggml-cuda.cu
2024-04-07 18:30:35 +08:00
Concedo
0061299cce fixed quant tools not compiling, updated docs 2024-04-06 23:11:05 +08:00
Jared Van Bortel
32c8486e1f
wpm : portable unicode tolower (#6305)
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
2024-03-26 17:46:21 -04:00
slaren
280345968d
cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
slaren
ae1f211ce2
cuda : refactor into multiple files (#6269) 2024-03-25 13:50:23 +01:00
Minsoo Cheong
64e7b47c69
examples : add "retrieval" (#6193)
* add `retrieval` example

* add README

* minor fixes

* cast filepos on print

* remove use of variable sized array

* store similarities in separate vector

* print error on insufficient batch size

* fix error message printing

* assign n_batch value to n_ubatch

* fix param definitions

* define retrieval-only parameters in retrieval.cpp

* fix `--context-file` option to be provided multiple times for multiple files

* use vector for `query_emb`

* add usage description in README

* fix merge conflict

* fix usage printing

* remove seed setting

* fix lint

* increase file read buffer size

* retrieval : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-25 09:38:22 +02:00
Pierrick Hymbert
21cad01b6e
split: add gguf-split in the make build target (#6262) 2024-03-23 17:18:13 +01:00
Johannes Gäßler
50ccaf5eac
lookup: complement data from context with general text statistics (#5479)
* lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens
2024-03-23 01:24:36 +01:00
slaren
2f0e81e053
cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208)
* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy

* add LLAMA_CUDA_NO_PEER_COPY to HIP build
2024-03-22 14:05:31 +01:00
Olivier Chafik
5b7b0ac8df
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)

* json: support tuple types (`[number, string]`)

* json: support additionalProperties (`{[k: string]: [string,number][]}`)

* json: support required / optional properties

* json: add support for pattern

* json: resolve $ref (and support https schema urls)

* json: fix $ref resolution

* join: support union types (mostly for nullable types I think)

* json: support allOf + nested anyOf

* json: support any (`{}` or `{type: object}`)

* json: fix merge

* json: temp fix for escapes

* json: spaces in output and unrestricted output spaces

* json: add typings

* json:fix typo

* Create ts-type-to-grammar.sh

* json: fix _format_literal (json.dumps already escapes quotes)

* json: merge lit sequences and handle negatives

{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}

* json: handle pattern repetitions

* Update json-schema-to-grammar.mjs

* Create regex-to-grammar.py

* json: extract repeated regexp patterns to subrule

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* json: handle schema from pydantic Optional fields

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update ts-type-to-grammar.sh

* Update ts-type-to-grammar.sh

* json: simplify nullable fields handling

* json: accept duplicate identical rules

* json: revert space to 1 at most

* json: reuse regexp pattern subrules

* json: handle uuid string format

* json: fix literal escapes

* json: add --allow-fetch

* json: simplify range escapes

* json: support negative ranges in patterns

* Delete commit.txt

* json: custom regex parser, adds dot support & JS-portable

* json: rm trailing spaces

* Update json-schema-to-grammar.mjs

* json: updated server & chat `( cd examples/server && ./deps.sh )`

* json: port fixes from mjs to python

* Update ts-type-to-grammar.sh

* json: support prefixItems alongside array items

* json: add date format + fix uuid

* json: add date, time, date-time formats

* json: preserve order of props from TS defs

* json: port schema converter to C++, wire in ./server

* json: nits

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* json: fix mjs implementation + align outputs

* Update json-schema-to-grammar.mjs.hpp

* json: test C++, JS & Python versions

* json: nits + regen deps

* json: cleanup test

* json: revert from c++17 to 11

* json: nit fixes

* json: dirty include for test

* json: fix zig build

* json: pass static command to std::system in tests (fixed temp files)

* json: fix top-level $refs

* json: don't use c++20 designated initializers

* nit

* json: basic support for reserved names `{number:{number:{root:number}}}`

* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)

* json: re-ran server deps.sh

* json: simplify test

* json: support mix of additional props & required/optional

* json: add tests for some expected failures

* json: fix type=const in c++, add failure expectations for non-str const&enum

* json: test (& simplify output of) empty schema

* json: check parsing in test + fix value & string refs

* json: add server tests for OAI JSON response_format

* json: test/fix top-level anyOf

* json: improve grammar parsing failures

* json: test/fix additional props corner cases

* json: fix string patterns (was missing quotes)

* json: ws nit

* json: fix json handling in server when there's no response_format

* json: catch schema conversion errors in server

* json: don't complain about unknown format type in server if unset

* json: cleaner build of test

* json: create examples/json-schema-pydantic-example.py

* json: fix date pattern

* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common

* json: indent 4 spaces

* json: fix naming of top-level c++ function (+ drop unused one)

* json: avoid using namespace std

* json: fix zig build

* Update server.feature

* json: iostream -> fprintf

* json: space before & refs for consistency

* json: nits
2024-03-21 11:50:43 +00:00
Pierrick Hymbert
d0d5de42e5
gguf-split: split and merge gguf per batch of tensors (#6135)
* gguf-split: split and merge gguf files per tensor

* gguf-split: build with make toolchain

* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split

* split : minor style + fix compile warnings

* gguf-split: remove --upload not implemented

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-19 12:05:44 +01:00
Pierrick Hymbert
d01b3c4c32
common: llama_load_model_from_url using --model-url (#6098)
* common: llama_load_model_from_url with libcurl dependency

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-17 19:12:37 +01:00
Georgi Gerganov
131b058409
make : ggml-metal.o depends on ggml.h 2024-03-15 11:38:40 +02:00
Georgi Gerganov
381da2d9f0
metal : build metallib + fix embed path (#6015)
* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library
2024-03-14 11:55:23 +02:00
Concedo
ec5dea14d7 merged, try to fix metal build 2024-03-14 11:15:50 +08:00
slaren
f30ea47a87
llama : add pipeline parallelism support (#6017)
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13 18:54:21 +01:00
Concedo
9f102b9db6 update makefile 2024-03-13 21:53:52 +08:00
Concedo
ba950716a9 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	build.zig
#	llama.cpp
#	tests/test-tokenizer-1-bpe.cpp
#	tests/test-tokenizer-1-llama.cpp
2024-03-13 11:21:58 +08:00
Georgi Gerganov
83796e62bc
llama : refactor unicode stuff (#5992)
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
2024-03-11 17:47:47 +02:00
Concedo
227f59dab6 added a simple program to do quantization for clip models 2024-03-11 21:50:30 +08:00
DAN™
bcebd7dbf6
llama : add support for GritLM (#5959)
* add gritlm example

* gritlm results match

* tabs to spaces

* comment out debug printing

* rebase to new embed

* gritlm embeddings are back babeee

* add to gitignore

* allow to toggle embedding mode

* Clean-up GritLM sample code.

* Fix types.

* Flush stdout and output ending newline if streaming.

* mostly style fixes; correct KQ_mask comment

* add causal_attn flag to llama_cparams

* gritml : minor

* llama : minor

---------

Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 17:56:30 +02:00
Concedo
c08d7e5042 wip integration of llava 2024-03-10 11:18:47 +08:00
Georgi Gerganov
8a3012a4ad
ggml : add ggml-common.h to deduplicate shared code (#5940)
* ggml : add ggml-common.h to shared code

ggml-ci

* scripts : update sync scripts

* sycl : reuse quantum tables

ggml-ci

* ggml : minor

* ggml : minor

* sycl : try to fix build
2024-03-09 12:47:57 +02:00
Gabe Goodhart
e1fa9569ba
server : add SSL support (#5926)
* add cmake build toggle to enable ssl support in server

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add flags for ssl key/cert files and use SSLServer if set

All SSL setup is hidden behind CPPHTTPLIB_OPENSSL_SUPPORT in the same
way that the base httlib hides the SSL support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Update readme for SSL support in server

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Add LLAMA_SERVER_SSL variable setup to top-level Makefile

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-03-09 11:57:09 +02:00
Georgi Gerganov
2002bc96bf
server : refactor (#5882)
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
2024-03-07 11:41:53 +02:00
Concedo
4eb3a95cbb reenable LCM sampler 2024-03-05 17:39:21 +08:00
Concedo
3463688a0e image generation is fully working over api (+1 squashed commits)
Squashed commits:

[c98ab0b4] single image generation is working now
2024-03-01 14:43:44 +08:00
Concedo
5a44d4de2b refactor and clean identifiers for sd, fix cmake 2024-02-29 18:28:45 +08:00
Concedo
f75e479db0 WIP on sdcpp integration 2024-02-29 00:40:07 +08:00
Concedo
8a919daafb add as library to makefile 2024-02-28 17:45:00 +08:00
Concedo
17355faf6e sdcpp is working! 2024-02-28 17:00:18 +08:00
Concedo
26696970ce initial files from sdcpp (not working) 2024-02-28 15:45:13 +08:00
le.chang
cbbd1efa06
Makefile: use variables for cublas (#5689)
* make: use arch variable for cublas

* fix UNAME_M

* check opt first

---------

Co-authored-by: lindeer <le.chang118@gmail.com>
2024-02-27 03:03:06 +01:00
kwin1412
f1a98c5254
make : fix nvcc version is empty (#5713)
fix nvcc version is empty
2024-02-25 18:46:49 +02:00
Concedo
b5ba6c9ece test to see if Ofast for ggml library plus batching adjustments fixes speed regression for ggmlv1 models 2024-02-25 21:14:53 +08:00
Concedo
f3a0e05d91 added noavx2 vulkan 2024-02-22 16:56:25 +08:00
CJ Pais
6560bed3f0
server : support llava 1.6 (#5553)
* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build
2024-02-20 21:07:22 +02:00
slaren
06bf2cf8c4
make : fix debug build with CUDA (#5616) 2024-02-20 20:06:17 +01:00
Haoxiang Fei
8dbbd75754
metal : add build system support for embedded metal library (#5604)
* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 11:58:36 +02:00
Jared Van Bortel
f24ed14ee0
make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598) 2024-02-19 15:54:12 -05:00
Georgi Gerganov
d0e3ce51f4
ci : enable -Werror for CUDA builds (#5579)
* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci
2024-02-19 14:45:41 +02:00
Georgi Gerganov
68a6b98b3c
make : fix CUDA build (#5580) 2024-02-19 13:41:51 +02:00
Xuan Son Nguyen
11b12de39b
llama : add llama_chat_apply_template() (#5538)
* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"
2024-02-19 10:23:37 +02:00
Jared Van Bortel
a0c2dad9d4
build : pass all warning flags to nvcc via -Xcompiler (#5570)
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
2024-02-18 16:21:52 -05:00
Ananta Bastola
6e4e973b26
ci : add an option to fail on compile warning (#3952)
* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-17 23:03:14 +02:00
henk717
01b7daf6b7
CUDA 12 support for the Conda Runtime (#680)
* Remove libculibos dependency

This dependency is something that is used to build libcudart which we are also already targeting. The individual file is no longer being distributed with the CUDA12 conda devkit, so we can no longer target it directly. But because all its functionality is inside libcudart we also don't need it.

This commit removes the inclusion so that Koboldcpp can be compiled with CUDA12 as distributed by conda. I have tested this on the 1.57 release on CUDA11.5 and CUDA12.1.

* Cleanup version definitions

The package versions are already controlled by the label, we don't need to define it multiple times for it to work correctly. Removing the separate definitions allows people to easily change which version of CUDA they wish for their system.
2024-02-16 21:32:30 +08:00