Commit graph

317 commits

Author SHA1 Message Date
Concedo
1edf83761a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/bench.yml.disabled
#	Makefile
#	README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-vulkan.cpp
2024-08-17 16:21:14 +08:00
Xuan Son Nguyen
8b3befc0e2
server : refactor middleware and /health endpoint (#9056)
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-16 17:19:05 +02:00
Riceball LEE
37501d9c79
server : fix duplicated n_predict key in the generation_settings (#8994) 2024-08-15 10:28:05 +03:00
Zhenwei Jin
4af8420afb
common : remove duplicate function llama_should_add_bos_token (#8778) 2024-08-15 10:23:23 +03:00
Jiří Podivín
234b30676a
server : init stop and error fields of the result struct (#9026)
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-08-15 09:21:57 +03:00
Concedo
e8de0af3ec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/bench.yml
#	.github/workflows/build.yml
#	.github/workflows/python-check-requirements.yml
#	README.md
#	docs/backend/SYCL.md
#	flake.lock
#	ggml/CMakeLists.txt
#	ggml/src/kompute-shaders/op_rope_f16.comp
#	ggml/src/kompute-shaders/op_rope_f32.comp
#	ggml/src/kompute-shaders/rope_common.comp
2024-08-14 22:25:43 +08:00
compilade
98a532d474
server : fix segfault on long system prompt (#8987)
* server : fix segfault on long system prompt

* server : fix parallel generation with very small batch sizes

* server : fix typo in comment
2024-08-14 09:51:02 +03:00
Georgi Gerganov
5ef07e25ac
server : handle models with missing EOS token (#8997)
ggml-ci
2024-08-12 10:21:50 +03:00
Concedo
bdfe8526b8 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.gitignore
#	CONTRIBUTING.md
#	Makefile
#	examples/llava/CMakeLists.txt
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	src/llama-vocab.cpp
2024-08-10 11:42:32 +08:00
Mathieu Geli
daef3ab233
server : add one level list nesting for embeddings (#8936) 2024-08-09 09:32:02 +03:00
Xuan Son Nguyen
1e6f6554aa
server : add lora hotswap endpoint (WIP) (#8857)
* server : add lora hotswap endpoint

* handle lora_no_apply

* fix build

* updae docs

* clean up struct def

* fix build

* add LoRA test

* fix style
2024-08-06 17:33:39 +02:00
Concedo
e1f97f7fb5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/llama-server.Dockerfile
#	README.md
#	flake.lock
#	ggml/src/ggml-vulkan.cpp
#	ggml/src/vulkan-shaders/concat.comp
#	ggml/src/vulkan-shaders/pad.comp
#	ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	src/llama.cpp
#	tests/test-backend-ops.cpp
2024-08-06 16:33:26 +08:00
Liu Jia
0a4ce78681
common : Changed tuple to struct (TODO fix) (#8823)
* common : Changed tuple to struct (TODO fix)

Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>

* delete llama_init_default_params()

* delete the extra whitespace
2024-08-05 18:14:10 +02:00
ardfork
978ba3d83d
Server: Don't ignore llama.cpp params (#8754)
* Don't ignore llama.cpp params

* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
Concedo
24b9616344 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-intel.Dockerfile
#	.devops/llama-cli-rocm.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-cli.Dockerfile
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.devops/llama-server.Dockerfile
#	CMakeLists.txt
#	CONTRIBUTING.md
#	Makefile
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	requirements.txt
#	src/llama.cpp
#	tests/test-backend-ops.cpp
2024-07-19 14:23:33 +08:00
RunningLeon
3807c3de04
server : respect --special cli arg (#8553) 2024-07-18 11:06:22 +03:00
Concedo
602661ba49 Merge commit 'c917b67f06' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	Makefile
#	ggml/src/ggml-cuda/mmq.cuh
#	tests/test-double-float.cpp
#	tests/test-quantize-fns.cpp
#	tests/test-quantize-perf.cpp
2024-07-14 11:38:20 +08:00
Douglas Hanley
c3ebcfa148
server : ensure batches are either all embed or all completion (#8420)
* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
Concedo
2cad736260 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	.github/labeler.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	README.md
#	ci/run.sh
#	docs/build.md
#	examples/CMakeLists.txt
#	flake.lock
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	grammars/README.md
#	requirements/requirements-convert_hf_to_gguf.txt
#	requirements/requirements-convert_hf_to_gguf_update.txt
#	scripts/check-requirements.sh
#	scripts/compare-llama-bench.py
#	scripts/gen-unicode-data.py
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tests/test-tokenizer-random.py
2024-07-11 16:36:16 +08:00
Clint Herron
278d0e1846
Initialize default slot sampling parameters from the global context. (#8418) 2024-07-10 20:08:17 -04:00
Clint Herron
a59f8fdc85
Server: Enable setting default sampling parameters via command-line (#8402)
* Load server sampling parameters from the server context by default.

* Wordsmithing comment
2024-07-09 18:26:40 -04:00
Bjarke Viksøe
cb4d86c4d7
server: Retrieve prompt template in /props (#8337)
* server: Retrieve prompt template in /props

This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.

The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.

Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.

* Make string buffer dynamic

* Add doc and better string handling

* Using chat_template naming convention

* Use intermediate vector for string assignment
2024-07-07 11:10:38 +02:00
Concedo
02f92f6ecc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-rocm.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-cpp-cuda.srpm.spec
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	examples/llama.android/llama/src/main/cpp/CMakeLists.txt
#	flake.lock
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	grammars/README.md
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	tests/test-chat-template.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
2024-06-30 10:59:42 +08:00
Sigbjørn Skjæret
38373cfbab
Add SPM infill support (#8016)
* add --spm-infill option

* support --spm-infill

* support --spm-infill
2024-06-28 12:53:43 +02:00
Concedo
f3dfa96dbc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.devops/llama-server.Dockerfile
#	.github/workflows/docker.yml
#	README.md
#	llama.cpp
#	tests/test-chat-template.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-llama-grammar.cpp
2024-06-26 18:59:10 +08:00
Xuan Son Nguyen
48e6b92cc3
Add chat template support for llama-cli (#8068)
* add chat template support for llama-cli

* add help message

* server: simplify format_chat

* more consistent naming

* improve

* add llama_chat_format_example

* fix server

* code style

* code style

* Update examples/main/main.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-25 21:56:49 +10:00
Concedo
92afdfcae4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/server.yml
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	llama.cpp
#	requirements/requirements-convert-hf-to-gguf-update.txt
#	requirements/requirements-convert-hf-to-gguf.txt
#	requirements/requirements-convert-legacy-llama.txt
#	scripts/sync-ggml.last
#	tests/test-tokenizer-random.py
2024-06-22 01:33:44 +08:00
sasha0552
ba58993152
server : fix smart slot selection (#8020) 2024-06-20 09:57:10 +10:00
Sigbjørn Skjæret
91c188d6c2
Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists

* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00
Concedo
b53e760557 Merge commit '1c641e6aac' into concedo_experimental
# Conflicts:
#	.devops/cloud-v-pipeline
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-rocm.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-cli.Dockerfile
#	.devops/llama-cpp-clblast.srpm.spec
#	.devops/llama-cpp-cuda.srpm.spec
#	.devops/llama-cpp.srpm.spec
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.devops/llama-server.Dockerfile
#	.devops/nix/apps.nix
#	.devops/nix/package.nix
#	.devops/tools.sh
#	.dockerignore
#	.github/ISSUE_TEMPLATE/01-bug-low.yml
#	.github/ISSUE_TEMPLATE/02-bug-medium.yml
#	.github/ISSUE_TEMPLATE/03-bug-high.yml
#	.github/ISSUE_TEMPLATE/04-bug-critical.yml
#	.github/workflows/bench.yml
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	.github/workflows/server.yml
#	.gitignore
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	docs/token_generation_performance_tips.md
#	flake.nix
#	grammars/README.md
#	pocs/vdot/CMakeLists.txt
#	scripts/get-hellaswag.sh
#	scripts/get-wikitext-103.sh
#	scripts/get-wikitext-2.sh
#	scripts/get-winogrande.sh
#	scripts/hf.sh
#	scripts/pod-llama.sh
#	scripts/qnt-all.sh
#	scripts/run-all-ppl.sh
#	scripts/run-with-preset.py
#	scripts/server-llm.sh
#	tests/test-backend-ops.cpp
2024-06-14 18:41:37 +08:00
Concedo
a8db72eca0 Merge commit 'ef52d1d16a' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	README.md
#	flake.lock
#	grammars/README.md
#	grammars/json.gbnf
#	grammars/json_arr.gbnf
#	tests/test-json-schema-to-grammar.cpp
2024-06-13 18:26:45 +08:00
Georgi Gerganov
704a35b183
server : restore numeric prompts (#7883) 2024-06-12 14:42:29 +03:00
Georgi Gerganov
d9da0e4986
server : improve "prompt" handling (#7847) 2024-06-10 14:59:55 +03:00
Concedo
562d980140 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full.Dockerfile
#	.devops/main-cuda.Dockerfile
#	.devops/main-rocm.Dockerfile
#	.devops/main-vulkan.Dockerfile
#	.devops/main.Dockerfile
#	.devops/server-cuda.Dockerfile
#	.devops/server.Dockerfile
#	README.md
#	common/CMakeLists.txt
#	grammars/README.md
#	tests/test-grammar-integration.cpp
#	tests/test-grammar-parser.cpp
#	tests/test-json-schema-to-grammar.cpp
2024-06-09 17:30:05 +08:00
sasha0552
7a16ce7db2
server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
woodx
a5cabd7649
server : do not get prompt in infill mode (#7286)
* avoid to get prompt in infill mode and embedding mode

* remove embedding mode

* refactor format

---------

Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
Georgi Gerganov
f83351f9a6
imatrix : migrate to gpt_params (#7771)
* imatrix : migrate to gpt_params

ggml-ci

* imatrix : add --save-frequency cli arg

* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
Concedo
6659742a2d do not merge the removal of opencl 2024-06-05 10:57:52 +08:00
Georgi Gerganov
1442677f92
common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Concedo
a97f7d5f91 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/main-cuda.Dockerfile
#	.devops/main-intel.Dockerfile
#	.devops/main-rocm.Dockerfile
#	.devops/main.Dockerfile
#	.devops/server-cuda.Dockerfile
#	.devops/server-intel.Dockerfile
#	.devops/server-rocm.Dockerfile
#	.devops/server.Dockerfile
#	.devops/tools.sh
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	llama.cpp
#	requirements.txt
#	requirements/requirements-convert-hf-to-gguf-update.txt
#	requirements/requirements-convert-hf-to-gguf.txt
#	requirements/requirements-convert-legacy-llama.txt
#	requirements/requirements-convert-llama-ggml-to-gguf.txt
#	scripts/check-requirements.sh
#	scripts/compare-llama-bench.py
#	scripts/convert-gg.sh
#	scripts/pod-llama.sh
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-tokenizer-0.sh
#	tests/test-tokenizer-random.py
2024-06-02 12:28:38 +08:00
Yazan Agha-Schrader
2e666832e6
server : new UI (#7633)
* ic

* migrate my eary work

* add the belonging stuff: css,favicon etc

* de prompts

* chore: Update HTML meta tags in index.html file

* add api-key css classes

* some necessary fixes

* Add API key CSS classes and update styling in style.css

* clean the code

* move API to the top, rearrange param sliders. update css

* add tooltips to the parameters with comprehensible explanations

* fix FloatField and BoolField tooltips

* fix grammar field width

* use template literales for promptFormats.js

* update const ModelGenerationInfo

* remove ms per token, since not relevant for most webui users and use cases

* add phi-3 prompt template

* add phi3 to dropdown

* add css class

* update forgotten css theme

* add user message suffix

* fix chatml & add llama3 format

* fix llama3 prompt template

* more prompt format fixes

* add more comon stop tokens

* add missing char

* do not separate with new line or comma

* move prompt style

* add hacky llama2 prompt solution, reduce redundancy in promptFormats.js

* fix toggle state localstorage

* add cmd-r prompt et reduce redundancy

* set default prompt to empty

* move files, clean code

* fix css path

* add a button to the new ui

* move new ui to "/public" due to otherwise problematic CORS behaviour

* include new ui in cpp

* fix wrong link to old ui

* renaming to ensure consistency

* fix typos "prompt-format" -> "prompt-formats"

* use correct indent

* add new ui files to makefile

* fix typo
2024-06-01 22:31:48 +03:00
Concedo
9282c307ed this commit does not work, just for debugging 2024-05-23 20:13:47 +08:00
Georgi Gerganov
6ff13987ad
common : normalize naming style (#7462)
* common : normalize naming style

ggml-ci

* common : match declaration / definition order

* zig : try to fix build
2024-05-22 20:04:20 +03:00
Concedo
52f9911240 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	requirements.txt
#	scripts/LlamaConfig.cmake.in
2024-05-21 19:05:52 +08:00
Georgi Gerganov
e932094d58
server : return error on too large embedding input (#7389) 2024-05-20 08:56:05 +03:00
Johannes Gäßler
41858392e1
server: fix seed being reported back (#7382) 2024-05-19 17:06:33 +03:00
Concedo
47cbfd6150 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	llama.cpp
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/test-backend-ops.cpp
2024-05-17 22:30:41 +08:00
Radoslav Gerganov
ee94172d33
server : add support for the RPC backend (#7305)
ref: #7292
2024-05-17 10:00:17 +03:00
Steve Grubb
4f0263633b
server: free sampling contexts on exit (#7264)
* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace
2024-05-14 16:11:24 +02:00
Concedo
2ee808a747 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	README.md
#	ci/run.sh
#	llama.cpp
#	models/ggml-vocab-llama-bpe.gguf.inp
#	models/ggml-vocab-llama-bpe.gguf.out
#	requirements.txt
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml.last
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-tokenizer-1-bpe.cpp
2024-05-14 19:28:47 +08:00