Concedo
1edf83761a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/bench.yml.disabled
# Makefile
# README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-vulkan.cpp
2024-08-17 16:21:14 +08:00
Xuan Son Nguyen
8b3befc0e2
server : refactor middleware and /health endpoint ( #9056 )
...
* server : refactor middleware and /health endpoint
* move "fail_on_no_slot" to /slots
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix server tests
* fix CI
* update server docs
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-16 17:19:05 +02:00
Riceball LEE
37501d9c79
server : fix duplicated n_predict key in the generation_settings ( #8994 )
2024-08-15 10:28:05 +03:00
Zhenwei Jin
4af8420afb
common : remove duplicate function llama_should_add_bos_token ( #8778 )
2024-08-15 10:23:23 +03:00
Jiří Podivín
234b30676a
server : init stop and error fields of the result struct ( #9026 )
...
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-08-15 09:21:57 +03:00
Concedo
e8de0af3ec
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/bench.yml
# .github/workflows/build.yml
# .github/workflows/python-check-requirements.yml
# README.md
# docs/backend/SYCL.md
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/kompute-shaders/op_rope_f16.comp
# ggml/src/kompute-shaders/op_rope_f32.comp
# ggml/src/kompute-shaders/rope_common.comp
2024-08-14 22:25:43 +08:00
compilade
98a532d474
server : fix segfault on long system prompt ( #8987 )
...
* server : fix segfault on long system prompt
* server : fix parallel generation with very small batch sizes
* server : fix typo in comment
2024-08-14 09:51:02 +03:00
Georgi Gerganov
5ef07e25ac
server : handle models with missing EOS token ( #8997 )
...
ggml-ci
2024-08-12 10:21:50 +03:00
Concedo
bdfe8526b8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .gitignore
# CONTRIBUTING.md
# Makefile
# examples/llava/CMakeLists.txt
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# src/llama-vocab.cpp
2024-08-10 11:42:32 +08:00
Mathieu Geli
daef3ab233
server : add one level list nesting for embeddings ( #8936 )
2024-08-09 09:32:02 +03:00
Xuan Son Nguyen
1e6f6554aa
server : add lora hotswap endpoint (WIP) ( #8857 )
...
* server : add lora hotswap endpoint
* handle lora_no_apply
* fix build
* updae docs
* clean up struct def
* fix build
* add LoRA test
* fix style
2024-08-06 17:33:39 +02:00
Concedo
e1f97f7fb5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/llama-server.Dockerfile
# README.md
# flake.lock
# ggml/src/ggml-vulkan.cpp
# ggml/src/vulkan-shaders/concat.comp
# ggml/src/vulkan-shaders/pad.comp
# ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-backend-ops.cpp
2024-08-06 16:33:26 +08:00
Liu Jia
0a4ce78681
common : Changed tuple to struct (TODO fix) ( #8823 )
...
* common : Changed tuple to struct (TODO fix)
Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>
* delete llama_init_default_params()
* delete the extra whitespace
2024-08-05 18:14:10 +02:00
ardfork
978ba3d83d
Server: Don't ignore llama.cpp params ( #8754 )
...
* Don't ignore llama.cpp params
* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
Concedo
24b9616344
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/full.Dockerfile
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-intel.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cli.Dockerfile
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-intel.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# CMakeLists.txt
# CONTRIBUTING.md
# Makefile
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# requirements.txt
# src/llama.cpp
# tests/test-backend-ops.cpp
2024-07-19 14:23:33 +08:00
RunningLeon
3807c3de04
server : respect --special
cli arg ( #8553 )
2024-07-18 11:06:22 +03:00
Concedo
602661ba49
Merge commit ' c917b67f06
' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# Makefile
# ggml/src/ggml-cuda/mmq.cuh
# tests/test-double-float.cpp
# tests/test-quantize-fns.cpp
# tests/test-quantize-perf.cpp
2024-07-14 11:38:20 +08:00
Douglas Hanley
c3ebcfa148
server : ensure batches are either all embed or all completion ( #8420 )
...
* make sure batches are all embed or all non-embed
* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
Concedo
2cad736260
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/labeler.yml
# .gitignore
# CMakeLists.txt
# Makefile
# Package.swift
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# scripts/check-requirements.sh
# scripts/compare-llama-bench.py
# scripts/gen-unicode-data.py
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-tokenizer-random.py
2024-07-11 16:36:16 +08:00
Clint Herron
278d0e1846
Initialize default slot sampling parameters from the global context. ( #8418 )
2024-07-10 20:08:17 -04:00
Clint Herron
a59f8fdc85
Server: Enable setting default sampling parameters via command-line ( #8402 )
...
* Load server sampling parameters from the server context by default.
* Wordsmithing comment
2024-07-09 18:26:40 -04:00
Bjarke Viksøe
cb4d86c4d7
server: Retrieve prompt template in /props ( #8337 )
...
* server: Retrieve prompt template in /props
This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.
The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.
Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.
* Make string buffer dynamic
* Add doc and better string handling
* Using chat_template naming convention
* Use intermediate vector for string assignment
2024-07-07 11:10:38 +02:00
Concedo
02f92f6ecc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cpp-cuda.srpm.spec
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# README.md
# examples/llama.android/llama/src/main/cpp/CMakeLists.txt
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# tests/test-chat-template.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
2024-06-30 10:59:42 +08:00
Sigbjørn Skjæret
38373cfbab
Add SPM infill support ( #8016 )
...
* add --spm-infill option
* support --spm-infill
* support --spm-infill
2024-06-28 12:53:43 +02:00
Concedo
f3dfa96dbc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# .github/workflows/docker.yml
# README.md
# llama.cpp
# tests/test-chat-template.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-llama-grammar.cpp
2024-06-26 18:59:10 +08:00
Xuan Son Nguyen
48e6b92cc3
Add chat template support for llama-cli ( #8068 )
...
* add chat template support for llama-cli
* add help message
* server: simplify format_chat
* more consistent naming
* improve
* add llama_chat_format_example
* fix server
* code style
* code style
* Update examples/main/main.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-25 21:56:49 +10:00
Concedo
92afdfcae4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/server.yml
# .gitignore
# CMakeLists.txt
# Makefile
# README-sycl.md
# README.md
# llama.cpp
# requirements/requirements-convert-hf-to-gguf-update.txt
# requirements/requirements-convert-hf-to-gguf.txt
# requirements/requirements-convert-legacy-llama.txt
# scripts/sync-ggml.last
# tests/test-tokenizer-random.py
2024-06-22 01:33:44 +08:00
sasha0552
ba58993152
server : fix smart slot selection ( #8020 )
2024-06-20 09:57:10 +10:00
Sigbjørn Skjæret
91c188d6c2
Only use FIM middle token if it exists ( #7648 )
...
* Only use FIM middle if it exists
* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00
Concedo
b53e760557
Merge commit ' 1c641e6aac
' into concedo_experimental
...
# Conflicts:
# .devops/cloud-v-pipeline
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cli.Dockerfile
# .devops/llama-cpp-clblast.srpm.spec
# .devops/llama-cpp-cuda.srpm.spec
# .devops/llama-cpp.srpm.spec
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# .devops/nix/apps.nix
# .devops/nix/package.nix
# .devops/tools.sh
# .dockerignore
# .github/ISSUE_TEMPLATE/01-bug-low.yml
# .github/ISSUE_TEMPLATE/02-bug-medium.yml
# .github/ISSUE_TEMPLATE/03-bug-high.yml
# .github/ISSUE_TEMPLATE/04-bug-critical.yml
# .github/workflows/bench.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# .github/workflows/server.yml
# .gitignore
# Makefile
# README-sycl.md
# README.md
# ci/run.sh
# docs/token_generation_performance_tips.md
# flake.nix
# grammars/README.md
# pocs/vdot/CMakeLists.txt
# scripts/get-hellaswag.sh
# scripts/get-wikitext-103.sh
# scripts/get-wikitext-2.sh
# scripts/get-winogrande.sh
# scripts/hf.sh
# scripts/pod-llama.sh
# scripts/qnt-all.sh
# scripts/run-all-ppl.sh
# scripts/run-with-preset.py
# scripts/server-llm.sh
# tests/test-backend-ops.cpp
2024-06-14 18:41:37 +08:00
Concedo
a8db72eca0
Merge commit ' ef52d1d16a
' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/server.yml
# CMakeLists.txt
# README.md
# flake.lock
# grammars/README.md
# grammars/json.gbnf
# grammars/json_arr.gbnf
# tests/test-json-schema-to-grammar.cpp
2024-06-13 18:26:45 +08:00
Georgi Gerganov
704a35b183
server : restore numeric prompts ( #7883 )
2024-06-12 14:42:29 +03:00
Georgi Gerganov
d9da0e4986
server : improve "prompt" handling ( #7847 )
2024-06-10 14:59:55 +03:00
Concedo
562d980140
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full.Dockerfile
# .devops/main-cuda.Dockerfile
# .devops/main-rocm.Dockerfile
# .devops/main-vulkan.Dockerfile
# .devops/main.Dockerfile
# .devops/server-cuda.Dockerfile
# .devops/server.Dockerfile
# README.md
# common/CMakeLists.txt
# grammars/README.md
# tests/test-grammar-integration.cpp
# tests/test-grammar-parser.cpp
# tests/test-json-schema-to-grammar.cpp
2024-06-09 17:30:05 +08:00
sasha0552
7a16ce7db2
server : smart slot selection using Longest Common Prefix ( #7728 )
...
* server : Smart selection of available slot using Longest Common Substring
* add usage
* remove trailing whitespaces
* Use Longest Common Prefix (LCP) instead of LCS
* Rename argument
2024-06-08 10:50:31 +03:00
woodx
a5cabd7649
server : do not get prompt in infill mode ( #7286 )
...
* avoid to get prompt in infill mode and embedding mode
* remove embedding mode
* refactor format
---------
Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
Georgi Gerganov
f83351f9a6
imatrix : migrate to gpt_params ( #7771 )
...
* imatrix : migrate to gpt_params
ggml-ci
* imatrix : add --save-frequency cli arg
* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
Concedo
6659742a2d
do not merge the removal of opencl
2024-06-05 10:57:52 +08:00
Georgi Gerganov
1442677f92
common : refactor cli arg parsing ( #7675 )
...
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Concedo
a97f7d5f91
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/full.Dockerfile
# .devops/main-cuda.Dockerfile
# .devops/main-intel.Dockerfile
# .devops/main-rocm.Dockerfile
# .devops/main.Dockerfile
# .devops/server-cuda.Dockerfile
# .devops/server-intel.Dockerfile
# .devops/server-rocm.Dockerfile
# .devops/server.Dockerfile
# .devops/tools.sh
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# README-sycl.md
# README.md
# ci/run.sh
# llama.cpp
# requirements.txt
# requirements/requirements-convert-hf-to-gguf-update.txt
# requirements/requirements-convert-hf-to-gguf.txt
# requirements/requirements-convert-legacy-llama.txt
# requirements/requirements-convert-llama-ggml-to-gguf.txt
# scripts/check-requirements.sh
# scripts/compare-llama-bench.py
# scripts/convert-gg.sh
# scripts/pod-llama.sh
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-tokenizer-0.sh
# tests/test-tokenizer-random.py
2024-06-02 12:28:38 +08:00
Yazan Agha-Schrader
2e666832e6
server : new UI ( #7633 )
...
* ic
* migrate my eary work
* add the belonging stuff: css,favicon etc
* de prompts
* chore: Update HTML meta tags in index.html file
* add api-key css classes
* some necessary fixes
* Add API key CSS classes and update styling in style.css
* clean the code
* move API to the top, rearrange param sliders. update css
* add tooltips to the parameters with comprehensible explanations
* fix FloatField and BoolField tooltips
* fix grammar field width
* use template literales for promptFormats.js
* update const ModelGenerationInfo
* remove ms per token, since not relevant for most webui users and use cases
* add phi-3 prompt template
* add phi3 to dropdown
* add css class
* update forgotten css theme
* add user message suffix
* fix chatml & add llama3 format
* fix llama3 prompt template
* more prompt format fixes
* add more comon stop tokens
* add missing char
* do not separate with new line or comma
* move prompt style
* add hacky llama2 prompt solution, reduce redundancy in promptFormats.js
* fix toggle state localstorage
* add cmd-r prompt et reduce redundancy
* set default prompt to empty
* move files, clean code
* fix css path
* add a button to the new ui
* move new ui to "/public" due to otherwise problematic CORS behaviour
* include new ui in cpp
* fix wrong link to old ui
* renaming to ensure consistency
* fix typos "prompt-format" -> "prompt-formats"
* use correct indent
* add new ui files to makefile
* fix typo
2024-06-01 22:31:48 +03:00
Concedo
9282c307ed
this commit does not work, just for debugging
2024-05-23 20:13:47 +08:00
Georgi Gerganov
6ff13987ad
common : normalize naming style ( #7462 )
...
* common : normalize naming style
ggml-ci
* common : match declaration / definition order
* zig : try to fix build
2024-05-22 20:04:20 +03:00
Concedo
52f9911240
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/workflows/build.yml
# .github/workflows/server.yml
# CMakeLists.txt
# Makefile
# README.md
# requirements.txt
# scripts/LlamaConfig.cmake.in
2024-05-21 19:05:52 +08:00
Georgi Gerganov
e932094d58
server : return error on too large embedding input ( #7389 )
2024-05-20 08:56:05 +03:00
Johannes Gäßler
41858392e1
server: fix seed being reported back ( #7382 )
2024-05-19 17:06:33 +03:00
Concedo
47cbfd6150
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
# README.md
# llama.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-backend-ops.cpp
2024-05-17 22:30:41 +08:00
Radoslav Gerganov
ee94172d33
server : add support for the RPC backend ( #7305 )
...
ref: #7292
2024-05-17 10:00:17 +03:00
Steve Grubb
4f0263633b
server: free sampling contexts on exit ( #7264 )
...
* server: free sampling contexts on exit
This cleans up last leak found by the address sanitizer.
* fix whitespace
* fix whitespace
2024-05-14 16:11:24 +02:00
Concedo
2ee808a747
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
# README.md
# ci/run.sh
# llama.cpp
# models/ggml-vocab-llama-bpe.gguf.inp
# models/ggml-vocab-llama-bpe.gguf.out
# requirements.txt
# scripts/compare-llama-bench.py
# scripts/sync-ggml.last
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-grammar-integration.cpp
# tests/test-tokenizer-1-bpe.cpp
2024-05-14 19:28:47 +08:00