Concedo
b2c1ff7a13
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .ecrc
# CMakePresets.json
# ci/run.sh
# docs/backend/SYCL.md
# ggml/src/CMakeLists.txt
# src/llama.cpp
# tests/test-backend-ops.cpp
# tests/test-sampling.cpp
2024-08-27 17:46:40 +08:00
Georgi Gerganov
e5edb210cd
server : update deps ( #9183 )
2024-08-26 12:16:57 +03:00
Xuan Son Nguyen
fc54ef0d1c
server : support reading arguments from environment variables ( #9105 )
...
* server : support reading arguments from environment variables
* add -fa and -dt
* readme : specify non-arg env var
2024-08-21 11:04:34 +02:00
Concedo
1edf83761a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/bench.yml.disabled
# Makefile
# README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-vulkan.cpp
2024-08-17 16:21:14 +08:00
Xuan Son Nguyen
8b3befc0e2
server : refactor middleware and /health endpoint ( #9056 )
...
* server : refactor middleware and /health endpoint
* move "fail_on_no_slot" to /slots
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix server tests
* fix CI
* update server docs
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-16 17:19:05 +02:00
Riceball LEE
37501d9c79
server : fix duplicated n_predict key in the generation_settings ( #8994 )
2024-08-15 10:28:05 +03:00
Zhenwei Jin
4af8420afb
common : remove duplicate function llama_should_add_bos_token ( #8778 )
2024-08-15 10:23:23 +03:00
Jiří Podivín
234b30676a
server : init stop and error fields of the result struct ( #9026 )
...
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-08-15 09:21:57 +03:00
Concedo
e8de0af3ec
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/bench.yml
# .github/workflows/build.yml
# .github/workflows/python-check-requirements.yml
# README.md
# docs/backend/SYCL.md
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/kompute-shaders/op_rope_f16.comp
# ggml/src/kompute-shaders/op_rope_f32.comp
# ggml/src/kompute-shaders/rope_common.comp
2024-08-14 22:25:43 +08:00
compilade
98a532d474
server : fix segfault on long system prompt ( #8987 )
...
* server : fix segfault on long system prompt
* server : fix parallel generation with very small batch sizes
* server : fix typo in comment
2024-08-14 09:51:02 +03:00
Georgi Gerganov
5ef07e25ac
server : handle models with missing EOS token ( #8997 )
...
ggml-ci
2024-08-12 10:21:50 +03:00
Concedo
bdfe8526b8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .gitignore
# CONTRIBUTING.md
# Makefile
# examples/llava/CMakeLists.txt
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# src/llama-vocab.cpp
2024-08-10 11:42:32 +08:00
Mathieu Geli
daef3ab233
server : add one level list nesting for embeddings ( #8936 )
2024-08-09 09:32:02 +03:00
Xuan Son Nguyen
1e6f6554aa
server : add lora hotswap endpoint (WIP) ( #8857 )
...
* server : add lora hotswap endpoint
* handle lora_no_apply
* fix build
* updae docs
* clean up struct def
* fix build
* add LoRA test
* fix style
2024-08-06 17:33:39 +02:00
Concedo
e1f97f7fb5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/llama-server.Dockerfile
# README.md
# flake.lock
# ggml/src/ggml-vulkan.cpp
# ggml/src/vulkan-shaders/concat.comp
# ggml/src/vulkan-shaders/pad.comp
# ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-backend-ops.cpp
2024-08-06 16:33:26 +08:00
Liu Jia
0a4ce78681
common : Changed tuple to struct (TODO fix) ( #8823 )
...
* common : Changed tuple to struct (TODO fix)
Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>
* delete llama_init_default_params()
* delete the extra whitespace
2024-08-05 18:14:10 +02:00
ardfork
978ba3d83d
Server: Don't ignore llama.cpp params ( #8754 )
...
* Don't ignore llama.cpp params
* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
Concedo
101efb66af
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# CMakeLists.txt
# Makefile
2024-08-01 10:54:28 +08:00
Igor Okulist
afbbcf3c04
server : update llama-server embedding flag documentation ( #8779 )
...
Fixes #8763
2024-07-31 19:59:09 -04:00
Concedo
ba5babb876
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/apps.nix
# .devops/tools.sh
# Makefile
# README.md
# docs/backend/SYCL.md
# docs/build.md
# examples/CMakeLists.txt
# ggml/include/ggml.h
# src/llama-vocab.cpp
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-sampling.cpp
2024-07-27 23:15:54 +08:00
Yaiko
01aec4a631
server : add Speech Recognition & Synthesis to UI ( #8679 )
...
* server : add Speech Recognition & Synthesis to UI
* server : add Speech Recognition & Synthesis to UI (fixes)
2024-07-26 00:10:16 +02:00
Ujjawal Panchal
4b0eff3df5
docs : Quantum -> Quantized ( #8666 )
...
* docfix: imatrix readme, quantum models -> quantized models.
* docfix: server readme: quantum models -> quantized models.
2024-07-25 11:13:27 +03:00
Concedo
01d5175654
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# Makefile
# ggml/src/CMakeLists.txt
2024-07-24 16:41:33 +08:00
Vali Malinoiu
b841d07408
server : fix URL.parse in the UI ( #8646 )
2024-07-23 17:37:42 +03:00
Concedo
c81d1623b4
Merge commit ' 751fcfc6c3
' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CONTRIBUTING.md
# README.md
# flake.lock
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
2024-07-23 19:18:05 +08:00
Jan Boon
628154492a
server : update doc to clarify n_keep when there is bos token ( #8619 )
2024-07-22 11:02:09 +03:00
Concedo
24b9616344
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/full.Dockerfile
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-intel.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cli.Dockerfile
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-intel.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# CMakeLists.txt
# CONTRIBUTING.md
# Makefile
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# requirements.txt
# src/llama.cpp
# tests/test-backend-ops.cpp
2024-07-19 14:23:33 +08:00
Eric Zhang
0d2c7321e9
server: use relative routes for static files in new UI ( #8552 )
...
* server: public: fix api_url on non-index pages
* server: public: use relative routes for static files in new UI
2024-07-18 12:43:49 +02:00
RunningLeon
3807c3de04
server : respect --special
cli arg ( #8553 )
2024-07-18 11:06:22 +03:00
Xuan Son Nguyen
4db8f60fe7
fix ci ( #8494 )
2024-07-15 19:23:10 +02:00
Concedo
e707ab9025
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/development/HOWTO-add-model.md
# docs/development/token_generation_performance_tips.md
# flake.lock
2024-07-16 00:49:34 +08:00
M-A
f17f39ff9c
server: update README.md with llama-server --help output [no ci] ( #8472 )
...
The README.md had a stale information. In particular, the --ctx-size
"defaults to 512" confused me and I had to check the code to confirm
this was false. This the server is evolving rapidly, it's probably
better to keep the source of truth at a single place (in the source) and
generate the README.md based on that.
Did:
make llama-server
./llama-server --help > t.txt
vimdiff t.txt examples/server/README.md
I copied the content inside a backquote block. I would have preferred
proper text but it would require a fair amount of surgery to make the
current output compatible with markdown. A follow up could be to
automate this process with a script.
No functional change.
2024-07-15 15:04:56 +03:00
Concedo
602661ba49
Merge commit ' c917b67f06
' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# Makefile
# ggml/src/ggml-cuda/mmq.cuh
# tests/test-double-float.cpp
# tests/test-quantize-fns.cpp
# tests/test-quantize-perf.cpp
2024-07-14 11:38:20 +08:00
Georgi Gerganov
4e24cffd8c
server : handle content array in chat API ( #8449 )
...
* server : handle content array in chat API
* Update examples/server/utils.hpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-12 14:48:15 +03:00
Douglas Hanley
c3ebcfa148
server : ensure batches are either all embed or all completion ( #8420 )
...
* make sure batches are all embed or all non-embed
* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
Concedo
2cad736260
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/labeler.yml
# .gitignore
# CMakeLists.txt
# Makefile
# Package.swift
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# scripts/check-requirements.sh
# scripts/compare-llama-bench.py
# scripts/gen-unicode-data.py
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-tokenizer-random.py
2024-07-11 16:36:16 +08:00
Clint Herron
278d0e1846
Initialize default slot sampling parameters from the global context. ( #8418 )
2024-07-10 20:08:17 -04:00
Clint Herron
a59f8fdc85
Server: Enable setting default sampling parameters via command-line ( #8402 )
...
* Load server sampling parameters from the server context by default.
* Wordsmithing comment
2024-07-09 18:26:40 -04:00
compilade
3fd62a6b1c
py : type-check all Python scripts with Pyright ( #8341 )
...
* py : type-check all Python scripts with Pyright
* server-tests : use trailing slash in openai base_url
* server-tests : add more type annotations
* server-tests : strip "chat" from base_url in oai_chat_completions
* server-tests : model metadata is a dict
* ci : disable pip cache in type-check workflow
The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.
* py : fix new type errors from master branch
* tests : fix test-tokenizer-random.py
Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.
* ci : only show warnings and errors in python type-check
The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
2024-07-07 15:04:39 -04:00
Bjarke Viksøe
cb4d86c4d7
server: Retrieve prompt template in /props ( #8337 )
...
* server: Retrieve prompt template in /props
This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.
The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.
Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.
* Make string buffer dynamic
* Add doc and better string handling
* Using chat_template naming convention
* Use intermediate vector for string assignment
2024-07-07 11:10:38 +02:00
Concedo
5b605d03ea
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/config.yml
# .gitignore
# CMakeLists.txt
# CONTRIBUTING.md
# Makefile
# README.md
# ci/run.sh
# common/common.h
# examples/main-cmake-pkg/CMakeLists.txt
# ggml/src/CMakeLists.txt
# models/ggml-vocab-bert-bge.gguf.inp
# models/ggml-vocab-bert-bge.gguf.out
# models/ggml-vocab-deepseek-coder.gguf.inp
# models/ggml-vocab-deepseek-coder.gguf.out
# models/ggml-vocab-deepseek-llm.gguf.inp
# models/ggml-vocab-deepseek-llm.gguf.out
# models/ggml-vocab-falcon.gguf.inp
# models/ggml-vocab-falcon.gguf.out
# models/ggml-vocab-gpt-2.gguf.inp
# models/ggml-vocab-gpt-2.gguf.out
# models/ggml-vocab-llama-bpe.gguf.inp
# models/ggml-vocab-llama-bpe.gguf.out
# models/ggml-vocab-llama-spm.gguf.inp
# models/ggml-vocab-llama-spm.gguf.out
# models/ggml-vocab-mpt.gguf.inp
# models/ggml-vocab-mpt.gguf.out
# models/ggml-vocab-phi-3.gguf.inp
# models/ggml-vocab-phi-3.gguf.out
# models/ggml-vocab-starcoder.gguf.inp
# models/ggml-vocab-starcoder.gguf.out
# requirements.txt
# requirements/requirements-convert_legacy_llama.txt
# scripts/check-requirements.sh
# scripts/pod-llama.sh
# src/CMakeLists.txt
# src/llama.cpp
# tests/test-rope.cpp
2024-07-06 00:25:10 +08:00
Pieter Ouwerkerk
5a7447c569
readme : fix minor typos [no ci] ( #8314 )
2024-07-05 09:58:41 +03:00
Clint Herron
07a3fc0608
Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. ( #8258 )
2024-07-02 12:18:10 -04:00
Concedo
02f92f6ecc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cpp-cuda.srpm.spec
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# README.md
# examples/llama.android/llama/src/main/cpp/CMakeLists.txt
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# tests/test-chat-template.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
2024-06-30 10:59:42 +08:00
Concedo
9c10486204
merge the file structure refactor, testing
2024-06-29 12:14:38 +08:00
Sigbjørn Skjæret
38373cfbab
Add SPM infill support ( #8016 )
...
* add --spm-infill option
* support --spm-infill
* support --spm-infill
2024-06-28 12:53:43 +02:00
Olivier Chafik
139cc621e9
json
: restore default additionalProperties to false, fix some pattern escapes (#8180 )
...
* json: expand ESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS charset
* json: revert default of additionalProperties to false
* Update README.md
2024-06-28 09:26:45 +01:00
Georgi Gerganov
f3f65429c4
llama : reorganize source code + improve CMake ( #8006 )
...
* scripts : update sync [no ci]
* files : relocate [no ci]
* ci : disable kompute build [no ci]
* cmake : fixes [no ci]
* server : fix mingw build
ggml-ci
* cmake : minor [no ci]
* cmake : link math library [no ci]
* cmake : build normal ggml library (not object library) [no ci]
* cmake : fix kompute build
ggml-ci
* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE
ggml-ci
* move public backend headers to the public include directory (#8122 )
* move public backend headers to the public include directory
* nix test
* spm : fix metal header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* scripts : fix sync paths [no ci]
* scripts : sync ggml-blas.h [no ci]
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-26 18:33:02 +03:00
Concedo
f3dfa96dbc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# .github/workflows/docker.yml
# README.md
# llama.cpp
# tests/test-chat-template.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-llama-grammar.cpp
2024-06-26 18:59:10 +08:00
Olivier Chafik
9b2f16f805
json
: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863 )
...
* json: better suport for "type" arrays (e.g. `{"type": ["array", "null"], "items": {"type": "string"}}`)
* json: add test for type: [array, null] fix
* update tests
2024-06-26 01:46:35 +01:00