Concedo
bdfe8526b8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .gitignore
# CONTRIBUTING.md
# Makefile
# examples/llava/CMakeLists.txt
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# src/llama-vocab.cpp
2024-08-10 11:42:32 +08:00
Mathieu Geli
daef3ab233
server : add one level list nesting for embeddings ( #8936 )
2024-08-09 09:32:02 +03:00
Xuan Son Nguyen
1e6f6554aa
server : add lora hotswap endpoint (WIP) ( #8857 )
...
* server : add lora hotswap endpoint
* handle lora_no_apply
* fix build
* updae docs
* clean up struct def
* fix build
* add LoRA test
* fix style
2024-08-06 17:33:39 +02:00
Concedo
e1f97f7fb5
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/llama-server.Dockerfile
# README.md
# flake.lock
# ggml/src/ggml-vulkan.cpp
# ggml/src/vulkan-shaders/concat.comp
# ggml/src/vulkan-shaders/pad.comp
# ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-backend-ops.cpp
2024-08-06 16:33:26 +08:00
Liu Jia
0a4ce78681
common : Changed tuple to struct (TODO fix) ( #8823 )
...
* common : Changed tuple to struct (TODO fix)
Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>
* delete llama_init_default_params()
* delete the extra whitespace
2024-08-05 18:14:10 +02:00
ardfork
978ba3d83d
Server: Don't ignore llama.cpp params ( #8754 )
...
* Don't ignore llama.cpp params
* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
Concedo
24b9616344
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/full.Dockerfile
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-intel.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cli.Dockerfile
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-intel.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# CMakeLists.txt
# CONTRIBUTING.md
# Makefile
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# requirements.txt
# src/llama.cpp
# tests/test-backend-ops.cpp
2024-07-19 14:23:33 +08:00
RunningLeon
3807c3de04
server : respect --special
cli arg ( #8553 )
2024-07-18 11:06:22 +03:00
Concedo
602661ba49
Merge commit ' c917b67f06
' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# Makefile
# ggml/src/ggml-cuda/mmq.cuh
# tests/test-double-float.cpp
# tests/test-quantize-fns.cpp
# tests/test-quantize-perf.cpp
2024-07-14 11:38:20 +08:00
Douglas Hanley
c3ebcfa148
server : ensure batches are either all embed or all completion ( #8420 )
...
* make sure batches are all embed or all non-embed
* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
Concedo
2cad736260
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/labeler.yml
# .gitignore
# CMakeLists.txt
# Makefile
# Package.swift
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# requirements/requirements-convert_hf_to_gguf.txt
# requirements/requirements-convert_hf_to_gguf_update.txt
# scripts/check-requirements.sh
# scripts/compare-llama-bench.py
# scripts/gen-unicode-data.py
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-tokenizer-random.py
2024-07-11 16:36:16 +08:00
Clint Herron
278d0e1846
Initialize default slot sampling parameters from the global context. ( #8418 )
2024-07-10 20:08:17 -04:00
Clint Herron
a59f8fdc85
Server: Enable setting default sampling parameters via command-line ( #8402 )
...
* Load server sampling parameters from the server context by default.
* Wordsmithing comment
2024-07-09 18:26:40 -04:00
Bjarke Viksøe
cb4d86c4d7
server: Retrieve prompt template in /props ( #8337 )
...
* server: Retrieve prompt template in /props
This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.
The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.
Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.
* Make string buffer dynamic
* Add doc and better string handling
* Using chat_template naming convention
* Use intermediate vector for string assignment
2024-07-07 11:10:38 +02:00
Concedo
02f92f6ecc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cpp-cuda.srpm.spec
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# README.md
# examples/llama.android/llama/src/main/cpp/CMakeLists.txt
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# tests/test-chat-template.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
2024-06-30 10:59:42 +08:00
Sigbjørn Skjæret
38373cfbab
Add SPM infill support ( #8016 )
...
* add --spm-infill option
* support --spm-infill
* support --spm-infill
2024-06-28 12:53:43 +02:00
Concedo
f3dfa96dbc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# .github/workflows/docker.yml
# README.md
# llama.cpp
# tests/test-chat-template.cpp
# tests/test-grammar-integration.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-llama-grammar.cpp
2024-06-26 18:59:10 +08:00
Xuan Son Nguyen
48e6b92cc3
Add chat template support for llama-cli ( #8068 )
...
* add chat template support for llama-cli
* add help message
* server: simplify format_chat
* more consistent naming
* improve
* add llama_chat_format_example
* fix server
* code style
* code style
* Update examples/main/main.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-25 21:56:49 +10:00
Concedo
92afdfcae4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/server.yml
# .gitignore
# CMakeLists.txt
# Makefile
# README-sycl.md
# README.md
# llama.cpp
# requirements/requirements-convert-hf-to-gguf-update.txt
# requirements/requirements-convert-hf-to-gguf.txt
# requirements/requirements-convert-legacy-llama.txt
# scripts/sync-ggml.last
# tests/test-tokenizer-random.py
2024-06-22 01:33:44 +08:00
sasha0552
ba58993152
server : fix smart slot selection ( #8020 )
2024-06-20 09:57:10 +10:00
Sigbjørn Skjæret
91c188d6c2
Only use FIM middle token if it exists ( #7648 )
...
* Only use FIM middle if it exists
* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00
Concedo
b53e760557
Merge commit ' 1c641e6aac
' into concedo_experimental
...
# Conflicts:
# .devops/cloud-v-pipeline
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-rocm.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-cli.Dockerfile
# .devops/llama-cpp-clblast.srpm.spec
# .devops/llama-cpp-cuda.srpm.spec
# .devops/llama-cpp.srpm.spec
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-rocm.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .devops/llama-server.Dockerfile
# .devops/nix/apps.nix
# .devops/nix/package.nix
# .devops/tools.sh
# .dockerignore
# .github/ISSUE_TEMPLATE/01-bug-low.yml
# .github/ISSUE_TEMPLATE/02-bug-medium.yml
# .github/ISSUE_TEMPLATE/03-bug-high.yml
# .github/ISSUE_TEMPLATE/04-bug-critical.yml
# .github/workflows/bench.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# .github/workflows/server.yml
# .gitignore
# Makefile
# README-sycl.md
# README.md
# ci/run.sh
# docs/token_generation_performance_tips.md
# flake.nix
# grammars/README.md
# pocs/vdot/CMakeLists.txt
# scripts/get-hellaswag.sh
# scripts/get-wikitext-103.sh
# scripts/get-wikitext-2.sh
# scripts/get-winogrande.sh
# scripts/hf.sh
# scripts/pod-llama.sh
# scripts/qnt-all.sh
# scripts/run-all-ppl.sh
# scripts/run-with-preset.py
# scripts/server-llm.sh
# tests/test-backend-ops.cpp
2024-06-14 18:41:37 +08:00
Concedo
a8db72eca0
Merge commit ' ef52d1d16a
' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/server.yml
# CMakeLists.txt
# README.md
# flake.lock
# grammars/README.md
# grammars/json.gbnf
# grammars/json_arr.gbnf
# tests/test-json-schema-to-grammar.cpp
2024-06-13 18:26:45 +08:00
Georgi Gerganov
704a35b183
server : restore numeric prompts ( #7883 )
2024-06-12 14:42:29 +03:00
Georgi Gerganov
d9da0e4986
server : improve "prompt" handling ( #7847 )
2024-06-10 14:59:55 +03:00
Concedo
562d980140
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full.Dockerfile
# .devops/main-cuda.Dockerfile
# .devops/main-rocm.Dockerfile
# .devops/main-vulkan.Dockerfile
# .devops/main.Dockerfile
# .devops/server-cuda.Dockerfile
# .devops/server.Dockerfile
# README.md
# common/CMakeLists.txt
# grammars/README.md
# tests/test-grammar-integration.cpp
# tests/test-grammar-parser.cpp
# tests/test-json-schema-to-grammar.cpp
2024-06-09 17:30:05 +08:00
sasha0552
7a16ce7db2
server : smart slot selection using Longest Common Prefix ( #7728 )
...
* server : Smart selection of available slot using Longest Common Substring
* add usage
* remove trailing whitespaces
* Use Longest Common Prefix (LCP) instead of LCS
* Rename argument
2024-06-08 10:50:31 +03:00
woodx
a5cabd7649
server : do not get prompt in infill mode ( #7286 )
...
* avoid to get prompt in infill mode and embedding mode
* remove embedding mode
* refactor format
---------
Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
Georgi Gerganov
f83351f9a6
imatrix : migrate to gpt_params ( #7771 )
...
* imatrix : migrate to gpt_params
ggml-ci
* imatrix : add --save-frequency cli arg
* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
Concedo
6659742a2d
do not merge the removal of opencl
2024-06-05 10:57:52 +08:00
Georgi Gerganov
1442677f92
common : refactor cli arg parsing ( #7675 )
...
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Concedo
a97f7d5f91
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/full-rocm.Dockerfile
# .devops/full.Dockerfile
# .devops/main-cuda.Dockerfile
# .devops/main-intel.Dockerfile
# .devops/main-rocm.Dockerfile
# .devops/main.Dockerfile
# .devops/server-cuda.Dockerfile
# .devops/server-intel.Dockerfile
# .devops/server-rocm.Dockerfile
# .devops/server.Dockerfile
# .devops/tools.sh
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# README-sycl.md
# README.md
# ci/run.sh
# llama.cpp
# requirements.txt
# requirements/requirements-convert-hf-to-gguf-update.txt
# requirements/requirements-convert-hf-to-gguf.txt
# requirements/requirements-convert-legacy-llama.txt
# requirements/requirements-convert-llama-ggml-to-gguf.txt
# scripts/check-requirements.sh
# scripts/compare-llama-bench.py
# scripts/convert-gg.sh
# scripts/pod-llama.sh
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-tokenizer-0.sh
# tests/test-tokenizer-random.py
2024-06-02 12:28:38 +08:00
Yazan Agha-Schrader
2e666832e6
server : new UI ( #7633 )
...
* ic
* migrate my eary work
* add the belonging stuff: css,favicon etc
* de prompts
* chore: Update HTML meta tags in index.html file
* add api-key css classes
* some necessary fixes
* Add API key CSS classes and update styling in style.css
* clean the code
* move API to the top, rearrange param sliders. update css
* add tooltips to the parameters with comprehensible explanations
* fix FloatField and BoolField tooltips
* fix grammar field width
* use template literales for promptFormats.js
* update const ModelGenerationInfo
* remove ms per token, since not relevant for most webui users and use cases
* add phi-3 prompt template
* add phi3 to dropdown
* add css class
* update forgotten css theme
* add user message suffix
* fix chatml & add llama3 format
* fix llama3 prompt template
* more prompt format fixes
* add more comon stop tokens
* add missing char
* do not separate with new line or comma
* move prompt style
* add hacky llama2 prompt solution, reduce redundancy in promptFormats.js
* fix toggle state localstorage
* add cmd-r prompt et reduce redundancy
* set default prompt to empty
* move files, clean code
* fix css path
* add a button to the new ui
* move new ui to "/public" due to otherwise problematic CORS behaviour
* include new ui in cpp
* fix wrong link to old ui
* renaming to ensure consistency
* fix typos "prompt-format" -> "prompt-formats"
* use correct indent
* add new ui files to makefile
* fix typo
2024-06-01 22:31:48 +03:00
Concedo
9282c307ed
this commit does not work, just for debugging
2024-05-23 20:13:47 +08:00
Georgi Gerganov
6ff13987ad
common : normalize naming style ( #7462 )
...
* common : normalize naming style
ggml-ci
* common : match declaration / definition order
* zig : try to fix build
2024-05-22 20:04:20 +03:00
Concedo
52f9911240
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/workflows/build.yml
# .github/workflows/server.yml
# CMakeLists.txt
# Makefile
# README.md
# requirements.txt
# scripts/LlamaConfig.cmake.in
2024-05-21 19:05:52 +08:00
Georgi Gerganov
e932094d58
server : return error on too large embedding input ( #7389 )
2024-05-20 08:56:05 +03:00
Johannes Gäßler
41858392e1
server: fix seed being reported back ( #7382 )
2024-05-19 17:06:33 +03:00
Concedo
47cbfd6150
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
# README.md
# llama.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-backend-ops.cpp
2024-05-17 22:30:41 +08:00
Radoslav Gerganov
ee94172d33
server : add support for the RPC backend ( #7305 )
...
ref: #7292
2024-05-17 10:00:17 +03:00
Steve Grubb
4f0263633b
server: free sampling contexts on exit ( #7264 )
...
* server: free sampling contexts on exit
This cleans up last leak found by the address sanitizer.
* fix whitespace
* fix whitespace
2024-05-14 16:11:24 +02:00
Concedo
2ee808a747
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
# README.md
# ci/run.sh
# llama.cpp
# models/ggml-vocab-llama-bpe.gguf.inp
# models/ggml-vocab-llama-bpe.gguf.out
# requirements.txt
# scripts/compare-llama-bench.py
# scripts/sync-ggml.last
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-grammar-integration.cpp
# tests/test-tokenizer-1-bpe.cpp
2024-05-14 19:28:47 +08:00
Xuan Son Nguyen
72c177c1f6
fix system prompt handling ( #7153 )
2024-05-11 17:28:10 +02:00
Steve Grubb
988631335a
server : free llama_batch on exit ( #7212 )
...
* [server] Cleanup a memory leak on exit
There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.
* make tab into spaces
2024-05-11 11:13:02 +03:00
Johannes Gäßler
5ae3426b0b
server: fix reported top tokens for temperature 0 ( #7203 )
2024-05-11 10:11:28 +02:00
Concedo
d084f78faa
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
# common/common.cpp
# requirements/requirements-convert-hf-to-gguf-update.txt
# requirements/requirements-convert-hf-to-gguf.txt
# requirements/requirements-convert.txt
# tests/CMakeLists.txt
# tests/test-json-schema-to-grammar.cpp
2024-05-09 15:13:34 +08:00
Johannes Gäßler
c12452c7ae
JSON: [key] -> .at(key), assert() -> GGML_ASSERT ( #7143 )
2024-05-08 21:53:08 +02:00
Johan
911b3900dd
server : add_special option for tokenize endpoint ( #7059 )
2024-05-08 15:27:58 +03:00
Concedo
bc39b4d98a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# README.md
# ci/run.sh
# docs/BLIS.md
# flake.lock
# grammars/README.md
2024-05-08 09:58:23 +08:00
Johannes Gäßler
af0a5b6163
server: fix incorrectly reported token probabilities ( #7125 )
...
* server: normalize token probabilities
* fix temperature == 0.0f
2024-05-07 23:07:58 +02:00