Concedo
dcfa1eca4e
Merge commit ' 017cc5f446
' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# CODEOWNERS
# examples/batched-bench/batched-bench.cpp
# examples/batched/batched.cpp
# examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp
# examples/gritlm/gritlm.cpp
# examples/llama-bench/llama-bench.cpp
# examples/passkey/passkey.cpp
# examples/quantize-stats/quantize-stats.cpp
# examples/run/run.cpp
# examples/simple-chat/simple-chat.cpp
# examples/simple/simple.cpp
# examples/tokenize/tokenize.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-autorelease.cpp
# tests/test-model-load-cancel.cpp
# tests/test-tokenizer-0.cpp
# tests/test-tokenizer-1-bpe.cpp
# tests/test-tokenizer-1-spm.cpp
2025-01-08 23:15:21 +08:00
Georgi Gerganov
e6e7c75d94
server : fix extra BOS in infill endpoint ( #11106 )
...
* server : fix extra BOS in infill endpoing
ggml-ci
* server : update infill tests
2025-01-06 15:36:08 +02:00
Georgi Gerganov
727368c60f
llama : use LLAMA_TOKEN_NULL ( #11062 )
...
ggml-ci
2025-01-06 10:52:15 +02:00
Concedo
f9f1585a7f
broken merge - kcpp changes will be applied above this commit for better tracking.
2025-01-03 23:49:17 +08:00
Georgi Gerganov
f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00
Concedo
911da8765f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/llama.android/llama/src/main/cpp/llama-android.cpp
# examples/run/run.cpp
# examples/server/README.md
# examples/server/bench/README.md
# examples/server/tests/README.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# tests/test-backend-ops.cpp
2025-01-03 11:56:20 +08:00
Pierrick Hymbert
2f0ee84b9b
server: bench: minor fixes ( #10765 )
...
* server/bench:
- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token
* server/bench:
- fix when prometheus not started
- wait for server to be ready before starting bench
2025-01-02 18:06:12 +01:00
Xuan Son Nguyen
0da5d86026
server : allow using LoRA adapters per-request ( #10994 )
...
* slot.can_batch_with
* lora per request
* test: force disable cache prompt
* move can_batch_with check
* fix condition
* add slow test with llama 8b
* update docs
* move lora change task to queue
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* lora_base
* remove redundant check
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-02 15:05:18 +01:00
Xuan Son Nguyen
45095a61bf
server : clean up built-in template detection ( #11026 )
...
* server : clean up built-in template detection
* fix compilation
* add chat template test
* fix condition
2024-12-31 15:22:01 +01:00
Xuan Son Nguyen
5896c65232
server : add OAI compat for /v1/completions ( #10974 )
...
* server : add OAI compat for /v1/completions
* add test
* add docs
* better docs
2024-12-31 12:34:13 +01:00
Isaac McFadyen
f865ea149d
server: added more docs for response_fields field ( #10995 )
2024-12-28 16:09:19 +01:00
Alexey Parfenov
16cdce7b68
server : fix token duplication when streaming with stop strings ( #10997 )
2024-12-28 16:08:54 +01:00
Concedo
7c671f289e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# examples/cvector-generator/mean.hpp
# examples/cvector-generator/pca.hpp
# examples/export-lora/export-lora.cpp
# examples/rpc/rpc-server.cpp
# examples/run/README.md
# examples/run/run.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-vulkan/ggml-vulkan.cpp
# scripts/compare-llama-bench.py
# scripts/hf.sh
# tests/test-chat-template.cpp
2024-12-28 12:48:34 +08:00
Reza Kakhki
9ba399dfa7
server : add support for "encoding_format": "base64" to the */embeddings endpoints ( #10967 )
...
* add support for base64
* fix base64 test
* improve test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 21:33:04 +01:00
Djip007
2cd43f4900
ggml : more perfo with llamafile tinyblas on x86_64 ( #10714 )
...
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
2024-12-24 18:54:49 +01:00
NeverLucky
09fe2e7613
server: allow filtering llama server response fields ( #10940 )
...
* llama_server_response_fields
* llama_server_response_fields_fix_issues
* params fixes
* fix
* clarify docs
* change to "response_fields"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 17:39:49 +01:00
Xuan Son Nguyen
14b699ecde
server : fix missing model id in /model endpoint ( #10957 )
...
* server : fix missing model id in /model endpoint
* fix ci
2024-12-23 12:52:25 +01:00
Xuan Son Nguyen
485dc01214
server : add system_fingerprint to chat/completion ( #10917 )
...
* server : add system_fingerprint to chat/completion
* update README
2024-12-23 12:02:44 +01:00
Concedo
4c56b7cada
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/gbnf-validator/gbnf-validator.cpp
# examples/llava/clip.cpp
# examples/run/README.md
# examples/run/run.cpp
# examples/server/README.md
# ggml/src/ggml-cpu/CMakeLists.txt
# src/llama.cpp
# tests/test-grammar-integration.cpp
# tests/test-llama-grammar.cpp
2024-12-21 09:41:49 +08:00
Xuan Son Nguyen
0ca416c91a
server : (UI) fix copy to clipboard function ( #10916 )
2024-12-20 14:12:06 +01:00
Xuan Son Nguyen
57bb2c40cd
server : fix logprobs, make it OAI-compatible ( #10783 )
...
* server : fix logprobs, make it openai-compatible
* update docs
* add std::log
* return pre-sampling p
* sort before apply softmax
* add comment
* fix test
* set p for sampled token
* update docs
* add --multi-token-probs
* update docs
* add `post_sampling_probs` option
* update docs [no ci]
* remove --multi-token-probs
* "top_probs" with "post_sampling_probs"
* resolve review comments
* rename struct token_prob to prob_info
* correct comment placement
* fix setting prob for sampled token
2024-12-19 15:40:08 +01:00
Concedo
ee486bad3e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# examples/CMakeLists.txt
# examples/batched/batched.cpp
# examples/gritlm/gritlm.cpp
# examples/llama.android/llama/build.gradle.kts
# examples/main/README.md
# examples/retrieval/retrieval.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml.c
# scripts/compare-commits.sh
# scripts/sync-ggml.last
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-sampling.cpp
2024-12-19 11:57:43 +08:00
Gaetan Bisson
7bbb5acf12
server: avoid overwriting Authorization header ( #10878 )
...
* server: avoid overwriting Authorization header
If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.
Fixes https://github.com/ggerganov/llama.cpp/issues/10854
* rebuild
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-18 15:00:07 +01:00
Georgi Gerganov
152610eda9
server : output embeddings for all tokens when pooling = none ( #10861 )
...
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : update readme [no ci]
* server : fix spacing [no ci]
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : be explicit about the pooling type in the tests
ggml-ci
* server : update /embeddings and /v1/embeddings endpoints
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* server : update readme
ggml-ci
* server : fixes
* tests : update server tests
ggml-ci
* server : update readme [no ci]
* server : remove rebase artifact
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 13:01:41 +02:00
Georgi Gerganov
0e70ba686e
server : add "tokens" output ( #10853 )
...
* server : add "tokens" output
ggml-ci
* server : update readme
ggml-ci
* server : return tokens ids only if requested
ggml-ci
* tests : improve "tokens" type check
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : remove "tokens" from the OAI endpoint
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 11:05:29 +02:00
Xuan Son Nguyen
46828872c3
server : (embeddings) using same format for "input" and "content" ( #10872 )
...
* server : (embeddings) using same format for "input" and "content"
* fix test case
* handle empty input case
* fix test
2024-12-18 10:55:09 +02:00
krystiancha
05c3a444b8
server : fill usage info in embeddings and rerank responses ( #10852 )
...
* server : fill usage info in embeddings response
* server : fill usage info in reranking response
2024-12-17 18:00:24 +02:00
Xuan Son Nguyen
227d7c5a7f
server : (UI) fix missing async generator on safari ( #10857 )
...
* server : (UI) fix missing async generator on safari
* fix
2024-12-17 09:52:09 +01:00
Georgi Gerganov
644fd71b44
sampling : refactor + optimize penalties sampler ( #10803 )
...
* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Vinesh Janarthanan
5478bbcd17
server: (UI) add syntax highlighting and latex math rendering ( #10808 )
...
* add code highlighting and math formatting
* code cleanup
* build public/index.html
* rebuild public/index.html
* fixed coding style
* fixed coding style
* style fixes
* highlight: smaller bundle size, fix light & dark theme
* remove katex
* add bundle size check
* add more languages
* add php
* reuse some langs
* use gzip
* Revert "remove katex"
This reverts commit c0e5046accd10be3f83018cffdc29a652849fc61.
* use better maintained @vscode/markdown-it-katex
* fix gzip non deterministic
* ability to add a demo conversation for dev
* fix latex rendering
* add comment
* latex codeblock as code
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-15 12:55:54 +01:00
Concedo
f456ed7237
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .devops/tools.sh
# .github/workflows/build.yml
# Makefile
# README.md
# common/CMakeLists.txt
# common/common.h
# examples/llava/CMakeLists.txt
# examples/run/CMakeLists.txt
# examples/run/README.md
# examples/run/run.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-kompute/ggml-kompute.cpp
# tests/test-backend-ops.cpp
# tests/test-rope.cpp
2024-12-15 15:30:10 +08:00
Michelle Tan
89d604f2c8
server: Fix has_next_line
in JSON response ( #10818 )
...
* Update server JSON response.
* Add unit test to check `has_new_line` JSON response
* Remove `has_new_line` unit test changes.
* Address code review comment: type check for `has_new_line` in unit test
2024-12-14 23:29:45 +01:00
cduk
56eea0781c
Removes spurious \r in output that causes logging in journalctl to treat lines as binary and therefore hidden by default ( #10771 )
...
Signed-off-by: Charles Darke <s.cduk@toodevious.com>
Co-authored-by: Charles Darke <s.cduk@toodevious.com>
2024-12-13 23:21:49 +01:00
Concedo
ed75f8a741
up to date merge, without vulkan-gen-shaders. They will be built before each release from now on, as they are very large
2024-12-13 17:18:01 +08:00
Concedo
de64b9198c
merge checkpoint 2 - functional merge without q4_0_4_4 (need regen shaders)
2024-12-13 17:04:19 +08:00
Concedo
4c4ce5e808
rewritten checkpoint 1 - before coopmat
2024-12-13 16:55:23 +08:00
Xuan Son Nguyen
adffa6ffd5
common : improve -ctv -ctk CLI arguments ( #10806 )
...
* common : improve ctv ctk cli argument
* regenerate docs
* even better approach
* use std::vector
2024-12-12 22:53:05 +01:00
CentricStorm
5555c0c1f6
docs: update server streaming mode documentation ( #9519 )
...
Provide more documentation for streaming mode.
2024-12-11 23:40:40 +01:00
Xuan Son Nguyen
235f6e14bf
server : (UI) add tok/s, get rid of completion.js ( #10786 )
...
* get rid of completion.js
* extract chat bubble to a component
* add tok/s info
* sync
* fix BASE_URL
* only extract timings when it's enabled
* fix auto scroll
2024-12-11 20:52:14 +01:00
kallewoof
484d2f31ae
bug-fix: snprintf prints NULL in place of the last character ( #10419 )
...
* bug-fix: snprintf prints NULL in place of the last character
We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string.
* add comment about extra null-term byte requirement
2024-12-11 14:48:04 +01:00
CentricStorm
4b4d92b098
docs: fix server documentation formatting ( #10776 )
2024-12-11 11:47:43 +01:00
Yüg
a86ad841f1
server : add flag to disable the web-ui ( #10762 ) ( #10751 )
...
Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>
2024-12-10 18:22:34 +01:00
Xuan Son Nguyen
ce8784bdb1
server : fix format_infill ( #10724 )
...
* server : fix format_infill
* fix
* rename
* update test
* use another model
* update test
* update test
* test_invalid_input_extra_req
2024-12-08 23:04:29 +01:00
Xuan Son Nguyen
e52522b869
server : bring back info of final chunk in stream mode ( #10722 )
...
* server : bring back into to final chunk in stream mode
* clarify a bit
* traling space
2024-12-08 20:38:51 +01:00
Xuan Son Nguyen
3573fa8e7b
server : (refactor) no more json in server_task input ( #10691 )
...
* server : (refactor) no more json in server_task input
* add test for slots endpoint
* add tests for /props and /slots
* remove task inf_type
* fix CI by adding safe_json_to_str
* add "model_path" to /props
* update readme
2024-12-07 20:21:09 +01:00
Georgi Gerganov
ce4a7b8493
server : various fixes ( #10704 )
...
* server : various fixes
ggml-ci
* server : show curent seed in slot_params
ggml-ci
* fix /slots endpoint
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : reflect endpoint response changes in the readme
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-07 18:02:05 +02:00
Georgi Gerganov
c2a16c0bdb
server : fix free of spec context and batch ( #10651 )
...
ggml-ci
2024-12-07 11:52:44 +02:00
Xuan Son Nguyen
6c5bc0625f
server : (refactoring) do not rely on JSON internally ( #10643 )
...
* server : (refactoring) reduce usage of json internally
* move all response types to struct
* wip [no ci]
* many fixes
* add virtual function
* fix index
* minor style fix
* add std::move
* refactor handle_completions_generic
* add virtual functions
* remove server.hpp
* clarify server_sent_event RFC specs
* apply review comments
* fix model_alias and completion_probabilities
* small clean up
* remove virtual for to_json_oai_compat()
* naming oai_compat --> oaicompat
* fix unwanted recursive call
* update docs
2024-12-06 11:14:32 +01:00
Plamen Minev
7736837d62
fix(server) : not show alert when DONE is received ( #10674 )
2024-12-05 22:36:41 +01:00
Georgi Gerganov
1da7b76569
server : fix speculative decoding with context shift ( #10641 )
...
* server : fix speculative decoding with context shift
ggml-ci
* server : take into account speculative limits
ggml-ci
* server : add tests
2024-12-04 22:38:20 +02:00