Concedo
5329df2bdf
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/server.yml
# CMakeLists.txt
# cmake/build-info.cmake
# examples/run/CMakeLists.txt
# examples/run/run.cpp
# examples/simple-chat/simple-chat.cpp
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-sampling.cpp
2025-01-21 00:25:07 +08:00
Georgi Gerganov
92bc493917
tests : increase timeout when sanitizers are enabled ( #11300 )
...
* tests : increase timeout when sanitizers are enabled
* tests : add DEFAULT_HTTP_TIMEOUT
2025-01-19 20:22:30 +02:00
Xuan Son Nguyen
f30f099228
server : implement cancellable request ( #11285 )
...
* server : implement cancellable request
* fix typo
* httplib 0.18.5
* fix i underflow
2025-01-18 14:12:05 +01:00
Concedo
11cd7c7bb0
survived the storm, again
2025-01-16 22:25:18 +08:00
Concedo
2a00ee8fa8
broken commit
2025-01-16 21:41:18 +08:00
ebraminio
c5bf0d1bd7
server : Improve code snippets direction between RTL text ( #11221 )
2025-01-14 11:39:33 +01:00
ebraminio
504af20ee4
server : (UI) Improve messages bubble shape in RTL ( #11220 )
...
I simply have overlooked message bubble's tail placement for RTL
text as I use the dark mode and that isn't visible there and this
fixes it.
2025-01-13 20:23:31 +01:00
ebraminio
437e05f714
server : (UI) Support for RTL text as models input or output ( #11208 )
2025-01-13 14:46:39 +01:00
Georgi Gerganov
afa8a9ec9b
llama : add llama_vocab
, functions -> methods, naming ( #11110 )
...
* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
Concedo
b154bd3671
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build.md
# docs/development/HOWTO-add-model.md
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
2025-01-10 17:57:38 +08:00
Daniel Bevenius
8eceb888d7
server : add tooltips to settings and themes btn ( #11154 )
...
* server : add tooltips to settings and themes btn
This commit adds tooltips to the settings and themes buttons in the
webui. The tooltip will be displayed below the actual buttons when
hovered over.
The motivation for this change is to clarify the purpose of the themes
button.
* squash! server : add tooltips to settings and themes btn
This commit adds a tooltip to the '...' button when a chat has been
started. The tooltip is "Chat options" which think could be a good
description as the dropdown contains options to delete or download the
current chat.
* rm tooltip for 3 dots button
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-09 11:28:29 +01:00
Concedo
dcfa1eca4e
Merge commit ' 017cc5f446
' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# CODEOWNERS
# examples/batched-bench/batched-bench.cpp
# examples/batched/batched.cpp
# examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp
# examples/gritlm/gritlm.cpp
# examples/llama-bench/llama-bench.cpp
# examples/passkey/passkey.cpp
# examples/quantize-stats/quantize-stats.cpp
# examples/run/run.cpp
# examples/simple-chat/simple-chat.cpp
# examples/simple/simple.cpp
# examples/tokenize/tokenize.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# scripts/sync-ggml.last
# src/llama.cpp
# tests/test-autorelease.cpp
# tests/test-model-load-cancel.cpp
# tests/test-tokenizer-0.cpp
# tests/test-tokenizer-1-bpe.cpp
# tests/test-tokenizer-1-spm.cpp
2025-01-08 23:15:21 +08:00
Georgi Gerganov
a3c1232c3f
arg : option to exclude arguments from specific examples ( #11136 )
...
* arg : option to exclude arguments from specific examples
ggml-ci
* readme : remove old args [no ci]
2025-01-08 12:55:36 +02:00
Georgi Gerganov
e6e7c75d94
server : fix extra BOS in infill endpoint ( #11106 )
...
* server : fix extra BOS in infill endpoing
ggml-ci
* server : update infill tests
2025-01-06 15:36:08 +02:00
Georgi Gerganov
727368c60f
llama : use LLAMA_TOKEN_NULL ( #11062 )
...
ggml-ci
2025-01-06 10:52:15 +02:00
Concedo
f9f1585a7f
broken merge - kcpp changes will be applied above this commit for better tracking.
2025-01-03 23:49:17 +08:00
Georgi Gerganov
f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00
Concedo
911da8765f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/llama.android/llama/src/main/cpp/llama-android.cpp
# examples/run/run.cpp
# examples/server/README.md
# examples/server/bench/README.md
# examples/server/tests/README.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# tests/test-backend-ops.cpp
2025-01-03 11:56:20 +08:00
Pierrick Hymbert
2f0ee84b9b
server: bench: minor fixes ( #10765 )
...
* server/bench:
- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token
* server/bench:
- fix when prometheus not started
- wait for server to be ready before starting bench
2025-01-02 18:06:12 +01:00
Xuan Son Nguyen
0da5d86026
server : allow using LoRA adapters per-request ( #10994 )
...
* slot.can_batch_with
* lora per request
* test: force disable cache prompt
* move can_batch_with check
* fix condition
* add slow test with llama 8b
* update docs
* move lora change task to queue
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* lora_base
* remove redundant check
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-02 15:05:18 +01:00
Xuan Son Nguyen
45095a61bf
server : clean up built-in template detection ( #11026 )
...
* server : clean up built-in template detection
* fix compilation
* add chat template test
* fix condition
2024-12-31 15:22:01 +01:00
Xuan Son Nguyen
5896c65232
server : add OAI compat for /v1/completions ( #10974 )
...
* server : add OAI compat for /v1/completions
* add test
* add docs
* better docs
2024-12-31 12:34:13 +01:00
Isaac McFadyen
f865ea149d
server: added more docs for response_fields field ( #10995 )
2024-12-28 16:09:19 +01:00
Alexey Parfenov
16cdce7b68
server : fix token duplication when streaming with stop strings ( #10997 )
2024-12-28 16:08:54 +01:00
Concedo
7c671f289e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# examples/cvector-generator/mean.hpp
# examples/cvector-generator/pca.hpp
# examples/export-lora/export-lora.cpp
# examples/rpc/rpc-server.cpp
# examples/run/README.md
# examples/run/run.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-vulkan/ggml-vulkan.cpp
# scripts/compare-llama-bench.py
# scripts/hf.sh
# tests/test-chat-template.cpp
2024-12-28 12:48:34 +08:00
Reza Kakhki
9ba399dfa7
server : add support for "encoding_format": "base64" to the */embeddings endpoints ( #10967 )
...
* add support for base64
* fix base64 test
* improve test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 21:33:04 +01:00
Djip007
2cd43f4900
ggml : more perfo with llamafile tinyblas on x86_64 ( #10714 )
...
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
2024-12-24 18:54:49 +01:00
NeverLucky
09fe2e7613
server: allow filtering llama server response fields ( #10940 )
...
* llama_server_response_fields
* llama_server_response_fields_fix_issues
* params fixes
* fix
* clarify docs
* change to "response_fields"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 17:39:49 +01:00
Xuan Son Nguyen
14b699ecde
server : fix missing model id in /model endpoint ( #10957 )
...
* server : fix missing model id in /model endpoint
* fix ci
2024-12-23 12:52:25 +01:00
Xuan Son Nguyen
485dc01214
server : add system_fingerprint to chat/completion ( #10917 )
...
* server : add system_fingerprint to chat/completion
* update README
2024-12-23 12:02:44 +01:00
Concedo
4c56b7cada
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/gbnf-validator/gbnf-validator.cpp
# examples/llava/clip.cpp
# examples/run/README.md
# examples/run/run.cpp
# examples/server/README.md
# ggml/src/ggml-cpu/CMakeLists.txt
# src/llama.cpp
# tests/test-grammar-integration.cpp
# tests/test-llama-grammar.cpp
2024-12-21 09:41:49 +08:00
Xuan Son Nguyen
0ca416c91a
server : (UI) fix copy to clipboard function ( #10916 )
2024-12-20 14:12:06 +01:00
Xuan Son Nguyen
57bb2c40cd
server : fix logprobs, make it OAI-compatible ( #10783 )
...
* server : fix logprobs, make it openai-compatible
* update docs
* add std::log
* return pre-sampling p
* sort before apply softmax
* add comment
* fix test
* set p for sampled token
* update docs
* add --multi-token-probs
* update docs
* add `post_sampling_probs` option
* update docs [no ci]
* remove --multi-token-probs
* "top_probs" with "post_sampling_probs"
* resolve review comments
* rename struct token_prob to prob_info
* correct comment placement
* fix setting prob for sampled token
2024-12-19 15:40:08 +01:00
Concedo
ee486bad3e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# examples/CMakeLists.txt
# examples/batched/batched.cpp
# examples/gritlm/gritlm.cpp
# examples/llama.android/llama/build.gradle.kts
# examples/main/README.md
# examples/retrieval/retrieval.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml.c
# scripts/compare-commits.sh
# scripts/sync-ggml.last
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
# tests/test-sampling.cpp
2024-12-19 11:57:43 +08:00
Gaetan Bisson
7bbb5acf12
server: avoid overwriting Authorization header ( #10878 )
...
* server: avoid overwriting Authorization header
If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.
Fixes https://github.com/ggerganov/llama.cpp/issues/10854
* rebuild
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-18 15:00:07 +01:00
Georgi Gerganov
152610eda9
server : output embeddings for all tokens when pooling = none ( #10861 )
...
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : update readme [no ci]
* server : fix spacing [no ci]
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : be explicit about the pooling type in the tests
ggml-ci
* server : update /embeddings and /v1/embeddings endpoints
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* server : update readme
ggml-ci
* server : fixes
* tests : update server tests
ggml-ci
* server : update readme [no ci]
* server : remove rebase artifact
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 13:01:41 +02:00
Georgi Gerganov
0e70ba686e
server : add "tokens" output ( #10853 )
...
* server : add "tokens" output
ggml-ci
* server : update readme
ggml-ci
* server : return tokens ids only if requested
ggml-ci
* tests : improve "tokens" type check
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : remove "tokens" from the OAI endpoint
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 11:05:29 +02:00
Xuan Son Nguyen
46828872c3
server : (embeddings) using same format for "input" and "content" ( #10872 )
...
* server : (embeddings) using same format for "input" and "content"
* fix test case
* handle empty input case
* fix test
2024-12-18 10:55:09 +02:00
krystiancha
05c3a444b8
server : fill usage info in embeddings and rerank responses ( #10852 )
...
* server : fill usage info in embeddings response
* server : fill usage info in reranking response
2024-12-17 18:00:24 +02:00
Xuan Son Nguyen
227d7c5a7f
server : (UI) fix missing async generator on safari ( #10857 )
...
* server : (UI) fix missing async generator on safari
* fix
2024-12-17 09:52:09 +01:00
Georgi Gerganov
644fd71b44
sampling : refactor + optimize penalties sampler ( #10803 )
...
* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Vinesh Janarthanan
5478bbcd17
server: (UI) add syntax highlighting and latex math rendering ( #10808 )
...
* add code highlighting and math formatting
* code cleanup
* build public/index.html
* rebuild public/index.html
* fixed coding style
* fixed coding style
* style fixes
* highlight: smaller bundle size, fix light & dark theme
* remove katex
* add bundle size check
* add more languages
* add php
* reuse some langs
* use gzip
* Revert "remove katex"
This reverts commit c0e5046accd10be3f83018cffdc29a652849fc61.
* use better maintained @vscode/markdown-it-katex
* fix gzip non deterministic
* ability to add a demo conversation for dev
* fix latex rendering
* add comment
* latex codeblock as code
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-15 12:55:54 +01:00
Concedo
f456ed7237
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .devops/tools.sh
# .github/workflows/build.yml
# Makefile
# README.md
# common/CMakeLists.txt
# common/common.h
# examples/llava/CMakeLists.txt
# examples/run/CMakeLists.txt
# examples/run/README.md
# examples/run/run.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-kompute/ggml-kompute.cpp
# tests/test-backend-ops.cpp
# tests/test-rope.cpp
2024-12-15 15:30:10 +08:00
Michelle Tan
89d604f2c8
server: Fix has_next_line
in JSON response ( #10818 )
...
* Update server JSON response.
* Add unit test to check `has_new_line` JSON response
* Remove `has_new_line` unit test changes.
* Address code review comment: type check for `has_new_line` in unit test
2024-12-14 23:29:45 +01:00
cduk
56eea0781c
Removes spurious \r in output that causes logging in journalctl to treat lines as binary and therefore hidden by default ( #10771 )
...
Signed-off-by: Charles Darke <s.cduk@toodevious.com>
Co-authored-by: Charles Darke <s.cduk@toodevious.com>
2024-12-13 23:21:49 +01:00
Concedo
ed75f8a741
up to date merge, without vulkan-gen-shaders. They will be built before each release from now on, as they are very large
2024-12-13 17:18:01 +08:00
Concedo
de64b9198c
merge checkpoint 2 - functional merge without q4_0_4_4 (need regen shaders)
2024-12-13 17:04:19 +08:00
Concedo
4c4ce5e808
rewritten checkpoint 1 - before coopmat
2024-12-13 16:55:23 +08:00
Xuan Son Nguyen
adffa6ffd5
common : improve -ctv -ctk CLI arguments ( #10806 )
...
* common : improve ctv ctk cli argument
* regenerate docs
* even better approach
* use std::vector
2024-12-12 22:53:05 +01:00
CentricStorm
5555c0c1f6
docs: update server streaming mode documentation ( #9519 )
...
Provide more documentation for streaming mode.
2024-12-11 23:40:40 +01:00