Concedo
4c4ce5e808
rewritten checkpoint 1 - before coopmat
2024-12-13 16:55:23 +08:00
Xuan Son Nguyen
6c5bc0625f
server : (refactoring) do not rely on JSON internally ( #10643 )
...
* server : (refactoring) reduce usage of json internally
* move all response types to struct
* wip [no ci]
* many fixes
* add virtual function
* fix index
* minor style fix
* add std::move
* refactor handle_completions_generic
* add virtual functions
* remove server.hpp
* clarify server_sent_event RFC specs
* apply review comments
* fix model_alias and completion_probabilities
* small clean up
* remove virtual for to_json_oai_compat()
* naming oai_compat --> oaicompat
* fix unwanted recursive call
* update docs
2024-12-06 11:14:32 +01:00
Plamen Minev
7736837d62
fix(server) : not show alert when DONE is received ( #10674 )
2024-12-05 22:36:41 +01:00
Georgi Gerganov
1da7b76569
server : fix speculative decoding with context shift ( #10641 )
...
* server : fix speculative decoding with context shift
ggml-ci
* server : take into account speculative limits
ggml-ci
* server : add tests
2024-12-04 22:38:20 +02:00
Xuan Son Nguyen
91c36c269b
server : (web ui) Various improvements, now use vite as bundler ( #10599 )
...
* hide buttons in dropdown menu
* use npm as deps manager and vite as bundler
* fix build
* fix build (2)
* fix responsive on mobile
* fix more problems on mobile
* sync build
* (test) add CI step for verifying build
* fix ci
* force rebuild .hpp files
* cmake: clean up generated files pre build
2024-12-03 19:38:44 +01:00
Nikolaos Pothitos
82bca2257b
readme : add option, update default value, fix formatting ( #10271 )
...
* readme : document --no-display-prompt
* readme : update default prompt context size
* readme : remove unnecessary indentation
Indenting a line with four spaces makes Markdown treat that section as
plain text.
* readme : indent commands under bullets
* readme : indent commands in lettered list
2024-12-03 12:50:08 +02:00
Georgi Gerganov
70b98fadbc
server : fix default draft model parameters ( #10586 )
...
* server : force F16 KV cache for the draft model
ggml-ci
* server : fix draft params
ggml-ci
* server : various params fixes
ggml-ci
2024-12-03 11:20:00 +02:00
Xuan Son Nguyen
642330ac7c
llama : add enum for built-in chat templates ( #10623 )
...
* llama : add enum for supported chat templates
* use "built-in" instead of "supported"
* arg: print list of built-in templates
* fix test
* update server README
2024-12-02 22:10:19 +01:00
Georgi Gerganov
8648c52101
make : deprecate ( #10514 )
...
* make : deprecate
ggml-ci
* ci : disable Makefile builds
ggml-ci
* docs : remove make references [no ci]
* ci : disable swift build
ggml-ci
* docs : remove obsolete make references, scripts, examples
ggml-ci
* basic fix for compare-commits.sh
* update build.md
* more build.md updates
* more build.md updates
* more build.md updates
* Update Makefile
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-12-02 21:22:53 +02:00
haopeng
64ed2091b2
server: Add "tokens per second" information in the backend ( #10548 )
...
* add cmake rvv support
* add timings
* remove space
* update readme
* fix
* fix code
* remove empty line
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-02 14:45:54 +01:00
alek3y
86dc11c5bc
server : bind to any port when specified ( #10590 )
2024-12-01 13:33:12 +02:00
Concedo
697ca70115
temp checkpoint
2024-11-30 12:13:20 +08:00
Concedo
ec95241e38
temp checkpoint
2024-11-30 11:59:27 +08:00
Concedo
0c8939be19
temp checkpoint
2024-11-30 11:57:28 +08:00
Diego Devesa
7cc2d2c889
ggml : move AMX to the CPU backend ( #10570 )
...
* ggml : move AMX to the CPU backend
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Xuan Son Nguyen
b782e5c7d4
server : add more test cases ( #10569 )
...
* server : add split model test
* add test speculative
* add invalid cases
2024-11-29 21:48:56 +01:00
Xuan Son Nguyen
6c59567689
server : (tests) don't use thread for capturing stdout/stderr, bump openai client library ( #10568 )
...
* server : (tests) don't use thread for capturing stdout/stderr
* test: bump openai to 1.55.2
* bump openai to 1.55.3
2024-11-28 19:17:49 +01:00
Xuan Son Nguyen
9f912511bc
common : fix duplicated file name with hf_repo and hf_file ( #10550 )
2024-11-27 22:30:52 +01:00
Xuan Son Nguyen
45abe0f74e
server : replace behave with pytest ( #10416 )
...
* server : replace behave with pytest
* fix test on windows
* misc
* add more tests
* more tests
* styling
* log less, fix embd test
* added all sequential tests
* fix coding style
* fix save slot test
* add parallel completion test
* fix parallel test
* remove feature files
* update test docs
* no cache_prompt for some tests
* add test_cache_vs_nocache_prompt
2024-11-26 16:20:18 +01:00
Concedo
288cd4cf10
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# .github/workflows/nix-ci.yml
# .github/workflows/python-lint.yml
# CMakeLists.txt
# common/CMakeLists.txt
# examples/CMakeLists.txt
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# src/CMakeLists.txt
2024-11-26 21:04:24 +08:00
Georgi Gerganov
84e1c33cde
server : fix parallel speculative decoding ( #10513 )
...
ggml-ci
2024-11-26 13:36:40 +02:00
Concedo
ec581b19d8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# Package.swift
# examples/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/llama-bench/llama-bench.cpp
# examples/server/README.md
# examples/server/server.cpp
# examples/simple-chat/simple-chat.cpp
# examples/simple/simple.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-amx/CMakeLists.txt
# ggml/src/ggml-blas/CMakeLists.txt
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-kompute/CMakeLists.txt
# ggml/src/ggml-kompute/ggml-kompute.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# pocs/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-quantize-fns.cpp
2024-11-26 17:01:20 +08:00
Georgi Gerganov
47f931c8f9
server : enable cache_prompt by default ( #10501 )
...
ggml-ci
2024-11-25 21:50:07 +02:00
Diego Devesa
10bce0450f
llama : accept a list of devices to use to offload a model ( #10497 )
...
* llama : accept a list of devices to use to offload a model
* accept `--dev none` to completely disable offloading
* fix dev list with dl backends
* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
brucepro
a9a678a6b2
Add download chat feature to server chat ( #10481 )
...
* Add download chat feature to server chat
Add a download feature next to the delete chat feature in the server vue chat interface.
* code style
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-25 17:11:55 +01:00
Georgi Gerganov
9ca2e67762
server : add speculative decoding support ( #10455 )
...
* server : add speculative decoding support
ggml-ci
* server : add helper function slot.can_speculate()
ggml-ci
2024-11-25 16:31:38 +02:00
Concedo
83350ec314
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# .github/workflows/build.yml
# Makefile
# common/CMakeLists.txt
# examples/CMakeLists.txt
# examples/infill/infill.cpp
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/kernels/CMakeLists.txt
# ggml/src/ggml-cann/kernels/dup.cpp
# ggml/src/ggml-cann/kernels/get_row_f16.cpp
# ggml/src/ggml-cann/kernels/get_row_f32.cpp
# ggml/src/ggml-cann/kernels/get_row_q4_0.cpp
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
2024-11-25 16:26:08 +08:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00
Concedo
282a647689
Merge commit ' 467576b6cc
' into concedo_experimental
...
# Conflicts:
# .gitignore
# Makefile
# README.md
# common/common.h
# docs/build.md
# examples/infill/infill.cpp
# examples/perplexity/perplexity.cpp
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tests/test-quantize-perf.cpp
2024-11-21 16:05:21 +08:00
Johannes Gäßler
4e54be0ec6
llama/ex: remove --logdir argument ( #10339 )
2024-11-16 23:00:41 +01:00
MaggotHATE
bcdb7a2386
server: (web UI) Add samplers sequence customization ( #10255 )
...
* Samplers sequence: simplified and input field.
* Removed unused function
* Modify and use `settings-modal-short-input`
* rename "name" --> "label"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-16 14:26:54 +01:00
Concedo
590553ef07
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/llama-cli-intel.Dockerfile
# .devops/llama-server-intel.Dockerfile
# .github/workflows/build.yml
# CMakePresets.json
# Makefile
# docs/backend/SYCL.md
# docs/build.md
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# scripts/compare-llama-bench.py
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
2024-11-16 17:20:14 +08:00
Xuan Son Nguyen
9901068ac7
server : (web UI) add copy button for code block, fix api key ( #10242 )
...
* server : (web ui) add copy btn for code blocks
* fix problem with api key
* use settings-modal-short-input component
* always show copy btn for code snippet
2024-11-15 10:48:49 +01:00
Concedo
df080b074d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/server/README.md
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/CMakeLists.txt
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2024-11-14 21:40:52 +08:00
Alexey Parfenov
ff7fb670d0
server : add missing docs ( #10269 )
2024-11-13 13:16:30 +02:00
Jhen-Jie Hong
0e712a5acb
server : fix incorrect res in validate_model_chat_template ( #10272 )
...
* server : fix validate_model_chat_template
* server : fix chat res
2024-11-13 13:15:23 +02:00
Georgi Gerganov
b141e5f6ef
server : enable KV cache defrag by default ( #10233 )
...
ggml-ci
2024-11-11 08:38:43 +02:00
MaggotHATE
505f33274d
server : (web UI) Add back sampler settings ( #10239 )
...
* Add back samplers to server
* Added tooltips with basic information
* Fixed stretching of input fields.
* use component for settings input, move help msg to tooltips
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
Concedo
a244b1ffd2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# Makefile
# Package.swift
# ci/run.sh
# docs/backend/SYCL.md
# examples/llama-bench/llama-bench.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/run-json-schema-to-grammar.mjs
# tests/test-backend-ops.cpp
2024-11-09 13:36:47 +08:00
Xuan Son Nguyen
76c6e7f105
server : minor UI fix ( #10207 )
2024-11-07 18:44:38 -04:00
Xuan Son Nguyen
a71d81cf8c
server : revamp chat UI with vuejs and daisyui ( #10175 )
...
* server : simple chat UI with vuejs and daisyui
* move old files to legacy folder
* embed deps into binary
* basic markdown support
* add conversation history, save to localStorage
* fix bg-base classes
* save theme preferences
* fix tests
* regenerate, edit, copy buttons
* small fixes
* docs: how to use legacy ui
* better error handling
* make CORS preflight more explicit
* add GET method for CORS
* fix tests
* clean up a bit
* better auto scroll
* small fixes
* use collapse-arrow
* fix closeAndSaveConfigDialog
* small fix
* remove console.log
* fix style for <pre> element
* lighter bubble color (less distract when reading)
2024-11-07 17:31:10 -04:00
Concedo
628dcd640e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# examples/server/README.md
2024-11-06 23:13:00 +08:00
Georgi Gerganov
b11f9ba9b8
server : remove hack for extra parallel slot ( #10187 )
...
ggml-ci
2024-11-06 13:29:01 +02:00
Xuan Son Nguyen
9e0ecfb697
server : clarify /slots endpoint, add is_processing ( #10162 )
...
* server : clarify /slots endpoint, add is_processing
* fix tests
2024-11-04 16:33:29 +01:00
Concedo
bb13925f39
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakePresets.json
# Makefile
# Package.swift
# ci/run.sh
# common/CMakeLists.txt
# examples/CMakeLists.txt
# flake.lock
# ggml/src/CMakeLists.txt
# ggml/src/ggml-backend.cpp
# ggml/src/ggml.c
# pocs/vdot/q8dot.cpp
# pocs/vdot/vdot.cpp
# tests/test-backend-ops.cpp
# tests/test-grad0.cpp
# tests/test-quantize-fns.cpp
# tests/test-quantize-perf.cpp
# tests/test-rope.cpp
2024-11-04 16:54:53 +08:00
sasha0552
42cadc74bd
server : fix slot selection by lru ( #10126 )
...
* server : fix slot selection by lru, migrate lcs to `size_t`
* minor debug log fix
2024-11-02 18:34:56 +02:00
Georgi Gerganov
45950415ed
server : fix endpoint checks ( #10135 )
...
ggml-ci
2024-11-02 18:34:00 +02:00
Concedo
bc30ebd044
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# examples/CMakeLists.txt
# examples/main/README.md
# ggml/src/CMakeLists.txt
# ggml/src/kompute-shaders/common.comp
# scripts/sync-ggml.last
# src/llama.cpp
2024-11-02 21:57:29 +08:00
sasha0552
d865d1478c
server : fix smart selection of available slot ( #10120 )
...
* Fix smart selection of available slot
* minor fix
* replace vectors of tokens with shorthands
2024-11-01 14:33:14 +01:00
Concedo
a46f8acd03
note: also has support for completion tokens count
2024-11-01 00:44:14 +08:00