Georgi Gerganov
152610eda9
server : output embeddings for all tokens when pooling = none ( #10861 )
...
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : update readme [no ci]
* server : fix spacing [no ci]
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : be explicit about the pooling type in the tests
ggml-ci
* server : update /embeddings and /v1/embeddings endpoints
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* server : update readme
ggml-ci
* server : fixes
* tests : update server tests
ggml-ci
* server : update readme [no ci]
* server : remove rebase artifact
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 13:01:41 +02:00
Georgi Gerganov
644fd71b44
sampling : refactor + optimize penalties sampler ( #10803 )
...
* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Concedo
f456ed7237
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .devops/tools.sh
# .github/workflows/build.yml
# Makefile
# README.md
# common/CMakeLists.txt
# common/common.h
# examples/llava/CMakeLists.txt
# examples/run/CMakeLists.txt
# examples/run/README.md
# examples/run/run.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-kompute/ggml-kompute.cpp
# tests/test-backend-ops.cpp
# tests/test-rope.cpp
2024-12-15 15:30:10 +08:00
Eric Curtin
c27ac678dd
Opt class for positional argument handling ( #10508 )
...
Added support for positional arguments `model` and `prompt`. Added
functionality to download via strings like:
llama-run llama3
llama-run ollama://granite-code
llama-run ollama://granite-code:8b
llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
llama-run https://example.com/some-file1.gguf
llama-run some-file2.gguf
llama-run file://some-file3.gguf
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-13 19:34:25 +01:00
Concedo
ed75f8a741
up to date merge, without vulkan-gen-shaders. They will be built before each release from now on, as they are very large
2024-12-13 17:18:01 +08:00
Concedo
de64b9198c
merge checkpoint 2 - functional merge without q4_0_4_4 (need regen shaders)
2024-12-13 17:04:19 +08:00
Concedo
4c4ce5e808
rewritten checkpoint 1 - before coopmat
2024-12-13 16:55:23 +08:00
Xuan Son Nguyen
adffa6ffd5
common : improve -ctv -ctk CLI arguments ( #10806 )
...
* common : improve ctv ctk cli argument
* regenerate docs
* even better approach
* use std::vector
2024-12-12 22:53:05 +01:00
Xuan Son Nguyen
9fdb124304
common : add missing env var for speculative ( #10801 )
2024-12-12 16:57:32 +01:00
Bartowski
ae4b922614
imatrix : Add imatrix to --no-context-shift ( #10766 )
...
This allows for setting the --no-context-shift value in llama-imatrix which is required for models like DeepSeek
2024-12-10 18:23:50 +01:00
Yüg
a86ad841f1
server : add flag to disable the web-ui ( #10762 ) ( #10751 )
...
Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>
2024-12-10 18:22:34 +01:00
Georgi Gerganov
c2a16c0bdb
server : fix free of spec context and batch ( #10651 )
...
ggml-ci
2024-12-07 11:52:44 +02:00
Xuan Son Nguyen
f162d45a21
common : bring back --no-warmup to server ( #10686 )
2024-12-06 13:29:05 +01:00
Xuan Son Nguyen
6c5bc0625f
server : (refactoring) do not rely on JSON internally ( #10643 )
...
* server : (refactoring) reduce usage of json internally
* move all response types to struct
* wip [no ci]
* many fixes
* add virtual function
* fix index
* minor style fix
* add std::move
* refactor handle_completions_generic
* add virtual functions
* remove server.hpp
* clarify server_sent_event RFC specs
* apply review comments
* fix model_alias and completion_probabilities
* small clean up
* remove virtual for to_json_oai_compat()
* naming oai_compat --> oaicompat
* fix unwanted recursive call
* update docs
2024-12-06 11:14:32 +01:00
Xuan Son Nguyen
642330ac7c
llama : add enum for built-in chat templates ( #10623 )
...
* llama : add enum for supported chat templates
* use "built-in" instead of "supported"
* arg: print list of built-in templates
* fix test
* update server README
2024-12-02 22:10:19 +01:00
haopeng
64ed2091b2
server: Add "tokens per second" information in the backend ( #10548 )
...
* add cmake rvv support
* add timings
* remove space
* update readme
* fix
* fix code
* remove empty line
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-02 14:45:54 +01:00
Concedo
557bcaf86e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .clang-tidy
# .github/workflows/build.yml
# Makefile
# Package.swift
# common/CMakeLists.txt
# examples/batched-bench/CMakeLists.txt
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/cvector-generator/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/export-lora/CMakeLists.txt
# examples/gbnf-validator/CMakeLists.txt
# examples/gguf-split/CMakeLists.txt
# examples/gguf/CMakeLists.txt
# examples/gritlm/CMakeLists.txt
# examples/imatrix/CMakeLists.txt
# examples/infill/CMakeLists.txt
# examples/llama-bench/CMakeLists.txt
# examples/llava/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/main-cmake-pkg/CMakeLists.txt
# examples/main/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/perplexity/CMakeLists.txt
# examples/quantize-stats/CMakeLists.txt
# examples/quantize/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/run/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/server/CMakeLists.txt
# examples/simple-chat/CMakeLists.txt
# examples/simple/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/tokenize/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-backend.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# src/unicode.cpp
# tests/test-sampling.cpp
2024-11-30 12:24:51 +08:00
Concedo
ec95241e38
temp checkpoint
2024-11-30 11:59:27 +08:00
Diego Devesa
7cc2d2c889
ggml : move AMX to the CPU backend ( #10570 )
...
* ggml : move AMX to the CPU backend
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Johannes Gäßler
890719311b
common: fix warning message when no GPU found ( #10564 )
2024-11-28 18:15:25 +01:00
Xuan Son Nguyen
9f912511bc
common : fix duplicated file name with hf_repo and hf_file ( #10550 )
2024-11-27 22:30:52 +01:00
Georgi Gerganov
ab96610b1e
cmake : enable warnings in llama ( #10474 )
...
* cmake : enable warnings in llama
ggml-ci
* cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS
* cmake : get_flags -> ggml_get_flags
* speculative-simple : fix warnings
* cmake : reuse ggml_get_flags
ggml-ci
* speculative-simple : fix compile warning
ggml-ci
2024-11-26 14:18:08 +02:00
Concedo
ec581b19d8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# Package.swift
# examples/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/llama-bench/llama-bench.cpp
# examples/server/README.md
# examples/server/server.cpp
# examples/simple-chat/simple-chat.cpp
# examples/simple/simple.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-amx/CMakeLists.txt
# ggml/src/ggml-blas/CMakeLists.txt
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-kompute/CMakeLists.txt
# ggml/src/ggml-kompute/ggml-kompute.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# pocs/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-quantize-fns.cpp
2024-11-26 17:01:20 +08:00
Georgi Gerganov
9fd8c2687f
server : add more information about error ( #10455 )
2024-11-25 22:28:59 +02:00
Diego Devesa
10bce0450f
llama : accept a list of devices to use to offload a model ( #10497 )
...
* llama : accept a list of devices to use to offload a model
* accept `--dev none` to completely disable offloading
* fix dev list with dl backends
* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
Diego Devesa
5931c1f233
ggml : add support for dynamic loading of backends ( #10469 )
...
* ggml : add support for dynamic loading of backends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-25 15:13:39 +01:00
Concedo
83350ec314
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# .github/workflows/build.yml
# Makefile
# common/CMakeLists.txt
# examples/CMakeLists.txt
# examples/infill/infill.cpp
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/kernels/CMakeLists.txt
# ggml/src/ggml-cann/kernels/dup.cpp
# ggml/src/ggml-cann/kernels/get_row_f16.cpp
# ggml/src/ggml-cann/kernels/get_row_f32.cpp
# ggml/src/ggml-cann/kernels/get_row_q4_0.cpp
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
2024-11-25 16:26:08 +08:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00
Concedo
091a432cf6
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/full-cuda.Dockerfile
# .devops/llama-cli-cann.Dockerfile
# .devops/llama-cli-cuda.Dockerfile
# .devops/llama-cli-intel.Dockerfile
# .devops/llama-cli-musa.Dockerfile
# .devops/llama-cli-vulkan.Dockerfile
# .devops/llama-server-cuda.Dockerfile
# .devops/llama-server-intel.Dockerfile
# .devops/llama-server-musa.Dockerfile
# .devops/llama-server-vulkan.Dockerfile
# .gitignore
# CMakeLists.txt
# Makefile
# cmake/llama-config.cmake.in
# docs/backend/SYCL.md
# docs/build.md
# examples/llama-bench/llama-bench.cpp
# flake.lock
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-backend.cpp
# ggml/src/ggml-blas/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/ggml-cpu.c
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2024-11-21 16:26:24 +08:00
Concedo
282a647689
Merge commit ' 467576b6cc
' into concedo_experimental
...
# Conflicts:
# .gitignore
# Makefile
# README.md
# common/common.h
# docs/build.md
# examples/infill/infill.cpp
# examples/perplexity/perplexity.cpp
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.sh
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tests/test-quantize-perf.cpp
2024-11-21 16:05:21 +08:00
Georgi Gerganov
8e752a777b
llama : add check for KV cache shifts ( #10401 )
...
ggml-ci
2024-11-19 13:29:26 +02:00
Johannes Gäßler
4e54be0ec6
llama/ex: remove --logdir argument ( #10339 )
2024-11-16 23:00:41 +01:00
Concedo
70aee82552
attempts a backflip, but does he stick the landing?
2024-11-16 17:05:45 +08:00
Diego Devesa
ae8de6d50a
ggml : build backends as libraries ( #10256 )
...
* ggml : build backends as libraries
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
Concedo
df080b074d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# examples/server/README.md
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/CMakeLists.txt
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2024-11-14 21:40:52 +08:00
Georgi Gerganov
b141e5f6ef
server : enable KV cache defrag by default ( #10233 )
...
ggml-ci
2024-11-11 08:38:43 +02:00
Concedo
a244b1ffd2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# Makefile
# Package.swift
# ci/run.sh
# docs/backend/SYCL.md
# examples/llama-bench/llama-bench.cpp
# examples/server/CMakeLists.txt
# examples/server/README.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# grammars/README.md
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/run-json-schema-to-grammar.mjs
# tests/test-backend-ops.cpp
2024-11-09 13:36:47 +08:00
Georgi Gerganov
5c333e0140
metal : add BF16 support ( #8439 )
...
* ggml : add initial BF16 support
ggml-ci
* metal : add mul_mat_id BF16 support
ggml-ci
* metal : check for bfloat support on the Metal device
ggml-ci
* metal : better var names [no ci]
* metal : do not build bfloat kernels when not supported
ggml-ci
* metal : try to fix BF16 support check
ggml-ci
* metal : this should correctly check bfloat support
2024-11-06 19:53:51 +02:00
Concedo
bb13925f39
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakePresets.json
# Makefile
# Package.swift
# ci/run.sh
# common/CMakeLists.txt
# examples/CMakeLists.txt
# flake.lock
# ggml/src/CMakeLists.txt
# ggml/src/ggml-backend.cpp
# ggml/src/ggml.c
# pocs/vdot/q8dot.cpp
# pocs/vdot/vdot.cpp
# tests/test-backend-ops.cpp
# tests/test-grad0.cpp
# tests/test-quantize-fns.cpp
# tests/test-quantize-perf.cpp
# tests/test-rope.cpp
2024-11-04 16:54:53 +08:00
Diego Devesa
9f40989351
ggml : move CPU backend to a separate file ( #10144 )
2024-11-03 19:34:08 +01:00
Georgi Gerganov
1926d6e39d
llama : adjust default context size + print warnings ( #10136 )
...
* llama : adjust default context size + print warnings
ggml-ci
* ggml-ci : add missing gpu-layers + adjust context sizes
2024-11-02 15:18:56 +02:00
Concedo
a46f8acd03
note: also has support for completion tokens count
2024-11-01 00:44:14 +08:00
Georgi Gerganov
8d8ff71536
llama : remove Tail-Free sampling ( #10071 )
...
ggml-ci
2024-10-29 10:42:05 +02:00
wwoodsTM
ff252ea48e
llama : add DRY sampler ( #9702 )
...
* sampling : add DRY sampler (post-refactor)
* DRY: Trying to fix coauthors, removed unneeded line
* DRY: Fixed redundant code
* DRY: Fixed crash issue due to DRY being in chain but uninitialized
---------
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Michael Podvitskiy
d80fb71f8b
llama: string_split fix ( #10022 )
...
* llama: Refactor string_split to use template specialization, fixes parsing strings with spaces
* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
2024-10-25 17:57:54 +02:00
Concedo
94a5a27b85
Alone in the darkness
...
They're coming for you
I know they will try to catch me too
Alone in the darkness
They're calling for you
There's nowhere to run for cover
2024-10-24 22:29:20 +08:00
Daniel Bevenius
674804a996
arg : fix typo in embeddings argument help [no ci] ( #9994 )
...
This commit fixes two typos in the help text for the `--embd-normalize`
and `--embd-separator` arguments. It also updates common.h which contain
the same typo in two comments.
2024-10-22 10:40:02 +03:00
Daniel Bevenius
94008cc760
arg : fix attention non-causal arg value hint ( #9985 )
...
This commit updates the argument value hint for the `--attention`
argument to `non-causal`.
The motivation for this change is that the only values for this argument
are `causal` and `non-causal`.
2024-10-21 21:12:52 +03:00
Georgi Gerganov
f594bc80ba
ggml : add asserts for type conversion in fattn kernels ( #9971 )
...
ggml-ci
2024-10-21 16:20:46 +03:00
Georgi Gerganov
55e47786e3
llama : default sampling changes + greedy update ( #9897 )
...
* llama : deprecate softmax sampler + fix dist sampler
ggml-ci
* tests : replace macros with functions
ggml-ci
* sampling : change temperature sampler logic
For t <= 0.0f, keep the max logit intact and set the rest to -inf
* cont : no need for special "greedy" logic
top-k == 1 is the same
* tests : init prob correctly
* llama : handle temp <= 0.0 in the temp_ext sampler too
ggml-ci
* cont : avoid extra loop in temperature sampler for sub-zero temp
ggml-ci
2024-10-21 09:46:40 +03:00