Commit graph

368 commits

Author SHA1 Message Date
Concedo
091a432cf6 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/llama-cli-cann.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-intel.Dockerfile
#	.devops/llama-cli-musa.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.devops/llama-server-musa.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	cmake/llama-config.cmake.in
#	docs/backend/SYCL.md
#	docs/build.md
#	examples/llama-bench/llama-bench.cpp
#	flake.lock
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-backend.cpp
#	ggml/src/ggml-blas/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2024-11-21 16:26:24 +08:00
Concedo
282a647689 Merge commit '467576b6cc' into concedo_experimental
# Conflicts:
#	.gitignore
#	Makefile
#	README.md
#	common/common.h
#	docs/build.md
#	examples/infill/infill.cpp
#	examples/perplexity/perplexity.cpp
#	examples/server/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-opt.cpp
#	tests/test-quantize-perf.cpp
2024-11-21 16:05:21 +08:00
Georgi Gerganov
8e752a777b
llama : add check for KV cache shifts (#10401)
ggml-ci
2024-11-19 13:29:26 +02:00
Johannes Gäßler
4e54be0ec6
llama/ex: remove --logdir argument (#10339) 2024-11-16 23:00:41 +01:00
Concedo
70aee82552 attempts a backflip, but does he stick the landing? 2024-11-16 17:05:45 +08:00
Diego Devesa
ae8de6d50a
ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
Concedo
a244b1ffd2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	Package.swift
#	ci/run.sh
#	docs/backend/SYCL.md
#	examples/llama-bench/llama-bench.cpp
#	examples/server/CMakeLists.txt
#	examples/server/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	grammars/README.md
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/run-json-schema-to-grammar.mjs
#	tests/test-backend-ops.cpp
2024-11-09 13:36:47 +08:00
Georgi Gerganov
5c333e0140
metal : add BF16 support (#8439)
* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support
2024-11-06 19:53:51 +02:00
Concedo
bb13925f39 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakePresets.json
#	Makefile
#	Package.swift
#	ci/run.sh
#	common/CMakeLists.txt
#	examples/CMakeLists.txt
#	flake.lock
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-backend.cpp
#	ggml/src/ggml.c
#	pocs/vdot/q8dot.cpp
#	pocs/vdot/vdot.cpp
#	tests/test-backend-ops.cpp
#	tests/test-grad0.cpp
#	tests/test-quantize-fns.cpp
#	tests/test-quantize-perf.cpp
#	tests/test-rope.cpp
2024-11-04 16:54:53 +08:00
Diego Devesa
9f40989351
ggml : move CPU backend to a separate file (#10144) 2024-11-03 19:34:08 +01:00
Concedo
a46f8acd03 note: also has support for completion tokens count 2024-11-01 00:44:14 +08:00
Georgi Gerganov
8d8ff71536
llama : remove Tail-Free sampling (#10071)
ggml-ci
2024-10-29 10:42:05 +02:00
wwoodsTM
ff252ea48e
llama : add DRY sampler (#9702)
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Michael Podvitskiy
d80fb71f8b
llama: string_split fix (#10022)
* llama: Refactor string_split to use template specialization,  fixes parsing strings with spaces

* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
2024-10-25 17:57:54 +02:00
Concedo
94a5a27b85 Alone in the darkness
They're coming for you
I know they will try to catch me too
Alone in the darkness
They're calling for you
There's nowhere to run for cover
2024-10-24 22:29:20 +08:00
Georgi Gerganov
f594bc80ba
ggml : add asserts for type conversion in fattn kernels (#9971)
ggml-ci
2024-10-21 16:20:46 +03:00
Xuan Son Nguyen
cda0e4b648
llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
* refactor llama_batch_get_one

* adapt all examples

* fix simple.cpp

* fix llama_bench

* fix

* fix context shifting

* free batch before return

* use common_batch_add, reuse llama_batch in loop

* null terminated seq_id list

* fix save-load-state example

* fix perplexity

* correct token pos in llama_batch_allocr
2024-10-18 23:18:01 +02:00
Concedo
a9dbcdd3ec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	docs/build.md
#	examples/infill/infill.cpp
#	examples/main/README.md
#	examples/server/README.md
#	flake.lock
#	scripts/sync-ggml.last
#	src/llama.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-sampling.cpp
2024-10-17 16:36:02 +08:00
MaggotHATE
fbc98b748e
sampling : add XTC sampler (#9742)
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
2024-10-15 12:54:55 +02:00
Georgi Gerganov
11ac9800af
llama : improve infill support and special token detection (#9798)
* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (#9800)

* gguf : deprecate old FIM token KVs
2024-10-12 08:21:51 +03:00
Concedo
e692a79aab Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	CONTRIBUTING.md
#	docs/android.md
#	docs/docker.md
#	examples/embedding/embedding.cpp
#	examples/imatrix/imatrix.cpp
#	examples/infill/infill.cpp
#	examples/llama-bench/llama-bench.cpp
#	examples/main/README.md
#	examples/parallel/parallel.cpp
#	examples/perplexity/perplexity.cpp
#	examples/quantize-stats/quantize-stats.cpp
#	examples/save-load-state/save-load-state.cpp
#	examples/server/README.md
#	examples/simple/CMakeLists.txt
#	examples/speculative/speculative.cpp
#	flake.lock
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-blas.cpp
#	pocs/vdot/q8dot.cpp
#	pocs/vdot/vdot.cpp
#	scripts/debug-test.sh
#	scripts/sync-ggml.last
#	src/llama.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tests/test-quantize-fns.cpp
#	tests/test-quantize-perf.cpp
#	tests/test-tokenizer-0.cpp
#	tests/test-tokenizer-1-bpe.cpp
#	tests/test-tokenizer-1-spm.cpp
2024-10-11 11:59:59 +08:00
Diego Devesa
7eee341bee
common : use common_ prefix for common library functions (#9805)
* common : use common_ prefix for common library functions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-10 22:57:42 +02:00
Concedo
da6cf261a8 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/close-issue.yml
#	.github/workflows/nix-ci-aarch64.yml
#	.github/workflows/nix-ci.yml
#	README.md
#	ci/run.sh
#	examples/server/README.md
#	ggml/src/ggml-cuda.cu
#	ggml/src/ggml-metal.m
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2024-10-05 22:24:08 +08:00
Georgi Gerganov
8c475b97b8
rerank : use [SEP] token instead of [BOS] (#9737)
* rerank : use [SEP] token instead of [BOS]

ggml-ci

* common : sanity check for non-NULL tokens

ggml-ci

* ci : adjust rank score interval

ggml-ci

* ci : add shebang to run.sh

ggml-ci
2024-10-05 15:55:04 +03:00
Concedo
ce7f9c9a2c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-rocm.Dockerfile
#	.devops/llama-cli-rocm.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.github/workflows/build.yml
#	.github/workflows/python-type-check.yml
#	CMakeLists.txt
#	CONTRIBUTING.md
#	README.md
#	ci/run.sh
#	examples/embedding/embedding.cpp
#	examples/server/README.md
#	flake.lock
#	ggml/include/ggml.h
#	ggml/src/ggml.c
#	requirements/requirements-convert_legacy_llama.txt
#	scripts/sync-ggml.last
#	src/llama-vocab.cpp
#	src/llama.cpp
#	tests/test-backend-ops.cpp
#	tests/test-grad0.cpp
#	tests/test-tokenizer-0.cpp
2024-10-02 01:00:57 +08:00
matiaslin
faac0bae26
common : ensure llama_batch size does not exceed max size (#9668)
A crash was observed when the number of tokens added to a batch exceeds
llama_batch size. An assertion in llama_batch_add was added to protect
against llama_batch size overflow.
2024-09-29 15:25:00 +03:00
Georgi Gerganov
f4d2b8846a
llama : add reranking support (#9510)
* py : add XLMRobertaForSequenceClassification [no ci]

* py : fix scalar-tensor conversion [no ci]

* py : fix position embeddings chop [no ci]

* llama : read new cls tensors [no ci]

* llama : add classigication head (wip) [no ci]

* llama : add "rank" pooling type

ggml-ci

* server : add rerank endpoint

ggml-ci

* llama : aboud ggml_repeat during classification

* rerank : cleanup + comments

* server : accept /rerank endpoint in addition to /v1/rerank [no ci]

* embedding : parse special tokens

* jina : support v1 reranker

* vocab : minor style

ggml-ci

* server : initiate tests for later

ggml-ci

* server : add docs

* llama : add comment [no ci]

* llama : fix uninitialized tensors

* ci : add rerank tests

ggml-ci

* add reranking test

* change test data

* Update examples/server/server.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add `--reranking` argument

* update server docs

* llama : fix comment [no ci]

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-28 17:42:03 +03:00
Concedo
29625c3d2e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	ci/run.sh
#	common/CMakeLists.txt
#	common/common.cpp
#	docs/backend/SYCL.md
#	examples/embedding/embedding.cpp
#	examples/imatrix/imatrix.cpp
#	examples/infill/infill.cpp
#	examples/llama-bench/llama-bench.cpp
#	examples/main/README.md
#	examples/parallel/parallel.cpp
#	examples/perplexity/perplexity.cpp
#	examples/server/CMakeLists.txt
#	examples/server/README.md
#	examples/server/bench/README.md
#	examples/server/tests/README.md
#	examples/speculative/speculative.cpp
#	flake.lock
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	grammars/README.md
#	scripts/compare-commits.sh
#	scripts/compare-llama-bench.py
#	tests/CMakeLists.txt
2024-09-19 14:53:57 +08:00
Georgi Gerganov
6262d13e0b
common : reimplement logging (#9418)
https://github.com/ggerganov/llama.cpp/pull/9418
2024-09-15 20:46:12 +03:00
Concedo
e44ddf26ef Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	Makefile
#	examples/embedding/embedding.cpp
#	examples/imatrix/imatrix.cpp
#	examples/llama-bench/llama-bench.cpp
#	examples/llava/MobileVLM-README.md
#	examples/parallel/parallel.cpp
#	examples/perplexity/perplexity.cpp
#	examples/quantize/CMakeLists.txt
#	examples/server/README.md
#	examples/speculative/speculative.cpp
#	tests/test-backend-ops.cpp
2024-09-13 16:17:24 +08:00
Georgi Gerganov
0abc6a2c25
llama : llama_perf + option to disable timings during decode (#9355)
* llama : llama_perf + option to disable timings during decode

ggml-ci

* common : add llama_arg

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* perf : separate functions in the API

ggml-ci

* perf : safer pointer handling + naming update

ggml-ci

* minor : better local var name

* perf : abort on invalid sampler pointer

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-13 09:53:38 +03:00
Ahmad Tameem
2b00fa7997
riscv : modify Makefile and add a RISCV_VECT to print log info (#9442)
- Added ggml_cpu_has_riscv_v() in GGML to print system info in log
- Modified Makefile to only use flag when cross compiling for RISC-V
2024-09-12 14:24:31 +03:00
Concedo
3ba348c4ab Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
2024-09-12 11:18:13 +08:00
Concedo
13394368b6 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/embedding/embedding.cpp
#	examples/infill/infill.cpp
#	examples/perplexity/perplexity.cpp
#	flake.lock
#	src/llama-sampling.cpp
2024-09-11 20:27:53 +08:00
Farbod Bijary
67155ab7f5
feat: Implements retrying logic for downloading models using --model-url flag (#9255)
* feat: Implements retrying logic for downloading models using --model-url flag

* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* apply comments

* implements a retry function to avoid duplication

* fix editorconfig

* change function name

---------

Co-authored-by: farbod <farbod.bjary82@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-09-11 11:22:37 +02:00
Xuan Son Nguyen
6cd4e03444
arg : bring back missing ifdef (#9411)
* arg : bring back missing ifdef

* replace with llama_supports_gpu_offload
2024-09-10 22:41:29 +02:00
Concedo
a947558e0e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	common/CMakeLists.txt
#	common/common.cpp
#	common/common.h
#	examples/embedding/embedding.cpp
#	examples/imatrix/imatrix.cpp
#	examples/infill/infill.cpp
#	examples/parallel/parallel.cpp
#	examples/perplexity/perplexity.cpp
#	examples/rpc/README.md
#	examples/save-load-state/save-load-state.cpp
#	examples/server/README.md
#	examples/speculative/speculative.cpp
#	tests/test-sampling.cpp
2024-09-10 16:39:23 +08:00
Xuan Son Nguyen
bfe76d4a17
common : move arg parser code to arg.cpp (#9388)
* common : move arg parser to arg.cpp

* better categorize args

* add cmake

* missing climits

* missing cstdarg

* common : more explicit includes

* fix build

* refactor gpt_params_parse

* update server readme

* fix test

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-09 23:36:09 +02:00
Concedo
b63158005f All samplers moved to kcpp side 2024-09-09 18:14:11 +08:00
Concedo
12fd16bfd4 Merge commit 'df270ef745' into concedo_experimental
# Conflicts:
#	Makefile
#	common/CMakeLists.txt
#	common/common.h
#	common/sampling.cpp
#	common/sampling.h
#	examples/infill/infill.cpp
#	examples/llama-bench/llama-bench.cpp
#	examples/quantize-stats/quantize-stats.cpp
#	examples/server/server.cpp
#	include/llama.h
#	src/llama-sampling.cpp
#	src/llama-sampling.h
#	src/llama.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-grammar-parser.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-llama-grammar.cpp
#	tests/test-sampling.cpp
2024-09-09 17:10:08 +08:00
Concedo
70cdb55cc9 Merge commit '947538acb8' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakePresets.json
#	examples/llama-bench/llama-bench.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-quantize-fns.cpp
2024-09-09 11:26:34 +08:00
Xuan Son Nguyen
3f7ccfd649
common : bring back missing args, add env var duplication check (#9375)
* common : bring back missing args

* move duplication check to test-arg-parser

* add check for duplicated env var

* correct default values
2024-09-08 18:08:55 +02:00
slaren
a249843d89
common : restore --n-gpu-layers (#9371) 2024-09-08 16:44:42 +02:00
Xuan Son Nguyen
00b02bb249
imatrix : fix arg parser for imatrix (#9366)
* imatrix : fix arg parser

* beautify printing first arg
2024-09-08 12:12:17 +02:00
Georgi Gerganov
faf69d4237
llama : sanitize invalid tokens (#9357)
* common : do not add null tokens during warmup

ggml-ci

* llama : check that the input tokens are valid

ggml-ci

* tests : fix batch size of bert model

ggml-ci
2024-09-08 00:33:13 +03:00
Xuan Son Nguyen
1b9ae5189c
common : refactor arg parser (#9308)
* (wip) argparser v3

* migrated

* add test

* handle env

* fix linux build

* add export-docs example

* fix build (2)

* skip build test-arg-parser on windows

* update server docs

* bring back missing --alias

* bring back --n-predict

* clarify test-arg-parser

* small correction

* add comments

* fix args with 2 values

* refine example-specific args

* no more lamba capture

Co-authored-by: slaren@users.noreply.github.com

* params.sparams

* optimize more

* export-docs --> gen-docs
2024-09-07 20:43:51 +02:00
Georgi Gerganov
df270ef745
llama : refactor sampling v2 (#9294)
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
2024-09-07 15:16:19 +03:00
Aarni Koskela
815b1fb20a
batched-bench : add --output-format jsonl option (#9293)
`--output-format` is modeled after `llama-bench`'s options
2024-09-06 17:59:58 +02:00
Concedo
73dca7e5bc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/nix/devshells.nix
#	.devops/nix/nixpkgs-instances.nix
#	.devops/nix/package.nix
#	.devops/nix/scope.nix
#	README.md
#	docs/docker.md
#	examples/llama-bench/llama-bench.cpp
#	flake.lock
#	flake.nix
#	grammars/README.md
#	src/llama.cpp
2024-09-06 01:07:31 +08:00
Radoslav Gerganov
82e3b03c11
rpc : make RPC servers come first in the device list (#9296)
* rpc : make RPC servers come first in the device list

* rpc : disable options for non-RPC builds

* rpc : rpc_count always zero for non-RPC builds
2024-09-04 11:08:32 +03:00