Zonghang Li
1ea2d61a97
speedup: add arg --keep-out-in-cuda to run the output layer on CUDA
2025-06-28 10:58:18 +04:00
Li, Zonghang
80e5b71b48
fix compute buffer estimate: tested on metal
2025-06-20 13:43:55 +04:00
Zonghang Li
dd589561b4
improve the computing buffer estimate
2025-06-19 08:02:43 +00:00
DeEMO
6ff38b2a0c
add args: data-port and signal-port
2025-06-17 12:00:04 +08:00
Lizonghang
500e066a2f
fix batch decoding and dynamic batching
2025-06-06 16:53:22 +04:00
Lizonghang
c54a6a0132
fix context shifting
2025-05-19 16:58:35 +04:00
Zonghang Li
bcfdace59b
add args -k and --force
2025-03-11 20:44:36 +04:00
Lizonghang
c84f9d29fe
use arg prefetch and remove arg unload
2025-02-12 17:04:41 +04:00
Lizonghang
ac5d63b09e
add explaination for why the output layer weights should be kept in metal shared memory
2025-01-25 23:51:16 +04:00
Lizonghang
1ca9a43bd1
keep the output layer weights in shared memory by default
2025-01-25 23:31:43 +04:00
Lizonghang
1c0087e919
rename arg --keep-inp-out-in-metal to --keep-out-in-metal
2025-01-23 23:17:06 +04:00
Lizonghang
78a544d716
add metal mem limit
2025-01-23 16:08:52 +04:00
Zonghang Li
33429ec4e1
add option --keep-inp-out-in-metal
2025-01-22 11:25:09 +04:00
Lizonghang
facb4ea736
add option --keep-inp-out-in-metal and fix bugs in unmap
2025-01-22 11:15:19 +04:00
Zonghang Li
46e99218b4
add arg --cuda-mem
2025-01-16 09:15:34 +04:00
Lizonghang
76a7fc7527
support different window sizes
2024-10-26 12:34:14 +04:00
Lizonghang
c97ea10617
add mmap prefetch and unloading
2024-10-25 16:33:56 +04:00
Lizonghang
2a01ff5fb1
init
2024-10-23 09:42:32 +04:00
Daniel Kleine
133c7b46b3
Fixed RNG seed docs ( #9723 )
...
* Update README.md
fixed RNG seed info
* changed print format to unsigned
2024-10-04 10:54:44 +02:00
Georgi Gerganov
f4d2b8846a
llama : add reranking support ( #9510 )
...
* py : add XLMRobertaForSequenceClassification [no ci]
* py : fix scalar-tensor conversion [no ci]
* py : fix position embeddings chop [no ci]
* llama : read new cls tensors [no ci]
* llama : add classigication head (wip) [no ci]
* llama : add "rank" pooling type
ggml-ci
* server : add rerank endpoint
ggml-ci
* llama : aboud ggml_repeat during classification
* rerank : cleanup + comments
* server : accept /rerank endpoint in addition to /v1/rerank [no ci]
* embedding : parse special tokens
* jina : support v1 reranker
* vocab : minor style
ggml-ci
* server : initiate tests for later
ggml-ci
* server : add docs
* llama : add comment [no ci]
* llama : fix uninitialized tensors
* ci : add rerank tests
ggml-ci
* add reranking test
* change test data
* Update examples/server/server.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* add `--reranking` argument
* update server docs
* llama : fix comment [no ci]
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-28 17:42:03 +03:00
Xuan Son Nguyen
afbbfaa537
server : add more env vars, improve gen-docs ( #9635 )
...
* server : add more env vars, improve gen-docs
* update server docs
* LLAMA_ARG_NO_CONTEXT_SHIFT
2024-09-25 14:05:13 +02:00
Xuan Son Nguyen
0b3bf966f4
server : add --no-context-shift option ( #9607 )
...
* server : add --no-context-shift option
* small fix
* Update examples/server/tests/features/embeddings.feature
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* tests : minor fix
* revert usage of GGML_ASSERT
* update server documentation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-23 22:23:54 +02:00
Bert Wagner
8b836ae731
arg : add env variable for parallel ( #9513 )
...
* add env variable for parallel
* Update README.md with env: LLAMA_ARG_N_PARALLEL
2024-09-17 16:35:38 +03:00
Vinesh Janarthanan
441b72b91f
main : option to disable context shift ( #9484 )
...
* added cli arg to disable context shift
* reverted precommit
* updated README.md for main
* white space
* allow disabling context shift in the server
* Update common/arg.cpp
no-context-shift only works for main example
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* added server example to --no-context-shift args
* removed server changes
* white space
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-16 09:20:01 +03:00
Georgi Gerganov
6262d13e0b
common : reimplement logging ( #9418 )
...
https://github.com/ggerganov/llama.cpp/pull/9418
2024-09-15 20:46:12 +03:00
Georgi Gerganov
0abc6a2c25
llama : llama_perf + option to disable timings during decode ( #9355 )
...
* llama : llama_perf + option to disable timings during decode
ggml-ci
* common : add llama_arg
* Update src/llama.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* perf : separate functions in the API
ggml-ci
* perf : safer pointer handling + naming update
ggml-ci
* minor : better local var name
* perf : abort on invalid sampler pointer
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-13 09:53:38 +03:00
Xuan Son Nguyen
6cd4e03444
arg : bring back missing ifdef ( #9411 )
...
* arg : bring back missing ifdef
* replace with llama_supports_gpu_offload
2024-09-10 22:41:29 +02:00
matteo
8d300bd35f
enable --special arg for llama-server ( #9419 )
...
Co-authored-by: matteo serva <matteo.serva@gmail.com>
2024-09-10 22:40:59 +02:00
slaren
49006c67b4
llama : move random seed generation to the samplers ( #9398 )
...
* llama_sampler_penalties : clamp penalty_last_n to zero
2024-09-10 18:04:25 +02:00
Xuan Son Nguyen
bfe76d4a17
common : move arg parser code to arg.cpp
( #9388 )
...
* common : move arg parser to arg.cpp
* better categorize args
* add cmake
* missing climits
* missing cstdarg
* common : more explicit includes
* fix build
* refactor gpt_params_parse
* update server readme
* fix test
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-09 23:36:09 +02:00