* Changes to server to allow metadata override
* documentation
* flake.nix: expose full scope in legacyPackages
* flake.nix: rocm not yet supported on aarch64, so hide the output
* flake.nix: expose checks
* workflows: nix-ci: init; build flake outputs
* workflows: nix-ci: add a job for eval
* workflows: weekly `nix flake update`
* workflows: nix-flakestry: drop tag filters
...and add a job for flakehub.com
* workflows: nix-ci: add a qemu job for jetsons
* flake.nix: suggest the binary caches
* flake.lock: update
to a commit recently cached by nixpkgs-cuda-ci
---------
Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.
See mozilla-Ocho/llamafile@711344b
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.
The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.
See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29
* Add API key authentication for enhanced server-client security
* server : to snake_case
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* * add multiprompt support
* * cleanup
* * more cleanup
* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests
* * remove all references to mutex_multitasks
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* * change to set
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Add openai-compatible POST /v1/chat/completions API endpoint to server example
* fix code style
* Update server README.md
* Improve server README.md
* Fix server.cpp code style according to review
* server : some style changes
* server : indentation
* server : enable special tokens during tokenization by default
* server : minor code style
* server : change random string generator
* straightforward /v1/models endpoint
---------
Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.
* Respect add_bos_token GGUF metadata value
* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
* Update server.cpp with min_p after it was introduced in https://github.com/ggerganov/llama.cpp/pull/3841
* Use spaces instead of tabs
* Update index.html.hpp after running deps.sh
* Fix test - fix line ending
* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>