Commit graph

4371 commits

Author SHA1 Message Date
compilade
de9692a7d2
llama : fix llama_copy_state_data with fragmented KV cache (#5840)
The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.
2024-03-03 10:41:55 +02:00
Pierrick Hymbert
e6029348e8
ci : schedule slow server tests only on Release or on demand (#5839) 2024-03-03 10:35:23 +02:00
Pierrick Hymbert
8ef969afce
server : init http requests thread pool with --parallel if set (#5836) 2024-03-03 09:48:36 +02:00
Concedo
0c59c1ed90 allow specifying width and height 2024-03-03 15:44:15 +08:00
Georgi Gerganov
fa974646e1
flake.lock: Update (#5842)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-02 20:11:31 -08:00
Pierrick Hymbert
9731134296
server: tests: passkey challenge / self-extend with context shift demo (#5832)
* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test
2024-03-02 22:00:14 +01:00
Michael Podvitskiy
4a6e2d6142
llama : add abort_callback to interrupt computation (#5409)
* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-02 21:52:25 +02:00
Georgi Gerganov
494c870326
ggml : fix IQ3_S AVX implementation (#5834)
ggml-ci
2024-03-02 20:00:49 +02:00
Jared Van Bortel
4d4d2366fc
convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821) 2024-03-02 12:27:26 -05:00
Jared Van Bortel
c7a0ad8ec9
convert-hf : make model class definitions self-contained (#5825) 2024-03-02 12:21:47 -05:00
Kawrakow
bbde6eb256
ggml : IQ3_S improvements (#5829)
* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-02 17:00:51 +02:00
Georgi Gerganov
ef2cd694c4
scripts : add pod-llama.sh 2024-03-02 16:54:20 +02:00
Concedo
fa1d8b8d95 updated lite 2024-03-02 22:33:02 +08:00
Xuan Son Nguyen
6c32d8c7ad
llama : refactor internal quantization functions (#5830) 2024-03-02 16:19:09 +02:00
compilade
802da0091b
llama : fix segfault from unknown model arch name (#5820)
* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-03-02 15:42:56 +02:00
Neo Zhang Jianyu
715641391d
Support multiple GPUs (split mode) on SYCL backend (#5806)
* suport multiple cards: split-mode - layer|row

* rm warning

* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test

* update news

* fix merge error

* update according to review comments
2024-03-02 19:49:30 +08:00
Concedo
4c0beef598 updated lite, added personal notes 2024-03-02 18:53:15 +08:00
Concedo
fda905a36a fixed unable to load config 2024-03-02 18:08:45 +08:00
crasm
9bf297a02b
workflows : remove nocleanup arg for check-requirements.sh (#5826)
Reduces peak tmpfs usage and should prevent the check from failing from
running out of space.

Fixes the 'No space left on device' issue mentioned in #5703.
2024-03-02 00:11:06 -05:00
Concedo
e1b213ae96 increase steps limit 2024-03-02 12:08:19 +08:00
Concedo
e53d21d748 sanitize SD prompt to avoid segfault 2024-03-02 12:05:59 +08:00
Concedo
59c5448ac8 fixed colab (+1 squashed commits)
Squashed commits:

[1d1c686f] updated colab and docs
2024-03-02 10:09:07 +08:00
Tushar
cb5e8f7fc4
build(nix): Introduce flake.formatter for nix fmt (#5687)
* build(nix): Introduce flake.formatter for `nix fmt`
* chore: Switch to pkgs.nixfmt-rfc-style
2024-03-01 15:18:26 -08:00
nold
da3b9ba2b7
convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792) 2024-03-01 16:51:12 -05:00
Sourab Mangrulkar
c29af7e225
llama : add StarCoder2 support (#5795)
* Add support for starcoder2

* handle rope type

* skip rope freq and rotary embeddings from being serialized

* resolve comments

* Update llama.cpp

* remove redundant changes

* handle `rope-theta`

* llama : change starcoder2 rope type

* address comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-01 21:30:46 +02:00
Concedo
0978134f65 fix macos tunnel 2024-03-02 02:03:13 +08:00
Georgi Gerganov
38d16b1426
server : remove api_like_OAI.py proxy script (#5808) 2024-03-01 20:00:58 +02:00
ddpasa
c2224f003b
ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813) 2024-03-01 18:00:00 +01:00
Concedo
2d9a90b652 try to fix ci compile errors (+1 squashed commits)
Squashed commits:

[d0d49663] fixed log multiline (+1 squashed commits)

Squashed commits:

[81a8befe] try to fix linux build error (+1 squashed commits)

Squashed commits:

[22850dda] try to fix build (+1 squashed commits)

Squashed commits:

[b8294611] missing type
2024-03-01 23:38:15 +08:00
kunal-vaishnavi
e743386728
gemma : fix bfloat16 -> float16 conversion issue (#5810) 2024-03-01 16:08:08 +02:00
Miwa / Ensan
f49a535686
common : fix flag --logits-all to --all-logits (#5805) 2024-03-01 15:48:56 +02:00
Pierrick Hymbert
3ab8b3a92e
llama : cleanup unused mmq flags (#5772)
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q

* remove: mul_mat_q in compare llama bench and usage

* update llama-bench

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-03-01 13:39:06 +02:00
Concedo
040de7d899 try add tunnels for macos 2024-03-01 17:52:09 +08:00
Concedo
55af5446ad Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
#	ci/run.sh
#	llama.cpp
#	scripts/sync-ggml.last
2024-03-01 17:41:37 +08:00
Douglas Hanley
9600d59e01
unicode : switch to multimap based nfd_map (#5799)
* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time
2024-03-01 11:15:36 +02:00
Pierrick Hymbert
5cb02b4a01
server: allow to override threads server pool with --threads-http (#5794) 2024-03-01 10:08:08 +01:00
Eve
6ea0f010ff
ci : add Ubuntu 22 Vulkan CI run (#5789) 2024-03-01 10:54:53 +02:00
Concedo
e5861e993d fix benchmark 2024-03-01 16:54:25 +08:00
Concedo
80011ed8aa KCPP SD: add warn and step restriction., updated lite, handle quant mode 2024-03-01 16:41:19 +08:00
Georgi Gerganov
f105471ef6
server : fix newlines in help (#5785) 2024-03-01 09:59:43 +02:00
AidanBeltonS
38d1521608
[SYCL] Use batched mul_mat pathway (#5591)
* Use batched mul_mat pathway

* rm extra line

* Explicitly state scaled data type

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-03-01 13:06:47 +05:30
Concedo
3463688a0e image generation is fully working over api (+1 squashed commits)
Squashed commits:

[c98ab0b4] single image generation is working now
2024-03-01 14:43:44 +08:00
Xuan Son Nguyen
052051d8ae
Server: normalize naming (#5779)
* server: normalize naming

* fix spacing
2024-02-29 21:42:11 +01:00
Concedo
e8f4d7b3da added model and config endpoints for sdcpp, added more samplers. speed is still not good 2024-02-29 22:56:09 +08:00
bebopkim
257015bb94
Resolve Metal compilation errors for sdcpp (#720) 2024-02-29 20:15:45 +08:00
Concedo
5a44d4de2b refactor and clean identifiers for sd, fix cmake 2024-02-29 18:28:45 +08:00
Concedo
66134bb36e ui for loading SD models done 2024-02-29 17:08:22 +08:00
Marcus Dunn
d5ab29757e
llama : constified llama_set_state_data's src (#5774) 2024-02-29 10:17:23 +02:00
Concedo
524ba12abd refactor - do not use a copy buffer to store generation outputs, instead return a cpp allocated ptr 2024-02-29 14:02:20 +08:00
Georgi Gerganov
87c91c0766
ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771)
ggml-ci
2024-02-28 21:44:21 +02:00