Concedo
c9eb2ba1c5
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# README.md
# ggml-opencl.c
2023-05-13 15:51:05 +08:00
Concedo
b6594ab91e
do not show tokenizer warning
2023-05-13 15:48:17 +08:00
Georgi Gerganov
738ace394a
llama : free ggml context in set / copy state data ( close #1425 )
2023-05-13 09:08:52 +03:00
Concedo
05cf5f7d6e
partially working, but the blas matmul is broken
2023-05-13 11:35:38 +08:00
Concedo
e9caff1cda
Interim merge. Do not use.
...
Merge branch 'master' into concedo_experimental
# Conflicts:
# README.md
# SHA256SUMS
# examples/quantize/quantize.cpp
# ggml-opencl.c
# ggml.c
# ggml.h
# llama.cpp
# llama.h
2023-05-12 23:20:27 +08:00
Georgi Gerganov
b9fd7eee57
ggml : remove bit shuffling ( #1405 )
...
* ggml : remove Q4_0 bit shufling (ARM NEON)
* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)
* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)
* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)
* ggml : remove Q5_0 bit shuffling (ARM NEON)
* ggml : 2x faster scalar implementations
* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)
* ggml : simplify scalar dot
* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit
* ggml : fix Q4_1 quantization
* ggml : update cuBLAS + normalize variable names
* ggml : remove Q4_2 mode
* ggml : minor formatting
* ggml : fix Q5_0 quantization
* scripts : add script for measuring the time per token
* AVX implementations (#1370 )
* ggml : uniform 5th bit extraction
* llama : produce error upon loading old model files
* llama : fix model magic/version write
* ggml : speed-up Q5_0 + Q5_1 at 4 threads
* ggml : preserve old Q4 and Q5 formats
* ggml : simplify Q8_1 - no need for low / high sums anymore
* ggml : fix Q8_0 and Q8_1 rounding
* Revert "AVX implementations (#1370 )"
This reverts commit 948d124837f9d287d8490f41338e0e4cceb0814f.
* ggml : fix AVX2 implementation
* sha : update hashes for 7B and 13B
* readme : update timings + remove warning banner
* llama : update v2 PR number to 1405
* ggml : fix WASM comments
* ggml : back to original bit order
* readme : add note that Q4 and Q5 have been changed
* llama : fix return for unknown version
---------
Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
Concedo
54194911ac
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# README.md
2023-05-09 16:50:43 +08:00
Pavol Rusnak
003ba2fb43
llama : fix hparams shadow ( #1367 )
...
fixes #1363
2023-05-08 17:48:21 +03:00
Georgi Gerganov
f9a6364912
llama : require first token to be BOS ( #1303 )
...
* llama : require first token to be BOS
* scripts : add ppl-run-all.sh
* perplexity : add BOS for each chunk
* readme : update perplexity values after BOS fix
* perplexity : add clarifying comments
2023-05-08 17:41:54 +03:00
Concedo
62beded0e7
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# Makefile
# README.md
2023-05-07 19:10:01 +08:00
Jed Fox
3924088512
Remove default arguments from sampling functions ( #1343 )
2023-05-06 17:01:47 -04:00
Concedo
a48dddab86
slightly bump the RAM up to support chinese alpaca
2023-05-06 11:48:22 +08:00
Concedo
ede8e4edbb
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
2023-05-03 23:34:50 +08:00
Evan Jones
e216aa0463
llama : only copy used KV cache in get / set state ( #1272 )
...
* llama : only copy used KV cache in get / set state
* switch to ggml for copying k, v
* avoid designated initializers
2023-05-02 22:26:13 -04:00
Georgi Gerganov
0e6cbff1b7
llama : fix compile warnings
2023-05-02 23:09:08 +03:00
Robert Brisita
2bb992f034
llama : allow 0 as a seed number. ( #1275 )
2023-05-02 19:23:44 +03:00
slaren
2d099e5193
ggml: add names to tensors ( #1268 )
...
* ggml: add names to tensors
* minor improvements to dot file formatting
2023-05-02 16:03:00 +02:00
Concedo
94827172e0
Merge branch 'master' into concedo
...
# Conflicts:
# CMakeLists.txt
# Makefile
# ggml-cuda.cu
# ggml-cuda.h
2023-05-02 14:38:31 +08:00
Georgi Gerganov
70269cae37
llama : fix session load / save ( #1263 )
2023-05-01 14:54:59 +03:00
slaren
b925f1f1b0
cuBLAS: fall back to pageable memory if pinned alloc fails ( #1233 )
...
* cuBLAS: fall back to pageable memory if pinned alloc fails
* cuBLAS: do not use pinned memory if env variable GGML_CUDA_NO_PINNED is set
2023-05-01 13:32:22 +02:00
Alex Klinkhamer
90b19bd6ee
llama : let context be const when accessing const data ( #1261 )
2023-05-01 10:24:20 +03:00
Concedo
0061b90ec6
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
2023-04-30 10:35:02 +08:00
Georgi Gerganov
214b6a3570
ggml : adjust mul_mat_f16 work memory ( #1226 )
...
* llama : minor - remove explicity int64_t cast
* ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS
* ggml : add asserts to guard for incorrect wsize
2023-04-29 18:43:28 +03:00
Georgi Gerganov
84ca9c2ecf
examples : fix save-load-state + rename llama-util.h
2023-04-29 13:48:11 +03:00
Concedo
da0c34b028
Merge branch 'master' into concedo_experimental
2023-04-29 18:27:06 +08:00
Ivan Stepanov
dd7eff57d8
llama : new sampling algorithms ( #1126 )
...
* Sample interface, new samplers.
New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat
Ignore EOS fix: -inf should be used.
* mirostat
* Added --logit-bias and --no-penalize-nl, removed std::span
* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
* Save and load example adjust
* Tests
* Windows build fix
* Windows test fix
2023-04-29 08:34:41 +03:00
Concedo
bb282a4ecf
reinstated the q4_3 format, for backwards compatibility.
2023-04-29 11:42:04 +08:00
Concedo
0fc1772a8f
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# Makefile
# README.md
# ggml.c
2023-04-29 11:14:05 +08:00
slaren
7fc50c051a
cuBLAS: use host pinned memory and dequantize while copying ( #1207 )
...
* cuBLAS: dequantize simultaneously while copying memory
* cuBLAS: use host pinned memory
* cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory
* cuBLAS: also pin kv cache
* fix rebase
2023-04-29 02:04:18 +02:00
Stephan Walter
36d19a603b
Remove Q4_3 which is no better than Q5 ( #1218 )
2023-04-28 23:10:43 +00:00
Evan Jones
1481a9cf25
llama : add session file format and saved sessions in main ( #1169 )
2023-04-28 18:59:37 +03:00
0cc4m
7296c961d9
ggml : add CLBlast support ( #1164 )
...
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 17:57:16 +03:00
Concedo
95bbd46019
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# README.md
2023-04-27 16:12:00 +08:00
Georgi Gerganov
574406dc7e
ggml : add Q5_0 and Q5_1 quantization ( #1187 )
...
* ggml : add Q5_0 quantization (cuBLAS only)
* ggml : fix Q5_0 qh -> uint32_t
* ggml : fix q5_0 histogram stats
* ggml : q5_0 scalar dot product
* ggml : q5_0 ARM NEON dot
* ggml : q5_0 more efficient ARM NEON using uint64_t masks
* ggml : rename Q5_0 -> Q5_1
* ggml : adding Q5_0 mode
* quantize : add Q5_0 and Q5_1 to map
* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195 )
---------
Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-26 23:14:13 +03:00
Ásgeir Bjarni Ingvarsson
87a6f846d3
Allow setting the rng seed after initialization. ( #1184 )
...
The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
2023-04-26 22:08:43 +02:00
Concedo
93a8e00dfa
Merge branch 'master' into concedo
...
# Conflicts:
# flake.nix
2023-04-26 18:01:35 +08:00
Georgi Gerganov
7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) ( #1179 )
...
* ggml : add Q8_0 quantization format (rename the old one to Q8_1)
* tests : fix test-quantize-fns
* ggml : finalize Q8_0 implementation
* ggml : use q4_0_q8_0 and q4_2_q8_0
* ggml : fix Q8_0 dot product bug (ARM)
* ggml : Q8_0 unroll x2
* ggml : fix bug - using wrong block type
* ggml : extend quantize_fns_t with "vec_dot_type"
* ggml : fix Q8_0 to use 255 values out of 256
* ggml : fix assert using wrong QK4_2 instead of QK4_3
2023-04-25 23:40:51 +03:00
Concedo
235daf4016
Merge branch 'master' into concedo
...
# Conflicts:
# .github/workflows/build.yml
# README.md
2023-04-25 20:44:22 +08:00
Georgi Gerganov
957c8ae21d
llama : increase scratch buffer size for 65B (ref #1152 )
...
Temporary solution
2023-04-24 18:47:30 +03:00
Concedo
e58f1d1336
Merge branch 'master' into concedo_experimental
2023-04-24 19:43:17 +08:00
Georgi Gerganov
c4fe84fb0d
llama : refactor get / set state + remove redundant kv cache API ( #1143 )
2023-04-24 07:40:02 +03:00
Concedo
8e615c8245
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# README.md
2023-04-24 12:20:08 +08:00
Georgi Gerganov
e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make"
2023-04-23 18:15:39 +03:00
Concedo
7c60441d71
Merge branch 'master' into concedo
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
2023-04-22 23:46:14 +08:00
Stephan Walter
c50b628810
Fix CI: ARM NEON, quantization unit tests, editorconfig ( #1122 )
2023-04-22 10:54:13 +00:00
Concedo
1b7aa2b815
Merge branch 'master' into concedo
...
# Conflicts:
# .github/workflows/build.yml
# CMakeLists.txt
# Makefile
2023-04-22 16:22:08 +08:00
Georgi Gerganov
872c365a91
ggml : fix AVX build + update to new Q8_0 format
2023-04-22 11:08:12 +03:00
Concedo
1ea0e15292
Merge branch 'master' into concedo
...
# Conflicts:
# llama.cpp
2023-04-22 16:07:27 +08:00
xaedes
b6e7f9b09e
llama : add api for getting/setting the complete state: rng, logits, embedding and kv_cache ( #1105 )
...
* reserve correct size for logits
* add functions to get and set the whole llama state:
including rng, logits, embedding and kv_cache
* remove unused variables
* remove trailing whitespace
* fix comment
2023-04-22 09:21:32 +03:00
Concedo
cee018960e
Merge branch 'master' into concedo_experimental
2023-04-22 00:19:50 +08:00