* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing
* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers
* Finish merge of ClBlast support
* Move CLBlast implementation to separate file
Add buffer reuse code (adapted from slaren's cuda implementation)
* Add q4_2 and q4_3 CLBlast support, improve code
* Double CLBlast speed by disabling OpenBLAS thread workaround
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
* Fix device selection env variable names
* Fix cast in opencl kernels
* Add CLBlast to CMakeLists.txt
* Replace buffer pool with static buffers a, b, qb, c
Fix compile warnings
* Fix typos, use GGML_TYPE defines, improve code
* Improve btype dequant kernel selection code, add error if type is unsupported
* Improve code quality
* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read
Co-authored-by: Henri Vasserman <henv@hot.ee>
* Use c compiler for opencl files
* Simplify code, fix include
* First check error, then release event
* Make globals static, fix indentation
* Rename dequant kernels file to conform with other file names
* Fix import cl file name
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
* reserve correct size for logits
* add functions to get and set the whole llama state:
including rng, logits, embedding and kv_cache
* remove unused variables
* remove trailing whitespace
* fix comment
* Multi-threading quantization.
Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.
* Multi-threading for quantize-stats
It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.
* Reviewer comments
* Avoiding compiler confusion
After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.
* Still fighting with lambda captures in MSVC
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build
* replaced use of auto with exact type to avoid using -std=c++14
* Made the assessors functions for static maps be static const
Mostly for msys2 and mingw64 builds, which are different from each other
and different from standard Visual Studio builds. Isn't Windows fun?
- Define _GNU_SOURCE in more files (it's already used in ggml.c for
Linux's sake).
- Don't use PrefetchVirtualMemory if not building for Windows 8 or later
(mingw64 doesn't by default). But warn the user about this situation
since it's probably not intended.
- Check for NOMINMAX already being defined, which it is on mingw64.
- Actually use the `increment` variable (bug in my `pizza` PR).
- Suppress unused variable warnings in the fake pthread_create and
pthread_join implementations for Windows.
- (not Windows-related) Remove mention of `asprintf` from comment;
`asprintf` is no longer used.
Fixes#871.