* metal: matrix-matrix multiplication kernel
This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.
* metal: fix performance degradation from gqa
Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.
* metal: fix bugs for GQA and perplexity test.
I mixed up ne02 and nb02 in previous commit.
* add log_callback to llama_context_params for custom logging.
* Fix macro expansion on gcc
* Add struct llama_state for global variables and move log_callback there
* Turn log level into enum and some minor changes.
* Remove model_for_logging parameter (not needed anymore)
* Convert remaining fprintf(stderr, ...) calls to use new macros.
* Fix enum and initialize g_state
* Fix log calls after merge
* Fix missing static
* Add back all the new lines in the logging strings
* Add comment for llama_log_callback and replace remaining printf calls
---------
Co-authored-by: grahameth <->
Co-authored-by: Helmut <helmut.buhler@inf.h-brs.de>
* added stream saving context data to file to avoid allocating unnecessary amounts of memory
* generalised copying state data to file or buffer
* added comments explaining how copy_state_data works
* fixed trailing whitespaces
* fixed save load state example
* updated save load state to use public function in llama.cpp
* - restored breakage of the llama_copy_state_data API
- moved new logic for copying llama state data to internal function
* fixed function declaration order
* restored save load state example
* fixed whitepace
* removed unused llama-util.h include
* Apply suggestions from code review
Co-authored-by: slaren <slarengh@gmail.com>
* Apply code review suggestions
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* ggml : add graph tensor allocator
* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset
* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset
* ggml : graph allocation in contexts
* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx
* llama.cpp : allocate graph in the context
* add GGML_PAD
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal: concurrently dispatch commands
Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.
* metal: don't call find_concurrency automatically.
* metal : code style changes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make rms_norm_eps a parameter
* add rms_norm_eps to command line
* fix baby llama, test-grad0
* use scientific notation for eps param in the help
ggml-ci
* llama, main : constrain sampling to grammar
* allow loading grammar from file
* fix whitespace errors
* handle & print parser errors
* add comments to grammar syntax and allow newlines where unambiguous
* add missing include
* support alternates in root rule
* fix bugs with empty token and EOS
* adjust JSON grammar
* remove swp file
* rewrite ternary expressions
Co-authored-by: Henri Vasserman <henv@hot.ee>
* use struct for grammar elements and add Unicode support
* add unicode escapes
* add inverse char ranges
* only sample full tokens (no peeking or truncation)
* llama : minor style changes
blindly applied in online editor - hopefully I didn't break something
* update help text
* add warning message if EOS is disabled
---------
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>