* ggml : add metal backend registry / device
ggml-ci
* metal : fix names [no ci]
* metal : global registry and device instances
ggml-ci
* cont : alternative initialization of global objects
ggml-ci
* llama : adapt to backend changes
ggml-ci
* fixes
* metal : fix indent
* metal : fix build when MTLGPUFamilyApple3 is not available
ggml-ci
* fix merge
* metal : avoid unnecessary singleton accesses
ggml-ci
* metal : minor fix [no ci]
* metal : g_state -> g_ggml_ctx_dev_main [no ci]
* metal : avoid reference of device context in the backend context
ggml-ci
* metal : minor [no ci]
* metal : fix maxTransferRate check
* metal : remove transfer rate stuff
---------
Co-authored-by: slaren <slarengh@gmail.com>
* rerank : use [SEP] token instead of [BOS]
ggml-ci
* common : sanity check for non-NULL tokens
ggml-ci
* ci : adjust rank score interval
ggml-ci
* ci : add shebang to run.sh
ggml-ci
* Add scaffolding for ggml logging macros
* Metal backend now uses GGML logging
* Cuda backend now uses GGML logging
* Cann backend now uses GGML logging
* Add enum tag to parameters
* Use C memory allocation funcs
* Fix compile error
* Use GGML_LOG instead of GGML_PRINT
* Rename llama_state to llama_logger_state
* Prevent null format string
* Fix whitespace
* Remove log callbacks from ggml backends
* Remove cuda log statement
* feat(gguf-py): Add granitemoe architecture
This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(convert_hf_to_gguf): Add GraniteMoeModel
GraniteMoe has the same configuration deltas as Granite
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(granitemoe convert): Split the double-sized input layer into gate and up
After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.
Branch: GraniteMoE
Co-Authored-By: alex.brooks@ibm.com
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(granitemoe): Implement granitemoe
GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Typo fix in docstring
Co-Authored-By: ggerganov@gmail.com
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(conversion): Simplify tensor name mapping in conversion
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Remove unused tensor name mappings
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Sanity check on merged FFN tensor sizes
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow "output" layer in granite moe architecture (convert and cpp)
Branch: GraniteMoE
Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(granite): Add missing 'output' tensor for Granite
This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>