koboldcpp/otherarch/llama_v2.cpp
YellowRoseCx cf5d918073
Koboldcpp-ROCm Port (#399)
* koboldcpp-ROCm Port

commit 3416c986d9d9a31c3cdefd7e7bd4d9438d72ba35
Merge: 5eb17f0 4c4e435
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Aug 25 13:46:56 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 5eb17f02c8638e003bb91bddf95ccf54d2ad0c12
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Aug 25 13:38:21 2023 -0500

    ROCm Port update

    * use hipblas based on cublas
    * Update Makefile for the Cuda kernels
    * Expand arch list and make it overrideable
    * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
    * add hipBLAS to README
    * new build arg LLAMA_CUDA_MMQ_Y
    * fix half2 decomposition
    * Add intrinsics polyfills for AMD
    * AMD assembly optimized __dp4a
    * Allow overriding CC_TURING
    * use "ROCm" instead of "CUDA"
    * ignore all build dirs
    * Add Dockerfiles
    * fix llama-bench
    * fix -nommq help for non CUDA/HIP

    ---------

    Co-Authored-By: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
    Co-Authored-By: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-Authored-By: funnbot <22226942+funnbot@users.noreply.github.com>
    Co-Authored-By: Engininja2 <139037756+Engininja2@users.noreply.github.com>
    Co-Authored-By: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
    Co-Authored-By: jammm <2500920+jammm@users.noreply.github.com>
    Co-Authored-By: jdecourval <7315817+jdecourval@users.noreply.github.com>

commit b34f4bd2724733e188ec4f6074042f66a5ed28c9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Aug 19 17:12:52 2023 -0500

    Update README.md

commit 7d1196108ad330b32845546fb3472c2172a0b6b8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Aug 14 23:03:12 2023 -0500

    remove force DMMV

commit cd61aa0d9e16627935c7978adf488a679ddfa745
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Aug 12 17:24:31 2023 -0500

    restore main_gpu parameter

commit 4a042f326830271a4c31104051b7b08e08ac234e
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Aug 12 10:51:46 2023 +0300

    gfx1100 support

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
    Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>

commit 8913bc6fea97d3cb860937b0461f455c6abe3ea1
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 10:16:02 2023 +0300

    Allow overriding CC_TURING

commit e77a4c37a756c002e97173f4122e088fb304e18a
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 10:00:07 2023 +0300

    Merge 'origin/master' into hipblas

commit cc4c4e355cd553b1557d5fba2562e824db93f9b4
Author: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Date:   Fri Aug 11 09:43:14 2023 +0300

    New __dp4a assembly

    Now compatible with gfx900 and faster as well.

commit 1a03b709848ce68d5bf5966237756167e2cac540
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 09:30:28 2023 +0300

    Undo mess

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>

commit 4366ff9ba1b1f12e494118ef9b5198479022fcc5
Author: DannyDaemonic <DannyDaemonic@gmail.com>
Date:   Thu Aug 10 13:11:36 2023 -0700

    Handle `ENABLE_VIRTUAL_TERMINAL_PROCESSING` more gracefully on earlier versions of Windows.

commit 811ff855a24323cafddc95c1b8aca711fef05f76
Author: Christian Demsar <crasm@git.vczf.us>
Date:   Thu Aug 10 10:28:27 2023 -0400

    Add --n-predict -2 for stopping generation on full context (#2565)

commit 37c9717aaa6815b6a5be21aaab970212f20fe6bf
Author: Martin Krasser <krasserm@googlemail.com>
Date:   Thu Aug 10 12:16:38 2023 +0200

    Fix grammar-based sampling issue in server (#2566)

commit d18ecd5b9e5dde58ae08a3eef1637406159ddaca
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Aug 10 13:19:41 2023 -0500

    make mmq gen faster for amd

commit 243894a952147a4fac5b6aee748861a0df6cc2c6
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Aug 10 12:14:40 2023 +0300

    ws fix

commit ac2f14da445ea87d73539adbd29d19ff2c9eba58
Author: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Date:   Thu Aug 10 12:11:27 2023 +0300

    AMD assembly optimized __dp4a

    Doesn't seem to work for gfx900, so commented out.

commit 9dba0c985f140ddded8cbb671f139e81fff82eed
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Aug 10 12:09:28 2023 +0300

    Fix merge

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

commit f570b5cb1070591527a82d94bba408927b37778d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 22:11:20 2023 -0500

    Revert "revert cuda changes as they are bugggy"

    This reverts commit 1541bf879772aeeed8ff646bfc52185c2a88b79b.

commit 1541bf879772aeeed8ff646bfc52185c2a88b79b
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Wed Aug 9 22:36:41 2023 +0800

    revert cuda changes as they are bugggy

commit bacc20203efb1839aa313858a04d75255bb4b7f4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 20:37:17 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit b7cb4cfd109986bd66e8fd382d1e2516eaddfebb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 20:00:52 2023 -0500

    additional fixes

commit fadae727baa3735ad3e0667384d6e05ca056b3ef
Merge: 518eb2a 8f8ab6c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:45:50 2023 -0500

    Merge branch 'hipblas' into develop4Main

commit 518eb2af9225f8300a108c4244c7eb0a2217c3bc
Merge: bda0215 cae6a84
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:32:10 2023 -0500

    Merge remote-tracking branch 'upstream/concedo' into develop2Main

commit bda0215b413bafc49890aa23fc35f96a191fb3e0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:17:54 2023 -0500

    update makefile to multisystem path

commit 8f8ab6c4c049df501e9a5ed8fef3aa0fc0691421
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:05:03 2023 -0500

    hipLDFLAG Path change Unix to multisystem in Makefile

    changed the hardcoded linux distro hipblas LD path from -L/opt/rocm/lib to use the defined ROCM_PATH variable to be flexible with ROCm on non-Linux OS

commit 610ba4cfc460ed65c4adc32d3365a216690384d5
Merge: 4024f91 25d43e0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 23:54:58 2023 +0300

    Merge 'origin/master' into hipblas

commit 4024f91a665d83b6de8658d45ec9d004c5d90c79
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 01:56:44 2023 +0300

    Add intrinsics polyfills for AMD

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
    Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>

commit ab6212864ce8e9af200bcedb3e0126ee49aa8d0a
Merge: d91456a f5bfea0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 00:37:01 2023 +0300

    Merge 'origin/master' into hipblas

commit ee9fa2aca4f2e6645b99702935b34a5f8ec8f05d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 2 01:53:58 2023 -0500

    Update Makefile

commit d91456aaf138566fa0aa3d507964049c8a09499b
Author: ardfork <134447697+ardfork@users.noreply.github.com>
Date:   Mon Jul 31 20:35:00 2023 +0300

    fix half2 decomposition

commit c1cb70d64d307d3fd9b7b9f61bb574e36520499a
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 31 19:56:44 2023 +0300

    new build arg LLAMA_CUDA_MMQ_Y

commit c1664a00ae98059df863a88cbcb13eeca3025742
Merge: 4336231 0728c5a
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 31 19:32:27 2023 +0300

    Merge 'origin/master' into hipblas

commit 848558d7d95a5036ac057efdefa9b2a2e6fb61b7
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 30 20:02:52 2023 -0500

    import vars logic fix

commit b650b849d52aac65364558521f76e75ded7ea590
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 30 00:21:36 2023 -0500

    Update easy_KCPP-ROCm_install.sh

commit 8573a67a29e813d82e7f032912a8c221cd199505
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 21:31:12 2023 -0500

    remove duplicate code and fix typo

    remove duplicate tooltip

commit 430986e3f68f599fd7a11ea4b2b8e45ef33da643
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 21:07:34 2023 -0500

    hide "missing" if all are built

    move tooltip functions to helper functions section. hides the string "Missing: ..." from showing if all backends are available
    " if len(runopts)==6 else + "

commit dd0db7265dbc0b0699ca861291006808b662b0e4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 20:52:31 2023 -0500

    hide "missing" if all are built

    move tooltip functions to helper functions section. hides the string "Missing: ..." from showing if all backends are available

commit 43fffb66d8a30cbd776c3682f8a104c3644206b1
Merge: 0ed65a4 b40550c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 19:13:15 2023 -0500

    Merge branch 'concedo'

commit 0ed65a44a5fdb529611730f276a4b910cbf70ae0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 18:34:21 2023 -0500

    Hide unavailable backends & Add tooltip over backend count

    Hides unavailable backends from the user and if the program is launched without any backends made, it shows an error message to them stating no backends were found and to make them using the 'make' command

    Add tooltip when hovering over backend count label

    hovering over the new label that shows the backend count will explain what the numbers are, and show the users which backends are not available or built

commit 2a263983ab35024a95c411995963182ada06ed6f
Merge: cee2e9d 31486eb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 15:16:33 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 4336231a32a0c6168da5d79801752289622e9e58
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 18:35:56 2023 +0300

    add hipBLAS to README

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>

commit f8e3fc6c746b37d69656fb5ae6af8e411d85dbca
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 14:16:46 2023 +0300

    rocblas init stuff

commit d2ade639f4339e786311effb3eafca8bfc360d56
Merge: cde52d6 8a88e58
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 12:59:48 2023 +0300

    Merge 'origin/master' into hipblas

commit cee2e9d76740fd8e8f50b612078f3e7658460f29
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 26 23:36:55 2023 -0500

    Only Show Available Backends in GUI

    Hides unavailable backends from the user and if the program is launched without any backends made, it shows an error message to them stating no backends were found and to make them using the 'make' command

commit 78636109fc2ded79ee3e9a44d2e3c2d63a8de70e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 26 13:27:22 2023 -0500

    Update easy_KCPP-ROCm_install.sh

commit 731cd6e2ab9bb722e211142bb633e7018ccdb31b
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jul 25 22:39:50 2023 -0500

    Create easy_rocm_install.sh

commit f154685bbdc79b5ace752fbc179e32f2f7806bdb
Merge: cbdc1f3 94e0a06
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jul 25 22:25:10 2023 -0500

    Merge branch 'concedo_experimentalMAIN'

commit cbdc1f3fb91969e79bc8640e0cebfc3247e200df
Merge: 5b838d4 9731682
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 16:53:21 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit cde52d6a63f13f46d6403cc2957f4b4c34ddf4e2
Merge: 8e8054a 84e09a7
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 24 12:22:58 2023 +0300

    Merge 'origin/master' into hipblas

commit 8e8054ad83e794b261914ad4f337d43e2c76882d
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 24 12:20:49 2023 +0300

    Add rocblas to build files

commit 1f6294dc4473701b5be791d47e4b3733f95dbc0a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:52:01 2023 -0500

    Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)

    * initialize rocblas

commit 5b838d47874536ebffc2f6cb25877e0476a9402d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:10:35 2023 -0500

    amd multigpu full layer offload w/o vram scratch

commit 9bfb2fdd68000670bda85c4e9748d72f5af09764
Merge: b379f9d 66328fc
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:07:44 2023 -0500

    Merge branch 'concedo_experimental'

commit b379f9d6fac570c220c928ff5f4ba4ed1ca7c051
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:07:00 2023 -0500

    Revert "amd multigpu full layer offload w/o vram scratch"

    This reverts commit 9adfc8e33f7116d6ae2e0992920733f783b70d08.

commit 9adfc8e33f7116d6ae2e0992920733f783b70d08
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 02:56:40 2023 -0500

    amd multigpu full layer offload w/o vram scratch

commit 05c792e622a1d9838f9343e04f79ddf2bb63ae96
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 00:18:48 2023 -0500

    initialize rocblas

commit ade68d09d7b63d3344e18b6193043b378671eb12
Merge: 521ad6b 56995ca
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 23 20:25:05 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 521ad6b5cb2a107ad7b972025aeb0f353e0cac67
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jul 20 21:42:33 2023 -0500

    lazy import_var error handling for saves

commit 9553e52e7e4eabe46312729f6c4effeef6390df7
Merge: cac6650 f036109
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jul 20 19:59:41 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit cac6650754502208abfead61ba169fefc5ae84ac
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 17 23:05:02 2023 -0500

    Makefile fix! Allows hip/clblast build together

commit 3db70b5f0a1a4a1207041ddc5f2c5e25306bad4d
Merge: 2ec4466 7568d1a
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 18 01:54:17 2023 +0300

    Merge 'origin/master' into hipblas

commit f208670ffb6cdbb1e225adfb2fd80a67a6dc5055
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 14 02:56:03 2023 -0500

    improve error handling with gpu names

commit 860e73845f61fe0afb6a26cc8054d8be1f9e3669
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 14 00:33:03 2023 -0500

    Show GPU names in GUI, Only show GPUs that exist

    changed the pre-set 1,2,3 and 1,2,3,all settings that the GPU selector had and replaced them with a function that grabs the GPU names and sets the names as the values for the selector boxes.

commit 2ec4466db54fd2f42f2ab7713cc1061e0cf59bf3
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jul 13 13:44:02 2023 +0300

    Update build flags.

    GGML_CUDA_DMMV_Y is now GGML_CUDA_MMV_Y
    so update your build instructions.

    GGML_CUDA_FORCE_DMMV is always enabled.

    ---------

    Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>

commit cd36b185ff6de91abbfd1b80366dd79a1303a878
Merge: afcb8fe 1cbf561
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jul 13 13:03:01 2023 +0300

    Merge 'origin/master' into hipblas

commit ac7ebc3ac1deedfbc2940443b26774f1b4c85fae
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 18:32:18 2023 -0500

    add hipBLAS name scheme to GUI and update README

commit 7f85cc5ac30f2f300ca817a489ef209c995c634b
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 17:35:54 2023 -0500

    update makefile and ggml.c

commit 6ca3499275ba168320424f06ab3301ec329a6a83
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 15:43:45 2023 -0500

    ggml.c fix

commit 770e674aa5b2a1a9ffff2888a12e27b04ccfc7ef
Merge: 2b289cd 5941514
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 15:24:36 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 2b289cde558310c6c67dfc8d508c04e634595716
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:30:00 2023 -0500

    Update c-cpp.yml

commit 5dae95a9bb486c7f720789dffde1cfb470bffce0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:28:51 2023 -0500

    Update c-cpp.yml

commit b37cd738c84debb53b149f5a9fb73de958f263fd
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:27:04 2023 -0500

    Create c-cpp.yml to test Actions

commit afcb8fe0c4f5e918422ea41d08824653d58575ed
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 18:09:27 2023 +0300

    Add new config option

commit 8c2c4978a32d671253809d8f0f09d98af2dd18ab
Merge: e610466 2347463
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 17:53:54 2023 +0300

    Merge 'origin/master' into hipblas

commit e610466307abc8f8bae641682ab3f91dbc33930e
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 17:53:14 2023 +0300

    Expand arch list and make it overrideable

commit 80e4e548bfbace2a966a58cb57dd1720ad7216b2
Merge: 7735c5a 1d16309
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 10 02:09:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 8432e9d5dc8d080535243467f8d380271e8d9489
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:55:30 2023 -0500

    Update Makefile

commit b58c1893fa839c0f35df96f6a8b026a7f2576762
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:20:00 2023 -0500

    Add multi-gpu CuBLAS support to new GUI

commit 0c1c71b9927127b45030fe88283dfbdd23853d34
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 07:56:57 2023 -0500

    Update Makefile

commit f864f60cd8e563e2594cee5a7da7e9aebed494f9
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sat Jul 8 00:25:15 2023 +0200

    CUDA: add __restrict__ to mul mat vec kernels (#2140)

commit 4539bc2761a7a23b588b5420b9d3fd1962ff63e5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 01:36:14 2023 -0500

    update makefile for changes

commit 912e31ec523eac9ef308f0d28bc2d93aab7c3ecb
Merge: 74e2703 ddaa4f2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 7 23:15:37 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 74e2703ac3b1557f107e540657d0919db115f913
Merge: cf65429 f9108ba
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 5 15:16:49 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit 7735c5a9af58f6713b54fd5a4b6463f3b116d44d
Merge: c3e3733 7ee76e4
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 4 17:09:16 2023 +0300

    Merge 'origin/master' into hipblas

commit cf65429c3832d32a8c17c7ed5ab47066d7511fbe
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:56:40 2023 -0500

    print cuda or opencl based on what's used

commit 72c16d2310b2e4c44018e2084aeb79e68c0b8709
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:45:39 2023 -0500

    Revert "fix my mistake that broke other arches"

    This reverts commit 777aed5e69e240a54e7d3da962d8520855f072b9.

commit 777aed5e69e240a54e7d3da962d8520855f072b9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 15:53:32 2023 -0500

    fix my mistake that broke other arches

commit 27780a987a8dabb18689038c0397e16f2f219c7e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:03:27 2023 -0500

    rocm fixes

commit f52c7d439770c1ea0bebc1f895b74d6aeea5f0a6
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:02:58 2023 -0500

    Revert "rocm fixes"

    This reverts commit 2fe9927353a1e53353623f850d3d534da88f5154.

commit 2fe9927353a1e53353623f850d3d534da88f5154
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:58:21 2023 -0500

    rocm fixes

commit efe7560c83a497f5e750bbe27922babd4233bda9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:43 2023 -0500

    Revert "move HIPBLAS definitions into ggml-cuda.h"

    This reverts commit bf49a93d63f833b7871ba6e60f8fe207562678ee.

commit 4fc0181e44685019dcd309d4bb345cac7a5fef87
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:36 2023 -0500

    Revert "move hipblas definitions to header files"

    This reverts commit 2741ffb70464a71fd138484de4b41da05622e027.

commit 89eb576f2771bd81a3a6274348b47535dfdd5f63
Merge: 2741ffb 3d2907d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 14:44:13 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit c3e3733c61f7705ea00fd593ee94527da8c12f1b
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jul 2 15:51:31 2023 +0300

    ROCm fixes

commit 15db19ae7b70d2a6350063e633b898a89ad78cbc
Merge: 04419f1 46088f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jul 2 15:39:57 2023 +0300

    Merge 'origin/master' into hipblas

commit 2741ffb70464a71fd138484de4b41da05622e027
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 17:07:42 2023 -0500

    move hipblas definitions to header files

commit bf49a93d63f833b7871ba6e60f8fe207562678ee
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 16:38:50 2023 -0500

    move HIPBLAS definitions into ggml-cuda.h

commit 540f4e05f4e95378f46a83e2919d3962c0ef9eac
Merge: 2c3b46f eda663f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 14:58:32 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 2c3b46f8a80ca9d94b2d3d06e1af6b6f7b791914
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:43:43 2023 -0500

    changes to fix build

commit c9e1103da0d72fd39a36391ac4b5d941a133598a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:20:07 2023 -0500

    Update ggml_v2-cuda-legacy.cu for ROCM

commit b858fc5db80ed545a6fbeae3d551bddb47955598
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 17:49:39 2023 -0500

    changes to work with upstream

commit 69a0c2534bb8825f4009760b12d9bd44d108c6ed
Merge: 096f0b0 1347d3a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 16:59:06 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 04419f18947e7b0dc43c07869eac3965f22b34cf
Merge: bb16eff d3494bb
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 28 23:30:10 2023 +0300

    Merge 'origin/master' into hipblas

commit bb16effc750e2706050f5d4ec89cecc42cc13882
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 15:27:10 2023 -0500

    headers fix; add kquants_iter for hipblas and add gfx803 (#1)

    * kquants_iter for hipblas and add gfx803
    * Update CMakeLists.txt with hipblas kquants_iter and DMMV_F16
    * remove dmmv_f16 for now

commit 096f0b055e11b7d930842f86146d0e5013c5dce6
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 15:27:02 2023 -0500

    revert unnecessary hipblas conditionals

commit d81e81adffd6eb59e280ae1885864bb5fbd9bba6
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 14:48:23 2023 -0500

    Update Makefile hipblas nvcc correction

commit c8ae94524a8bd7dca891b6b711cb5598a30fcf74
Merge: c1e5c83 0be54f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 27 10:50:37 2023 +0300

    Merge 'origin/master' into hipblas

commit 2579ecf8db9569d7756161f05ce7b0f5f23174b0
Merge: abed427 d2034ce
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 25 17:50:04 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit c1e5c8345eca45563d382d9417b84ed5f0ab77ff
Merge: 35a6031 447ccbe
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 25 21:40:05 2023 +0300

    Merge 'origin/master' into hipblas

commit 35a603161a17ddeb6128e9d4718b8fab5e34b558
Merge: df7346c 66a2555
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 25 10:57:48 2023 +0300

    Merge 'origin/master' into hipblas

commit abed427b6f370698fe8e8409e7980f238aad03ef
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 19:16:30 2023 -0500

    reorganize If statements to include proper headers

commit 06c3bf03b92c2e00fc4bcd27f0c34f32c58b19a9
Merge: ea6d320 8342fe8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 16:57:20 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit ea6d3208dcdc0b05e2c164dde8ee0bfc6a02ad09
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jun 23 01:53:28 2023 -0500

    Update README.md

commit 4d56ad8158595d1e835cb379939dc5526deb39e2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 16:19:43 2023 -0500

    Update README.md

commit 21f930872b6e232679fe02eac9e429367365c6af
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 15:42:05 2023 -0500

    kquants_iter for hipblas and add gfx803

commit df7346ccd52bc0368eeeb878e31a284e01eac61a
Merge: 5dd2fbe 7487137
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jun 22 20:51:09 2023 +0300

    Merge 'origin/master' into hipblas

commit b6ff89066bbf2de23dab90bc8bbf9f63d8d1e070
Merge: eb094f0 e6ddb15
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 12:42:09 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit eb094f043f9b0b94e7db028ca36e96ce479b0369
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 23:59:18 2023 -0500

    lowvram parameter description

commit 3a5dfeb568d543376910180caa9a99b081fef9d4
Merge: 665cc11 b1f00fa
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 16:53:03 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit 665cc1136b188e7ff5c1aa1359118c999ff6d162
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 01:13:19 2023 -0500

    add lowvram parameter

commit 222cbbb141f7ce79884cafb6bcebd860ae27cc04
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 19:03:28 2023 -0500

    add additional hipblas conditions for cublas

commit e1f958124ec99525cb58d8c534f9d1789377544e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 16:51:59 2023 -0500

    Add hip def for cuda v2

commit 3bff5c0f0defd9d49b770c5ce107c71e5cba8003
Merge: a7e74b3 266d47a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 13:38:06 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit a7e74b39fe5eedf85d955fe5ea5f4c546322a9b0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:04:18 2023 -0500

    Update README.md

commit 5e99b3cb72d83f45b3f7904ffb8f242e743a142c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:03:42 2023 -0500

    Update Makefile

commit 9190b17432ebdc489ab05b71df6c3b8d5e7f5895
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 21:47:10 2023 -0500

    Update README.md

commit 5dd2fbe6ea87f78e38d888844a3820302a297048
Merge: 67e229b 20568fe
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 20 01:23:12 2023 +0300

    Merge 'origin/master' into hipblas

commit 2780ea292b1e9c6ead274de3afb34337716be08f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 15:48:00 2023 -0500

    Update Makefile

commit 04a3e64807a92c2e105af92f16dd6db2ea024d39
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:33:39 2023 -0500

    remove extra line

commit cccbca9dea3780e797a3b4972ba211e0c762fdc1
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:17 2023 -0500

    attempt adding ROCM hipblas

commit a44a1d4b90ed11d83d622eb976a945ff26a8974e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:01 2023 -0500

    attempt adding ROCM hipblas

commit b08818416972f83349bc4d6479bccc55ee31436d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:30:54 2023 -0500

    attempt adding ROCM hipblas

commit 67e229b7ca0a51f367c1e1495a15c261d0893d25
Merge: 6f7c156 b241649
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 18 00:36:54 2023 +0300

    Merge 'origin/master' into hipblas

commit 6f7c15637a8ed60d5d5dade24aaf63a296bc32a6
Merge: 61df8e9 fc45a81
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jun 17 16:53:22 2023 +0300

    Merge 'origin/master' into hipblas

commit 61df8e92179b84af9041e53f61d0194dfd791de0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 14 22:46:10 2023 +0300

    add cudaMemset

commit a836529996845343dfb96becb4fd48e3f55da55c
Merge: 85f902d 254a7a7
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 14 22:41:55 2023 +0300

    Merge 'origin/master' into hipblas

commit 85f902d5c44cee18858812212dad850b9409c7f9
Merge: 4362e80 b50b570
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jun 8 10:50:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 4362e805a4b0bd80d0cff0e3d8d0b1162cc8043c
Merge: fa5b3d7 17366df
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 23:14:40 2023 +0300

    Merge 'origin/master' into hipblas

commit fa5b3d7365266a9903450c1105551ffec7f51d92
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:47:00 2023 +0300

    fix makefile.

commit 1ba4ce4ad792f9672eecc37bf982386d3a007914
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:41:08 2023 +0300

    Revert "warp size fixes"

    It seems like 32 is faster for me, at least and it won't cause so many conflicts.

    This reverts commit 5d6eb72164e5ae000d07dd725e635faa7a2f723d.

commit 5d6eb72164e5ae000d07dd725e635faa7a2f723d
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:32:41 2023 +0300

    warp size fixes

commit 33091a9bd3bb3ecf59b0f5535b084f443f6a20b6
Merge: 9fdaa1d 2d43387
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 16:19:23 2023 +0300

    Merge  'origin/master' into hipblas

commit 9fdaa1d2501a2c4a030af6d34e97b2e4766b27c4
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 27 19:17:53 2023 +0300

    Add more defs

    For forward compatibility #1607

commit a4648c1e7c70b4985393ec0851403ef7fb8d1ffc
Merge: 4c8b3fb 0ecb1bb
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 27 18:22:39 2023 +0300

    Merge 'origin/master' into hipblas

commit 4c8b3fb1071dff0cd0c4b4f96e506294ba6473f4
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 01:08:53 2023 +0300

    add configurable vars

commit 30d921af3e0b21f511652c98448ccb631434d0d4
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 01:03:56 2023 +0300

    and makefile

commit a593a4f6c24389528a5eed8e6dc86eb06ced38b8
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 00:55:28 2023 +0300

    Add missing parameters

commit 174bf6a86d045a30b1253cbe3cc773808b202186
Merge: f80ce7a 1fcdcc2
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 00:44:23 2023 +0300

    Merge 'origin/master' into hipblas

commit f80ce7a4e00b33adf6b13d231689dbf3a33ec475
Merge: 600ace3 ac7876a
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 25 00:02:50 2023 +0300

    Merge branch 'origin/master' into hipblas

commit 600ace39c8f1d311b8f3c49003f5a6448a44b18e
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 23:42:20 2023 +0300

    update warp size

commit b19fefef943d974db2eda8a8908e67e1d08e317c
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 23:28:08 2023 +0300

    Forwardcompat

commit c66115b833178ea3711543ddbbd4eb2b21ab523e
Merge: a0b2d5f b8ee340
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 18:29:31 2023 +0300

    Merge 'origin/master' into hipblas

commit a0b2d5f291
Merge: 8bab456 2a5ee02
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue May 16 17:08:29 2023 +0300

    Merge 'origin/master' into hipblas

commit 8bab45611e
Merge: 2956630 b5c9295
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon May 15 00:01:12 2023 +0300

    Merge 'origin/master' into hipblas

commit 2956630a3d
Merge: 0fe6384 f048af0
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 13 13:12:52 2023 +0300

    Merge 'origin/master' into hipblas

commit 0fe6384755
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 12 17:22:11 2023 +0300

    fix makefile

commit 605560d9ec
Merge: 127f68e 089b1c9
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 12 16:12:53 2023 +0300

    Merge 'origin/master' into hipblas

commit 127f68eb5a
Merge: 070cbcc b608b55
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 11 20:21:27 2023 +0300

    Merge 'origin/master' into hipblas

commit 070cbcc1bd
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 18:10:56 2023 +0300

    occupanct function

commit a3296d50aa
Merge: 0aefa6a e129551
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 18:06:04 2023 +0300

    Merge 'origin/master' into hipblas

commit 0aefa6ab71
Merge: baeb482 1b0fd45
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 12:24:41 2023 +0300

    Merge 'origin/master' into hipblas

commit baeb482a94
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 12:24:12 2023 +0300

    Revert to default copy

commit 289073a532
Merge: 1107194 173d0e6
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 6 19:59:41 2023 +0300

    Merge 'origin/master' into hipblas

commit 1107194e6b
Merge: 04c0d48 a3b85b2
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 6 00:38:20 2023 +0300

    Merge 'origin/master' into hipblas

commit 04c0d480d7
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 4 12:31:16 2023 +0300

    Move all HIP stuff to ggml-cuda.cu

commit d83cfbad0c
Merge: b67cc50 799fdc1
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 4 11:31:16 2023 +0300

    Merge 'origin/master' into hipblas

commit b67cc50dad
Merge: fcbc262 e216aa0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed May 3 15:04:51 2023 +0300

    Merge 'origin/master' into hipblas

commit fcbc262eb9
Merge: c73def1 f4cef87
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon May 1 22:45:29 2023 +0300

    Merge 'origin/master' into hipblas

commit c73def129a
Merge: d8ea75e f0d70f1
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Apr 30 18:40:42 2023 +0300

    Merge 'origin/master' into hipblas

commit d8ea75e952
Merge: d194586 334637e
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Apr 29 11:25:51 2023 +0300

    Merge 'origin/master' into hipblas

commit d194586f65
Merge: 2ab9d11 7f15c5c
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 23:03:52 2023 +0300

    Merge 'origin/master' into hipblas

commit 2ab9d11f37
Merge: 3b4a531 04aaae1
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 16:30:05 2023 +0300

    Merge 'origin/master' into hipblas

commit 3b4a53138f
Merge: a1caa48 0b2da20
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 10:08:41 2023 +0300

    Merge 'origin/master' into hipblas

commit a1caa48611
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 10:08:21 2023 +0300

    add more cuda defines

    This is so 'slaren/cuda-f16f32' would merge.

commit ecc056519f
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 01:58:27 2023 +0300

    only .cu file needs to be complied as device

commit ef51e9ecac
Merge: d571d16 4afcc37
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Apr 26 12:46:26 2023 +0300

    Merge branch 'ggerganov:master' into hipblas

commit d571d1629f
Merge: 608aa33 dd0eabc
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Apr 25 21:15:33 2023 +0300

    Merge 'origin/master' into hipblas

commit 608aa33d9f
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Apr 25 21:15:04 2023 +0300

    change default GPU arch to match CMake

commit 3a004b2a01
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Apr 24 02:24:54 2023 +0300

    add rpath

commit db7a01297e
Merge: 3677235 284685f
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Apr 23 21:49:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 367723544c
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Apr 22 23:28:00 2023 +0300

    More build file changes

commit d3e1984ce0
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 21 03:32:06 2023 +0300

    add rpath

commit 0e005f7793
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 21 02:13:00 2023 +0300

    Build file changes

    Now HIP Clang is not required, the CMake scripts will configure the
    needed compiler, which can be system clang++. Also other code can
    still use GCC, but CMake will force the clang to link.

commit 54a63c10e8
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Apr 20 22:19:22 2023 +0300

    Update Makefile for the Cuda kernels

commit 0fd8363adc
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Apr 20 02:04:00 2023 +0300

    use hipblas based on cublas

* Merge Fixes

* readme merge fix

* remove old ggmlv2 changes

* bring ggml v2_cuda up to date with AMD changes

* Revert ggml v2_cuda changes BC they werent needed

This reverts commit 3385dd4240.

* avoid launching subprocesses to get device names for now, but other than that seems to be working

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2023-08-28 17:05:06 +08:00

3104 lines
107 KiB
C++

// Defines fileno on msys:
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#include <cstdint>
#include <cstdio>
#endif
#include "llama_v2-util.h"
#include "llama_v2.h"
#include "ggml_v2.h"
#ifdef GGML_USE_CUBLAS
#include "ggml_v2-cuda.h"
#endif
#if defined(GGML_USE_CLBLAST)
#include "ggml_v2-opencl.h"
#endif
#include <array>
#include <ctime>
#include <cinttypes>
#include <fstream>
#include <random>
#include <map>
#include <unordered_map>
#include <queue>
#include <cassert>
#include <cstring>
#include <climits>
#include <memory>
#include <algorithm>
#include <initializer_list>
#include <thread>
#include <atomic>
#include <mutex>
#include <sstream>
#include <numeric>
#define LLAMA_V2_USE_SCRATCH
#define LLAMA_V2_MAX_SCRATCH_BUFFERS 16
// available llama models
enum e_model2 {
MODEL_UNKNOWN_2,
MODEL_7B_2,
MODEL_13B_2,
MODEL_30B_2,
MODEL_65B_2,
};
static const size_t MB_2 = 1024*1024;
// computed for n_ctx == 2048
// TODO: dynamically determine these sizes
// needs modifications in ggml
static const std::map<e_model2, size_t> & MEM_REQ_SCRATCH0_2()
{
static std::map<e_model2, size_t> k_sizes = {
{ MODEL_UNKNOWN_2, 512ull * MB_2 },
{ MODEL_7B_2, 512ull * MB_2 },
{ MODEL_13B_2, 512ull * MB_2 },
{ MODEL_30B_2, 640ull * MB_2 },
{ MODEL_65B_2, 1024ull * MB_2 },
};
return k_sizes;
}
static const std::map<e_model2, size_t> & MEM_REQ_SCRATCH1_2()
{
static std::map<e_model2, size_t> k_sizes = {
{ MODEL_UNKNOWN_2, 512ull * MB_2 },
{ MODEL_7B_2, 512ull * MB_2 },
{ MODEL_13B_2, 512ull * MB_2 },
{ MODEL_30B_2, 640ull * MB_2 },
{ MODEL_65B_2, 1024ull * MB_2 },
};
return k_sizes;
}
// 2*n_embd*n_ctx*n_layer*sizeof(float16)
static const std::map<e_model2, size_t> & MEM_REQ_KV_SELF_2()
{
static std::map<e_model2, size_t> k_sizes = {
{ MODEL_UNKNOWN_2, 1026ull * MB_2 },
{ MODEL_7B_2, 1026ull * MB_2 },
{ MODEL_13B_2, 1608ull * MB_2 },
{ MODEL_30B_2, 3124ull * MB_2 },
{ MODEL_65B_2, 5120ull * MB_2 },
};
return k_sizes;
}
// this is mostly needed for temporary mul_mat buffers to dequantize the data
// not actually needed if BLAS is disabled
static const std::map<e_model2, size_t> & MEM_REQ_EVAL_2()
{
static std::map<e_model2, size_t> k_sizes = {
{ MODEL_UNKNOWN_2, 800ull * MB_2 },
{ MODEL_7B_2, 800ull * MB_2 },
{ MODEL_13B_2, 1024ull * MB_2 },
{ MODEL_30B_2, 1280ull * MB_2 },
{ MODEL_65B_2, 1536ull * MB_2 },
};
return k_sizes;
}
// default hparams (LLaMA 7B)
struct llama_v2_hparams {
uint32_t n_vocab = 32000;
uint32_t n_ctx = 512; // this is provided as user input?
uint32_t n_embd = 4096;
uint32_t n_mult = 256;
uint32_t n_head = 32;
uint32_t n_layer = 32;
uint32_t n_rot = 64;
enum llama_v2_ftype ftype = LLAMA_V2_FTYPE_MOSTLY_F16;
bool operator!=(const llama_v2_hparams & other) const {
return memcmp(this, &other, sizeof(llama_v2_hparams));
}
};
struct llama_v2_layer {
// normalization
struct ggml_v2_tensor * attention_norm;
// attention
struct ggml_v2_tensor * wq;
struct ggml_v2_tensor * wk;
struct ggml_v2_tensor * wv;
struct ggml_v2_tensor * wo;
// normalization
struct ggml_v2_tensor * ffn_norm;
// ff
struct ggml_v2_tensor * w1;
struct ggml_v2_tensor * w2;
struct ggml_v2_tensor * w3;
};
struct llama_v2_kv_cache {
struct ggml_v2_tensor * k;
struct ggml_v2_tensor * v;
struct ggml_v2_context * ctx = NULL;
llama_v2_ctx_buffer buf;
int n; // number of tokens currently in the cache
~llama_v2_kv_cache() {
if (ctx) {
ggml_v2_free(ctx);
}
}
};
struct llama_v2_model {
e_model2 type = MODEL_UNKNOWN_2;
llama_v2_hparams hparams;
struct ggml_v2_tensor * tok_embeddings;
struct ggml_v2_tensor * norm;
struct ggml_v2_tensor * output;
std::vector<llama_v2_layer> layers;
// context
struct ggml_v2_context * ctx = NULL;
// key + value cache for the self attention
// TODO: move to llama_v2_state
struct llama_v2_kv_cache kv_self;
// the model memory buffer
llama_v2_ctx_buffer buf;
// model memory mapped file
std::unique_ptr<llama_v2_mmap> mapping;
// objects representing data potentially being locked in memory
llama_v2_mlock mlock_buf;
llama_v2_mlock mlock_mmap;
// for quantize-stats only
std::vector<std::pair<std::string, struct ggml_v2_tensor *>> tensors_by_name;
~llama_v2_model() {
if (ctx) {
ggml_v2_free(ctx);
}
}
};
struct llama_v2_vocab {
using id = int32_t;
using token = std::string;
struct token_score {
token tok;
float score;
};
std::unordered_map<token, id> token_to_id;
std::vector<token_score> id_to_token;
};
struct llama_v2_context {
std::mt19937 rng;
int64_t t_load_us = 0;
int64_t t_start_us = 0;
bool has_evaluated_once = false;
int64_t t_sample_us = 0;
int64_t t_eval_us = 0;
int64_t t_p_eval_us = 0;
int32_t n_sample = 0; // number of tokens sampled
int32_t n_eval = 0; // number of eval calls
int32_t n_p_eval = 0; // number of tokens in eval calls for the prompt (with batch size > 1)
llama_v2_model model;
llama_v2_vocab vocab;
size_t mem_per_token = 0;
// decode output (2-dimensional array: [n_tokens][n_vocab])
std::vector<float> logits;
bool logits_all = false;
// input embedding (1-dimensional array: [n_embd])
std::vector<float> embedding;
// memory buffers used to evaluate the model
// TODO: move in llama_v2_state
llama_v2_ctx_buffer buf_compute;
llama_v2_ctx_buffer buf_scratch[LLAMA_V2_MAX_SCRATCH_BUFFERS];
int buf_last = 0;
size_t buf_max_size[LLAMA_V2_MAX_SCRATCH_BUFFERS] = { 0 };
void use_buf(struct ggml_v2_context * ctx, int i) {
#if defined(LLAMA_V2_USE_SCRATCH)
size_t last_size = 0;
if (i == -1) {
last_size = ggml_v2_set_scratch(ctx, { 0, 0, nullptr, });
} else {
auto & buf = buf_scratch[i];
last_size = ggml_v2_set_scratch(ctx, { 0, buf.size, buf.addr, });
}
if (buf_last >= 0) {
buf_max_size[buf_last] = std::max(buf_max_size[buf_last], last_size);
}
buf_last = i;
#else
(void) i;
(void) ctx;
#endif
}
size_t get_buf_max_mem(int i) const {
#if defined(LLAMA_V2_USE_SCRATCH)
return buf_max_size[i];
#else
(void) i;
return 0;
#endif
}
};
template <typename T>
static T checked_mul2(T a, T b) {
T ret = a * b;
if (a != 0 && ret / a != b) {
throw format_old("overflow multiplying %llu * %llu",
(unsigned long long) a, (unsigned long long) b);
}
return ret;
}
static size_t checked_div2(size_t a, size_t b) {
if (b == 0 || a % b != 0) {
throw format_old("error dividing %zu / %zu", a, b);
}
return a / b;
}
static std::string llama_v2_format_tensor_shape(const std::vector<uint32_t> & ne) {
char buf[256];
snprintf(buf, sizeof(buf), "%5u", ne.at(0));
for (size_t i = 1; i < ne.size(); i++) {
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), " x %5u", ne.at(i));
}
return buf;
}
static size_t llama_v2_calc_tensor_size(const std::vector<uint32_t> & ne, enum ggml_v2_type type) {
size_t size = ggml_v2_type_size(type);
for (uint32_t dim : ne) {
size = checked_mul2<size_t>(size, dim);
}
return size / ggml_v2_blck_size(type);
}
struct llama_v2_load_tensor_shard {
std::vector<uint32_t> ne;
size_t size;
enum ggml_v2_type type;
size_t file_idx;
size_t file_off;
void calc_size() {
size = llama_v2_calc_tensor_size(ne, type);
}
};
enum llama_v2_split_type {
SPLIT_NONE_2,
SPLIT_BY_COLUMNS_2,
SPLIT_BY_ROWS_2
};
struct llama_v2_load_tensor {
std::vector<llama_v2_load_tensor_shard> shards;
std::string name;
enum ggml_v2_type type = GGML_V2_TYPE_F32;
llama_v2_split_type split_type = SPLIT_NONE_2;
std::vector<uint32_t> ne;
size_t size;
struct ggml_v2_tensor * ggml_v2_tensor = NULL;
uint8_t * data;
llama_v2_load_tensor(const std::string & name) : name(name) {}
void calc_all() {
calc_type();
calc_split_type();
calc_ne();
calc_size();
}
void calc_type() {
const auto & first_shard = shards.at(0);
for (const auto & shard : shards) {
if (shard.type != first_shard.type) {
throw format_old("inconsistent tensor shard type in '%s'", name.c_str());
}
}
type = first_shard.type;
}
void calc_split_type() {
if (shards.at(0).ne.size() == 1 || // 1D tensors are just duplicated in every file
shards.size() == 1) { // only one file?
split_type = SPLIT_NONE_2;
} else if (name.find("tok_embeddings.") == 0 ||
name.find(".attention.wo.weight") != std::string::npos ||
name.find(".feed_forward.w2.weight") != std::string::npos) {
split_type = SPLIT_BY_COLUMNS_2;
} else {
split_type = SPLIT_BY_ROWS_2;
}
}
void calc_ne() {
const auto & first_shard = shards.at(0);
for (const auto & shard : shards) {
if (shard.ne != first_shard.ne) {
throw format_old("inconsistent tensor shard shape in '%s': first was %s, other was %s",
name.c_str(), llama_v2_format_tensor_shape(first_shard.ne).c_str(), llama_v2_format_tensor_shape(shard.ne).c_str());
}
}
ne = first_shard.ne;
LLAMA_V2_ASSERT(shards.size() <= UINT32_MAX);
uint32_t n_shards = (uint32_t) shards.size();
switch (split_type) {
case SPLIT_NONE_2:
ne = first_shard.ne;
break;
case SPLIT_BY_COLUMNS_2:
ne = {checked_mul2<uint32_t>(first_shard.ne[0], n_shards),
first_shard.ne[1]};
break;
case SPLIT_BY_ROWS_2:
ne = {first_shard.ne[0],
checked_mul2<uint32_t>(first_shard.ne[1], n_shards)};
break;
}
}
void calc_size() {
size = llama_v2_calc_tensor_size(ne, type);
}
};
struct llama_v2_load_tensors_map {
// tensors is kept in a separate vector to preserve file order
std::vector<llama_v2_load_tensor> tensors;
std::unordered_map<std::string, size_t> name_to_idx;
};
enum llama_v2_file_version {
LLAMA_V2_FILE_VERSION_GGML,
LLAMA_V2_FILE_VERSION_GGMF_V1, // added version field and scores in vocab
LLAMA_V2_FILE_VERSION_GGJT_V1, // added padding
LLAMA_V2_FILE_VERSION_GGJT_V2, // changed quantization format
LLAMA_V2_FILE_VERSION_GGJT_V3, // changed Q4 and Q8 quantization format
};
struct llama_v2_file_loader {
llama_v2_file file;
llama_v2_file_version file_version;
llama_v2_hparams hparams;
llama_v2_vocab vocab;
llama_v2_file_loader(const char * fname, size_t file_idx, llama_v2_load_tensors_map & tensors_map)
: file(fname, "rb") {
fprintf(stderr, "llama.cpp: loading model from %s\n", fname);
read_magic();
read_hparams();
read_vocab();
read_tensor_metadata(file_idx, tensors_map);
}
void read_magic() {
uint32_t magic = file.read_u32();
uint32_t version = 0;
if (magic != 'ggml') {
version = file.read_u32();
}
if (magic == 'ggml' && version == 0) {
file_version = LLAMA_V2_FILE_VERSION_GGML;
} else if (magic == 'ggmf' && version == 1) {
file_version = LLAMA_V2_FILE_VERSION_GGMF_V1;
} else if (magic == 'ggjt' && version == 1) {
file_version = LLAMA_V2_FILE_VERSION_GGJT_V1;
} else if (magic == 'ggjt' && version == 2) {
file_version = LLAMA_V2_FILE_VERSION_GGJT_V2;
} else if (magic == 'ggjt' && version == 3) {
file_version = LLAMA_V2_FILE_VERSION_GGJT_V3;
} else {
throw format_old("unknown (magic, version) combination: %08x, %08x; is this really a GGML file?",
magic, version);
}
}
void read_hparams() {
hparams.n_vocab = file.read_u32();
hparams.n_embd = file.read_u32();
hparams.n_mult = file.read_u32();
hparams.n_head = file.read_u32();
hparams.n_layer = file.read_u32();
hparams.n_rot = file.read_u32();
hparams.ftype = (enum llama_v2_ftype) file.read_u32();
}
void read_vocab() {
vocab.id_to_token.resize(hparams.n_vocab);
int32_t vocabloops = hparams.n_vocab;
if(vocabloops==32001 && file_version == LLAMA_V2_FILE_VERSION_GGML)
{
printf("---\n!! WARNING: Model appears to be GPT4ALL v1 model, triggering compatibility fix !!\n---\n");
vocabloops -= 1;
}
for (uint32_t i = 0; i < vocabloops; i++) {
uint32_t len = file.read_u32();
std::string word = file.read_string(len);
float score = 0.0f;
if (file_version >= LLAMA_V2_FILE_VERSION_GGMF_V1) {
file.read_raw(&score, sizeof(score));
}
vocab.token_to_id[word] = i;
auto & tok_score = vocab.id_to_token[i];
tok_score.tok = std::move(word);
tok_score.score = score;
}
}
void read_tensor_metadata(size_t file_idx, llama_v2_load_tensors_map & tensors_map) {
while (file.tell() < file.size) {
llama_v2_load_tensor_shard shard;
uint32_t n_dims = file.read_u32();
uint32_t name_len = file.read_u32();
shard.type = (enum ggml_v2_type) file.read_u32();
shard.ne.resize(n_dims);
file.read_raw(shard.ne.data(), sizeof(shard.ne[0]) * n_dims);
std::string name = file.read_string(name_len);
if (n_dims < 1 || n_dims > 2) {
throw format_old("llama.cpp: tensor '%s' should not be %u-dimensional", name.c_str(), n_dims);
}
switch (shard.type) {
case GGML_V2_TYPE_F32:
case GGML_V2_TYPE_F16:
case GGML_V2_TYPE_Q4_0:
case GGML_V2_TYPE_Q4_1:
case GGML_V2_TYPE_Q4_2:
case GGML_V2_TYPE_Q4_3:
case GGML_V2_TYPE_Q5_0:
case GGML_V2_TYPE_Q5_1:
case GGML_V2_TYPE_Q8_0:
break;
default: {
throw format_old("unrecognized tensor type %u\n", shard.type);
}
}
if (file_version >= LLAMA_V2_FILE_VERSION_GGJT_V1) {
// skip to the next multiple of 32 bytes
file.seek(-file.tell() & 31, SEEK_CUR);
}
shard.file_idx = file_idx;
shard.file_off = file.tell();
shard.calc_size();
file.seek(shard.size, SEEK_CUR);
auto it = tensors_map.name_to_idx.find(name);
size_t idx;
if (it != tensors_map.name_to_idx.end()) {
idx = it->second;
} else {
tensors_map.tensors.emplace_back(name);
idx = tensors_map.tensors.size() - 1;
tensors_map.name_to_idx.emplace(name, idx);
}
tensors_map.tensors.at(idx).shards.push_back(shard);
}
}
};
struct llama_v2_file_saver {
llama_v2_file file;
llama_v2_file_loader * any_file_loader;
llama_v2_file_saver(const char * fname, llama_v2_file_loader * any_file_loader, enum llama_v2_ftype new_ftype)
: file(fname, "wb"), any_file_loader(any_file_loader) {
fprintf(stderr, "llama.cpp: saving model to %s\n", fname);
write_magic();
write_hparams(new_ftype);
write_vocab();
}
void write_magic() {
file.write_u32(LLAMA_V2_FILE_MAGIC); // magic
file.write_u32(LLAMA_V2_FILE_VERSION); // version
}
void write_hparams(enum llama_v2_ftype new_ftype) {
const llama_v2_hparams & hparams = any_file_loader->hparams;
file.write_u32(hparams.n_vocab);
file.write_u32(hparams.n_embd);
file.write_u32(hparams.n_mult);
file.write_u32(hparams.n_head);
file.write_u32(hparams.n_layer);
file.write_u32(hparams.n_rot);
file.write_u32(new_ftype);
}
void write_vocab() {
if (any_file_loader->file_version == LLAMA_V2_FILE_VERSION_GGML) {
fprintf(stderr, "llama.cpp: WARNING: input is an old file that doesn't have scores; will add dummy scores\n");
}
uint32_t n_vocab = any_file_loader->hparams.n_vocab;
for (uint32_t i = 0; i < n_vocab; i++) {
const auto & token_score = any_file_loader->vocab.id_to_token.at(i);
file.write_u32((uint32_t) token_score.tok.size());
file.write_raw(token_score.tok.data(), token_score.tok.size());
file.write_raw(&token_score.score, sizeof(token_score.score));
}
}
void write_tensor(llama_v2_load_tensor & tensor, enum ggml_v2_type new_type, const void * new_data, size_t new_size) {
switch (new_type) {
case GGML_V2_TYPE_F32:
case GGML_V2_TYPE_F16:
case GGML_V2_TYPE_Q4_0:
case GGML_V2_TYPE_Q4_1:
case GGML_V2_TYPE_Q4_2:
case GGML_V2_TYPE_Q4_3:
case GGML_V2_TYPE_Q5_0:
case GGML_V2_TYPE_Q5_1:
case GGML_V2_TYPE_Q8_0:
break;
default: LLAMA_V2_ASSERT(false);
}
file.write_u32((uint32_t) tensor.ne.size());
file.write_u32((uint32_t) tensor.name.size());
file.write_u32(new_type);
file.write_raw(tensor.ne.data(), sizeof(tensor.ne[0]) * tensor.ne.size());
file.write_raw(tensor.name.data(), tensor.name.size());
file.seek(-file.tell() & 31, SEEK_CUR);
LLAMA_V2_ASSERT(new_size == llama_v2_calc_tensor_size(tensor.ne, new_type));
file.write_raw(new_data, new_size);
}
};
struct llama_v2_model_loader {
std::vector<std::unique_ptr<llama_v2_file_loader>> file_loaders;
llama_v2_load_tensors_map tensors_map;
bool use_mmap;
size_t num_ggml_v2_tensors_created = 0;
struct ggml_v2_context * ggml_v2_ctx = NULL;
std::unique_ptr<llama_v2_mmap> mapping;
llama_v2_model_loader(const std::string & fname_base, bool use_mmap, bool vocab_only) {
auto * first_file = new llama_v2_file_loader(fname_base.c_str(), 0, tensors_map);
file_loaders.emplace_back(first_file);
uint32_t n_parts = vocab_only ? 1 : guess_n_parts();
for (uint32_t i = 1; i < n_parts; i++) {
std::string fname = fname_base + "." + std::to_string(i);
auto * ith_file = new llama_v2_file_loader(fname.c_str(), i, tensors_map);
file_loaders.emplace_back(ith_file);
if (ith_file->hparams != first_file->hparams) {
throw format_old("llama.cpp: hparams inconsistent between files");
}
}
if (!llama_v2_mmap::SUPPORTED) {
use_mmap = false;
}
if (use_mmap && alignment_prevents_mmap()) {
fprintf(stderr, "llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this\n");
use_mmap = false;
}
this->use_mmap = use_mmap;
for (llama_v2_load_tensor & lt : tensors_map.tensors) {
lt.calc_all();
}
}
bool alignment_prevents_mmap() {
for (const llama_v2_load_tensor & lt : tensors_map.tensors) {
for (const llama_v2_load_tensor_shard & shard : lt.shards) {
if (shard.file_off & 3) {
return true;
}
}
}
return false;
}
uint32_t guess_n_parts() const {
auto it = tensors_map.name_to_idx.find("tok_embeddings.weight");
if (it == tensors_map.name_to_idx.end()) {
throw std::string("missing tok_embeddings.weight");
}
const llama_v2_load_tensor & lt = tensors_map.tensors.at(it->second);
return file_loaders.at(0)->hparams.n_embd / lt.shards.at(0).ne.at(0);
}
void calc_sizes(size_t * ctx_size_p, size_t * mmapped_size_p) const {
*ctx_size_p = *mmapped_size_p = 0;
for (const llama_v2_load_tensor & lt : tensors_map.tensors) {
*ctx_size_p += sizeof(struct ggml_v2_tensor) + GGML_V2_OBJECT_SIZE;
*(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size;
}
}
struct ggml_v2_tensor * get_tensor(const std::string & name, const std::vector<uint32_t> & ne) {
auto it = tensors_map.name_to_idx.find(name);
if (it == tensors_map.name_to_idx.end()) {
throw format_old("llama.cpp: tensor '%s' is missing from model", name.c_str());
}
llama_v2_load_tensor & lt = tensors_map.tensors.at(it->second);
if (lt.ne != ne) {
throw format_old("llama.cpp: tensor '%s' has wrong shape; expected %s, got %s",
name.c_str(), llama_v2_format_tensor_shape(ne).c_str(), llama_v2_format_tensor_shape(lt.ne).c_str());
}
return get_tensor_for(lt);
}
struct ggml_v2_tensor * get_tensor_for(llama_v2_load_tensor & lt) {
struct ggml_v2_tensor * tensor;
if (lt.ne.size() == 2) {
tensor = ggml_v2_new_tensor_2d(ggml_v2_ctx, lt.type, lt.ne.at(0), lt.ne.at(1));
} else {
LLAMA_V2_ASSERT(lt.ne.size() == 1);
tensor = ggml_v2_new_tensor_1d(ggml_v2_ctx, lt.type, lt.ne.at(0));
}
ggml_v2_set_name(tensor, lt.name.c_str());
LLAMA_V2_ASSERT(lt.ggml_v2_tensor == NULL); // if this fails, we called get_tensor twice on the same tensor
lt.ggml_v2_tensor = tensor;
num_ggml_v2_tensors_created++;
return tensor;
}
void done_getting_tensors() const {
if (num_ggml_v2_tensors_created != tensors_map.tensors.size()) {
throw std::string("llama.cpp: file contained more tensors than expected");
}
}
void load_all_data(llama_v2_progress_callback progress_callback, void * progress_callback_user_data, llama_v2_mlock * lmlock) {
size_t data_size = 0;
for (const llama_v2_load_tensor & lt : tensors_map.tensors) {
data_size += lt.size;
}
if (use_mmap) {
mapping.reset(new llama_v2_mmap(&file_loaders.at(0)->file));
if (!lmlock) {
// Don't call the callback since the actual loading will be lazy
// and we can't measure it.
progress_callback = NULL;
}
if (lmlock) {
lmlock->init(mapping->addr);
}
}
size_t done_size = 0;
for (llama_v2_load_tensor & lt : tensors_map.tensors) {
if (progress_callback) {
progress_callback((float) done_size / data_size, progress_callback_user_data);
}
LLAMA_V2_ASSERT(lt.ggml_v2_tensor); // unused tensors should have been caught by load_data already
lt.data = (uint8_t *) lt.ggml_v2_tensor->data;
load_data_for(lt);
lt.ggml_v2_tensor->data = lt.data;
done_size += lt.size;
if (use_mmap && lmlock) {
lmlock->grow_to(done_size);
}
}
if (progress_callback) {
progress_callback(1.0f, progress_callback_user_data);
}
}
void load_data_for(llama_v2_load_tensor & lt) {
if (use_mmap) {
LLAMA_V2_ASSERT(lt.shards.size() == 1);
lt.data = (uint8_t *) mapping->addr + lt.shards.at(0).file_off;
} else if (lt.split_type == SPLIT_NONE_2) {
llama_v2_file & file = file_loaders.at(lt.shards.at(0).file_idx)->file;
file.seek(lt.shards.at(0).file_off, SEEK_SET);
file.read_raw(lt.data, lt.size);
} else if (lt.split_type == SPLIT_BY_ROWS_2) {
size_t offset = 0;
for (llama_v2_load_tensor_shard & shard : lt.shards) {
llama_v2_file & file = file_loaders.at(shard.file_idx)->file;
file.seek(shard.file_off, SEEK_SET);
file.read_raw(lt.data + offset, shard.size);
offset += shard.size;
}
LLAMA_V2_ASSERT(offset == lt.size);
} else if (lt.split_type == SPLIT_BY_COLUMNS_2) {
// Let's load the data into temporary buffers to ensure the OS performs large loads.
std::vector<llama_v2_buffer> tmp_bufs(lt.shards.size());
for (size_t i = 0; i < lt.shards.size(); i++) {
llama_v2_load_tensor_shard & shard = lt.shards.at(i);
llama_v2_file & file = file_loaders.at(shard.file_idx)->file;
file.seek(shard.file_off, SEEK_SET);
tmp_bufs.at(i).resize(shard.size);
file.read_raw(tmp_bufs.at(i).addr, shard.size);
}
// Then reshape.
size_t num_rows = lt.ne.at(1);
size_t per_shard_row_size = lt.shards.at(0).size / num_rows;
size_t out_offset = 0;
for (size_t row = 0; row < num_rows; row++) {
for (llama_v2_buffer & tmp_buf : tmp_bufs) {
memcpy(lt.data + out_offset,
tmp_buf.addr + row * per_shard_row_size,
per_shard_row_size);
out_offset += per_shard_row_size;
}
}
LLAMA_V2_ASSERT(out_offset == lt.size);
}
if (0) {
print_checksum(lt);
}
}
static void print_checksum(llama_v2_load_tensor & lt) {
uint32_t sum = 0;
for (size_t i = 0; i < lt.size; i++) {
uint8_t byte = lt.data[i];
sum = byte + (sum << 6) + (sum << 16) - sum; // sdbm hash
}
fprintf(stderr, "%s checksum: %#08x (%s, size %zu)\n", lt.name.c_str(), sum,
llama_v2_format_tensor_shape(lt.ne).c_str(), lt.size);
}
};
//
// kv cache
//
static bool kv_cache_init(
const struct llama_v2_hparams & hparams,
struct llama_v2_kv_cache & cache,
ggml_v2_type wtype,
int n_ctx) {
const int n_embd = hparams.n_embd;
const int n_layer = hparams.n_layer;
const int64_t n_mem = n_layer*n_ctx;
const int64_t n_elements = n_embd*n_mem;
cache.buf.resize(2u*n_elements*ggml_v2_type_size(wtype) + 2u*MB_2);
struct ggml_v2_init_params params;
params.mem_size = cache.buf.size;
params.mem_buffer = cache.buf.addr;
params.no_alloc = false;
cache.ctx = ggml_v2_init(params);
if (!cache.ctx) {
fprintf(stderr, "%s: failed to allocate memory for kv cache\n", __func__);
return false;
}
cache.k = ggml_v2_new_tensor_1d(cache.ctx, wtype, n_elements);
cache.v = ggml_v2_new_tensor_1d(cache.ctx, wtype, n_elements);
ggml_v2_set_name(cache.k, "cache_k");
ggml_v2_set_name(cache.v, "cache_v");
return true;
}
struct llama_v2_context_params llama_v2_context_default_params() {
struct llama_v2_context_params result = {
/*.n_ctx =*/ 512,
/*.gpu_layers =*/ 0,
/*.seed =*/ -1,
/*.f16_kv =*/ true,
/*.logits_all =*/ false,
/*.vocab_only =*/ false,
/*.use_mmap =*/ true,
/*.use_mlock =*/ false,
/*.embedding =*/ false,
/*.progress_callback =*/ nullptr,
/*.progress_callback_user_data =*/ nullptr,
};
return result;
}
bool llama_v2_mmap_supported() {
return llama_v2_mmap::SUPPORTED;
}
bool llama_v2_mlock_supported() {
return llama_v2_mlock::SUPPORTED;
}
//
// model loading
//
static const char *llama_v2_file_version_name(llama_v2_file_version version) {
switch (version) {
case LLAMA_V2_FILE_VERSION_GGML: return "'ggml' (old version with low tokenizer quality and no mmap support)";
case LLAMA_V2_FILE_VERSION_GGMF_V1: return "ggmf v1 (old version with no mmap support)";
case LLAMA_V2_FILE_VERSION_GGJT_V1: return "ggjt v1 (pre #1405)";
case LLAMA_V2_FILE_VERSION_GGJT_V2: return "ggjt v2 (pre #1508)";
case LLAMA_V2_FILE_VERSION_GGJT_V3: return "ggjt v3 (latest)";
}
return "unknown";
}
static const char *llama_v2_ftype_name(enum llama_v2_ftype ftype) {
switch (ftype) {
case LLAMA_V2_FTYPE_ALL_F32: return "all F32";
case LLAMA_V2_FTYPE_MOSTLY_F16: return "mostly F16";
case LLAMA_V2_FTYPE_MOSTLY_Q4_0: return "mostly Q4_0";
case LLAMA_V2_FTYPE_MOSTLY_Q4_1: return "mostly Q4_1";
case LLAMA_V2_FTYPE_MOSTLY_Q4_1_SOME_F16:
return "mostly Q4_1, some F16";
case LLAMA_V2_FTYPE_MOSTLY_Q4_2: return "mostly Q4_2";
case LLAMA_V2_FTYPE_MOSTLY_Q4_3: return "mostly Q4_3";
case LLAMA_V2_FTYPE_MOSTLY_Q5_0: return "mostly Q5_0";
case LLAMA_V2_FTYPE_MOSTLY_Q5_1: return "mostly Q5_1";
case LLAMA_V2_FTYPE_MOSTLY_Q8_0: return "mostly Q8_0";
default: return "unknown, may not work";
}
}
static const char *llama_v2_model_type_name(e_model2 type) {
switch (type) {
case MODEL_7B_2: return "7B";
case MODEL_13B_2: return "13B";
case MODEL_30B_2: return "30B";
case MODEL_65B_2: return "65B";
default:
printf("\nWARNING: NON-STANDARD LLAMA FILE DETECTED. DEFAULT TO 7B SIZE.\n");
return "UNKNOWN";
}
}
static void llama_v2_model_load_internal(
const std::string & fname,
llama_v2_context & lctx,
int n_ctx,
int n_gpu_layers,
ggml_v2_type memory_type,
bool use_mmap,
bool use_mlock,
bool vocab_only,
llama_v2_progress_callback progress_callback,
void * progress_callback_user_data) {
lctx.t_start_us = ggml_v2_time_us();
std::unique_ptr<llama_v2_model_loader> ml(new llama_v2_model_loader(fname, use_mmap, vocab_only));
lctx.vocab = std::move(ml->file_loaders.at(0)->vocab);
auto & model = lctx.model;
model.hparams = ml->file_loaders.at(0)->hparams;
llama_v2_file_version file_version = ml->file_loaders.at(0)->file_version;
auto & hparams = model.hparams;
uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;
{
switch (hparams.n_layer) {
case 32: model.type = e_model2::MODEL_7B_2; break;
case 40: model.type = e_model2::MODEL_13B_2; break;
case 60: model.type = e_model2::MODEL_30B_2; break;
case 80: model.type = e_model2::MODEL_65B_2; break;
default: model.type = e_model2::MODEL_UNKNOWN_2; break;
}
hparams.n_ctx = n_ctx;
}
{
fprintf(stderr, "%s: format = %s\n", __func__, llama_v2_file_version_name(file_version));
fprintf(stderr, "%s: n_vocab = %u\n", __func__, hparams.n_vocab);
fprintf(stderr, "%s: n_ctx = %u\n", __func__, hparams.n_ctx);
fprintf(stderr, "%s: n_embd = %u\n", __func__, hparams.n_embd);
fprintf(stderr, "%s: n_mult = %u\n", __func__, hparams.n_mult);
fprintf(stderr, "%s: n_head = %u\n", __func__, hparams.n_head);
fprintf(stderr, "%s: n_layer = %u\n", __func__, hparams.n_layer);
fprintf(stderr, "%s: n_rot = %u\n", __func__, hparams.n_rot);
fprintf(stderr, "%s: ftype = %u (%s)\n", __func__, hparams.ftype, llama_v2_ftype_name(hparams.ftype));
fprintf(stderr, "%s: n_ff = %u\n", __func__, n_ff);
fprintf(stderr, "%s: n_parts = %zu\n", __func__, ml->file_loaders.size());
fprintf(stderr, "%s: model size = %s\n", __func__, llama_v2_model_type_name(model.type));
}
if (file_version < LLAMA_V2_FILE_VERSION_GGJT_V2) {
if (hparams.ftype != LLAMA_V2_FTYPE_ALL_F32 &&
hparams.ftype != LLAMA_V2_FTYPE_MOSTLY_F16 &&
hparams.ftype != LLAMA_V2_FTYPE_MOSTLY_Q8_0) {
printf("\nLegacy LLAMA GGJT v1 compatability changes triggered.\n");
}
}
if (file_version < LLAMA_V2_FILE_VERSION_GGJT_V3) {
if (hparams.ftype == LLAMA_V2_FTYPE_MOSTLY_Q4_0 ||
hparams.ftype == LLAMA_V2_FTYPE_MOSTLY_Q4_1 ||
hparams.ftype == LLAMA_V2_FTYPE_MOSTLY_Q8_0) {
printf("\nLegacy LLAMA GGJT v2 compatability changes triggered.\n");
}
}
if (vocab_only) {
return;
}
auto & ctx = model.ctx;
size_t ctx_size;
size_t mmapped_size;
ml->calc_sizes(&ctx_size, &mmapped_size);
fprintf(stderr, "%s: ggml ctx size = %6.2f MB\n", __func__, ctx_size/1024.0/1024.0);
// print memory requirements
{
const size_t scale = memory_type == GGML_V2_TYPE_F32 ? 2 : 1;
// this is the total memory required to run the inference
const size_t mem_required =
ctx_size +
mmapped_size +
MEM_REQ_SCRATCH0_2().at(model.type) +
MEM_REQ_SCRATCH1_2().at(model.type) +
MEM_REQ_EVAL_2().at(model.type);
// this is the memory required by one llama_v2_state
const size_t mem_required_state =
scale*MEM_REQ_KV_SELF_2().at(model.type);
fprintf(stderr, "%s: mem required = %7.2f MB (+ %7.2f MB per state)\n", __func__,
mem_required / 1024.0 / 1024.0, mem_required_state / 1024.0 / 1024.0);
}
// create the ggml context
{
lctx.model.buf.resize(ctx_size);
if (use_mlock) {
lctx.model.mlock_buf.init(lctx.model.buf.addr);
lctx.model.mlock_buf.grow_to(lctx.model.buf.size);
}
struct ggml_v2_init_params params = {
/*.mem_size =*/ lctx.model.buf.size,
/*.mem_buffer =*/ lctx.model.buf.addr,
/*.no_alloc =*/ ml->use_mmap,
};
model.ctx = ggml_v2_init(params);
if (!model.ctx) {
throw format_old("ggml_v2_init() failed");
}
}
// prepare memory for the weights
{
const uint32_t n_embd = hparams.n_embd;
const uint32_t n_layer = hparams.n_layer;
const uint32_t n_vocab = hparams.n_vocab;
ml->ggml_v2_ctx = ctx;
model.tok_embeddings = ml->get_tensor("tok_embeddings.weight", {n_embd, n_vocab});
model.norm = ml->get_tensor("norm.weight", {n_embd});
model.output = ml->get_tensor("output.weight", {n_embd, n_vocab});
model.layers.resize(n_layer);
for (uint32_t i = 0; i < n_layer; ++i) {
auto & layer = model.layers[i];
std::string layers_i = "layers." + std::to_string(i);
layer.attention_norm = ml->get_tensor(layers_i + ".attention_norm.weight", {n_embd});
layer.wq = ml->get_tensor(layers_i + ".attention.wq.weight", {n_embd, n_embd});
layer.wk = ml->get_tensor(layers_i + ".attention.wk.weight", {n_embd, n_embd});
layer.wv = ml->get_tensor(layers_i + ".attention.wv.weight", {n_embd, n_embd});
layer.wo = ml->get_tensor(layers_i + ".attention.wo.weight", {n_embd, n_embd});
layer.ffn_norm = ml->get_tensor(layers_i + ".ffn_norm.weight", {n_embd});
layer.w1 = ml->get_tensor(layers_i + ".feed_forward.w1.weight", {n_embd, n_ff});
layer.w2 = ml->get_tensor(layers_i + ".feed_forward.w2.weight", { n_ff, n_embd});
layer.w3 = ml->get_tensor(layers_i + ".feed_forward.w3.weight", {n_embd, n_ff});
}
}
ml->done_getting_tensors();
// populate `tensors_by_name`
for (llama_v2_load_tensor & lt : ml->tensors_map.tensors) {
model.tensors_by_name.emplace_back(lt.name, lt.ggml_v2_tensor);
}
ml->load_all_data(progress_callback, progress_callback_user_data, use_mlock ? &lctx.model.mlock_mmap : NULL);
model.mapping = std::move(ml->mapping);
#if defined(GGML_USE_CUBLAS)
{
const int n_gpu = std::min(n_gpu_layers, int(hparams.n_layer));
if(GetQuantsUnshuffled())
{
fprintf(stderr, "%s: [old cublas] offloading %d layers to GPU\n", __func__, n_gpu);
size_t vram_total = 0;
for (int i = 0; i < n_gpu; ++i) {
const auto & layer = model.layers[i];
ggml_v2_cuda_transform_tensor(layer.wq); vram_total += ggml_v2_nbytes(layer.wq);
ggml_v2_cuda_transform_tensor(layer.wk); vram_total += ggml_v2_nbytes(layer.wk);
ggml_v2_cuda_transform_tensor(layer.wv); vram_total += ggml_v2_nbytes(layer.wv);
ggml_v2_cuda_transform_tensor(layer.wo); vram_total += ggml_v2_nbytes(layer.wo);
ggml_v2_cuda_transform_tensor(layer.w1); vram_total += ggml_v2_nbytes(layer.w1);
ggml_v2_cuda_transform_tensor(layer.w2); vram_total += ggml_v2_nbytes(layer.w2);
ggml_v2_cuda_transform_tensor(layer.w3); vram_total += ggml_v2_nbytes(layer.w3);
}
if (n_gpu_layers > (int) hparams.n_layer) {
fprintf(stderr, "%s: [old cublas] offloading output layer to GPU\n", __func__);
ggml_v2_cuda_transform_tensor(model.output); vram_total += ggml_v2_nbytes(model.output);
}
fprintf(stderr, "%s: [old cublas] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
}
else
{
if(n_gpu>0)
{
printf("\n[WARNING: Old format does not support GPU offloading! It will be deactivated!]\n");
}
}
}
#elif defined(GGML_USE_CLBLAST)
{
const int n_gpu = std::min(n_gpu_layers, int(hparams.n_layer));
if(GetQuantsUnshuffled())
{
fprintf(stderr, "%s: [opencl] offloading %d layers to GPU\n", __func__, n_gpu);
size_t vram_total = 0;
for (int i = 0; i < n_gpu; ++i) {
const auto & layer = model.layers[i];
ggml_v2_cl_transform_tensor(layer.wq); vram_total += ggml_v2_nbytes(layer.wq);
ggml_v2_cl_transform_tensor(layer.wk); vram_total += ggml_v2_nbytes(layer.wk);
ggml_v2_cl_transform_tensor(layer.wv); vram_total += ggml_v2_nbytes(layer.wv);
ggml_v2_cl_transform_tensor(layer.wo); vram_total += ggml_v2_nbytes(layer.wo);
ggml_v2_cl_transform_tensor(layer.w1); vram_total += ggml_v2_nbytes(layer.w1);
ggml_v2_cl_transform_tensor(layer.w2); vram_total += ggml_v2_nbytes(layer.w2);
ggml_v2_cl_transform_tensor(layer.w3); vram_total += ggml_v2_nbytes(layer.w3);
}
if (n_gpu_layers > (int) hparams.n_layer) {
fprintf(stderr, "%s: [opencl] offloading output layer to GPU\n", __func__);
ggml_v2_cl_transform_tensor(model.output); vram_total += ggml_v2_nbytes(model.output);
}
fprintf(stderr, "%s: [opencl] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
}
else
{
if(n_gpu>0)
{
printf("\n[WARNING: Old format does not support GPU offloading! It will be deactivated!]\n");
}
}
}
#else
(void) n_gpu_layers;
#endif
// loading time will be recalculate after the first eval, so
// we take page faults deferred by mmap() into consideration
lctx.t_load_us = ggml_v2_time_us() - lctx.t_start_us;
}
static bool llama_v2_model_load(
const std::string & fname,
llama_v2_context & lctx,
int n_ctx,
int n_gpu_layers,
ggml_v2_type memory_type,
bool use_mmap,
bool use_mlock,
bool vocab_only,
llama_v2_progress_callback progress_callback,
void *progress_callback_user_data) {
try {
llama_v2_model_load_internal(fname, lctx, n_ctx, n_gpu_layers, memory_type, use_mmap, use_mlock,
vocab_only, progress_callback, progress_callback_user_data);
return true;
} catch (const std::string & err) {
fprintf(stderr, "error loading model: %s\n", err.c_str());
return false;
}
}
// evaluate the transformer
//
// - lctx: llama context
// - tokens: new batch of tokens to process
// - n_past: the context size so far
// - n_threads: number of threads to use
//
static bool llama_v2_eval_internal(
llama_v2_context & lctx,
const llama_v2_token * tokens,
const int n_tokens,
const int n_past,
const int n_threads) {
// enforce that the first token is BOS (not needed, messes with my context manip code)
//if (n_past == 0 && tokens[0] != llama_v2_token_bos()) {
//fprintf(stderr, "%s: first token must be BOS\n", __func__);
// return false; //never fail. Not even in the face of Armageddon.
//}
const int64_t t_start_us = ggml_v2_time_us();
const int N = n_tokens;
const auto & model = lctx.model;
const auto & hparams = model.hparams;
const auto & kv_self = model.kv_self;
LLAMA_V2_ASSERT(!!kv_self.ctx);
const int n_embd = hparams.n_embd;
const int n_layer = hparams.n_layer;
const int n_ctx = hparams.n_ctx;
const int n_head = hparams.n_head;
const int n_vocab = hparams.n_vocab;
const int n_rot = hparams.n_embd/hparams.n_head;
auto & mem_per_token = lctx.mem_per_token;
auto & buf_compute = lctx.buf_compute;
struct ggml_v2_init_params params = {
/*.mem_size =*/ buf_compute.size,
/*.mem_buffer =*/ buf_compute.addr,
/*.no_alloc =*/ false,
};
struct ggml_v2_context * ctx0 = ggml_v2_init(params);
// for big prompts, if BLAS is enabled, it is better to use only one thread
// otherwise, the threads are spin-lock waiting for the BLAS calls and are degrading the performance
ggml_v2_cgraph gf = {};
gf.n_threads = N >= 32 && ggml_v2_cpu_has_blas() && !ggml_v2_cpu_has_gpublas() ? 1 : n_threads;
struct ggml_v2_tensor * embd = ggml_v2_new_tensor_1d(ctx0, GGML_V2_TYPE_I32, N);
ggml_v2_set_name(embd, "embd");
memcpy(embd->data, tokens, N*ggml_v2_element_size(embd));
struct ggml_v2_tensor * inpL = ggml_v2_get_rows(ctx0, model.tok_embeddings, embd);
for (int il = 0; il < n_layer; ++il) {
struct ggml_v2_tensor * inpSA = inpL;
struct ggml_v2_tensor * cur;
lctx.use_buf(ctx0, 0);
// norm
{
cur = ggml_v2_rms_norm(ctx0, inpL);
// cur = attention_norm*cur
cur = ggml_v2_mul(ctx0,
ggml_v2_repeat(ctx0, model.layers[il].attention_norm, cur),
cur);
}
// self-attention
{
// compute Q and K and RoPE them
struct ggml_v2_tensor * Qcur = ggml_v2_rope_inplace(ctx0, ggml_v2_reshape_3d(ctx0, ggml_v2_mul_mat(ctx0, model.layers[il].wq, cur), n_embd/n_head, n_head, N), n_past, n_rot, 0);
struct ggml_v2_tensor * Kcur = ggml_v2_rope_inplace(ctx0, ggml_v2_reshape_3d(ctx0, ggml_v2_mul_mat(ctx0, model.layers[il].wk, cur), n_embd/n_head, n_head, N), n_past, n_rot, 0);
ggml_v2_set_name(Qcur, "Qcur");
ggml_v2_set_name(Kcur, "Kcur");
// store key and value to memory
{
// compute the transposed [N, n_embd] V matrix
struct ggml_v2_tensor * Vcur = ggml_v2_transpose(ctx0, ggml_v2_reshape_2d(ctx0, ggml_v2_mul_mat(ctx0, model.layers[il].wv, cur), n_embd, N));
struct ggml_v2_tensor * k = ggml_v2_view_1d(ctx0, kv_self.k, N*n_embd, (ggml_v2_element_size(kv_self.k)*n_embd)*(il*n_ctx + n_past));
struct ggml_v2_tensor * v = ggml_v2_view_2d(ctx0, kv_self.v, N, n_embd,
( n_ctx)*ggml_v2_element_size(kv_self.v),
(il*n_ctx)*ggml_v2_element_size(kv_self.v)*n_embd + n_past*ggml_v2_element_size(kv_self.v));
// important: storing RoPE-ed version of K in the KV cache!
ggml_v2_build_forward_expand(&gf, ggml_v2_cpy(ctx0, Kcur, k));
ggml_v2_build_forward_expand(&gf, ggml_v2_cpy(ctx0, Vcur, v));
}
struct ggml_v2_tensor * Q =
ggml_v2_permute(ctx0,
Qcur,
0, 2, 1, 3);
ggml_v2_set_name(Q, "Q");
struct ggml_v2_tensor * K =
ggml_v2_permute(ctx0,
ggml_v2_reshape_3d(ctx0,
ggml_v2_view_1d(ctx0, kv_self.k, (n_past + N)*n_embd, il*n_ctx*ggml_v2_element_size(kv_self.k)*n_embd),
n_embd/n_head, n_head, n_past + N),
0, 2, 1, 3);
ggml_v2_set_name(K, "K");
// K * Q
struct ggml_v2_tensor * KQ = ggml_v2_mul_mat(ctx0, K, Q);
ggml_v2_set_name(KQ, "KQ");
// KQ_scaled = KQ / sqrt(n_embd/n_head)
struct ggml_v2_tensor * KQ_scale = ggml_v2_new_f32(ctx0, 1.0f/sqrtf(float(n_embd)/n_head));
ggml_v2_set_name(KQ_scale, "1/sqrt(n_embd/n_head)");
// KQ_scaled shape [n_past + N, N, n_head, 1]
struct ggml_v2_tensor * KQ_scaled = ggml_v2_scale_inplace(ctx0, KQ, KQ_scale);
ggml_v2_set_name(KQ_scaled, "KQ_scaled");
// KQ_masked = mask_past(KQ_scaled)
struct ggml_v2_tensor * KQ_masked = ggml_v2_diag_mask_inf_inplace(ctx0, KQ_scaled, n_past);
ggml_v2_set_name(KQ_masked, "KQ_masked");
// KQ = soft_max(KQ_masked)
struct ggml_v2_tensor * KQ_soft_max = ggml_v2_soft_max_inplace(ctx0, KQ_masked);
ggml_v2_set_name(KQ_soft_max, "KQ_soft_max");
// split cached V into n_head heads
struct ggml_v2_tensor * V =
ggml_v2_view_3d(ctx0, kv_self.v,
n_past + N, n_embd/n_head, n_head,
n_ctx*ggml_v2_element_size(kv_self.v),
n_ctx*ggml_v2_element_size(kv_self.v)*n_embd/n_head,
il*n_ctx*ggml_v2_element_size(kv_self.v)*n_embd);
ggml_v2_set_name(V, "V");
#if 1
struct ggml_v2_tensor * KQV = ggml_v2_mul_mat(ctx0, V, KQ_soft_max);
ggml_v2_set_name(KQV, "KQV");
#else
// make V contiguous in memory to speed up the matmul, however we waste time on the copy
// on M1 this is faster for the perplexity computation, but ~5% slower for the single-token generation
// is there a better way?
struct ggml_v2_tensor * V_cont = ggml_v2_cpy(ctx0, V, ggml_v2_new_tensor_3d(ctx0, kv_self.v->type, n_past + N, n_embd/n_head, n_head));
struct ggml_v2_tensor * KQV = ggml_v2_mul_mat(ctx0, V_cont, KQ_soft_max);
#endif
// KQV_merged = KQV.permute(0, 2, 1, 3)
struct ggml_v2_tensor * KQV_merged = ggml_v2_permute(ctx0, KQV, 0, 2, 1, 3);
ggml_v2_set_name(KQV_merged, "KQV_merged");
// cur = KQV_merged.contiguous().view(n_embd, N)
cur = ggml_v2_cpy(ctx0,
KQV_merged,
ggml_v2_new_tensor_2d(ctx0, GGML_V2_TYPE_F32, n_embd, N));
ggml_v2_set_name(cur, "KQV_merged_contiguous");
// projection (no bias)
cur = ggml_v2_mul_mat(ctx0,
model.layers[il].wo,
cur);
}
lctx.use_buf(ctx0, 1);
struct ggml_v2_tensor * inpFF = ggml_v2_add(ctx0, cur, inpSA);
// feed-forward network
{
// norm
{
cur = ggml_v2_rms_norm(ctx0, inpFF);
// cur = ffn_norm*cur
cur = ggml_v2_mul(ctx0,
ggml_v2_repeat(ctx0, model.layers[il].ffn_norm, cur),
cur);
}
struct ggml_v2_tensor * tmp = ggml_v2_mul_mat(ctx0,
model.layers[il].w3,
cur);
cur = ggml_v2_mul_mat(ctx0,
model.layers[il].w1,
cur);
// SILU activation
cur = ggml_v2_silu(ctx0, cur);
cur = ggml_v2_mul(ctx0, cur, tmp);
cur = ggml_v2_mul_mat(ctx0,
model.layers[il].w2,
cur);
}
cur = ggml_v2_add(ctx0, cur, inpFF);
// input for next layer
inpL = cur;
}
lctx.use_buf(ctx0, 0);
// used at the end to optionally extract the embeddings
struct ggml_v2_tensor * embeddings = NULL;
// norm
{
inpL = ggml_v2_rms_norm(ctx0, inpL);
// inpL = norm*inpL
inpL = ggml_v2_mul(ctx0,
ggml_v2_repeat(ctx0, model.norm, inpL),
inpL);
embeddings = inpL;
}
// lm_head
inpL = ggml_v2_mul_mat(ctx0, model.output, inpL);
lctx.use_buf(ctx0, -1);
// logits -> probs
//inpL = ggml_v2_soft_max_inplace(ctx0, inpL);
// run the computation
ggml_v2_build_forward_expand(&gf, inpL);
ggml_v2_graph_compute (ctx0, &gf);
#ifdef GGML_V2_PERF
// print timing information per ggml operation (for debugging purposes)
// requires GGML_V2_PERF to be defined
ggml_v2_graph_print(&gf);
#endif
// plot the computation graph in dot format (for debugging purposes)
//if (n_past%100 == 0) {
// ggml_v2_graph_dump_dot(&gf, NULL, "llama.dot");
//}
//embd_w.resize(n_vocab*N);
//memcpy(embd_w.data(), ggml_v2_get_data(inpL), sizeof(float)*n_vocab*N);
// update kv token count
lctx.model.kv_self.n = n_past + N;
// extract logits
{
auto & logits_out = lctx.logits;
if (lctx.logits_all) {
logits_out.resize(n_vocab * N);
memcpy(logits_out.data(), (float *) ggml_v2_get_data(inpL), sizeof(float)*n_vocab*N);
} else {
// return result for just the last token
logits_out.resize(n_vocab);
memcpy(logits_out.data(), (float *) ggml_v2_get_data(inpL) + (n_vocab*(N-1)), sizeof(float)*n_vocab);
}
}
// extract embeddings
if (!lctx.embedding.empty()) {
auto & embedding_out = lctx.embedding;
embedding_out.resize(n_embd);
memcpy(embedding_out.data(), (float *) ggml_v2_get_data(embeddings) + (n_embd*(N - 1)), sizeof(float)*n_embd);
}
if (mem_per_token == 0) {
mem_per_token = ggml_v2_used_mem(ctx0)/N;
}
#if 0
printf("\n%s: used_mem = %.3f MB, scratch -- %.3f MB %.3f MB\n", __func__,
ggml_v2_used_mem(ctx0)/1024.0/1024.0,
lctx.get_buf_max_mem(0)/1024.0/1024.0,
lctx.get_buf_max_mem(1)/1024.0/1024.0);
#endif
ggml_v2_free(ctx0);
// measure the performance only for the single-token evals
if (N == 1) {
lctx.t_eval_us += ggml_v2_time_us() - t_start_us;
lctx.n_eval++;
}
else if (N > 1) {
lctx.t_p_eval_us += ggml_v2_time_us() - t_start_us;
lctx.n_p_eval += N;
}
return true;
}
//
// tokenizer
//
static size_t utf8_len2(char src) {
const size_t lookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4 };
uint8_t highbits = static_cast<uint8_t>(src) >> 4;
return lookup[highbits];
}
struct llama_v2_sp_symbol {
using index = int;
index prev;
index next;
const char * text;
size_t n;
};
static_assert(std::is_trivially_copyable<llama_v2_sp_symbol>::value, "llama_v2_sp_symbol is not trivially copyable");
struct llama_v2_sp_bigram {
struct comparator {
bool operator()(llama_v2_sp_bigram & l, llama_v2_sp_bigram & r) {
return (l.score < r.score) || (l.score == r.score && l.left > r.left);
}
};
using queue_storage = std::vector<llama_v2_sp_bigram>;
using queue = std::priority_queue<llama_v2_sp_bigram, queue_storage, comparator>;
llama_v2_sp_symbol::index left;
llama_v2_sp_symbol::index right;
float score;
size_t size;
};
// original implementation:
// https://github.com/ggerganov/llama.cpp/commit/074bea2eb1f1349a0118239c4152914aecaa1be4
struct llama_v2_tokenizer {
llama_v2_tokenizer(const llama_v2_vocab & vocab): vocab_(vocab) {}
void tokenize(const std::string & text, std::vector<llama_v2_vocab::id> & output) {
// split string into utf8 chars
int index = 0;
size_t offs = 0;
while (offs < text.size()) {
llama_v2_sp_symbol sym;
size_t char_len = std::min(text.size() - offs, utf8_len2(text[offs]));
sym.text = text.c_str() + offs;
sym.n = char_len;
offs += char_len;
sym.prev = index - 1;
sym.next = offs == text.size() ? -1 : index + 1;
index++;
symbols_.emplace_back(sym);
}
// seed the work queue with all possible 2-character tokens.
for (size_t i = 1; i < symbols_.size(); ++i) {
try_add_bigram(i - 1, i);
}
// keep substituting the highest frequency pairs for as long as we can.
while (!work_queue_.empty()) {
auto bigram = work_queue_.top();
work_queue_.pop();
auto & left_sym = symbols_[bigram.left];
auto & right_sym = symbols_[bigram.right];
// if one of the symbols already got merged, skip it.
if (left_sym.n == 0 || right_sym.n == 0 ||
left_sym.n + right_sym.n != bigram.size) {
continue;
}
// merge the right sym into the left one
left_sym.n += right_sym.n;
right_sym.n = 0;
//printf("left = '%*s' size = %zu\n", (int) left_sym.n, left_sym.text, bigram.size);
// remove the right sym from the chain
left_sym.next = right_sym.next;
if (right_sym.next >= 0) {
symbols_[right_sym.next].prev = bigram.left;
}
// find more substitutions
try_add_bigram(left_sym.prev, bigram.left);
try_add_bigram(bigram.left, left_sym.next);
}
for (int i = 0; i != -1; i = symbols_[i].next) {
auto & symbol = symbols_[i];
auto token = vocab_.token_to_id.find(std::string(symbol.text, symbol.n));
if (token == vocab_.token_to_id.end()) {
// output any symbols that did not form tokens as bytes.
for (int j = 0; j < (int) symbol.n; ++j) {
llama_v2_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
output.push_back(token_id);
}
} else {
output.push_back((*token).second);
}
}
}
private:
void try_add_bigram(int left, int right) {
if (left == -1 || right == -1) {
return;
}
const std::string text = std::string(symbols_[left].text, symbols_[left].n + symbols_[right].n);
auto token = vocab_.token_to_id.find(text);
if (token == vocab_.token_to_id.end()) {
return;
}
if (static_cast<size_t>((*token).second) >= vocab_.id_to_token.size()) {
return;
}
const auto &tok_score = vocab_.id_to_token[(*token).second];
llama_v2_sp_bigram bigram;
bigram.left = left;
bigram.right = right;
bigram.score = tok_score.score;
bigram.size = text.size();
work_queue_.push(bigram);
}
const llama_v2_vocab & vocab_;
std::vector<llama_v2_sp_symbol> symbols_;
llama_v2_sp_bigram::queue work_queue_;
};
static std::vector<llama_v2_vocab::id> llama_v2_tokenize(const llama_v2_vocab & vocab, const std::string & text, bool bos) {
llama_v2_tokenizer tokenizer(vocab);
std::vector<llama_v2_vocab::id> output;
if (text.empty()) {
return output;
}
if (bos) {
output.push_back(llama_v2_token_bos());
}
tokenizer.tokenize(text, output);
return output;
}
//
// sampling
//
void llama_v2_sample_softmax(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates) {
assert(candidates->size > 0);
const int64_t t_start_sample_us = ggml_v2_time_us();
// Sort the logits in descending order
if (!candidates->sorted) {
std::sort(candidates->data, candidates->data + candidates->size, [](const llama_v2_token_data & a, const llama_v2_token_data & b) {
return a.logit > b.logit;
});
candidates->sorted = true;
}
float max_l = candidates->data[0].logit;
float cum_sum = 0.0f;
for (size_t i = 0; i < candidates->size; ++i) {
float p = expf(candidates->data[i].logit - max_l);
candidates->data[i].p = p;
cum_sum += p;
}
for (size_t i = 0; i < candidates->size; ++i) {
candidates->data[i].p /= cum_sum;
}
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
void llama_v2_sample_top_k(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, int k, size_t min_keep) {
const int64_t t_start_sample_us = ggml_v2_time_us();
k = std::max(k, (int) min_keep);
k = std::min(k, (int) candidates->size);
// Sort scores in descending order
if (!candidates->sorted) {
auto comp = [](const llama_v2_token_data & a, const llama_v2_token_data & b) {
return a.logit > b.logit;
};
if (k == (int) candidates->size) {
std::sort(candidates->data, candidates->data + candidates->size, comp);
} else {
std::partial_sort(candidates->data, candidates->data + k, candidates->data + candidates->size, comp);
}
candidates->sorted = true;
}
candidates->size = k;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
void llama_v2_sample_top_p(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, float p, size_t min_keep) {
if (p >= 1.0f) {
return;
}
const int64_t t_start_sample_us = ggml_v2_time_us();
llama_v2_sample_softmax(ctx, candidates);
// Compute the cumulative probabilities
float cum_sum = 0.0f;
size_t last_idx = candidates->size;
for (size_t i = 0; i < candidates->size; ++i) {
cum_sum += candidates->data[i].p;
// Check if the running sum is greater than p or if we have kept at least min_keep tokens
if (cum_sum > p && i >= min_keep) {
last_idx = i;
break;
}
}
// Resize the output vector to keep only the top-p tokens
candidates->size = last_idx;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
void llama_v2_sample_tail_free(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, float z, size_t min_keep) {
if (z >= 1.0f || candidates->size <= 2) {
return;
}
const int64_t t_start_sample_us = ggml_v2_time_us();
llama_v2_sample_softmax(nullptr, candidates);
// Compute the first and second derivatives
std::vector<float> first_derivatives(candidates->size - 1);
std::vector<float> second_derivatives(candidates->size - 2);
for (size_t i = 0; i < first_derivatives.size(); ++i) {
first_derivatives[i] = candidates->data[i].p - candidates->data[i + 1].p;
}
for (size_t i = 0; i < second_derivatives.size(); ++i) {
second_derivatives[i] = first_derivatives[i] - first_derivatives[i + 1];
}
// Calculate absolute value of second derivatives
for (size_t i = 0; i < second_derivatives.size(); ++i) {
second_derivatives[i] = abs(second_derivatives[i]);
}
// Normalize the second derivatives
float second_derivatives_sum = std::accumulate(second_derivatives.begin(), second_derivatives.end(), 0.0f);
for (float & value : second_derivatives) {
value /= second_derivatives_sum;
}
float cum_sum = 0.0f;
size_t last_idx = candidates->size;
for (size_t i = 0; i < second_derivatives.size(); ++i) {
cum_sum += second_derivatives[i];
// Check if the running sum is greater than z or if we have kept at least min_keep tokens
if (cum_sum > z && i >= min_keep) {
last_idx = i;
break;
}
}
// Resize the output vector to keep only the tokens above the tail location
candidates->size = last_idx;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
void llama_v2_sample_typical(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, float p, size_t min_keep) {
// Reference implementation:
// https://github.com/huggingface/transformers/compare/main...cimeister:typical-sampling:typical-pr
if (p >= 1.0f) {
return;
}
const int64_t t_start_sample_us = ggml_v2_time_us();
// Compute the softmax of logits and calculate entropy
llama_v2_sample_softmax(nullptr, candidates);
float entropy = 0.0f;
for (size_t i = 0; i < candidates->size; ++i) {
entropy += -candidates->data[i].p * logf(candidates->data[i].p);
}
// Compute the absolute difference between negative log probability and entropy for each candidate
std::vector<float> shifted_scores;
for (size_t i = 0; i < candidates->size; ++i) {
float shifted_score = fabsf(-logf(candidates->data[i].p) - entropy);
shifted_scores.push_back(shifted_score);
}
// Sort tokens based on the shifted_scores and their corresponding indices
std::vector<size_t> indices(candidates->size);
std::iota(indices.begin(), indices.end(), 0);
std::sort(indices.begin(), indices.end(), [&](size_t a, size_t b) {
return shifted_scores[a] < shifted_scores[b];
});
// Compute the cumulative probabilities
float cum_sum = 0.0f;
size_t last_idx = indices.size();
for (size_t i = 0; i < indices.size(); ++i) {
size_t idx = indices[i];
cum_sum += candidates->data[idx].p;
// Check if the running sum is greater than typical or if we have kept at least min_keep tokens
if (cum_sum > p && i >= min_keep - 1) {
last_idx = i + 1;
break;
}
}
// Resize the output vector to keep only the locally typical tokens
std::vector<llama_v2_token_data> new_candidates;
for (size_t i = 0; i < last_idx; ++i) {
size_t idx = indices[i];
new_candidates.push_back(candidates->data[idx]);
}
// Replace the data in candidates with the new_candidates data
std::copy(new_candidates.begin(), new_candidates.end(), candidates->data);
candidates->size = new_candidates.size();
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
void llama_v2_sample_temperature(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates_p, float temp) {
const int64_t t_start_sample_us = ggml_v2_time_us();
for (size_t i = 0; i < candidates_p->size; ++i) {
candidates_p->data[i].logit /= temp;
}
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
void llama_v2_sample_repetition_penalty(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, const llama_v2_token * last_tokens, size_t last_tokens_size, float penalty) {
if (last_tokens_size == 0 || penalty == 1.0f) {
return;
}
const int64_t t_start_sample_us = ggml_v2_time_us();
for (size_t i = 0; i < candidates->size; ++i) {
const auto * token_iter = std::find(last_tokens, last_tokens + last_tokens_size, candidates->data[i].id);
if (token_iter == last_tokens + last_tokens_size) {
continue;
}
// The academic publication that described this technique actually just only divided, but that would cause tokens with negative logits to become more likely, which is obviously wrong.
// This is common fix for this problem, which is to multiply by the penalty instead of dividing.
if (candidates->data[i].logit <= 0) {
candidates->data[i].logit *= penalty;
} else {
candidates->data[i].logit /= penalty;
}
}
candidates->sorted = false;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
void llama_v2_sample_frequency_and_presence_penalties(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, const llama_v2_token * last_tokens_p, size_t last_tokens_size, float alpha_frequency, float alpha_presence) {
if (last_tokens_size == 0 || (alpha_frequency == 0.0f && alpha_presence == 0.0f)) {
return;
}
const int64_t t_start_sample_us = ggml_v2_time_us();
// Create a frequency map to count occurrences of each token in last_tokens
std::unordered_map<llama_v2_token, int> token_count;
for (size_t i = 0; i < last_tokens_size; ++i) {
token_count[last_tokens_p[i]]++;
}
// Apply frequency and presence penalties to the candidates
for (size_t i = 0; i < candidates->size; ++i) {
auto token_iter = token_count.find(candidates->data[i].id);
if (token_iter == token_count.end()) {
continue;
}
int count = token_iter->second;
candidates->data[i].logit -= float(count) * alpha_frequency + float(count > 0) * alpha_presence;
}
candidates->sorted = false;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
}
llama_v2_token llama_v2_sample_token_mirostat(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, float tau, float eta, int m, float * mu) {
assert(ctx);
auto N = float(llama_v2_n_vocab(ctx));
int64_t t_start_sample_us;
t_start_sample_us = ggml_v2_time_us();
llama_v2_sample_softmax(nullptr, candidates);
// Estimate s_hat using the most probable m tokens
float s_hat = 0.0;
float sum_ti_bi = 0.0;
float sum_ti_sq = 0.0;
for (size_t i = 0; i < size_t(m - 1) && i < candidates->size - 1; ++i) {
float t_i = logf(float(i + 2) / float(i + 1));
float b_i = logf(candidates->data[i].p / candidates->data[i + 1].p);
sum_ti_bi += t_i * b_i;
sum_ti_sq += t_i * t_i;
}
s_hat = sum_ti_bi / sum_ti_sq;
// Compute k from the estimated s_hat and target surprise value
float epsilon_hat = s_hat - 1;
float k = powf((epsilon_hat * powf(2, *mu)) / (1 - powf(N, -epsilon_hat)), 1 / s_hat);
// Sample the next word X using top-k sampling
llama_v2_sample_top_k(nullptr, candidates, int(k), 1);
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
llama_v2_token X = llama_v2_sample_token(ctx, candidates);
t_start_sample_us = ggml_v2_time_us();
// Compute error as the difference between observed surprise and target surprise value
size_t X_idx = std::distance(candidates->data, std::find_if(candidates->data, candidates->data + candidates->size, [&](const llama_v2_token_data & candidate) {
return candidate.id == X;
}));
float observed_surprise = -log2f(candidates->data[X_idx].p);
float e = observed_surprise - tau;
// Update mu using the learning rate and error
*mu = *mu - eta * e;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
ctx->n_sample++;
}
return X;
}
llama_v2_token llama_v2_sample_token_mirostat_v2(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates, float tau, float eta, float * mu) {
assert(ctx);
int64_t t_start_sample_us;
t_start_sample_us = ggml_v2_time_us();
llama_v2_sample_softmax(ctx, candidates);
// Truncate the words with surprise values greater than mu
candidates->size = std::distance(candidates->data, std::find_if(candidates->data, candidates->data + candidates->size, [&](const llama_v2_token_data & candidate) {
return -log2f(candidate.p) > *mu;
}));
// Normalize the probabilities of the remaining words
llama_v2_sample_softmax(ctx, candidates);
// Sample the next word X from the remaining words
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
llama_v2_token X = llama_v2_sample_token(ctx, candidates);
t_start_sample_us = ggml_v2_time_us();
// Compute error as the difference between observed surprise and target surprise value
size_t X_idx = std::distance(candidates->data, std::find_if(candidates->data, candidates->data + candidates->size, [&](const llama_v2_token_data & candidate) {
return candidate.id == X;
}));
float observed_surprise = -log2f(candidates->data[X_idx].p);
float e = observed_surprise - tau;
// Update mu using the learning rate and error
*mu = *mu - eta * e;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
}
return X;
}
llama_v2_token llama_v2_sample_token_greedy(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates) {
const int64_t t_start_sample_us = ggml_v2_time_us();
// Find max element
auto * max_iter = std::max_element(candidates->data, candidates->data + candidates->size, [](const llama_v2_token_data & a, const llama_v2_token_data & b) {
return a.logit < b.logit;
});
llama_v2_token result = max_iter->id;
if (ctx) {
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
ctx->n_sample++;
}
return result;
}
llama_v2_token llama_v2_sample_token(struct llama_v2_context * ctx, llama_v2_token_data_array * candidates) {
assert(ctx);
const int64_t t_start_sample_us = ggml_v2_time_us();
llama_v2_sample_softmax(nullptr, candidates);
std::vector<float> probs;
probs.reserve(candidates->size);
for (size_t i = 0; i < candidates->size; ++i) {
probs.push_back(candidates->data[i].p);
}
std::discrete_distribution<> dist(probs.begin(), probs.end());
auto & rng = ctx->rng;
int idx = dist(rng);
llama_v2_token result = candidates->data[idx].id;
ctx->t_sample_us += ggml_v2_time_us() - t_start_sample_us;
ctx->n_sample++;
return result;
}
//
// quantization
//
static void llama_v2_model_quantize_internal(const std::string & fname_inp, const std::string & fname_out, enum llama_v2_ftype ftype, int nthread) {
ggml_v2_type quantized_type;
switch (ftype) {
case LLAMA_V2_FTYPE_MOSTLY_Q4_0: quantized_type = GGML_V2_TYPE_Q4_0; break;
case LLAMA_V2_FTYPE_MOSTLY_Q4_1: quantized_type = GGML_V2_TYPE_Q4_1; break;
case LLAMA_V2_FTYPE_MOSTLY_Q4_2: quantized_type = GGML_V2_TYPE_Q4_2; break;
case LLAMA_V2_FTYPE_MOSTLY_Q4_3: quantized_type = GGML_V2_TYPE_Q4_3; break;
case LLAMA_V2_FTYPE_MOSTLY_Q5_0: quantized_type = GGML_V2_TYPE_Q5_0; break;
case LLAMA_V2_FTYPE_MOSTLY_Q5_1: quantized_type = GGML_V2_TYPE_Q5_1; break;
case LLAMA_V2_FTYPE_MOSTLY_Q8_0: quantized_type = GGML_V2_TYPE_Q8_0; break;
default: throw format_old("invalid output file type %d\n", ftype);
};
if (nthread <= 0) {
nthread = std::thread::hardware_concurrency();
}
std::unique_ptr<llama_v2_model_loader> model_loader(new llama_v2_model_loader(fname_inp, /*use_mmap*/ false,
/*vocab_only*/ false));
llama_v2_file_saver file_saver(fname_out.c_str(), model_loader->file_loaders.at(0).get(), ftype);
size_t total_size_org = 0;
size_t total_size_new = 0;
std::vector<int64_t> hist_all(1 << 4, 0);
std::vector<std::thread> workers;
std::mutex mutex;
size_t idx = 0;
for (llama_v2_load_tensor & tensor : model_loader->tensors_map.tensors) {
llama_v2_buffer read_data;
read_data.resize(tensor.size);
tensor.data = read_data.addr;
model_loader->load_data_for(tensor);
printf("[%4zu/%4zu] %36s - %16s, type = %6s, ",
++idx, model_loader->tensors_map.tensors.size(),
tensor.name.c_str(), llama_v2_format_tensor_shape(tensor.ne).c_str(),
ggml_v2_type_name(tensor.type));
// This used to be a regex, but <regex> has an extreme cost to compile times.
bool quantize = tensor.name.rfind("weight") == tensor.name.size() - 6; // ends with 'weight'?
// quantize only 2D tensors
quantize &= (tensor.ne.size() == 2);
// uncomment this to keep the output layer in FP16
//if (tensor.name == "output.weight") {
// quantize = false;
//}
enum ggml_v2_type new_type;
void * new_data;
size_t new_size;
llama_v2_buffer work;
if (!quantize) {
new_type = tensor.type;
new_data = tensor.data;
new_size = tensor.size;
printf("size = %8.3f MB\n", tensor.size/1024.0/1024.0);
} else {
new_type = quantized_type;
float * f32_data;
size_t nelements = tensor.ne.at(0) * tensor.ne.at(1);
llama_v2_buffer f32_conv_buf;
if (tensor.type == GGML_V2_TYPE_F32) {
f32_data = (float *) tensor.data;
} else if (tensor.type == GGML_V2_TYPE_F16) {
f32_conv_buf.resize(nelements * sizeof(float));
f32_data = (float *) f32_conv_buf.addr;
const auto * f16_data = (const ggml_v2_fp16_t *) tensor.data;
for (size_t i = 0; i < nelements; i++) {
f32_data[i] = ggml_v2_fp16_to_fp32(f16_data[i]);
}
} else {
throw format_old("type %s unsupported for integer quantization", ggml_v2_type_name(tensor.type));
}
printf("quantizing .. ");
fflush(stdout);
work.resize(nelements * 4); // upper bound on size
new_data = work.addr;
std::vector<int64_t> hist_cur(1 << 4, 0);
int chunk_size = 32 * 512;
const int nchunk = (nelements + chunk_size - 1)/chunk_size;
const int nthread_use = nthread > 1 ? std::max(1, std::min(nthread, nchunk)) : 1;
if (nthread_use < 2) {
new_size = ggml_v2_quantize_chunk(new_type, f32_data, new_data, 0, nelements, hist_cur.data());
} else {
size_t counter = 0;
new_size = 0;
auto compute = [&mutex, &counter, &hist_cur, &new_size, new_type, f32_data, new_data, nelements, chunk_size] () {
std::vector<int64_t> local_hist;
size_t local_size = 0;
while (true) {
std::unique_lock<std::mutex> lock(mutex);
size_t first = counter; counter += chunk_size;
if (first >= nelements) {
if (!local_hist.empty()) {
for (int j=0; j<int(local_hist.size()); ++j) {
hist_cur[j] += local_hist[j];
}
new_size += local_size;
}
break;
}
lock.unlock();
size_t last = std::min(nelements, first + chunk_size);
if (local_hist.empty()) {
local_hist.resize(hist_cur.size(), 0);
}
local_size += ggml_v2_quantize_chunk(new_type, f32_data, new_data, first, last - first, local_hist.data());
}
};
if ((int) workers.size() < nthread_use - 1) {
workers.resize(nthread_use - 1);
}
for (int it = 0; it < nthread_use - 1; ++it) {
workers[it] = std::thread(compute);
}
compute();
for (int it = 0; it < nthread_use - 1; ++it) {
workers[it].join();
}
}
printf("size = %8.2f MB -> %8.2f MB | hist: ", tensor.size/1024.0/1024.0, new_size/1024.0/1024.0);
for (size_t i = 0; i < hist_cur.size(); i++) {
hist_all[i] += hist_cur[i];
}
for (size_t i = 0; i < hist_cur.size(); i++) {
printf("%5.3f ", hist_cur[i] / float(nelements));
}
printf("\n");
}
total_size_org += tensor.size;
total_size_new += new_size;
file_saver.write_tensor(tensor, new_type, new_data, new_size);
}
printf("%s: model size = %8.2f MB\n", __func__, total_size_org/1024.0/1024.0);
printf("%s: quant size = %8.2f MB\n", __func__, total_size_new/1024.0/1024.0);
{
int64_t sum_all = 0;
for (size_t i = 0; i < hist_all.size(); i++) {
sum_all += hist_all[i];
}
printf("%s: hist: ", __func__);
for (size_t i = 0; i < hist_all.size(); i++) {
printf("%5.3f ", hist_all[i] / float(sum_all));
}
printf("\n");
}
}
//
// interface implementation
//
struct llama_v2_context * llama_v2_init_from_file(
const char * path_model,
struct llama_v2_context_params params) {
ggml_v2_time_init();
llama_v2_context * ctx = new llama_v2_context;
if (params.seed < 0 || params.seed==0xFFFFFFFF) {
params.seed = time(NULL);
}
unsigned cur_percentage = 0;
if (params.progress_callback == NULL) {
params.progress_callback_user_data = &cur_percentage;
params.progress_callback = [](float progress, void * ctx) {
unsigned * cur_percentage_p = (unsigned *) ctx;
unsigned percentage = (unsigned) (100 * progress);
while (percentage > *cur_percentage_p) {
++*cur_percentage_p;
fprintf(stderr, ".");
fflush(stderr);
if (percentage >= 100) {
fprintf(stderr, "\n");
}
}
};
}
ctx->rng = std::mt19937(params.seed);
ctx->logits_all = params.logits_all;
ggml_v2_type memory_type = params.f16_kv ? GGML_V2_TYPE_F16 : GGML_V2_TYPE_F32;
if (!llama_v2_model_load(path_model, *ctx, params.n_ctx, params.n_gpu_layers, memory_type,
params.use_mmap, params.use_mlock, params.vocab_only,
params.progress_callback, params.progress_callback_user_data)) {
fprintf(stderr, "%s: failed to load model\n", __func__);
llama_v2_free(ctx);
return nullptr;
}
// reserve memory for context buffers
if (!params.vocab_only) {
if (!kv_cache_init(ctx->model.hparams, ctx->model.kv_self, memory_type, ctx->model.hparams.n_ctx)) {
fprintf(stderr, "%s: kv_cache_init() failed for self-attention cache\n", __func__);
llama_v2_free(ctx);
return nullptr;
}
{
const size_t memory_size = ggml_v2_nbytes(ctx->model.kv_self.k) + ggml_v2_nbytes(ctx->model.kv_self.v);
fprintf(stderr, "%s: kv self size = %7.2f MB\n", __func__, memory_size / 1024.0 / 1024.0);
}
const auto & hparams = ctx->model.hparams;
// resized during inference
if (params.logits_all) {
ctx->logits.reserve(hparams.n_ctx*hparams.n_vocab);
} else {
ctx->logits.reserve(hparams.n_vocab);
}
if (params.embedding){
ctx->embedding.resize(hparams.n_embd);
}
ctx->buf_compute.resize(MEM_REQ_EVAL_2().at(ctx->model.type));
ctx->buf_scratch[0].resize(MEM_REQ_SCRATCH0_2().at(ctx->model.type));
ctx->buf_scratch[1].resize(MEM_REQ_SCRATCH1_2().at(ctx->model.type));
}
return ctx;
}
void llama_v2_free(struct llama_v2_context * ctx) {
delete ctx;
}
int llama_v2_model_quantize(
const char * fname_inp,
const char * fname_out,
enum llama_v2_ftype ftype,
int nthread) {
try {
llama_v2_model_quantize_internal(fname_inp, fname_out, ftype, nthread);
return 0;
} catch (const std::string & err) {
fprintf(stderr, "%s: failed to quantize: %s\n", __func__, err.c_str());
return 1;
}
}
int llama_v2_apply_lora_from_file_internal(struct llama_v2_context * ctx, const char * path_lora, const char * path_base_model, int n_threads) {
fprintf(stderr, "%s: applying lora adapter from '%s' - please wait ...\n", __func__, path_lora);
auto & model = ctx->model;
const int64_t t_start_lora_us = ggml_v2_time_us();
auto fin = std::ifstream(path_lora, std::ios::binary);
if (!fin) {
fprintf(stderr, "%s: failed to open '%s'\n", __func__, path_lora);
return 1;
}
// verify magic and version
{
uint32_t magic;
fin.read((char *) &magic, sizeof(magic));
if (magic != 'ggla') {
fprintf(stderr, "%s: bad file magic\n", __func__);
return 1;
}
uint32_t format_version;
fin.read((char *) &format_version, sizeof(format_version));
if (format_version != 1) {
fprintf(stderr, "%s: unsupported file version\n", __func__ );
return 1;
}
}
int32_t lora_r;
int32_t lora_alpha;
fin.read((char *) &lora_r, sizeof(lora_r));
fin.read((char *) &lora_alpha, sizeof(lora_alpha));
float scaling = (float)lora_alpha / (float)lora_r;
fprintf(stderr, "%s: r = %d, alpha = %d, scaling = %.2f\n", __func__, lora_r, lora_alpha, scaling);
// create a temporary ggml context to store the lora tensors
// todo: calculate size from biggest possible tensor
std::vector<uint8_t> lora_buf(1024ull * 1024ull * 1024ull);
struct ggml_v2_init_params params;
params.mem_size = lora_buf.size();
params.mem_buffer = lora_buf.data();
params.no_alloc = false;
ggml_v2_context * lora_ctx = ggml_v2_init(params);
std::unordered_map<std::string, struct ggml_v2_tensor *> lora_tensors;
// create a name -> tensor map of the model to accelerate lookups
std::unordered_map<std::string, struct ggml_v2_tensor*> model_tensors;
for (auto & kv: model.tensors_by_name) {
model_tensors.insert(kv);
}
// load base model
std::unique_ptr<llama_v2_model_loader> model_loader;
ggml_v2_context * base_ctx = NULL;
llama_v2_buffer base_buf;
if (path_base_model) {
fprintf(stderr, "%s: loading base model from '%s'\n", __func__, path_base_model);
model_loader.reset(new llama_v2_model_loader(path_base_model, /*use_mmap*/ true, /*vocab_only*/ false));
size_t ctx_size;
size_t mmapped_size;
model_loader->calc_sizes(&ctx_size, &mmapped_size);
base_buf.resize(ctx_size);
ggml_v2_init_params base_params;
base_params.mem_size = base_buf.size;
base_params.mem_buffer = base_buf.addr;
base_params.no_alloc = model_loader->use_mmap;
base_ctx = ggml_v2_init(base_params);
model_loader->ggml_v2_ctx = base_ctx;
// maybe this should in llama_v2_model_loader
if (model_loader->use_mmap) {
model_loader->mapping.reset(new llama_v2_mmap(&model_loader->file_loaders.at(0)->file, /* prefetch */ false));
}
}
// read tensors and apply
bool warned = false;
int n_tensors = 0;
while (true) {
int32_t n_dims;
int32_t length;
int32_t ftype;
fin.read(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
fin.read(reinterpret_cast<char *>(&length), sizeof(length));
fin.read(reinterpret_cast<char *>(&ftype), sizeof(ftype));
if (fin.eof()) {
break;
}
int32_t ne[2] = { 1, 1 };
for (int i = 0; i < n_dims; ++i) {
fin.read(reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
}
std::string name;
{
char buf[1024];
fin.read(buf, length);
name = std::string(buf, length);
}
// check for lora suffix and get the type of tensor
const std::string lora_suffix = ".lora";
size_t pos = name.rfind(lora_suffix);
if (pos == std::string::npos) {
fprintf(stderr, "%s: error: '%s' is not a lora tensor\n", __func__, name.c_str());
return 1;
}
std::string lora_type = name.substr(pos + lora_suffix.length());
std::string base_name = name;
base_name.erase(pos);
// fprintf(stderr, "%s: %s => %s (lora type %s) ", __func__, name.c_str(),base_name.c_str(), lora_type.c_str());
if (model_tensors.find(base_name) == model_tensors.end()) {
fprintf(stderr, "%s: unknown tensor '%s' in lora adapter\n", __func__, name.data());
return 1;
}
// create ggml tensor
ggml_v2_type wtype;
switch (ftype) {
case 0: wtype = GGML_V2_TYPE_F32; break;
case 1: wtype = GGML_V2_TYPE_F16; break;
default:
{
fprintf(stderr, "%s: invalid tensor data type '%d'\n",
__func__, ftype);
return false;
}
}
ggml_v2_tensor* lora_tensor;
if (n_dims == 2) {
lora_tensor = ggml_v2_new_tensor_2d(lora_ctx, wtype, ne[0], ne[1]);
}
else {
fprintf(stderr, "%s: unsupported tensor dimension %d\n", __func__, n_dims);
return 1;
}
// load tensor data
size_t offset = fin.tellg();
size_t tensor_data_size = ggml_v2_nbytes(lora_tensor);
offset = (offset + 31) & -32;
fin.seekg(offset);
fin.read((char*)lora_tensor->data, tensor_data_size);
lora_tensors[name] = lora_tensor;
// check if we have both A and B tensors and apply
if (lora_tensors.find(base_name + ".loraA") != lora_tensors.end() &&
lora_tensors.find(base_name + ".loraB") != lora_tensors.end()) {
ggml_v2_tensor * dest_t = model_tensors[base_name];
ggml_v2_tensor * base_t;
if (model_loader) {
// load from base model
if (model_loader->tensors_map.name_to_idx.find(base_name) == model_loader->tensors_map.name_to_idx.end()) {
fprintf(stderr, "%s: error: tensor '%s' not found in base model\n", __func__, base_name.c_str());
return 1;
}
size_t idx = model_loader->tensors_map.name_to_idx[base_name];
llama_v2_load_tensor & lt = model_loader->tensors_map.tensors[idx];
base_t = model_loader->get_tensor(base_name, { (uint32_t)dest_t->ne[0], (uint32_t)dest_t->ne[1] });
lt.data = (uint8_t *) lt.ggml_v2_tensor->data;
model_loader->load_data_for(lt);
lt.ggml_v2_tensor->data = lt.data;
}
else {
base_t = dest_t;
}
if (ggml_v2_is_quantized(base_t->type)) {
if (!warned) {
fprintf(stderr, "%s: warning: using a lora adapter with a quantized model may result in poor quality, "
"use a f16 or f32 base model with --lora-base\n", __func__);
warned = true;
}
}
ggml_v2_tensor * loraA = lora_tensors[base_name + ".loraA"];
ggml_v2_tensor * loraB = lora_tensors[base_name + ".loraB"];
if (base_t->ne[0] != loraA->ne[1] || base_t->ne[1] != loraB->ne[1]) {
fprintf(stderr, "%s: incompatible tensor dimensions (%" PRId64 " and %" PRId64 ");"
" are you sure that this adapter is for this model?\n", __func__, base_t->ne[0], loraA->ne[1]);
return 1;
}
// w = w + BA*s
ggml_v2_tensor * BA = ggml_v2_mul_mat(lora_ctx, loraA, loraB);
if (scaling != 1.0f) {
ggml_v2_tensor * scale_tensor = ggml_v2_new_f32(lora_ctx, scaling);
BA = ggml_v2_scale_inplace(lora_ctx, BA, scale_tensor);
}
ggml_v2_tensor * r;
if (base_t == dest_t) {
r = ggml_v2_add_inplace(lora_ctx, dest_t, BA);
}
else {
r = ggml_v2_add(lora_ctx, base_t, BA);
r = ggml_v2_cpy(lora_ctx, r, dest_t);
}
struct ggml_v2_cgraph gf = ggml_v2_build_forward(r);
gf.n_threads = n_threads;
ggml_v2_graph_compute(lora_ctx, &gf);
// we won't need these tensors again, reset the context to save memory
ggml_v2_free(lora_ctx);
lora_ctx = ggml_v2_init(params);
lora_tensors.clear();
n_tensors++;
if (n_tensors % 4 == 0) {
fprintf(stderr, ".");
}
}
}
// TODO: this should be in a destructor, it will leak on failure
ggml_v2_free(lora_ctx);
if (base_ctx) {
ggml_v2_free(base_ctx);
}
const int64_t t_lora_us = ggml_v2_time_us() - t_start_lora_us;
fprintf(stderr, " done (%.2f ms)\n", t_lora_us / 1000.0);
return 0;
}
int llama_v2_apply_lora_from_file(struct llama_v2_context * ctx, const char * path_lora, const char * path_base_model, int n_threads) {
try {
return llama_v2_apply_lora_from_file_internal(ctx, path_lora, path_base_model, n_threads);
} catch (const std::string & err) {
fprintf(stderr, "%s: failed to apply lora adapter: %s\n", __func__, err.c_str());
return 1;
}
}
int llama_v2_get_kv_cache_token_count(const struct llama_v2_context * ctx) {
return ctx->model.kv_self.n;
}
#define LLAMA_V2_MAX_RNG_STATE (64*1024)
void llama_v2_set_rng_seed(struct llama_v2_context * ctx, int seed) {
if (seed < 0 || seed==0xFFFFFFFF) {
seed = time(NULL);
}
ctx->rng.seed(seed);
}
// Returns the *maximum* size of the state
size_t llama_v2_get_state_size(const struct llama_v2_context * ctx) {
// we don't know size of rng until we actually serialize it. so reserve more than enough memory for its serialized state.
// for reference, std::mt19937(1337) serializes to 6701 bytes.
const size_t s_rng_size = sizeof(size_t);
const size_t s_rng = LLAMA_V2_MAX_RNG_STATE;
const size_t s_logits_capacity = sizeof(size_t);
const size_t s_logits_size = sizeof(size_t);
const size_t s_logits = ctx->logits.capacity() * sizeof(float);
const size_t s_embedding_size = sizeof(size_t);
const size_t s_embedding = ctx->embedding.size() * sizeof(float);
const size_t s_kv_size = sizeof(size_t);
const size_t s_kv_ntok = sizeof(int);
const size_t s_kv = ctx->model.kv_self.buf.size;
const size_t s_total = (
+ s_rng_size
+ s_rng
+ s_logits_capacity
+ s_logits_size
+ s_logits
+ s_embedding_size
+ s_embedding
+ s_kv_size
+ s_kv_ntok
+ s_kv
);
return s_total;
}
// Copies the state to the specified destination address
size_t llama_v2_copy_state_data(struct llama_v2_context * ctx, uint8_t * dst) {
uint8_t * out = dst;
// copy rng
{
std::stringstream rng_ss;
rng_ss << ctx->rng;
const size_t rng_size = rng_ss.str().size();
char rng_buf[LLAMA_V2_MAX_RNG_STATE];
memset(&rng_buf[0], 0, LLAMA_V2_MAX_RNG_STATE);
memcpy(&rng_buf[0], rng_ss.str().data(), rng_ss.str().size());
memcpy(out, &rng_size, sizeof(rng_size)); out += sizeof(rng_size);
memcpy(out, &rng_buf[0], LLAMA_V2_MAX_RNG_STATE); out += LLAMA_V2_MAX_RNG_STATE;
}
// copy logits
{
const size_t logits_cap = ctx->logits.capacity();
const size_t logits_size = ctx->logits.size();
memcpy(out, &logits_cap, sizeof(logits_cap)); out += sizeof(logits_cap);
memcpy(out, &logits_size, sizeof(logits_size)); out += sizeof(logits_size);
if (logits_size) {
memcpy(out, ctx->logits.data(), logits_size * sizeof(float));
}
out += logits_cap * sizeof(float);
}
// copy embeddings
{
const size_t embedding_size = ctx->embedding.size();
memcpy(out, &embedding_size, sizeof(embedding_size)); out += sizeof(embedding_size);
if (embedding_size) {
memcpy(out, ctx->embedding.data(), embedding_size * sizeof(float));
out += embedding_size * sizeof(float);
}
}
// copy kv cache
{
const auto & kv_self = ctx->model.kv_self;
const auto & hparams = ctx->model.hparams;
const int n_layer = hparams.n_layer;
const int n_embd = hparams.n_embd;
const int n_ctx = hparams.n_ctx;
const size_t kv_size = kv_self.buf.size;
const int kv_ntok = llama_v2_get_kv_cache_token_count(ctx);
memcpy(out, &kv_size, sizeof(kv_size)); out += sizeof(kv_size);
memcpy(out, &kv_ntok, sizeof(kv_ntok)); out += sizeof(kv_ntok);
if (kv_size) {
const size_t elt_size = ggml_v2_element_size(kv_self.k);
char buffer[4096];
ggml_v2_context * cpy_ctx = ggml_v2_init({ sizeof(buffer), buffer, /* no_alloc */ true });
ggml_v2_cgraph gf{};
gf.n_threads = 1;
ggml_v2_tensor * kout3d = ggml_v2_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, kv_ntok, n_layer);
kout3d->data = out;
out += ggml_v2_nbytes(kout3d);
ggml_v2_tensor * vout3d = ggml_v2_new_tensor_3d(cpy_ctx, kv_self.v->type, kv_ntok, n_embd, n_layer);
vout3d->data = out;
out += ggml_v2_nbytes(vout3d);
ggml_v2_tensor * k3d = ggml_v2_view_3d(cpy_ctx, kv_self.k,
n_embd, kv_ntok, n_layer,
elt_size*n_embd, elt_size*n_embd*n_ctx, 0);
ggml_v2_tensor * v3d = ggml_v2_view_3d(cpy_ctx, kv_self.v,
kv_ntok, n_embd, n_layer,
elt_size*n_ctx, elt_size*n_ctx*n_embd, 0);
ggml_v2_build_forward_expand(&gf, ggml_v2_cpy(cpy_ctx, k3d, kout3d));
ggml_v2_build_forward_expand(&gf, ggml_v2_cpy(cpy_ctx, v3d, vout3d));
ggml_v2_graph_compute(cpy_ctx, &gf);
ggml_v2_free(cpy_ctx);
}
}
const size_t written = out - dst;
const size_t max_size = llama_v2_get_state_size(ctx);
LLAMA_V2_ASSERT(written <= max_size);
return written;
}
// Sets the state reading from the specified source address
size_t llama_v2_set_state_data(struct llama_v2_context * ctx, const uint8_t * src) {
const uint8_t * inp = src;
// set rng
{
size_t rng_size;
char rng_buf[LLAMA_V2_MAX_RNG_STATE];
memcpy(&rng_size, inp, sizeof(rng_size)); inp += sizeof(rng_size);
memcpy(&rng_buf[0], inp, LLAMA_V2_MAX_RNG_STATE); inp += LLAMA_V2_MAX_RNG_STATE;
std::stringstream rng_ss;
rng_ss.str(std::string(&rng_buf[0], rng_size));
rng_ss >> ctx->rng;
LLAMA_V2_ASSERT(rng_ss.fail() == false);
}
// set logits
{
size_t logits_cap;
size_t logits_size;
memcpy(&logits_cap, inp, sizeof(logits_cap)); inp += sizeof(logits_cap);
memcpy(&logits_size, inp, sizeof(logits_size)); inp += sizeof(logits_size);
LLAMA_V2_ASSERT(ctx->logits.capacity() == logits_cap);
if (logits_size) {
ctx->logits.resize(logits_size);
memcpy(ctx->logits.data(), inp, logits_size * sizeof(float));
}
inp += logits_cap * sizeof(float);
}
// set embeddings
{
size_t embedding_size;
memcpy(&embedding_size, inp, sizeof(embedding_size)); inp += sizeof(embedding_size);
LLAMA_V2_ASSERT(ctx->embedding.capacity() == embedding_size);
if (embedding_size) {
memcpy(ctx->embedding.data(), inp, embedding_size * sizeof(float));
inp += embedding_size * sizeof(float);
}
}
// set kv cache
{
const auto & kv_self = ctx->model.kv_self;
const auto & hparams = ctx->model.hparams;
const int n_layer = hparams.n_layer;
const int n_embd = hparams.n_embd;
const int n_ctx = hparams.n_ctx;
size_t kv_size;
int kv_ntok;
memcpy(&kv_size, inp, sizeof(kv_size)); inp += sizeof(kv_size);
memcpy(&kv_ntok, inp, sizeof(kv_ntok)); inp += sizeof(kv_ntok);
if (kv_size) {
LLAMA_V2_ASSERT(kv_self.buf.size == kv_size);
const size_t elt_size = ggml_v2_element_size(kv_self.k);
char buffer[4096];
ggml_v2_context * cpy_ctx = ggml_v2_init({ sizeof(buffer), buffer, /* no_alloc */ true });
ggml_v2_cgraph gf{};
gf.n_threads = 1;
ggml_v2_tensor * kin3d = ggml_v2_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, kv_ntok, n_layer);
kin3d->data = (void *) inp;
inp += ggml_v2_nbytes(kin3d);
ggml_v2_tensor * vin3d = ggml_v2_new_tensor_3d(cpy_ctx, kv_self.v->type, kv_ntok, n_embd, n_layer);
vin3d->data = (void *) inp;
inp += ggml_v2_nbytes(vin3d);
ggml_v2_tensor * k3d = ggml_v2_view_3d(cpy_ctx, kv_self.k,
n_embd, kv_ntok, n_layer,
elt_size*n_embd, elt_size*n_embd*n_ctx, 0);
ggml_v2_tensor * v3d = ggml_v2_view_3d(cpy_ctx, kv_self.v,
kv_ntok, n_embd, n_layer,
elt_size*n_ctx, elt_size*n_ctx*n_embd, 0);
ggml_v2_build_forward_expand(&gf, ggml_v2_cpy(cpy_ctx, kin3d, k3d));
ggml_v2_build_forward_expand(&gf, ggml_v2_cpy(cpy_ctx, vin3d, v3d));
ggml_v2_graph_compute(cpy_ctx, &gf);
ggml_v2_free(cpy_ctx);
}
ctx->model.kv_self.n = kv_ntok;
}
const size_t nread = inp - src;
const size_t max_size = llama_v2_get_state_size(ctx);
LLAMA_V2_ASSERT(nread <= max_size);
return nread;
}
bool llama_v2_load_session_file(struct llama_v2_context * ctx, const char * path_session, llama_v2_token * tokens_out, size_t n_token_capacity, size_t * n_token_count_out) {
llama_v2_file file(path_session, "rb");
// sanity checks
{
const uint32_t magic = file.read_u32();
const uint32_t version = file.read_u32();
if (magic != LLAMA_V2_SESSION_MAGIC || version != LLAMA_V2_SESSION_VERSION) {
fprintf(stderr, "%s : unknown (magic, version) for session file: %08x, %08x\n", __func__, magic, version);
return false;
}
llama_v2_hparams session_hparams;
file.read_raw(&session_hparams, sizeof(llama_v2_hparams));
if (session_hparams != ctx->model.hparams) {
fprintf(stderr, "%s : model hparams didn't match from session file!\n", __func__);
return false;
}
}
// load the prompt
{
const uint32_t n_token_count = file.read_u32();
if (n_token_count > n_token_capacity) {
fprintf(stderr, "%s : token count in session file exceeded capacity! %u > %zu\n", __func__, n_token_count, n_token_capacity);
return false;
}
file.read_raw(tokens_out, sizeof(llama_v2_token) * n_token_count);
*n_token_count_out = n_token_count;
}
// restore the context state
{
const size_t n_state_size_cur = file.size - file.tell();
const size_t n_state_size_max = llama_v2_get_state_size(ctx);
if (n_state_size_cur > n_state_size_max) {
fprintf(stderr, "%s : the state size in session file is too big! max %zu, got %zu\n", __func__, n_state_size_max, n_state_size_cur);
return false;
}
std::vector<uint8_t> state_data(n_state_size_max);
file.read_raw(state_data.data(), n_state_size_cur);
llama_v2_set_state_data(ctx, state_data.data());
}
return true;
}
bool llama_v2_save_session_file(struct llama_v2_context * ctx, const char * path_session, const llama_v2_token * tokens, size_t n_token_count) {
llama_v2_file file(path_session, "wb");
file.write_u32(LLAMA_V2_SESSION_MAGIC);
file.write_u32(LLAMA_V2_SESSION_VERSION);
file.write_raw(&ctx->model.hparams, sizeof(llama_v2_hparams));
// save the prompt
file.write_u32((uint32_t) n_token_count);
file.write_raw(tokens, sizeof(llama_v2_token) * n_token_count);
// save the context state
{
const size_t n_state_size_max = llama_v2_get_state_size(ctx);
std::vector<uint8_t> state_data(n_state_size_max);
const size_t n_state_size_cur = llama_v2_copy_state_data(ctx, state_data.data());
file.write_raw(state_data.data(), n_state_size_cur);
}
return true;
}
int llama_v2_eval(
struct llama_v2_context * ctx,
const llama_v2_token * tokens,
int n_tokens,
int n_past,
int n_threads) {
if (!llama_v2_eval_internal(*ctx, tokens, n_tokens, n_past, n_threads)) {
fprintf(stderr, "%s: failed to eval\n", __func__);
return 1;
}
// get a more accurate load time, upon first eval
// TODO: fix this
if (!ctx->has_evaluated_once) {
ctx->t_load_us = ggml_v2_time_us() - ctx->t_start_us;
ctx->has_evaluated_once = true;
}
return 0;
}
int llama_v2_tokenize(
struct llama_v2_context * ctx,
const char * text,
llama_v2_token * tokens,
int n_max_tokens,
bool add_bos) {
auto res = llama_v2_tokenize(ctx->vocab, text, add_bos);
if (n_max_tokens < (int) res.size()) {
fprintf(stderr, "%s: too many tokens\n", __func__);
return -((int) res.size());
}
for (size_t i = 0; i < res.size(); i++) {
tokens[i] = res[i];
}
return res.size();
}
int llama_v2_n_vocab(const struct llama_v2_context * ctx) {
return ctx->vocab.id_to_token.size();
}
int llama_v2_n_ctx(const struct llama_v2_context * ctx) {
return ctx->model.hparams.n_ctx;
}
int llama_v2_n_embd(const struct llama_v2_context * ctx) {
return ctx->model.hparams.n_embd;
}
float * llama_v2_get_logits(struct llama_v2_context * ctx) {
return ctx->logits.data();
}
float * llama_v2_get_embeddings(struct llama_v2_context * ctx) {
return ctx->embedding.data();
}
const char * llama_v2_token_to_str(const struct llama_v2_context * ctx, llama_v2_token token) {
if (token >= llama_v2_n_vocab(ctx)) {
return nullptr;
}
return ctx->vocab.id_to_token[token].tok.c_str();
}
llama_v2_token llama_v2_token_bos() {
return 1;
}
llama_v2_token llama_v2_token_eos() {
return 2;
}
llama_v2_token llama_v2_token_nl() {
return 13;
}
void llama_v2_print_timings(struct llama_v2_context * ctx) {
const int64_t t_end_us = ggml_v2_time_us();
const int32_t n_sample = std::max(1, ctx->n_sample);
const int32_t n_eval = std::max(1, ctx->n_eval);
const int32_t n_p_eval = std::max(1, ctx->n_p_eval);
fprintf(stderr, "\n");
fprintf(stderr, "%s: load time = %8.2f ms\n", __func__, ctx->t_load_us / 1000.0);
fprintf(stderr, "%s: sample time = %8.2f ms / %5d runs (%8.2f ms per token)\n", __func__, 1e-3 * ctx->t_sample_us, n_sample, 1e-3 * ctx->t_sample_us / n_sample);
fprintf(stderr, "%s: prompt eval time = %8.2f ms / %5d tokens (%8.2f ms per token)\n", __func__, 1e-3 * ctx->t_p_eval_us, n_p_eval, 1e-3 * ctx->t_p_eval_us / n_p_eval);
fprintf(stderr, "%s: eval time = %8.2f ms / %5d runs (%8.2f ms per token)\n", __func__, 1e-3 * ctx->t_eval_us, n_eval, 1e-3 * ctx->t_eval_us / n_eval);
fprintf(stderr, "%s: total time = %8.2f ms\n", __func__, (t_end_us - ctx->t_start_us)/1000.0);
}
void llama_v2_reset_timings(struct llama_v2_context * ctx) {
ctx->t_start_us = ggml_v2_time_us();
ctx->t_sample_us = ctx->n_sample = 0;
ctx->t_eval_us = ctx->n_eval = 0;
ctx->t_p_eval_us = ctx->n_p_eval = 0;
}
const char * llama_v2_print_system_info(void) {
static std::string s;
s = "";
s += "AVX = " + std::to_string(ggml_v2_cpu_has_avx()) + " | ";
s += "AVX2 = " + std::to_string(ggml_v2_cpu_has_avx2()) + " | ";
s += "AVX512 = " + std::to_string(ggml_v2_cpu_has_avx512()) + " | ";
s += "AVX512_VBMI = " + std::to_string(ggml_v2_cpu_has_avx512_vbmi()) + " | ";
s += "AVX512_VNNI = " + std::to_string(ggml_v2_cpu_has_avx512_vnni()) + " | ";
s += "FMA = " + std::to_string(ggml_v2_cpu_has_fma()) + " | ";
s += "NEON = " + std::to_string(ggml_v2_cpu_has_neon()) + " | ";
s += "ARM_FMA = " + std::to_string(ggml_v2_cpu_has_arm_fma()) + " | ";
s += "F16C = " + std::to_string(ggml_v2_cpu_has_f16c()) + " | ";
s += "FP16_VA = " + std::to_string(ggml_v2_cpu_has_fp16_va()) + " | ";
s += "WASM_SIMD = " + std::to_string(ggml_v2_cpu_has_wasm_simd()) + " | ";
s += "BLAS = " + std::to_string(ggml_v2_cpu_has_blas()) + " | ";
s += "SSE3 = " + std::to_string(ggml_v2_cpu_has_sse3()) + " | ";
s += "VSX = " + std::to_string(ggml_v2_cpu_has_vsx()) + " | ";
return s.c_str();
}
// For internal test use
std::vector<std::pair<std::string, struct ggml_v2_tensor *>>& llama_v2_internal_get_tensor_map(struct llama_v2_context * ctx) {
return ctx->model.tensors_by_name;
}
// TODO: Calculate this constant from the vocabulary
#define MAX_TOKEN_LEN 18
// SentencePiece implementation after https://guillaume-be.github.io/2020-05-30/sentence_piece
std::vector<llama_v2_token> legacy_llama_v2_tokenize(const llama_v2_vocab & vocab, const std::string & text, bool bos) {
std::vector<llama_v2_token> res;
std::vector<int> score;
std::vector<llama_v2_token> prev;
int len = text.length();
score.resize(len + 1);
prev.resize(len + 1);
// Forward pass
for (int i = 0; i < len; i++) {
int max_len = std::min(len - i, MAX_TOKEN_LEN);
for (int sub_len = 1; sub_len <= max_len; sub_len++) {
auto sub = text.substr(i, sub_len);
auto token = vocab.token_to_id.find(sub);
if (token != vocab.token_to_id.end()) {
int token_score = sub.length() * sub.length();
int local_score = score[i] + token_score;
int next = i + sub_len;
if (score[next] < local_score) {
score[next] = local_score;
prev[next] = (*token).second;
}
}
}
}
// Backward pass
int i = len;
while (i > 0) {
llama_v2_token token_id = prev[i];
if (token_id == 0) {
// TODO: Return error or something more meaningful
printf("failed to tokenize string!\n");
break;
}
res.push_back(token_id);
auto token = vocab.id_to_token[token_id].tok;
i -= token.length();
}
if (bos) {
res.push_back(1); // TODO: replace with vocab.bos
}
// Pieces are in reverse order so correct that
std::reverse(res.begin(), res.end());
return res;
}
int legacy_llama_v2_tokenize(
struct llama_v2_context * ctx,
const char * text,
llama_v2_token * tokens,
int n_max_tokens,
bool add_bos) {
auto res = legacy_llama_v2_tokenize(ctx->vocab, text, add_bos);
if (n_max_tokens < (int) res.size()) {
fprintf(stderr, "%s: too many tokens\n", __func__);
return -((int) res.size());
}
for (size_t i = 0; i < res.size(); i++) {
tokens[i] = res[i];
}
return res.size();
}
std::vector<llama_v2_token> legacy_llama_v2_tokenize(struct llama_v2_context * ctx, const std::string & text, bool add_bos) {
std::vector<llama_v2_token> res(8096);
int n = legacy_llama_v2_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
res.resize(n);
return res;
}
std::vector<llama_token> llama_v2_tokenize(struct llama_v2_context * ctx, const std::string & text, bool add_bos) {
// initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
std::vector<llama_token> res(text.size() + (int) add_bos);
const int n = llama_v2_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
assert(n >= 0);
res.resize(n);
return res;
}