Commit graph

123 commits

Author SHA1 Message Date
drbh
ee77efea2a
test : add simple grammar parsing tests (#2594)
* adds simple grammar parsing tests

* adds cassert header
2023-08-13 17:00:48 +03:00
Concedo
9483288e03 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-08-12 16:04:11 +08:00
byte-6174
b19edd54d5
Adding support for llama2.c models (#2559) 2023-08-12 01:17:25 +02:00
Concedo
dae9dffa6a rename koboldcpp.dll to koboldcpp_default.dll 2023-08-11 14:54:27 +08:00
Concedo
45456fa6ca switch noavx2 to not use openblas, as it has incompatible instructions 2023-07-30 16:47:33 +08:00
Concedo
343ae756fa Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.nix
#	ggml-cuda.cu
2023-07-22 11:51:30 +08:00
Georgi Gerganov
3973b25a64
gitignore : fix final newline 2023-07-21 14:42:41 +03:00
Jose Maldonado
73643f5fb1
gitignore : changes for Poetry users + chat examples (#2284)
A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:53:27 +03:00
Jiří Podivín
54e3bc76fe
make : add new target for test binaries (#2244)
Programs in the tests directory are now build with target tests
and placed in the same location.

* clean target was expanded to remove new binaries

* test target binaries are listed in a variable

* Locations of binaries were added to the .gitignore

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:09:16 +03:00
Concedo
374fffb9c6 Reworking rope WIP 2023-07-19 00:54:41 +08:00
Georgi Gerganov
d01bccde9f
ci : integrate with ggml-org/ci (#2250)
* ci : run ctest

ggml-ci

* ci : add open llama 3B-v2 tests

ggml-ci

* ci : disable wget progress output

ggml-ci

* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations

ggml-ci

* tests : try to fix tail free sampling test

ggml-ci

* ci : add K-quants

ggml-ci

* ci : add short perplexity tests

ggml-ci

* ci : add README.md

* ppl : add --chunks argument to limit max number of chunks

ggml-ci

* ci : update README
2023-07-18 14:24:43 +03:00
Concedo
b0b131499f Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	tests/test-tokenizer-0.cpp
2023-07-11 16:12:15 +08:00
Evan Miller
5656d10599
mpi : add support for distributed inference via MPI (#2099)
* MPI support, first cut

* fix warnings, update README

* fixes

* wrap includes

* PR comments

* Update CMakeLists.txt

* Add GH workflow, fix test

* Add info to README

* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)

* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()

* mpi : move all MPI logic into ggml-mpi

Not tested yet

* mpi : various fixes - communication now works but results are wrong

* mpi : fix output tensor after MPI compute (still not working)

* mpi : fix inference

* mpi : minor

* Add OpenMPI to GH action

* [mpi] continue-on-error: true

* mpi : fix after master merge

* [mpi] Link MPI C++ libraries to fix OpenMPI

* tests : fix new llama_backend API

* [mpi] use MPI_INT32_T

* mpi : factor out recv / send in functions and reuse

* mpi : extend API to allow usage with outer backends (e.g. Metal)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-10 18:49:56 +03:00
Concedo
523fc3be52 fixed rwkv, standardized new ctx usage 2023-07-10 20:05:53 +08:00
Concedo
dff5575647 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	Makefile
#	ggml-opencl.cpp
#	llama.cpp
2023-06-29 17:35:28 +08:00
ningshanwutuobang
cfa0750bc9
llama : support input embeddings directly (#1910)
* add interface for float input

* fixed inpL shape and type

* add examples of input floats

* add test example for embd input

* fixed sampling

* add free for context

* fixed add end condition for generating

* add examples for llava.py

* add READMD for llava.py

* add READMD for llava.py

* add example of PandaGPT

* refactor the interface and fixed the styles

* add cmake build for embd-input

* add cmake build for embd-input

* Add MiniGPT-4 example

* change the order of the args of llama_eval_internal

* fix ci error
2023-06-28 18:53:37 +03:00
Concedo
278427d9a4 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
2023-06-18 15:29:44 +08:00
Georgi Gerganov
051e1b0e6a
llama : fix kv_cache n init (close #1903) 2023-06-17 19:31:20 +03:00
Concedo
9f8e2f8a18 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	pocs/vdot/vdot.cpp
#	scripts/verify-checksum-models.py
#	tests/test-quantize-fns.cpp
#	tests/test-quantize-perf.cpp
#	tests/test-sampling.cpp
#	tests/test-tokenizer-0.cpp
2023-06-17 20:02:32 +08:00
Randall Fitzgerald
794db3e7b9
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.

Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.

This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.

Summary of the changes:

- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict 
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables

---------

Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 14:53:04 +03:00
Concedo
7ef8d740b9 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
2023-06-16 16:37:14 +08:00
Borislav Stanimirov
602c748863
gitignore : add several entries specific to Visual Studio (#1888) 2023-06-16 09:58:11 +03:00
daboe01
cf267d1c71
make : add train-text-from-scratch (#1850)
* make finetuning example accessible

* fixed: targed was in wrong line

* fixed: name of executable was wrong

* fixed: naming of binary

* fixed: model path was wrong

* fixed clean target

* Update examples/train-text-from-scratch/README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-15 20:42:48 +03:00
Concedo
0833845268 merged metal patch directly into the file 2023-06-09 14:38:31 +08:00
Hyun-joo KIM
6fa1613f15
Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment 2023-06-09 01:47:36 +09:00
Concedo
ed603dcafc Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	docs/BLIS.md
#	llama.cpp
#	tests/test-quantize-fns.cpp
2023-06-06 23:12:01 +08:00
Georgi Gerganov
2d43387daf
ggml : fix builds, add ggml-quants-k.o (close #1712, close #1710) 2023-06-06 10:18:03 +03:00
Georgi Gerganov
7ad7750c5c
gitignore : add .clang-tidy 2023-06-06 09:55:25 +03:00
Georgi Gerganov
ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77fe811082f672538fc91bcfc9072d63.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
Concedo
20803c221e cleaning up some old junk 2023-06-04 11:05:46 +08:00
Concedo
cd4012c3ed minor fixes to debug logging, fixed a typo, added a new failsafe mode 2023-05-23 21:31:42 +08:00
Concedo
587308a202 fixed some build errors on linux, changed icon resolution, added more error printing 2023-05-22 12:18:42 +08:00
Concedo
e01e373e63 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	ggml.c
#	llama.cpp
2023-05-14 11:34:41 +08:00
Georgi Gerganov
0cd22e190a
llama : fix various warnings 2023-05-13 11:23:15 +03:00
Concedo
e9caff1cda Interim merge. Do not use.
Merge branch 'master' into concedo_experimental

# Conflicts:
#	README.md
#	SHA256SUMS
#	examples/quantize/quantize.cpp
#	ggml-opencl.c
#	ggml.c
#	ggml.h
#	llama.cpp
#	llama.h
2023-05-12 23:20:27 +08:00
Georgi Gerganov
b9fd7eee57
ggml : remove bit shuffling (#1405)
* ggml : remove Q4_0 bit shufling (ARM NEON)

* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

* ggml : remove Q5_0 bit shuffling (ARM NEON)

* ggml : 2x faster scalar implementations

* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

* ggml : simplify scalar dot

* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

* ggml : fix Q4_1 quantization

* ggml : update cuBLAS + normalize variable names

* ggml : remove Q4_2 mode

* ggml : minor formatting

* ggml : fix Q5_0 quantization

* scripts : add script for measuring the time per token

* AVX implementations (#1370)

* ggml : uniform 5th bit extraction

* llama : produce error upon loading old model files

* llama : fix model magic/version write

* ggml : speed-up Q5_0 + Q5_1 at 4 threads

* ggml : preserve old Q4 and Q5 formats

* ggml : simplify Q8_1 - no need for low / high sums anymore

* ggml : fix Q8_0 and Q8_1 rounding

* Revert "AVX implementations (#1370)"

This reverts commit 948d124837f9d287d8490f41338e0e4cceb0814f.

* ggml : fix AVX2 implementation

* sha : update hashes for 7B and 13B

* readme : update timings + remove warning banner

* llama : update v2 PR number to 1405

* ggml : fix WASM comments

* ggml : back to original bit order

* readme : add note that Q4 and Q5 have been changed

* llama : fix return for unknown version

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
Concedo
54194911ac Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-05-09 16:50:43 +08:00
Georgi Gerganov
f9a6364912
llama : require first token to be BOS (#1303)
* llama : require first token to be BOS

* scripts : add ppl-run-all.sh

* perplexity : add BOS for each chunk

* readme : update perplexity values after BOS fix

* perplexity : add clarifying comments
2023-05-08 17:41:54 +03:00
Concedo
62beded0e7 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	README.md
2023-05-07 19:10:01 +08:00
Jed Fox
3924088512
Remove default arguments from sampling functions (#1343) 2023-05-06 17:01:47 -04:00
Concedo
966cd2ce91 Merge remote-tracking branch 'temp/concedo' into concedo_experimental
# Conflicts:
#	koboldcpp.py
2023-05-02 22:43:34 +08:00
Sergey Kucher
069b3d4c37 Adds --mlock argument 2023-05-02 16:19:37 +03:00
Concedo
94827172e0 Merge branch 'master' into concedo
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	ggml-cuda.cu
#	ggml-cuda.h
2023-05-02 14:38:31 +08:00
DannyDaemonic
f4cef87edf
Add git-based build information for better issue tracking (#1232)
* Add git-based build information for better issue tracking

* macOS fix

* "build (hash)" and "CMAKE_SOURCE_DIR" changes

* Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages

* Fix conditional dependency on missing target

* Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile

* 4 space indenting for cmake, attempt to clean up my mess in Makefile

* Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it
2023-05-01 18:23:47 +02:00
Concedo
3de34ee492 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	ggml-opencl.c
2023-05-01 12:03:46 +08:00
Stephan Walter
f0d70f147d
Various fixes to mat_mul benchmark (#1253) 2023-04-30 12:32:37 +00:00
Concedo
0fc1772a8f Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	ggml.c
2023-04-29 11:14:05 +08:00
CRD716
5fba3c016b
examples : add Jeopardy example (#1168)
* Basic Setup

* Prevent Results.txt from coming up

* Prefixes, Line separators, etc

* editorcheck

* introduction to give more consistent results

* Basic graph thing

* Grading, ready for testing!

* Y'all ready to get funky?

* fix column removal stuff

* missed a few
2023-04-28 19:13:33 +03:00
Concedo
95bbd46019 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	README.md
2023-04-27 16:12:00 +08:00
Georgi Gerganov
574406dc7e
ggml : add Q5_0 and Q5_1 quantization (#1187)
* ggml : add Q5_0 quantization (cuBLAS only)

* ggml : fix Q5_0 qh -> uint32_t

* ggml : fix q5_0 histogram stats

* ggml : q5_0 scalar dot product

* ggml : q5_0 ARM NEON dot

* ggml : q5_0 more efficient ARM NEON using uint64_t masks

* ggml : rename Q5_0 -> Q5_1

* ggml : adding Q5_0 mode

* quantize : add Q5_0 and Q5_1 to map

* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195)

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-26 23:14:13 +03:00