koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-10 04:00:53 +00:00

Author	SHA1	Message	Date
Concedo	235daf4016	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml # README.md	2023-04-25 20:44:22 +08:00
slaren	e4cf982e0d	Fix cuda compilation (#1128 ) * Fix: Issue with CUBLAS compilation error due to missing -fPIC flag --------- Co-authored-by: B1gM8c <89020353+B1gM8c@users.noreply.github.com>	2023-04-24 17:29:58 +02:00
Concedo	59fb174678	fixed compile errors, made mmap automatic when lora is selected, added updated quantizers and quantization handling for gpt neox gpt 2 and gptj	2023-04-24 23:20:06 +08:00
Concedo	8e615c8245	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-04-24 12:20:08 +08:00
Georgi Gerganov	e4422e299c	ggml : better PERF prints + support "LLAMA_PERF=1 make"	2023-04-23 18:15:39 +03:00
Concedo	1b7aa2b815	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile	2023-04-22 16:22:08 +08:00
Georgi Gerganov	872c365a91	ggml : fix AVX build + update to new Q8_0 format	2023-04-22 11:08:12 +03:00
Concedo	7b3d04e5d4	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt	2023-04-22 10:58:16 +08:00
Concedo	4fa3dfe8bc	just doesn't work properly on windows. will leave it as a manual flag for others	2023-04-22 10:57:38 +08:00
slaren	50cb666b8a	Improve cuBLAS performance by using a memory pool (#1094 ) * Improve cuBLAS performance by using a memory pool * Move cuda specific definitions to ggml-cuda.h/cu * Add CXX flags to nvcc * Change memory pool synchronization mechanism to a spin lock General code cleanup	2023-04-21 21:59:17 +02:00
Concedo	68898046c2	accidentally added the binaries onto repo again.	2023-04-22 00:41:19 +08:00
Concedo	f555db44ec	adding the libraries for cublas first. but i cannot get the kernel to work yet	2023-04-21 23:24:09 +08:00
Concedo	794a38a2e8	Revert "cublas is not feasible at this time. removed for now" This reverts commit `3687db7cf7`.	2023-04-21 21:02:40 +08:00
Concedo	5160053e51	merged llama adapter into the rest of the gpt adapters	2023-04-21 17:47:48 +08:00
Concedo	82d74ca1a6	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml	2023-04-21 16:24:30 +08:00
Concedo	3687db7cf7	cublas is not feasible at this time. removed for now	2023-04-21 16:14:23 +08:00
slaren	2005469ea1	Add Q4_3 support to cuBLAS (#1086 )	2023-04-20 20:49:53 +02:00
Concedo	07bb31b034	wip dont use	2023-04-21 00:35:54 +08:00
Concedo	7ba36c2c6c	trying to put out penguin based fires. sorry for inconvenience	2023-04-20 23:15:07 +08:00
源文雨	5addcb120c	fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080 )	2023-04-20 15:28:43 +02:00
Concedo	4605074245	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # ggml.c	2023-04-20 17:30:54 +08:00
Concedo	0b08ec7c5d	forgot to remove this	2023-04-20 16:28:47 +08:00
Concedo	346cd68903	make linux and OSX build process equal to windows. Now it will build all applicable libraries, for a full build do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`	2023-04-20 15:53:55 +08:00
slaren	02d6988121	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	2023-04-20 03:14:14 +02:00
Stephan Walter	f3d4edf504	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	2023-04-19 19:06:37 +03:00
Concedo	be1222c36e	Merged the upstream cublas feature,	2023-04-19 20:45:37 +08:00
slaren	8944a13296	Add NVIDIA cuBLAS support (#1044 )	2023-04-19 11:22:45 +02:00
Concedo	f662a9a230	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml # .github/workflows/docker.yml # CMakeLists.txt # Makefile # README.md	2023-04-19 16:34:51 +08:00
Kawrakow	5ecff35151	Adding a simple program to measure speed of dot products (#1041 ) On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-04-18 19:00:14 +00:00
Concedo	ea01771dd5	rwkv is done	2023-04-18 20:55:01 +08:00
Concedo	763ad172c0	arranged files, updated kobold lite, modified makefile for extra link args on linux, started RWKV implementation	2023-04-17 17:31:45 +08:00
Concedo	6548d3b3fb	Added prints for stopping sequences, made makefile 1% friendlier to arch linux users	2023-04-16 20:43:17 +08:00
Georgi Gerganov	e95b6554b4	ggml : add Q8_0 quantization for intermediate results (#951 ) * ggml : add Q8_0 quantization for intermediate results * quantize-stats : fix test + add it to Makefile default * Q8: use int8_t, AVX/AVX2 optimizations * ggml : fix quantize_row_q8_0() ARM_NEON rounding * minor : updates after rebase to latest master * quantize-stats : delete obsolete strings * ggml : fix q4_1 dot func --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-15 17:53:22 +03:00
Concedo	d00b865eb1	Merge branch 'master' into concedo # Conflicts: # .devops/full.Dockerfile # Makefile # flake.nix	2023-04-15 11:33:43 +08:00
Stephan Walter	93265e988a	make : fix dependencies, use auto variables (#983 )	2023-04-14 22:39:48 +03:00
Concedo	932d981222	more make targets	2023-04-14 21:54:18 +08:00
Concedo	a819f22cac	Merge branch 'master' into concedo # Conflicts: # CMakeLists.txt # Makefile # README.md # flake.nix	2023-04-14 21:40:33 +08:00
Georgi Gerganov	9190e8eac8	llama : merge llama_internal.h into llama.h Hide it behind an #ifdef	2023-04-13 18:04:45 +03:00
CRD716	8cda5c981d	fix whitespace (#944 )	2023-04-13 16:03:57 +02:00
SebastianApel	95ea26f6e9	benchmark : add tool for timing q4_0 matrix multiplication (#653 ) * Initial version of q4_0 matrix multiplication benchmark * Bugfix: Added dependency to ggml.o to benchmark * Reviewer requests: added parameter for threads, switched to ggml_time_us() * Reviewer input: removed rtsc, use epsilon for check * Review comment: Removed set_locale * Feature: Param for numer of iterations, Bugfix for use of parameter threads * Reviewer suggestion: Moved to examples * Reviewer feedback: Updated clean: and benchmark: sections --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-13 15:46:23 +03:00
Concedo	c1b75f38d0	try to fix noavx2 for really old devices by	2023-04-13 14:36:00 +08:00
Concedo	5c22f7e4c4	reduce batch sizes and skip all intrinsic flags except AVX when building in compatibility mode.	2023-04-13 11:32:05 +08:00
Concedo	1bd5992da4	clean and refactor handling of flags	2023-04-12 23:25:31 +08:00
Concedo	636f8e5a8e	updated the quantize files and makefile	2023-04-12 21:40:25 +08:00
Concedo	4faae0afa9	Merged upstream, fixed OSX compile errors, integrated noavx2 build into main	2023-04-12 18:08:55 +08:00
Concedo	23c675b2e6	integrated optional (experimentl) CLBlast support	2023-04-11 23:33:44 +08:00
0cc4m	c3db99ea32	Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing	2023-04-10 18:20:40 +02:00
Concedo	f53238f570	Merged the upstream updates for model loading code, and ditched the legacy llama loaders since they were no longer needed.	2023-04-10 12:00:34 +08:00
comex	f963b63afa	Rewrite loading code to try to satisfy everyone: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)	2023-04-10 01:10:46 +02:00
Concedo	0b904e12db	Merge branch 'master' into concedo # Conflicts: # Makefile	2023-04-08 17:42:09 +08:00

1 2

100 commits