Commit graph

6136 commits

Author SHA1 Message Date
Concedo
8db8154a25 Merge branch 'concedo_experimental' of https://github.com/LostRuins/koboldcpp into concedo_experimental 2024-11-19 18:09:29 +08:00
Concedo
14cbd07eaa more wip multiplayer 2024-11-19 18:09:26 +08:00
pandora
a548108dd2
Create Mistral-V7.json (#1224) 2024-11-19 10:45:50 +08:00
Concedo
ee586b9a9d fixed vulkan 2024-11-19 01:26:31 +08:00
Concedo
d5feaa8a3d fixed old mixtral models, but at what cost? was it worth it? 2024-11-19 01:01:25 +08:00
GPTLocalhost (Word Add-in)
aacb6c3a70
Add GPTLocalhost as third-party resource (#1221) 2024-11-18 10:17:06 +08:00
Concedo
39124828ab wip multiplayer 2024-11-17 23:29:25 +08:00
Concedo
e7897f3257 update docs 2024-11-17 11:43:49 +08:00
Concedo
d6932bbff8 test fix linux build 2024-11-17 02:43:42 +08:00
Concedo
e1f0b0bedd try fix macos build (+1 squashed commits)
Squashed commits:

[ae66dddfd] try fix macos build
2024-11-17 02:37:08 +08:00
Concedo
f6e9d11636 try with 2 parallel jobs 2024-11-17 01:46:41 +08:00
Concedo
952328fdc8 try fix cuda build 2024-11-17 01:41:52 +08:00
Concedo
9acfe96c77 fix cuda build 2024-11-16 21:58:22 +08:00
Concedo
a8694698fd accept gguf text encoders for sd 2024-11-16 17:23:02 +08:00
Concedo
590553ef07 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/llama-cli-intel.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.github/workflows/build.yml
#	CMakePresets.json
#	Makefile
#	docs/backend/SYCL.md
#	docs/build.md
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
2024-11-16 17:20:14 +08:00
Concedo
70aee82552 attempts a backflip, but does he stick the landing? 2024-11-16 17:05:45 +08:00
Georgi Gerganov
f245cc28d4
scripts : fix missing key in compare-llama-bench.py (#10332) 2024-11-16 10:32:50 +02:00
Jeff Bolz
772703c8ff
vulkan: Optimize some mat-vec mul quant shaders (#10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
2024-11-16 07:26:57 +01:00
Concedo
a5f8e596d3 unset sc if ff off 2024-11-16 10:52:33 +08:00
FirstTimeEZ
dd3a6ce9f8
vulkan : add cmake preset debug/release (#10306) 2024-11-16 02:59:33 +01:00
Dan Johansson
1e58ee1318
ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324) 2024-11-16 01:53:37 +01:00
FirstTimeEZ
89e4caaaf0
llama : save number of parameters and the size in llama_model (#10286)
fixes #10285
2024-11-16 01:42:13 +01:00
Srihari-mcw
74d73dc85c
Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314) 2024-11-15 22:27:00 +01:00
Johannes Gäßler
4047be74da
scripts: update compare-llama-bench.py (#10319) 2024-11-15 21:19:03 +01:00
slaren
883d206fbd ggml : fix some build issues 2024-11-15 21:45:32 +02:00
Georgi Gerganov
09ecbcb596 cmake : fix ppc64 check (whisper/0)
ggml-ci
2024-11-15 15:44:06 +02:00
thewh1teagle
3225008973 ggml : vulkan logs (whisper/2547) 2024-11-15 15:44:06 +02:00
Georgi Gerganov
cbf5541a82 sync : ggml 2024-11-15 15:44:06 +02:00
Eve
18429220bd
AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
2024-11-15 12:47:58 +01:00
R0CKSTAR
f0204a0ec7
ci: build test musa with cmake (#10298)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-11-15 12:47:25 +01:00
Romain Biessy
57f8355b29
sycl: Update Intel docker images to use DPC++ 2025.0 (#10305) 2024-11-15 13:10:45 +02:00
Xuan Son Nguyen
9901068ac7
server : (web UI) add copy button for code block, fix api key (#10242)
* server : (web ui) add copy btn for code blocks

* fix problem with api key

* use settings-modal-short-input component

* always show copy btn for code snippet
2024-11-15 10:48:49 +01:00
Chenguang Li
231f9360d9
cann: dockerfile and doc adjustment (#10302)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-15 15:09:35 +08:00
Georgi Gerganov
4802ad350b
scripts : fix regex in sync [no ci] 2024-11-15 08:38:43 +02:00
Concedo
fedc3874bd try fix build inconsistency 2024-11-15 14:12:53 +08:00
Concedo
d595a80abc update prints 2024-11-15 14:10:02 +08:00
Romain Biessy
5a54af4d4f
sycl: Use syclcompat::dp4a (#10267)
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692d61360b46954a1c7f780bd2e569b73.
2024-11-15 11:09:12 +08:00
Charles Xu
1607a5e5b0
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-11-15 01:28:50 +01:00
Diego Devesa
ae8de6d50a
ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
Concedo
df080b074d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/server/README.md
#	examples/speculative/speculative.cpp
#	flake.lock
#	ggml/src/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2024-11-14 21:40:52 +08:00
Johannes Gäßler
4a8ccb37ad
CUDA: no -sm row for very small matrices (#10185) 2024-11-14 13:00:15 +01:00
Georgi Gerganov
2a82891a85
speculative : fix out-of-bounds access (#10289) 2024-11-14 11:44:15 +02:00
Concedo
bfa118ee45 fix llava segfault 2024-11-14 14:16:39 +08:00
Concedo
4b96c3bba8 try new batch api (not actually batching) 2024-11-14 13:47:26 +08:00
Jeff Bolz
af148c9386
vulkan: Optimize binary ops (#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
2024-11-14 06:22:55 +01:00
Jeff Bolz
66798e42fb
vulkan: Use macros to make the mat mul pipeline creation more concise (#10259)
Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a
pipeline. This isn't really used yet.
2024-11-13 21:59:47 +01:00
Michael Podvitskiy
fb4a0ec083
llama : propagate the results of graph_compute (#9525)
* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-13 20:00:35 +02:00
Concedo
8a7d53d838 declutter sdcpp adapter removed useless string literals 2024-11-14 00:27:39 +08:00
Georgi Gerganov
5ea926dad7
sync : ggml 2024-11-13 18:11:54 +02:00
Small Grass Forest
1ee9eea094
docs : update bindings list (#10261)
Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
2024-11-13 13:17:10 +02:00