Commit graph

7620 commits

Author SHA1 Message Date
Georgi Gerganov
6acce39710
readme : update the usage section with examples (#10596)
* readme : update the usage section with examples

* readme : more examples
2024-12-01 11:25:17 +02:00
Concedo
2ba5949054 updated sdcpp, also set euler as default sampler 2024-12-01 17:00:20 +08:00
Concedo
e93c2427b4 allow incompatible vocab in debugmode 2024-12-01 14:11:03 +08:00
Concedo
42228b9746 warning when selecting non gguf models 2024-12-01 13:35:51 +08:00
Wang Qin
43957ef203
build: update Makefile comments for C++ version change (#10598) 2024-12-01 04:19:44 +01:00
Concedo
d5e732f3ab updated lite 2024-12-01 01:49:09 +08:00
Concedo
b7cd210cd2 more linting with Ruff (+1 squashed commits)
Squashed commits:

[43802cfe2] Applied default Ruff linting
2024-12-01 01:23:13 +08:00
Adrien Gallouët
0c39f44d70
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (#10567)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-11-30 09:13:18 -08:00
Concedo
409e393d10 fixed critical bug in image model loader 2024-11-30 23:28:24 +08:00
Concedo
153da19274 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
2024-11-30 16:59:25 +08:00
Concedo
0028e71993 special handling to resolve incomplete utf8 token sequences in qwen 2024-11-30 16:54:01 +08:00
Concedo
32ac3153e4 default speculative set to 8. added more adapter fields 2024-11-30 16:18:27 +08:00
Georgi Gerganov
3e0ba0e604
readme : remove old badge 2024-11-30 10:09:21 +02:00
Georgi Gerganov
abadba05be
readme : refresh (#10587)
* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
2024-11-30 09:47:07 +02:00
Eve
0533e7fb38
vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536)
* subgroup 64 version with subgroup add. 15% faster

scalable version

tested for subgroup sizes 16-128

* check for subgroup multiple of 16 and greater than 16

* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45)

* force 16 sequential threads per block

* make 16 subgroup size a constant
2024-11-30 08:00:02 +01:00
Concedo
5353bfa983 updated lite 2024-11-30 12:26:20 +08:00
Concedo
557bcaf86e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.clang-tidy
#	.github/workflows/build.yml
#	Makefile
#	Package.swift
#	common/CMakeLists.txt
#	examples/batched-bench/CMakeLists.txt
#	examples/batched/CMakeLists.txt
#	examples/convert-llama2c-to-ggml/CMakeLists.txt
#	examples/cvector-generator/CMakeLists.txt
#	examples/embedding/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/export-lora/CMakeLists.txt
#	examples/gbnf-validator/CMakeLists.txt
#	examples/gguf-split/CMakeLists.txt
#	examples/gguf/CMakeLists.txt
#	examples/gritlm/CMakeLists.txt
#	examples/imatrix/CMakeLists.txt
#	examples/infill/CMakeLists.txt
#	examples/llama-bench/CMakeLists.txt
#	examples/llava/CMakeLists.txt
#	examples/lookahead/CMakeLists.txt
#	examples/lookup/CMakeLists.txt
#	examples/main-cmake-pkg/CMakeLists.txt
#	examples/main/CMakeLists.txt
#	examples/parallel/CMakeLists.txt
#	examples/passkey/CMakeLists.txt
#	examples/perplexity/CMakeLists.txt
#	examples/quantize-stats/CMakeLists.txt
#	examples/quantize/CMakeLists.txt
#	examples/retrieval/CMakeLists.txt
#	examples/run/CMakeLists.txt
#	examples/save-load-state/CMakeLists.txt
#	examples/server/CMakeLists.txt
#	examples/simple-chat/CMakeLists.txt
#	examples/simple/CMakeLists.txt
#	examples/speculative-simple/CMakeLists.txt
#	examples/speculative/CMakeLists.txt
#	examples/tokenize/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-backend.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	pocs/vdot/CMakeLists.txt
#	src/CMakeLists.txt
#	src/unicode.cpp
#	tests/test-sampling.cpp
2024-11-30 12:24:51 +08:00
Concedo
697ca70115 temp checkpoint 2024-11-30 12:13:20 +08:00
Concedo
ec95241e38 temp checkpoint 2024-11-30 11:59:27 +08:00
Concedo
0c8939be19 temp checkpoint 2024-11-30 11:57:28 +08:00
Concedo
e0c59486ee default to 12 tokens drafted 2024-11-30 11:52:07 +08:00
Concedo
b21d0fe3ac customizable speculative size 2024-11-30 11:28:19 +08:00
Concedo
f75bbb945f speculative decoding initial impl completed (+6 squashed commit)
Squashed commit:

[0a6306ca0] draft wip dont use (will be squashed)

[a758a1c9c] wip dont use (will be squashed)

[e1994d3ce] wip dont use

[f59690d68] wip

[77228147d] wip on spec decoding. dont use yet

[2445bca54] wip adding speculative decoding (+1 squashed commits)

Squashed commits:

[50e341bb7] wip adding speculative decoding
2024-11-30 10:41:10 +08:00
Diego Devesa
7cc2d2c889
ggml : move AMX to the CPU backend (#10570)
* ggml : move AMX to the CPU backend

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Xuan Son Nguyen
b782e5c7d4
server : add more test cases (#10569)
* server : add split model test

* add test speculative

* add invalid cases
2024-11-29 21:48:56 +01:00
Robert Collins
3a8e9af402
imatrix : support combine-only (#10492)
* imatrix-combine-only idea

* ensured that behavior consistent with log
2024-11-29 19:21:37 +02:00
Diego Devesa
a3a3048e7a
cleanup UI link list (#10577)
* cleanup UI link list

* sort list alphabetically

* add missing licenses
2024-11-29 17:45:08 +01:00
Georgi Gerganov
f0678c5ff4
ggml : fix I8MM Q4_1 scaling factor conversion (#10562)
ggml-ci
2024-11-29 16:25:39 +02:00
Shupei Fan
4b3242bbea
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580) 2024-11-29 14:49:02 +01:00
Alberto Cabrera Pérez
0f77aae560
sycl : offload of get_rows set to 0 (#10432) 2024-11-29 20:38:45 +08:00
Alberto Cabrera Pérez
266b8519ee
sycl : Reroute permuted mul_mats through oneMKL (#10408)
This PR fixes the failing MUL_MAT tests for the sycl backend.
2024-11-29 09:49:43 +00:00
Chenguang Li
938f608742
CANN: RoPE operator optimization (#10563)
* [cann] RoPE operator optimization

* [CANN]Code Formatting

---------

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-29 14:46:55 +08:00
Jeff Bolz
f095a649ec
vulkan: get the first command buffer submitted sooner (#10499)
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
2024-11-29 07:18:02 +01:00
Ting Lou
678d7994f4
llava: return false instead of exit (#10546) 2024-11-29 01:09:46 +01:00
Georgi Gerganov
dc22344088
ggml : remove redundant copyright notice + update authors 2024-11-28 20:46:40 +02:00
Georgi Gerganov
4c0a95b107
llama : add missing model types 2024-11-28 20:45:07 +02:00
Xuan Son Nguyen
6c59567689
server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568)
* server : (tests) don't use thread for capturing stdout/stderr

* test: bump openai to 1.55.2

* bump openai to 1.55.3
2024-11-28 19:17:49 +01:00
Johannes Gäßler
890719311b
common: fix warning message when no GPU found (#10564) 2024-11-28 18:15:25 +01:00
Random Fly
7281cf13ad
docs: fix outdated usage of llama-simple (#10565) 2024-11-28 16:03:11 +01:00
Diego Devesa
e90688edd0
ci : fix tag name in cuda and hip releases (#10566) 2024-11-28 15:58:54 +01:00
Georgi Gerganov
76b27d29c2
ggml : fix row condition for i8mm kernels (#10561)
ggml-ci
2024-11-28 14:56:37 +02:00
Georgi Gerganov
eea986f215
cmake : fix ARM feature detection (#10543)
ggml-ci
2024-11-28 14:56:23 +02:00
Shupei Fan
c202cef168
ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
* ggml-cpu: support IQ4_NL_4_4 by runtime repack

* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-11-28 13:52:03 +01:00
Sergio López
2025fa67e9
kompute : improve backend to pass test_backend_ops (#10542)
* kompute: op_unary: reject unsupported parameters

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: softmax: implement ALiBi support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: rope: implement neox and phi3 support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_q4_k permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_[q4_0|q4_1|q8_0] permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_f16 permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_q6_k permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

---------

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-11-28 12:51:38 +01:00
Ruixin Huang
c6bc73951e
CANN: Update cann.md to display correctly in CLion (#10538) 2024-11-28 15:27:11 +08:00
leo-pony
605fa66c50
CANN: Fix SOC_TYPE compile bug (#10519)
* CANN: Fix the bug build fail on Ascend310P under two cases:
1) Manual specify SOC_TYPE
2) Under some unusual compile environment

* Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU.

* fix CANN  compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version
2024-11-28 15:25:24 +08:00
Chenguang Li
b7420131bf
CANN: ROPE operator optimization (#10540)
* [cann] ROPE operator optimization

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-28 14:24:46 +08:00
Xuan Son Nguyen
9f912511bc
common : fix duplicated file name with hf_repo and hf_file (#10550) 2024-11-27 22:30:52 +01:00
uvos
3ad5451f3b
Add some minimal optimizations for CDNA (#10498)
* Add some minimal optimizations for CDNA

* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-11-27 17:10:08 +01:00
Diego Devesa
46c69e0e75
ci : faster CUDA toolkit installation method and use ccache (#10537)
* ci : faster CUDA toolkit installation method and use ccache

* remove fetch-depth

* only pack CUDA runtime on master
2024-11-27 11:03:25 +01:00