Commit graph

260 commits

Author SHA1 Message Date
Concedo
557bcaf86e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.clang-tidy
#	.github/workflows/build.yml
#	Makefile
#	Package.swift
#	common/CMakeLists.txt
#	examples/batched-bench/CMakeLists.txt
#	examples/batched/CMakeLists.txt
#	examples/convert-llama2c-to-ggml/CMakeLists.txt
#	examples/cvector-generator/CMakeLists.txt
#	examples/embedding/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/export-lora/CMakeLists.txt
#	examples/gbnf-validator/CMakeLists.txt
#	examples/gguf-split/CMakeLists.txt
#	examples/gguf/CMakeLists.txt
#	examples/gritlm/CMakeLists.txt
#	examples/imatrix/CMakeLists.txt
#	examples/infill/CMakeLists.txt
#	examples/llama-bench/CMakeLists.txt
#	examples/llava/CMakeLists.txt
#	examples/lookahead/CMakeLists.txt
#	examples/lookup/CMakeLists.txt
#	examples/main-cmake-pkg/CMakeLists.txt
#	examples/main/CMakeLists.txt
#	examples/parallel/CMakeLists.txt
#	examples/passkey/CMakeLists.txt
#	examples/perplexity/CMakeLists.txt
#	examples/quantize-stats/CMakeLists.txt
#	examples/quantize/CMakeLists.txt
#	examples/retrieval/CMakeLists.txt
#	examples/run/CMakeLists.txt
#	examples/save-load-state/CMakeLists.txt
#	examples/server/CMakeLists.txt
#	examples/simple-chat/CMakeLists.txt
#	examples/simple/CMakeLists.txt
#	examples/speculative-simple/CMakeLists.txt
#	examples/speculative/CMakeLists.txt
#	examples/tokenize/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-backend.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt
#	pocs/vdot/CMakeLists.txt
#	src/CMakeLists.txt
#	src/unicode.cpp
#	tests/test-sampling.cpp
2024-11-30 12:24:51 +08:00
Concedo
ec95241e38 temp checkpoint 2024-11-30 11:59:27 +08:00
Diego Devesa
7cc2d2c889
ggml : move AMX to the CPU backend (#10570)
* ggml : move AMX to the CPU backend

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Georgi Gerganov
4c0a95b107
llama : add missing model types 2024-11-28 20:45:07 +02:00
Georgi Gerganov
ab96610b1e
cmake : enable warnings in llama (#10474)
* cmake : enable warnings in llama

ggml-ci

* cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS

* cmake : get_flags -> ggml_get_flags

* speculative-simple : fix warnings

* cmake : reuse ggml_get_flags

ggml-ci

* speculative-simple : fix compile warning

ggml-ci
2024-11-26 14:18:08 +02:00
Concedo
ec581b19d8 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/010-bug-compilation.yml
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	.github/ISSUE_TEMPLATE/019-bug-misc.yml
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakeLists.txt
#	Makefile
#	Package.swift
#	examples/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/llama-bench/llama-bench.cpp
#	examples/server/README.md
#	examples/server/server.cpp
#	examples/simple-chat/simple-chat.cpp
#	examples/simple/simple.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-amx/CMakeLists.txt
#	ggml/src/ggml-blas/CMakeLists.txt
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-kompute/CMakeLists.txt
#	ggml/src/ggml-kompute/ggml-kompute.cpp
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-rpc/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	pocs/CMakeLists.txt
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-quantize-fns.cpp
2024-11-26 17:01:20 +08:00
Shane A
80acb7b430
Rename Olmo1124 to Olmo2 (#10500) 2024-11-25 19:36:09 +01:00
Diego Devesa
10bce0450f
llama : accept a list of devices to use to offload a model (#10497)
* llama : accept a list of devices to use to offload a model

* accept `--dev none` to completely disable offloading

* fix dev list with dl backends

* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
Diego Devesa
5931c1f233
ggml : add support for dynamic loading of backends (#10469)
* ggml : add support for dynamic loading of backends

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-25 15:13:39 +01:00
Concedo
83350ec314 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/020-enhancement.yml
#	.github/ISSUE_TEMPLATE/030-research.yml
#	.github/ISSUE_TEMPLATE/040-refactor.yml
#	.github/workflows/build.yml
#	Makefile
#	common/CMakeLists.txt
#	examples/CMakeLists.txt
#	examples/infill/infill.cpp
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup-stats.cpp
#	examples/lookup/lookup.cpp
#	examples/parallel/parallel.cpp
#	examples/retrieval/retrieval.cpp
#	examples/save-load-state/save-load-state.cpp
#	examples/speculative/speculative.cpp
#	flake.lock
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/kernels/CMakeLists.txt
#	ggml/src/ggml-cann/kernels/dup.cpp
#	ggml/src/ggml-cann/kernels/get_row_f16.cpp
#	ggml/src/ggml-cann/kernels/get_row_f32.cpp
#	ggml/src/ggml-cann/kernels/get_row_q4_0.cpp
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
2024-11-25 16:26:08 +08:00
Diego Devesa
dc39012cba
llama : fix op mul check with command-r-plus (#10476) 2024-11-24 16:10:26 +01:00
Concedo
116879144c better error messages 2024-11-23 18:55:01 +08:00
Concedo
091a432cf6 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/llama-cli-cann.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-intel.Dockerfile
#	.devops/llama-cli-musa.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.devops/llama-server-musa.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	cmake/llama-config.cmake.in
#	docs/backend/SYCL.md
#	docs/build.md
#	examples/llama-bench/llama-bench.cpp
#	flake.lock
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-backend.cpp
#	ggml/src/ggml-blas/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-musa/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2024-11-21 16:26:24 +08:00
Georgi Gerganov
1bb30bf28c
llama : handle KV shift for recurrent models (#10402)
ggml-ci
2024-11-21 10:22:47 +02:00
Concedo
282a647689 Merge commit '467576b6cc' into concedo_experimental
# Conflicts:
#	.gitignore
#	Makefile
#	README.md
#	common/common.h
#	docs/build.md
#	examples/infill/infill.cpp
#	examples/perplexity/perplexity.cpp
#	examples/server/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-opt.cpp
#	tests/test-quantize-perf.cpp
2024-11-21 16:05:21 +08:00
Georgi Gerganov
8e752a777b
llama : add check for KV cache shifts (#10401)
ggml-ci
2024-11-19 13:29:26 +02:00
Shane A
a88ad007de
llama : add OLMo November 2024 support (#10394)
* Add OLMo November 2024 constants

* Add OLMo November 2024 converter

* Add loading of OLMo November 2024 tensors and hyper parameters

* Add building of OLMo November 2024 model
2024-11-19 11:04:08 +02:00
Concedo
d5feaa8a3d fixed old mixtral models, but at what cost? was it worth it? 2024-11-19 01:01:25 +08:00
Diego Devesa
be5caccef9
llama : only use default buffer types for the KV cache (#10358) 2024-11-17 12:25:45 +01:00
Johannes Gäßler
4e54be0ec6
llama/ex: remove --logdir argument (#10339) 2024-11-16 23:00:41 +01:00
Concedo
590553ef07 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/llama-cli-intel.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.github/workflows/build.yml
#	CMakePresets.json
#	Makefile
#	docs/backend/SYCL.md
#	docs/build.md
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	scripts/compare-llama-bench.py
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
2024-11-16 17:20:14 +08:00
Concedo
70aee82552 attempts a backflip, but does he stick the landing? 2024-11-16 17:05:45 +08:00
FirstTimeEZ
89e4caaaf0
llama : save number of parameters and the size in llama_model (#10286)
fixes #10285
2024-11-16 01:42:13 +01:00
Charles Xu
1607a5e5b0
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-11-15 01:28:50 +01:00
Diego Devesa
ae8de6d50a
ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
Concedo
df080b074d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	examples/server/README.md
#	examples/speculative/speculative.cpp
#	flake.lock
#	ggml/src/CMakeLists.txt
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2024-11-14 21:40:52 +08:00
Michael Podvitskiy
fb4a0ec083
llama : propagate the results of graph_compute (#9525)
* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-13 20:00:35 +02:00
Georgi Gerganov
f018acba22
llama : fix Qwen model type strings 2024-11-09 11:26:34 +02:00
Concedo
a244b1ffd2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
#	Package.swift
#	ci/run.sh
#	docs/backend/SYCL.md
#	examples/llama-bench/llama-bench.cpp
#	examples/server/CMakeLists.txt
#	examples/server/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	grammars/README.md
#	scripts/sync-ggml-am.sh
#	scripts/sync-ggml.last
#	scripts/sync-ggml.sh
#	tests/run-json-schema-to-grammar.mjs
#	tests/test-backend-ops.cpp
2024-11-09 13:36:47 +08:00
wwoodsTM
5107e8cea3
DRY: Fixes clone functionality (#10192) 2024-11-07 16:20:25 +01:00
Zhiyuan Li
3bcd40b3c5
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133)
* rwkv6: rename to wkv6

* rwkv6: support avx2 avx512 armv8 armv9

* rwkv6: update cuda file name

* rwkv6: rename params

* wkv on sycl

* sycl: add some ops

* sycl: Enhance OP support judgment

* wkv6: drop armv9 and tranfer to GGML style

ggml-ci

* sync : ggml

* update the function to use appropriate types

* fix define error

* Update ggml/src/ggml-cpu.c

* add appropriate asserts

* move element-wise functions outside

* put the declaration outside the loop

* rewrite to be more inline with the common pattern for distributing threads

* use recommended way GGML_TENSOR_LOCALS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
2024-11-07 15:19:10 +08:00
Concedo
628dcd640e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	examples/server/README.md
2024-11-06 23:13:00 +08:00
Diego Devesa
94d8cb8be1
metal : fix from ptr buffer name (#10189) 2024-11-06 12:10:07 +01:00
Gabe Goodhart
b8deef0ec0
llama : add <|tool_call|> formatting to Granite template (#10177)
Branch: GraniteToolCallTemplate

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-11-05 14:23:04 +02:00
Concedo
bb13925f39 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakePresets.json
#	Makefile
#	Package.swift
#	ci/run.sh
#	common/CMakeLists.txt
#	examples/CMakeLists.txt
#	flake.lock
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-backend.cpp
#	ggml/src/ggml.c
#	pocs/vdot/q8dot.cpp
#	pocs/vdot/vdot.cpp
#	tests/test-backend-ops.cpp
#	tests/test-grad0.cpp
#	tests/test-quantize-fns.cpp
#	tests/test-quantize-perf.cpp
#	tests/test-rope.cpp
2024-11-04 16:54:53 +08:00
Concedo
c7e351bf41 add exception for ibm granite, then keep using f16 kq mul for HIPBLAS only for now pending ROCM investigation re https://github.com/ggerganov/llama.cpp/pull/10015 2024-11-04 15:47:13 +08:00
Diego Devesa
9f40989351
ggml : move CPU backend to a separate file (#10144) 2024-11-03 19:34:08 +01:00
Concedo
33721615b5 fixed build issues 2024-11-03 11:01:51 +08:00
Concedo
bc30ebd044 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	examples/CMakeLists.txt
#	examples/main/README.md
#	ggml/src/CMakeLists.txt
#	ggml/src/kompute-shaders/common.comp
#	scripts/sync-ggml.last
#	src/llama.cpp
2024-11-02 21:57:29 +08:00
Concedo
223c5f0844 clblast survived 2024-11-02 21:51:38 +08:00
Georgi Gerganov
1926d6e39d
llama : adjust default context size + print warnings (#10136)
* llama : adjust default context size + print warnings

ggml-ci

* ggml-ci : add missing gpu-layers + adjust context sizes
2024-11-02 15:18:56 +02:00
Concedo
3072db6895 remove annoying eog prints 2024-11-02 12:44:33 +08:00
Diego Devesa
e991e3127f
llama : use smart pointers for ggml resources (#10117) 2024-11-01 23:48:26 +01:00
Diego Devesa
85679d37f3
llama : improve output buffer type selection (#10098) 2024-11-01 00:49:53 +01:00
Diego Devesa
1e9f94994e
quantize : fix --keep-split (#10114) 2024-11-01 00:45:34 +01:00
Diego Devesa
c02e5ab2a6
llama : fix buffer checks for mamba and rwk (#10111)
* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE
2024-10-31 22:54:23 +01:00
Zhenwei Jin
ab3d71f97f
loader: refactor tensor weights storage (#9935)
* loader: refactor tensor weights storage

* use sorted map, sort weights by layer

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-10-31 19:50:39 +01:00
Concedo
a46f8acd03 note: also has support for completion tokens count 2024-11-01 00:44:14 +08:00
Diego Devesa
dea5e86051
ggml : check tensor name lengths in gguf files (#10100) 2024-10-31 11:40:59 +01:00
Diego Devesa
c5b0f4b5d9
llama : refactor model loader with backend registry (#10026) 2024-10-30 02:01:23 +01:00