Concedo
f75bbb945f
speculative decoding initial impl completed (+6 squashed commit)
...
Squashed commit:
[0a6306ca0] draft wip dont use (will be squashed)
[a758a1c9c] wip dont use (will be squashed)
[e1994d3ce] wip dont use
[f59690d68] wip
[77228147d] wip on spec decoding. dont use yet
[2445bca54] wip adding speculative decoding (+1 squashed commits)
Squashed commits:
[50e341bb7] wip adding speculative decoding
2024-11-30 10:41:10 +08:00
Concedo
b9e99c69e8
fixed build
2024-11-26 22:06:55 +08:00
Concedo
9708abf282
avoid confusing users.
2024-11-26 21:28:20 +08:00
Concedo
fdfb9b33d7
updated lite
2024-11-26 21:05:28 +08:00
Concedo
288cd4cf10
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# .github/workflows/nix-ci.yml
# .github/workflows/python-lint.yml
# CMakeLists.txt
# common/CMakeLists.txt
# examples/CMakeLists.txt
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# src/CMakeLists.txt
2024-11-26 21:04:24 +08:00
Georgi Gerganov
ab96610b1e
cmake : enable warnings in llama ( #10474 )
...
* cmake : enable warnings in llama
ggml-ci
* cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS
* cmake : get_flags -> ggml_get_flags
* speculative-simple : fix warnings
* cmake : reuse ggml_get_flags
ggml-ci
* speculative-simple : fix compile warning
ggml-ci
2024-11-26 14:18:08 +02:00
Diego Devesa
7db3846a94
ci : publish the docker images created during scheduled runs ( #10515 )
2024-11-26 13:05:20 +01:00
Diego Devesa
c6807b3f28
ci : add ubuntu cuda build, build with one arch on windows ( #10456 )
2024-11-26 13:05:07 +01:00
Charles Xu
25669aa92c
ggml-cpu: cmake add arm64 cpu feature check for macos ( #10487 )
...
* ggml-cpu: cmake add arm64 cpu feature check for macos
* use vmmlaq_s32 for compile option i8mm check
2024-11-26 13:37:05 +02:00
Georgi Gerganov
84e1c33cde
server : fix parallel speculative decoding ( #10513 )
...
ggml-ci
2024-11-26 13:36:40 +02:00
Georgi Gerganov
811872a59d
speculative : simplify the implementation ( #10504 )
...
ggml-ci
2024-11-26 12:29:38 +02:00
Shanshan Shen
9a4b79bcfa
CANN: Improve the Inferencing Performance for Ascend NPU Device ( #10454 )
...
* improve inferencing performance for ascend npu.
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
* some modification after review
* some modifications after review
* restore some modifications
* restore some modifications
---------
Co-authored-by: shanshan shen <shanshanshen333@gmail.com>
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
2024-11-26 18:08:37 +08:00
Chenguang Li
7066b4cce2
CANN: RoPE and CANCAT operator optimization ( #10488 )
...
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-26 17:31:05 +08:00
Concedo
ec581b19d8
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/workflows/build.yml
# .github/workflows/docker.yml
# CMakeLists.txt
# Makefile
# Package.swift
# examples/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/llama-bench/llama-bench.cpp
# examples/server/README.md
# examples/server/server.cpp
# examples/simple-chat/simple-chat.cpp
# examples/simple/simple.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-amx/CMakeLists.txt
# ggml/src/ggml-blas/CMakeLists.txt
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-kompute/CMakeLists.txt
# ggml/src/ggml-kompute/ggml-kompute.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-musa/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-vulkan/CMakeLists.txt
# pocs/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-quantize-fns.cpp
2024-11-26 17:01:20 +08:00
Concedo
5b35790343
move nix example to a standalone file (+1 squashed commits)
...
Squashed commits:
[fef38bab5] move nix example to a standalone file
2024-11-26 16:32:26 +08:00
DontEatOreo
d82d7b60e6
Refactor Nix section ( #1235 )
...
* README.md: remove `nix{3-run,shell}` example
* README.md: better re-word NVIDIA CUDA section for nix
* README.md: remove unneed section in nix
* README.md: add section to open issue regarding nix version on nixpkgs
* README.md: add clarification on how to add KoboldCpp to `home-manager`
* README.md: add example in nix section
2024-11-26 16:18:45 +08:00
kallewoof
fd320f6682
/props endpoint: provide context size through default_generation_settings ( #1237 )
2024-11-26 16:15:27 +08:00
Junil Kim
0eb4e12bee
vulkan: Fix a vulkan-shaders-gen arugment parsing error ( #10484 )
...
The vulkan-shaders-gen was not parsing the --no-clean argument correctly.
Because the previous code was parsing the arguments which have a value only
and the --no-clean argument does not have a value, it was not being parsed
correctly. This commit can now correctly parse arguments that don't have values.
2024-11-26 01:47:20 +00:00
Eric Curtin
0cc63754b8
Introduce llama-run ( #10291 )
...
It's like simple-chat but it uses smart pointers to avoid manual
memory cleanups. Less memory leaks in the code now. Avoid printing
multiple dots. Split code into smaller functions. Uses no exception
handling.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-11-25 22:56:24 +01:00
Diego Devesa
50d5cecbda
ci : build docker images only once daily ( #10503 )
2024-11-25 22:05:39 +01:00
Georgi Gerganov
9fd8c2687f
server : add more information about error ( #10455 )
2024-11-25 22:28:59 +02:00
Georgi Gerganov
47f931c8f9
server : enable cache_prompt by default ( #10501 )
...
ggml-ci
2024-11-25 21:50:07 +02:00
Georgi Gerganov
106964e3d2
metal : enable mat-vec kernels for bs <= 4 ( #10491 )
2024-11-25 21:49:31 +02:00
Shane A
80acb7b430
Rename Olmo1124 to Olmo2 ( #10500 )
2024-11-25 19:36:09 +01:00
Diego Devesa
10bce0450f
llama : accept a list of devices to use to offload a model ( #10497 )
...
* llama : accept a list of devices to use to offload a model
* accept `--dev none` to completely disable offloading
* fix dev list with dl backends
* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
Johannes Gäßler
1f922254f0
Github: update issue templates [no ci] ( #10489 )
2024-11-25 19:18:37 +01:00
brucepro
a9a678a6b2
Add download chat feature to server chat ( #10481 )
...
* Add download chat feature to server chat
Add a download feature next to the delete chat feature in the server vue chat interface.
* code style
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-25 17:11:55 +01:00
Concedo
6bd686c8a8
fix pyinstaller for windows
2024-11-25 23:20:23 +08:00
Georgi Gerganov
9ca2e67762
server : add speculative decoding support ( #10455 )
...
* server : add speculative decoding support
ggml-ci
* server : add helper function slot.can_speculate()
ggml-ci
2024-11-25 16:31:38 +02:00
Diego Devesa
5931c1f233
ggml : add support for dynamic loading of backends ( #10469 )
...
* ggml : add support for dynamic loading of backends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-25 15:13:39 +01:00
Georgi Gerganov
f6d12e7df8
tests : fix compile warning
2024-11-25 15:17:32 +02:00
Georgi Gerganov
b756441104
metal : minor code formatting
2024-11-25 15:08:04 +02:00
Neo Zhang Jianyu
5a8987793f
[SYCL] Fix building Win package for oneAPI 2025.0 update ( #10483 )
...
* fix build package for 2025.0
* debug
* debug
* fix
* rm debug
---------
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-11-25 17:31:10 +08:00
Concedo
83350ec314
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# .github/workflows/build.yml
# Makefile
# common/CMakeLists.txt
# examples/CMakeLists.txt
# examples/infill/infill.cpp
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/parallel/parallel.cpp
# examples/retrieval/retrieval.cpp
# examples/save-load-state/save-load-state.cpp
# examples/speculative/speculative.cpp
# flake.lock
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/kernels/CMakeLists.txt
# ggml/src/ggml-cann/kernels/dup.cpp
# ggml/src/ggml-cann/kernels/get_row_f16.cpp
# ggml/src/ggml-cann/kernels/get_row_f32.cpp
# ggml/src/ggml-cann/kernels/get_row_q4_0.cpp
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
2024-11-25 16:26:08 +08:00
Concedo
a7f161da95
updated lite
2024-11-25 16:15:59 +08:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00
Georgi Gerganov
cce5a90075
flake.lock: Update ( #10470 )
...
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/5e4fbfb6b3de1aa2872b76d49fafc942626e2add?narHash=sha256-OZiZ3m8SCMfh3B6bfGC/Bm4x3qc1m2SVEAlkV6iY7Yg%3D' (2024-11-15)
→ 'github:NixOS/nixpkgs/23e89b7da85c3640bbc2173fe04f4bd114342367?narHash=sha256-y/MEyuJ5oBWrWAic/14LaIr/u5E0wRVzyYsouYY3W6w%3D' (2024-11-19)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-24 08:03:25 -08:00
Diego Devesa
dc39012cba
llama : fix op mul check with command-r-plus ( #10476 )
2024-11-24 16:10:26 +01:00
Gabe Goodhart
9336db462c
convert : XLMRoberta Type Vocab Size ( #10458 )
...
This matches the key in common bert-based embedding models and may have a
value other than 1 in it.
Branch: XLMRobertaTypeVocabSize
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-11-24 11:02:34 +02:00
Concedo
1e0792a3ef
comfyui emulation also done
2024-11-24 15:39:03 +08:00
Concedo
9bd27323e7
emulate comfyui txt2img
2024-11-24 11:28:12 +08:00
momonga
96fa2c5e2d
fix gguf-py: Conversion error when multiple licenses are configured ( #9807 )
...
* fix general.license list to str
* fix join license list
---------
Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com>
2024-11-24 01:09:22 +01:00
Concedo
bf28d956ae
ollama chat api done
2024-11-24 00:10:15 +08:00
Concedo
62dde8cfb2
ollama sync completions mostly working. stupid api.
2024-11-23 23:31:37 +08:00
Concedo
2c1a06a07d
wip ollama emulation, added detokenize endpoint
2024-11-23 22:48:03 +08:00
Diego Devesa
55ed008b2d
ggml : do not use ARM features not included in the build ( #10457 )
2024-11-23 14:41:12 +01:00
Concedo
c0da7e4dcf
multiplayer activity tracking
2024-11-23 19:59:55 +08:00
Concedo
116879144c
better error messages
2024-11-23 18:55:01 +08:00
Concedo
fd073fc904
try fix ci builds
2024-11-23 18:37:09 +08:00
Concedo
afc575fbd8
cleanup, try to add version tagging
2024-11-23 12:59:06 +08:00