Commit graph

6004 commits

Author SHA1 Message Date
Georgi Gerganov
d197545530
llama : bump max layers from 256 to 512 (#8530)
* llama : bump max layers from 256 to 512

* llama : replace asserts with exceptions
2024-07-19 16:50:47 +03:00
Georgi Gerganov
be0cfb4175
readme : fix server badge 2024-07-19 14:34:55 +03:00
Clint Herron
b57eb9ca4f
ggml : add friendlier error message to fopen errors (#8575)
* Add additional error information when model files fail to load.

* Adding additional error information to most instances of fopen.
2024-07-19 14:05:45 +03:00
Frank Mai
f299aa98ec
fix: typo of chatglm4 chat tmpl (#8586)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-07-19 11:44:41 +02:00
Concedo
1a23d49c32 serve tags endpoint 2024-07-19 16:08:54 +08:00
Brian
3d0e4367d9
convert-*.py: add general.name kv override (#8571) 2024-07-19 17:51:51 +10:00
Concedo
24b9616344 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-intel.Dockerfile
#	.devops/llama-cli-rocm.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-cli.Dockerfile
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.devops/llama-server.Dockerfile
#	CMakeLists.txt
#	CONTRIBUTING.md
#	Makefile
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	requirements.txt
#	src/llama.cpp
#	tests/test-backend-ops.cpp
2024-07-19 14:23:33 +08:00
Johannes Gäßler
a15ef8f8a0
CUDA: fix partial offloading for ne0 % 256 != 0 (#8572) 2024-07-18 23:48:47 +02:00
Concedo
a998588f3a improved estimation 2024-07-19 00:20:11 +08:00
65a
705b7ecf60
cmake : install all ggml public headers (#8480)
Co-authored-by: 65a <65a@65a.invalid>
2024-07-18 17:47:12 +03:00
Concedo
caab9cb8ae fixed unwanted removal 2024-07-18 22:27:22 +08:00
BBC-Esq
621801da0e
Streamline misc (#1007)
* fix typo and streamline a little

* streamline togglehorde

* oops
2024-07-18 22:25:38 +08:00
Concedo
8b0a9f7e56 remove keys, use tuple 2024-07-18 22:11:13 +08:00
BBC-Esq
7de1ebf897
Streamline with dictionaries (#1005)
* dictionary #1

* dictionary #2
2024-07-18 22:05:30 +08:00
BBC-Esq
ce971a0f3d
Streamline with fstrings (#1006)
* fstring #1

* fstring #2
2024-07-18 21:48:46 +08:00
Eric Zhang
0d2c7321e9
server: use relative routes for static files in new UI (#8552)
* server: public: fix api_url on non-index pages

* server: public: use relative routes for static files in new UI
2024-07-18 12:43:49 +02:00
Brian
672a6f1018
convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor (#7499)
Main thing is that the default output filename will take this form

{name}{parameters}{finetune}{version}{encoding}{kind}

In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content

* No Change:
  - Internal GGUF Spec
    - `general.architecture`
    - `general.quantization_version`
    - `general.alignment`
    - `general.file_type`
  - General Model Details
    - `general.name`
    - `general.author`
    - `general.version`
    - `general.description`
  - Licensing details
    - `general.license`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.url`
  - Model Source during conversion
    - `general.source.url`

* Removed:
  - Model Source during conversion
    - `general.source.huggingface.repository`

* Added:
  - General Model Details
    - `general.organization`
    - `general.finetune`
    - `general.basename`
    - `general.quantized_by`
    - `general.size_label`
  - Licensing details
    - `general.license.name`
    - `general.license.link`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.doi`
    - `general.uuid`
    - `general.repo_url`
  - Model Source during conversion
    - `general.source.doi`
    - `general.source.uuid`
    - `general.source.repo_url`
  - Base Model Source
    - `general.base_model.count`
    - `general.base_model.{id}.name`
    - `general.base_model.{id}.author`
    - `general.base_model.{id}.version`
    - `general.base_model.{id}.organization`
    - `general.base_model.{id}.url` (Model Website/Paper)
    - `general.base_model.{id}.doi`
    - `general.base_model.{id}.uuid`
    - `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...))
  - Array based KV stores
    - `general.tags`
    - `general.languages`
    - `general.datasets`

---------

Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-18 20:40:15 +10:00
RunningLeon
3807c3de04
server : respect --special cli arg (#8553) 2024-07-18 11:06:22 +03:00
Concedo
6080fa38ce updated lite 2024-07-18 15:55:45 +08:00
Concedo
90c1bbbcb9 more url downoad support 2024-07-18 11:56:05 +08:00
Johannes Gäßler
e02b597be3
lookup: fibonacci hashing, fix crashes (#8548) 2024-07-17 23:35:44 +02:00
Al Mochkin
b3283448ce
build : Fix docker build warnings (#8535) (#8537) 2024-07-17 20:21:55 +02:00
Concedo
ad86b1aeb8 Implemented Kcpp Launch Templates (+1 squashed commits)
Squashed commits:

[5ea4c1de] wip integrating skcpps templates (+1 squashed commits)

Squashed commits:

[737daa7f] skcpps wip
2024-07-18 00:22:59 +08:00
Brian
30f80ca0bc
CONTRIBUTING.md : remove mention of noci (#8541) 2024-07-17 17:57:06 +03:00
Concedo
8ccc0144d2 ability to set -1 as gpulayers and determine at runtime (+1 squashed commits)
Squashed commits:

[594263c3] ability to set -1 as gpulayers and determine at runtime
2024-07-17 20:31:19 +08:00
hipudding
1bdd8ae19f
[CANN] Add Ascend NPU backend (#6035)
* [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

Co-authored-by: wangshuai09 <391746016@qq.com>

* delete trailing whitespaces

* Modify the code based on review comment

* Rename LLAMA_CANN to GGML_CANN

* Make ggml-common.h private

* add ggml_cann prefix for acl funcs

* Add logging for CANN backend

* Delete Trailing whitespace

---------

Co-authored-by: wangshuai09 <391746016@qq.com>
2024-07-17 14:23:50 +03:00
Concedo
869e30a6a0 Updated CLInfo from https://github.com/Oblomov/clinfo
https://ci.appveyor.com/api/projects/oblomov/clinfo/artifacts/clinfo.exe?job=platform%3a+x64
2024-07-17 19:20:17 +08:00
Concedo
6c883a4803 dummy skcpps format 2024-07-17 18:35:27 +08:00
Concedo
eca7521c13 allowed embedded chat adapters 2024-07-17 18:08:43 +08:00
Masaya, Kato
da3913d8f9
batched: fix n_predict parameter (#8527) 2024-07-17 10:34:28 +03:00
Georgi Gerganov
d65a8361fe
llama : disable context-shift for DeepSeek v2 (#8501) 2024-07-17 10:32:59 +03:00
Concedo
5988243aee fix wrong order, fix llava debug mode failure 2024-07-17 15:30:19 +08:00
Johannes Gäßler
5e116e8dd5
make/cmake: add missing force MMQ/cuBLAS for HIP (#8515) 2024-07-16 21:20:59 +02:00
Concedo
e99fa531a2 reorder items 2024-07-17 00:28:48 +08:00
Concedo
d775a419b2 updated lite with chat inject, added layer detect, added more console logging 2024-07-16 23:10:15 +08:00
Brian
1666f92dcd
gguf-hash : update clib.json to point to original xxhash repo (#8491)
* Update clib.json to point to Cyan4973 original xxhash

Convinced Cyan4973 to add clib.json directly to his repo, so can now point the clib package directly to him now. Previously pointed to my fork with the clib.json package metadata

https://github.com/Cyan4973/xxHash/pull/954

* gguf-hash: readme update to point to Cyan4973 xxHash repo [no ci]
2024-07-16 10:14:16 +03:00
Steve Bonds
37b12f92ab
export-lora : handle help argument (#8497)
The --help option on export-lora isn't accepted as valid. The help still gets displayed by default, but the script exits with an error message and nonzero status.
2024-07-16 10:04:45 +03:00
Georgi Gerganov
0efec57787
llama : valign + remove unused ftype (#8502) 2024-07-16 10:00:30 +03:00
compilade
7acfd4e8d5
convert_hf : faster lazy safetensors (#8482)
* convert_hf : faster lazy safetensors

This makes '--dry-run' much, much faster.

* convert_hf : fix memory leak in lazy MoE conversion

The '_lazy' queue was sometimes self-referential,
which caused reference cycles of objects old enough
to avoid garbage collection until potential memory exhaustion.
2024-07-15 23:13:10 -04:00
Xuan Son Nguyen
97bdd26eee
Refactor lora adapter support (#8332)
* lora: load to devide buft

* add patch tensor function

* correct tensor patch

* llama_lora_adapter_apply

* correct ggml_backend_tensor_copy

* add llm_build_mm

* fix auto merge

* update based on review comments

* add convert script

* no more transpose A

* add f16 convert

* add metadata check

* add sanity check

* fix ftype

* add requirements

* fix requirements

* fix outfile

* conversion: only allow selected models

* fix types

* cuda : do not use dmmv if the tensor does not have enough cols

* llama : lora fixes

* do not disable mmap with lora

Co-authored-by: slaren <slarengh@gmail.com>

* llm_build_lora_mm_id

* convert_lora : MoE LoRA conversion support

* convert_lora : prefer safetensors, similarly to convert_hf

* convert_hf : simplify modify_tensors for InternLM2

* convert_lora : lazy conversion

* llama : load and use alpha from LoRA adapters

* llama : use llm_build_lora_mm in most model graphs

* auto scale

* Revert "auto scale"

This reverts commit 42415a4874e0f963e4aca6796ea5dfb97cd17464.

* remove redundant params

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* change kv metadata

* move add_type to __init__

* convert_hf : move add_type to main()

* convert_lora : use the GGUFWriter from Model instead of overwriting it

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-15 20:50:47 +02:00
Xuan Son Nguyen
4db8f60fe7
fix ci (#8494) 2024-07-15 19:23:10 +02:00
Concedo
a441c27cb5 fixed broken link 2024-07-16 01:00:16 +08:00
Concedo
e707ab9025 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/development/HOWTO-add-model.md
#	docs/development/token_generation_performance_tips.md
#	flake.lock
2024-07-16 00:49:34 +08:00
Concedo
516fd35e93 error popups on python exits 2024-07-16 00:46:32 +08:00
Concedo
8412946b9f fix oldcpu build avx1 2024-07-15 23:42:22 +08:00
Concedo
21179d675b try ci for avx1, up ver (+2 squashed commit)
Squashed commit:

[74150175] up version

[97b6163c] try ci for avx1 linux
2024-07-15 23:07:07 +08:00
Daniel Bevenius
8fac431b06
ggml : suppress unknown pragma 'GCC' on windows (#8460)
This commit adds a macro guard to pragma GCC to avoid the following
warning on windows:

```console
C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068:
unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj]
```
2024-07-15 15:48:17 +03:00
M-A
f17f39ff9c
server: update README.md with llama-server --help output [no ci] (#8472)
The README.md had a stale information. In particular, the --ctx-size
"defaults to 512" confused me and I had to check the code to confirm
this was false. This the server is evolving rapidly, it's probably
better to keep the source of truth at a single place (in the source) and
generate the README.md based on that.

Did:

    make llama-server
    ./llama-server --help > t.txt
    vimdiff t.txt examples/server/README.md

I copied the content inside a backquote block. I would have preferred
proper text but it would require a fair amount of surgery to make the
current output compatible with markdown. A follow up could be to
automate this process with a script.

No functional change.
2024-07-15 15:04:56 +03:00
Georgi Gerganov
9104bc20ed
common : add --no-cont-batching arg (#6358) 2024-07-15 14:54:58 +03:00
NikolaiLyssogor
fc690b018e
docs: fix links in development docs [no ci] (#8481)
Fixes a few links to within the repo that were broken in the reorganization of the
documentation in #8325.
2024-07-15 14:46:39 +03:00