mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-10 23:34:35 +00:00
Merge branch 'main' into main
This commit is contained in:
commit
ca1dc1e7d1
94 changed files with 4366 additions and 703 deletions
19
.devcontainer/Dockerfile
Normal file
19
.devcontainer/Dockerfile
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel as compile_server
|
||||||
|
WORKDIR /workspace
|
||||||
|
ENV CUDA_HOME /usr/local/cuda
|
||||||
|
RUN <<EOF
|
||||||
|
apt update -y && apt install -y --no-install-recommends \
|
||||||
|
git \
|
||||||
|
wget \
|
||||||
|
vim \
|
||||||
|
gcc \
|
||||||
|
g++ \
|
||||||
|
cmake &&
|
||||||
|
rm -rf /var/lib/apt/lists/* &&
|
||||||
|
pip install --upgrade pip &&
|
||||||
|
pip install ninja pyproject numpy cpufeature &&
|
||||||
|
pip install flash-attn &&
|
||||||
|
cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /opt/conda/lib/
|
||||||
|
EOF
|
||||||
|
# Set the default shell to bash
|
||||||
|
CMD ["/bin/bash"]
|
34
.devcontainer/devcontainer.json
Normal file
34
.devcontainer/devcontainer.json
Normal file
|
@ -0,0 +1,34 @@
|
||||||
|
{
|
||||||
|
"name": "Ktrans Dev Container",
|
||||||
|
"privileged": true,
|
||||||
|
"build": {
|
||||||
|
"dockerfile": "Dockerfile",
|
||||||
|
"context": "..",
|
||||||
|
"args": {
|
||||||
|
"http_proxy": "${env:http_proxy}",
|
||||||
|
"https_proxy": "${env:https_proxy}",
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"runArgs": [
|
||||||
|
"--network=host",
|
||||||
|
"--gpus",
|
||||||
|
"all"
|
||||||
|
// "--gpu all"
|
||||||
|
],
|
||||||
|
"workspaceFolder": "/workspace",
|
||||||
|
"workspaceMount": "source=${localWorkspaceFolder},target=/workspace,type=bind,consistency=cached",
|
||||||
|
"mounts": [
|
||||||
|
"source=/mnt/data,target=/mnt/incontainer,type=bind,consistency=cached"
|
||||||
|
],
|
||||||
|
"customizations": {
|
||||||
|
"vscode": {
|
||||||
|
"extensions": [
|
||||||
|
],
|
||||||
|
"settings": {
|
||||||
|
"terminal.integrated.shell.linux": "/bin/bash",
|
||||||
|
"cmake.configureOnOpen": true,
|
||||||
|
"cmake.generator": "Ninja"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
39
.github/ISSUE_TEMPLATE/-bug-.yaml
vendored
Normal file
39
.github/ISSUE_TEMPLATE/-bug-.yaml
vendored
Normal file
|
@ -0,0 +1,39 @@
|
||||||
|
name: 🐞 Bug report
|
||||||
|
description: Create a report to help us reproduce and fix the bug
|
||||||
|
title: "[Bug] "
|
||||||
|
labels: ['Bug']
|
||||||
|
|
||||||
|
body:
|
||||||
|
- type: checkboxes
|
||||||
|
attributes:
|
||||||
|
label: Checklist
|
||||||
|
options:
|
||||||
|
- label: 1. I have searched related issues but cannot get the expected help.
|
||||||
|
- label: 2. The bug has not been fixed in the latest version.
|
||||||
|
- label: 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
|
||||||
|
- label: 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
|
||||||
|
- label: 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: Describe the bug
|
||||||
|
description: A clear and concise description of what the bug is.
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: Reproduction
|
||||||
|
description: |
|
||||||
|
What command or script did you run? Which **model** are you using?
|
||||||
|
placeholder: |
|
||||||
|
A placeholder for the command.
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: Environment
|
||||||
|
description: |
|
||||||
|
Please provide necessary environment information here (e.g. OS/GPU/CPU). Otherwise the issue will be close.
|
||||||
|
placeholder: Environment here.
|
||||||
|
validations:
|
||||||
|
required: true
|
39
.github/ISSUE_TEMPLATE/-bug2-.yaml
vendored
Normal file
39
.github/ISSUE_TEMPLATE/-bug2-.yaml
vendored
Normal file
|
@ -0,0 +1,39 @@
|
||||||
|
name: 🐞 BUG报告
|
||||||
|
description: 创建报告以帮助我们复现并修复BUG
|
||||||
|
title: "[Bug] "
|
||||||
|
labels: ['Bug']
|
||||||
|
|
||||||
|
body:
|
||||||
|
- type: checkboxes
|
||||||
|
attributes:
|
||||||
|
label: 检查清单
|
||||||
|
options:
|
||||||
|
- label: 1. 我已经搜索过相关问题,但未能获得预期的帮助
|
||||||
|
- label: 2. 该问题在最新版本中尚未修复
|
||||||
|
- label: 3. 请注意,如果您提交的BUG相关 issue 缺少对应环境信息和最小可复现示例,我们将难以复现和定位问题,降低获得反馈的可能性
|
||||||
|
- label: 4. 如果您提出的不是bug而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
|
||||||
|
- label: 5. 为方便社区交流,我将使用中文/英文或附上中文/英文翻译(如使用其他语言)。未附带翻译的非中文/英语内容可能会被关闭
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: 问题描述
|
||||||
|
description: 清晰简洁地描述BUG是什么
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: 复现步骤
|
||||||
|
description: |
|
||||||
|
你运行了什么命令或脚本?使用的是哪个**模型**?
|
||||||
|
placeholder: |
|
||||||
|
在此处填写命令
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: 环境信息
|
||||||
|
description: |
|
||||||
|
请提供必要的环境信息(如操作系统/GPU/CPU),否则该 issue 将被关闭
|
||||||
|
placeholder: 在此处填写环境信息
|
||||||
|
validations:
|
||||||
|
required: true
|
23
.github/ISSUE_TEMPLATE/-feature-.yaml
vendored
Normal file
23
.github/ISSUE_TEMPLATE/-feature-.yaml
vendored
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
name: 🚀 Feature request
|
||||||
|
description: Suggest an idea for this project
|
||||||
|
title: "[Feature] "
|
||||||
|
|
||||||
|
body:
|
||||||
|
- type: checkboxes
|
||||||
|
attributes:
|
||||||
|
label: Checklist
|
||||||
|
options:
|
||||||
|
- label: 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
|
||||||
|
- label: 2. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-English/Chinese content without translation may be closed.
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: Motivation
|
||||||
|
description: |
|
||||||
|
A clear and concise description of the motivation of the feature.
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: Related resources
|
||||||
|
description: |
|
||||||
|
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.
|
23
.github/ISSUE_TEMPLATE/-feature2-.yaml
vendored
Normal file
23
.github/ISSUE_TEMPLATE/-feature2-.yaml
vendored
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
name: 🚀 新功能请求
|
||||||
|
description: 为项目提出新功能建议
|
||||||
|
title: "[Feature] "
|
||||||
|
|
||||||
|
body:
|
||||||
|
- type: checkboxes
|
||||||
|
attributes:
|
||||||
|
label: 检查清单
|
||||||
|
options:
|
||||||
|
- label: 1. 如果您提出的不是新功能而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
|
||||||
|
- label: 2. 为方便社区交流,我将使用中文/英文或附上英文/中文翻译(如使用其他语言)。未附带翻译的非英文/中文内容可能会被关闭
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: 需求背景
|
||||||
|
description: |
|
||||||
|
清晰简洁地描述该功能的背景需求
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
- type: textarea
|
||||||
|
attributes:
|
||||||
|
label: 相关资源
|
||||||
|
description: |
|
||||||
|
如果有官方代码实现或第三方实现,请在此提供相关信息,这将非常有帮助
|
98
.github/workflows/docker-image.yml
vendored
Normal file
98
.github/workflows/docker-image.yml
vendored
Normal file
|
@ -0,0 +1,98 @@
|
||||||
|
name: DockerHub CI
|
||||||
|
|
||||||
|
on:
|
||||||
|
release:
|
||||||
|
types: [published]
|
||||||
|
workflow_dispatch:
|
||||||
|
inputs:
|
||||||
|
choose:
|
||||||
|
description: 'Will you push the image to DockerHub? 0 for No, 1 for Yes'
|
||||||
|
required: true
|
||||||
|
default: '0'
|
||||||
|
type: string
|
||||||
|
|
||||||
|
# push:
|
||||||
|
# branches:
|
||||||
|
# - main
|
||||||
|
env:
|
||||||
|
DOCKERHUB_REPO: ${{ secrets.DOCKERHUB_USERNAME }}/ktransformers
|
||||||
|
jobs:
|
||||||
|
test:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v2
|
||||||
|
- name: Run tests
|
||||||
|
run: |
|
||||||
|
if [ -f docker-compose.test.yml ]; then
|
||||||
|
docker-compose --file docker-compose.test.yml build
|
||||||
|
docker-compose --file docker-compose.test.yml run sut
|
||||||
|
else
|
||||||
|
docker build . --file Dockerfile
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker_task:
|
||||||
|
needs: test
|
||||||
|
name: ${{ matrix.instruct}}
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
strategy:
|
||||||
|
fail-fast: false
|
||||||
|
matrix:
|
||||||
|
include:
|
||||||
|
# for amd64
|
||||||
|
- {instruct: "FANCY", platform: "linux/amd64"}
|
||||||
|
- {instruct: "AVX512", platform: "linux/amd64"}
|
||||||
|
- {instruct: "AVX2", platform: "linux/amd64"}
|
||||||
|
- {instruct: "NATIVE", platform: "linux/amd64"}
|
||||||
|
# for arm64
|
||||||
|
- {instruct: "NATIVE", platform: "linux/arm64"}
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Move Docker data directory
|
||||||
|
run: |
|
||||||
|
sudo systemctl stop docker
|
||||||
|
sudo mkdir -p /mnt/docker
|
||||||
|
sudo rsync -avz /var/lib/docker/ /mnt/docker
|
||||||
|
sudo rm -rf /var/lib/docker
|
||||||
|
sudo ln -s /mnt/docker /var/lib/docker
|
||||||
|
sudo systemctl start docker
|
||||||
|
|
||||||
|
-
|
||||||
|
name: Set up QEMU
|
||||||
|
uses: docker/setup-qemu-action@v3
|
||||||
|
|
||||||
|
-
|
||||||
|
name: Set up Docker Buildx
|
||||||
|
uses: docker/setup-buildx-action@v3
|
||||||
|
|
||||||
|
-
|
||||||
|
name: Login to Docker Hub
|
||||||
|
uses: docker/login-action@v3
|
||||||
|
with:
|
||||||
|
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||||
|
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||||
|
-
|
||||||
|
name: Build and push for amd64
|
||||||
|
if: matrix.platform == 'linux/amd64'
|
||||||
|
uses: docker/build-push-action@v6
|
||||||
|
with:
|
||||||
|
push: true
|
||||||
|
platforms: |
|
||||||
|
linux/amd64
|
||||||
|
tags: |
|
||||||
|
${{ env.DOCKERHUB_REPO }}:latest-${{ matrix.instruct }}
|
||||||
|
${{ env.DOCKERHUB_REPO }}:${{ github.event.release.tag_name }}-${{ matrix.instruct }}
|
||||||
|
build-args: |
|
||||||
|
CPU_INSTRUCT=${{ matrix.instruct }}
|
||||||
|
-
|
||||||
|
name: Build and push for arm64
|
||||||
|
if: matrix.platform == 'linux/arm64'
|
||||||
|
uses: docker/build-push-action@v6
|
||||||
|
with:
|
||||||
|
push: true
|
||||||
|
platforms: |
|
||||||
|
linux/arm64
|
||||||
|
tags: |
|
||||||
|
${{ env.DOCKERHUB_REPO }}:latest-${{ matrix.instruct }}
|
||||||
|
${{ env.DOCKERHUB_REPO }}:${{ github.event.release.tag_name }}-${{ matrix.instruct }}
|
||||||
|
build-args: |
|
||||||
|
CPU_INSTRUCT=${{ matrix.instruct }}
|
7
.gitignore
vendored
7
.gitignore
vendored
|
@ -19,7 +19,10 @@ ktransformers/server/local_store/
|
||||||
ktransformers/server_test1.db
|
ktransformers/server_test1.db
|
||||||
*.patch
|
*.patch
|
||||||
img/
|
img/
|
||||||
tmp1.txt
|
tmp*.txt
|
||||||
test_65_300_1536.txt
|
|
||||||
test.txt
|
test.txt
|
||||||
book
|
book
|
||||||
|
ktransformers/tests/chat_txt.txt
|
||||||
|
mmlu_result*
|
||||||
|
ktransformers/ktransformers_ext/cuda_musa/
|
||||||
|
test_prompt.txt
|
||||||
|
|
|
@ -10,7 +10,8 @@ EOF
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel as compile_server
|
FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel as compile_server
|
||||||
|
ARG CPU_INSTRUCT=NATIVE
|
||||||
WORKDIR /workspace
|
WORKDIR /workspace
|
||||||
ENV CUDA_HOME /usr/local/cuda
|
ENV CUDA_HOME /usr/local/cuda
|
||||||
COPY --from=web_compile /home/ktransformers /workspace/ktransformers
|
COPY --from=web_compile /home/ktransformers /workspace/ktransformers
|
||||||
|
@ -26,10 +27,12 @@ rm -rf /var/lib/apt/lists/* &&
|
||||||
cd ktransformers &&
|
cd ktransformers &&
|
||||||
git submodule init &&
|
git submodule init &&
|
||||||
git submodule update &&
|
git submodule update &&
|
||||||
|
pip install --upgrade pip &&
|
||||||
pip install ninja pyproject numpy cpufeature &&
|
pip install ninja pyproject numpy cpufeature &&
|
||||||
pip install flash-attn &&
|
pip install flash-attn &&
|
||||||
CPU_INSTRUCT=NATIVE KTRANSFORMERS_FORCE_BUILD=TRUE TORCH_CUDA_ARCH_LIST="8.0;8.6;8.7;8.9;9.0+PTX" pip install . --no-build-isolation --verbose &&
|
CPU_INSTRUCT=${CPU_INSTRUCT} KTRANSFORMERS_FORCE_BUILD=TRUE TORCH_CUDA_ARCH_LIST="8.0;8.6;8.7;8.9;9.0+PTX" pip install . --no-build-isolation --verbose &&
|
||||||
pip cache purge
|
pip cache purge &&
|
||||||
|
cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /opt/conda/lib/
|
||||||
EOF
|
EOF
|
||||||
|
|
||||||
ENTRYPOINT ["tail", "-f", "/dev/null"]
|
ENTRYPOINT ["tail", "-f", "/dev/null"]
|
6
Makefile
6
Makefile
|
@ -18,4 +18,8 @@ dev_install:
|
||||||
|
|
||||||
echo "Installing ktransformers"
|
echo "Installing ktransformers"
|
||||||
KTRANSFORMERS_FORCE_BUILD=TRUE pip install -e . -v --no-build-isolation
|
KTRANSFORMERS_FORCE_BUILD=TRUE pip install -e . -v --no-build-isolation
|
||||||
echo "Installation completed successfully"
|
echo "Installation completed successfully"
|
||||||
|
install_numa:
|
||||||
|
USE_NUMA=1 make dev_install
|
||||||
|
install_no_numa:
|
||||||
|
env -u USE_NUMA make dev_install
|
||||||
|
|
|
@ -23,7 +23,8 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
|
||||||
|
|
||||||
<h2 id="Updates">🔥 Updates</h2>
|
<h2 id="Updates">🔥 Updates</h2>
|
||||||
|
|
||||||
* **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
|
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
|
||||||
|
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
|
||||||
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
|
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
|
||||||
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
|
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
|
||||||
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
|
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
|
||||||
|
@ -103,7 +104,7 @@ Getting started with KTransformers is simple! Follow the steps below to set up a
|
||||||
|
|
||||||
### 📥 Installation
|
### 📥 Installation
|
||||||
|
|
||||||
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/).
|
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
|
||||||
|
|
||||||
|
|
||||||
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
|
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
|
||||||
|
@ -125,7 +126,7 @@ To utilize the provided kernels, users only need to create a YAML-based injectio
|
||||||
```python
|
```python
|
||||||
with torch.device("meta"):
|
with torch.device("meta"):
|
||||||
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
||||||
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
|
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
|
||||||
...
|
...
|
||||||
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
||||||
```
|
```
|
||||||
|
|
16
README_ZH.md
16
README_ZH.md
|
@ -21,6 +21,8 @@ KTransformers 是一个以 Python 为中心的灵活框架,其核心是可扩
|
||||||
|
|
||||||
<h2 id="Updates">🔥 更新</h2>
|
<h2 id="Updates">🔥 更新</h2>
|
||||||
|
|
||||||
|
* **2025 年 2 月 15 日**:为DeepSeek-V3/R1支持[FP8 GPU内核](./doc/en/fp8_kernel.md); 支持更长的上下文([教程](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context)).
|
||||||
|
* **2025 年 2 月 15 日**:长上下文(从4K到8K,24GB VRAM) & 稍快的速度(+15%)(最快 16 Tokens/s),文档请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 和 [在线指南](https://kvcache-ai.github.io/ktransformers/) 。
|
||||||
* **2025 年 2 月 10 日**:支持 Deepseek-R1 和 V3 在单个(24GB VRAM)/多 GPU 和 382G DRAM 上运行,速度提升高达 3~28 倍。详细教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md)。
|
* **2025 年 2 月 10 日**:支持 Deepseek-R1 和 V3 在单个(24GB VRAM)/多 GPU 和 382G DRAM 上运行,速度提升高达 3~28 倍。详细教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md)。
|
||||||
* **2024 年 8 月 28 日**:支持 InternLM2.5-7B-Chat-1M 模型下的 1M 上下文,使用 24GB 的 VRAM 和 150GB 的 DRAM。详细教程请参见 [这里](./doc/en/long_context_tutorial.md)。
|
* **2024 年 8 月 28 日**:支持 InternLM2.5-7B-Chat-1M 模型下的 1M 上下文,使用 24GB 的 VRAM 和 150GB 的 DRAM。详细教程请参见 [这里](./doc/en/long_context_tutorial.md)。
|
||||||
* **2024 年 8 月 28 日**:将 DeepseekV2 所需的 VRAM 从 21G 降低到 11G。
|
* **2024 年 8 月 28 日**:将 DeepseekV2 所需的 VRAM 从 21G 降低到 11G。
|
||||||
|
@ -67,11 +69,11 @@ https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
|
||||||
|
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3>
|
<!-- <h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3>
|
||||||
<p align="center">
|
<p align="center"> -->
|
||||||
|
|
||||||
https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
|
|
||||||
|
|
||||||
|
<!-- https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12 -->
|
||||||
|
<!--
|
||||||
* **1M 上下文 InternLM 2.5 7B**:以全 bf16 精度运行,使用 24GB VRAM 和 150GB DRAM,可在本地桌面设置中实现。在 1M "针在干草堆中" 测试中达到 92.88% 的成功率,在 128K NIAH 测试中达到 100%。
|
* **1M 上下文 InternLM 2.5 7B**:以全 bf16 精度运行,使用 24GB VRAM 和 150GB DRAM,可在本地桌面设置中实现。在 1M "针在干草堆中" 测试中达到 92.88% 的成功率,在 128K NIAH 测试中达到 100%。
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
|
@ -88,7 +90,7 @@ https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
|
||||||
|
|
||||||
* **增强的速度**:使用稀疏注意力,通过 llamafile 内核实现 1M 上下文生成 16.91 tokens/s 的速度。这种方法比 llama.cpp 的全注意力方法快 10 倍以上。
|
* **增强的速度**:使用稀疏注意力,通过 llamafile 内核实现 1M 上下文生成 16.91 tokens/s 的速度。这种方法比 llama.cpp 的全注意力方法快 10 倍以上。
|
||||||
|
|
||||||
* **灵活的稀疏注意力框架**:提供了一个灵活的块稀疏注意力框架,用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。
|
* **灵活的稀疏注意力框架**:提供了一个灵活的块稀疏注意力框架,用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。 -->
|
||||||
|
|
||||||
<strong>更多高级功能即将推出,敬请期待!</strong>
|
<strong>更多高级功能即将推出,敬请期待!</strong>
|
||||||
|
|
||||||
|
@ -115,7 +117,7 @@ KTransformers 的核心是一个用户友好的、基于模板的注入框架。
|
||||||
```python
|
```python
|
||||||
with torch.device("meta"):
|
with torch.device("meta"):
|
||||||
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
||||||
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
|
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
|
||||||
...
|
...
|
||||||
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
||||||
```
|
```
|
||||||
|
@ -150,7 +152,7 @@ YAML 文件中的每个规则都有两部分:`match` 和 `replace`。`match`
|
||||||
|
|
||||||
<h2 id="ack">致谢和贡献者</h2>
|
<h2 id="ack">致谢和贡献者</h2>
|
||||||
|
|
||||||
KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 和 Marlin 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
|
KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
|
||||||
|
|
||||||
KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformer 更快、更易于使用。
|
KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformer 更快、更易于使用。
|
||||||
|
|
||||||
|
|
BIN
WeChatGroup.png
BIN
WeChatGroup.png
Binary file not shown.
Before Width: | Height: | Size: 829 KiB After Width: | Height: | Size: 258 KiB |
|
@ -22,6 +22,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
|
||||||
|
|
||||||
<h2 id="Updates">🔥 Updates</h2>
|
<h2 id="Updates">🔥 Updates</h2>
|
||||||
|
|
||||||
|
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
|
||||||
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./en/DeepseekR1_V3_tutorial.md).
|
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./en/DeepseekR1_V3_tutorial.md).
|
||||||
* **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./en/long_context_tutorial.md).
|
* **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./en/long_context_tutorial.md).
|
||||||
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
|
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
|
||||||
|
|
|
@ -5,11 +5,12 @@
|
||||||
- [Installation Guide](en/install.md)
|
- [Installation Guide](en/install.md)
|
||||||
|
|
||||||
# Tutorial
|
# Tutorial
|
||||||
- [Deepseek-R1/V3 Show Case](en/DeepseekR1_V3_tutorial.md)
|
- [Deepseek-R1/V3 Show Case/Tutorial](en/DeepseekR1_V3_tutorial.md)
|
||||||
- [Why KTransformers So Fast](en/deepseek-v2-injection.md)
|
- [Why KTransformers So Fast](en/deepseek-v2-injection.md)
|
||||||
- [Injection Tutorial](en/injection_tutorial.md)
|
- [Injection Tutorial](en/injection_tutorial.md)
|
||||||
- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
|
- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
|
||||||
# Server (Temporary Deprecated)
|
- [Use FP8 GPU Kernel](en/fp8_kernel.md)
|
||||||
|
# Server
|
||||||
- [Server](en/api/server/server.md)
|
- [Server](en/api/server/server.md)
|
||||||
- [Website](en/api/server/website.md)
|
- [Website](en/api/server/website.md)
|
||||||
- [Tabby](en/api/server/tabby.md)
|
- [Tabby](en/api/server/tabby.md)
|
||||||
|
@ -19,4 +20,6 @@
|
||||||
# FAQ
|
# FAQ
|
||||||
- [FAQ](en/FAQ.md)
|
- [FAQ](en/FAQ.md)
|
||||||
# V3 Reproduction
|
# V3 Reproduction
|
||||||
- [Success List](en/V3-success.md)
|
- [Success List](en/V3-success.md)
|
||||||
|
# Benchmark
|
||||||
|
- [Benchmark](en/benchmark.md)
|
|
@ -16,6 +16,9 @@
|
||||||
- [Memory consumptions:](#memory-consumptions)
|
- [Memory consumptions:](#memory-consumptions)
|
||||||
- [Benchmark results](#benchmark-results-2)
|
- [Benchmark results](#benchmark-results-2)
|
||||||
- [How to Run](#how-to-run)
|
- [How to Run](#how-to-run)
|
||||||
|
- [V0.2.2 longer context \& FP8 kernel](#v022-longer-context--fp8-kernel)
|
||||||
|
- [longer context](#longer-context)
|
||||||
|
- [FP8 kernel](#fp8-kernel)
|
||||||
- [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
|
- [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
|
||||||
- [Single socket version (32 cores)](#single-socket-version-32-cores)
|
- [Single socket version (32 cores)](#single-socket-version-32-cores)
|
||||||
- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
|
- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
|
||||||
|
@ -90,7 +93,7 @@ Integrated the highly efficient Triton MLA Kernel from the fantastic sglang proj
|
||||||
"6 experts" case is part of V0.3's preview
|
"6 experts" case is part of V0.3's preview
|
||||||
|
|
||||||
|
|
||||||
| Prompt | hi (2) | 1K (969) | 2K (1930) | 4K (3846) | llama.cpp (8 experts) |
|
| Prompt | hi (2) | 1K (969) | 2K (1930) | 4K (3846) | 8K (7678) |
|
||||||
| --- | --- | --- | --- | --- | --- |
|
| --- | --- | --- | --- | --- | --- |
|
||||||
| Output length | 10tokens | 300tokens | 300tokens | 300tokens | 300tokens |
|
| Output length | 10tokens | 300tokens | 300tokens | 300tokens | 300tokens |
|
||||||
| **6 experts V0.2.0** | | | | | |
|
| **6 experts V0.2.0** | | | | | |
|
||||||
|
@ -154,6 +157,37 @@ the output quality doesn't change. But the speed of decoding and prefill
|
||||||
is speed up which is inspiring. So our showcase makes use of this finding*
|
is speed up which is inspiring. So our showcase makes use of this finding*
|
||||||
|
|
||||||
## How to Run
|
## How to Run
|
||||||
|
### V0.2.2 longer context & FP8 kernel
|
||||||
|
#### longer context
|
||||||
|
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
|
||||||
|
|
||||||
|
Note: The latest MLA kernel in FlashInfer still has a few minor issues. They are continuously fixing them on the main branch. If you are using FlashInfer, please install it from the main source code.
|
||||||
|
|
||||||
|
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
|
||||||
|
```
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\..*\\.self_attn$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
|
||||||
|
```
|
||||||
|
|
||||||
|
If the VRAM is still insufficient, try reducing the `chunk_prefill_size` parameter (default is 8192) to further decrease the intermediate results during chunk prefill.
|
||||||
|
#### FP8 kernel
|
||||||
|
|
||||||
|
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
|
||||||
|
- **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
|
||||||
|
- **Hybrid Quantization Architecture**:
|
||||||
|
- Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
|
||||||
|
- Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
|
||||||
|
|
||||||
|
So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
|
||||||
|
|
||||||
|
The detailed guide is [here](./fp8_kernel.md).
|
||||||
|
|
||||||
### V0.2 & V0.2.1 Showcase
|
### V0.2 & V0.2.1 Showcase
|
||||||
#### Single socket version (32 cores)
|
#### Single socket version (32 cores)
|
||||||
Our local_chat test command is:
|
Our local_chat test command is:
|
||||||
|
@ -171,7 +205,7 @@ Attention! If you are testing R1 and it may skip thinking. So you can add arg: `
|
||||||
|
|
||||||
#### Dual socket version (64 cores)
|
#### Dual socket version (64 cores)
|
||||||
|
|
||||||
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set). You may check the doc [here](./install.md) for install details. <br>
|
Make sure before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set). You may check the doc [here](./install.md) for install details. <br>
|
||||||
|
|
||||||
Test Command:
|
Test Command:
|
||||||
``` shell
|
``` shell
|
||||||
|
@ -226,6 +260,7 @@ Intel is currently the only CPU vendor that supports AMX-like instructions, whic
|
||||||
### Easier
|
### Easier
|
||||||
* Official Docker images to simplify installation
|
* Official Docker images to simplify installation
|
||||||
* Fix the server integration for web API access
|
* Fix the server integration for web API access
|
||||||
|
* Fix the local chat only accepting a single line prompt (currently \n begins generating prompt)
|
||||||
* Support for more quantization types, including the highly requested dynamic quantization from unsloth
|
* Support for more quantization types, including the highly requested dynamic quantization from unsloth
|
||||||
|
|
||||||
Stay tuned for more updates!
|
Stay tuned for more updates!
|
||||||
|
|
|
@ -25,7 +25,7 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
|
||||||
1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
|
1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
|
||||||
2. server: Increase the `--cache_lens' to a larger value.
|
2. server: Increase the `--cache_lens' to a larger value.
|
||||||
2. Move more weights to the GPU.
|
2. Move more weights to the GPU.
|
||||||
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml
|
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
|
||||||
```yaml
|
```yaml
|
||||||
- match:
|
- match:
|
||||||
name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
|
name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
|
||||||
|
@ -39,11 +39,13 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
|
||||||
You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU.
|
You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU.
|
||||||
|
|
||||||
> Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
|
> Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
|
||||||
|
> Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization.
|
||||||
|
> Note KExpertsTorch is untested.
|
||||||
|
|
||||||
|
|
||||||
### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
|
### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
|
||||||
|
|
||||||
Use the `--optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
|
Use the `--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
|
||||||
|
|
||||||
> Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.
|
> Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.
|
||||||
|
|
||||||
|
@ -53,7 +55,7 @@ You have to set `--cpu_infer` to the number of cores you want to use. The more c
|
||||||
|
|
||||||
### Q: My DeepSeek-R1 model is not thinking.
|
### Q: My DeepSeek-R1 model is not thinking.
|
||||||
|
|
||||||
According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think true `.
|
According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think True `.
|
||||||
|
|
||||||
### Q: Loading gguf error
|
### Q: Loading gguf error
|
||||||
|
|
||||||
|
@ -61,9 +63,37 @@ Make sure you:
|
||||||
1. Have the `gguf` file in the `--gguf_path` directory.
|
1. Have the `gguf` file in the `--gguf_path` directory.
|
||||||
2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
|
2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
|
||||||
3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
|
3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
|
||||||
|
4. The file itself is not corrupted; you can verify this by checking that the sha256sum matches the one from huggingface, modelscope, or hf-mirror.
|
||||||
|
|
||||||
### Q: Version `GLIBCXX_3.4.30' not found
|
### Q: Version `GLIBCXX_3.4.30' not found
|
||||||
The detailed error:
|
The detailed error:
|
||||||
>ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)
|
>ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)
|
||||||
|
|
||||||
It may because of your conda env have no this version. Your can first exit your conda env by `conda deactivate` and use `whereis libstdc++.so.6` to find the path. And re enter your conda env and copy the .so by `cp <path of outter libstdc++> <path of your conda env libstdc++>`
|
Running `conda install -c conda-forge libstdcxx-ng` can solve the problem.
|
||||||
|
|
||||||
|
|
||||||
|
### Q: When running the bfloat16 moe model, the data shows NaN
|
||||||
|
The detailed error:
|
||||||
|
```shell
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/root/ktransformers/ktransformers/local_chat.py", line 183, in <module>
|
||||||
|
fire.Fire(local_chat)
|
||||||
|
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 135, in Fire
|
||||||
|
component_trace = _Fire(component, args, parsed_flag_args, context, name)
|
||||||
|
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 468, in _Fire
|
||||||
|
component, remaining_args = _CallAndUpdateTrace(
|
||||||
|
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
|
||||||
|
component = fn(*varargs, **kwargs)
|
||||||
|
File "/root/ktransformers/ktransformers/local_chat.py", line 177, in local_chat
|
||||||
|
generated = prefill_and_generate(
|
||||||
|
File "/root/ktransformers/ktransformers/util/utils.py", line 204, in prefill_and_generate
|
||||||
|
next_token = decode_one_tokens(cuda_graph_runner, next_token.unsqueeze(0), position_ids, cache_position, past_key_values, use_cuda_graph).to(torch_device)
|
||||||
|
File "/root/ktransformers/ktransformers/util/utils.py", line 128, in decode_one_tokens
|
||||||
|
next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
|
||||||
|
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
|
||||||
|
```
|
||||||
|
**SOLUTION**: The issue of running ktransformers on Ubuntu 22.04 is caused by the current system's g++ version being too old, and the pre-defined macros do not include avx_bf16. We have tested and confirmed that it works on g++ 11.4 in Ubuntu 22.04.
|
||||||
|
|
||||||
|
### Q: Using fp8 prefill very slow.
|
||||||
|
|
||||||
|
The FP8 kernel is build by JIT, so the first run will be slow. The subsequent runs will be faster.
|
|
@ -8,6 +8,20 @@ This document provides the necessary steps to set up and run the web service for
|
||||||
|
|
||||||
Before you can compile the web code, make sure you have installed [Node.js](https://nodejs.org) version 18.3 or higher
|
Before you can compile the web code, make sure you have installed [Node.js](https://nodejs.org) version 18.3 or higher
|
||||||
|
|
||||||
|
Note: The version of Node.js in the Ubuntu or Debian GNU/Linux software repository is too low, causing compilation errors. Users can also install Node.js through the Nodesource repository, provided they uninstall the outdated version first.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
|
||||||
|
# sudo apt-get remove nodejs npm -y && sudo apt-get autoremove -y
|
||||||
|
sudo apt-get update -y && sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
|
||||||
|
curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | sudo gpg --dearmor -o /usr/share/keyrings/nodesource.gpg
|
||||||
|
sudo chmod 644 /usr/share/keyrings/nodesource.gpg
|
||||||
|
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/nodesource.gpg] https://deb.nodesource.com/node_23.x nodistro main" | sudo tee /etc/apt/sources.list.d/nodesource.list
|
||||||
|
sudo apt-get update -y
|
||||||
|
sudo apt-get install nodejs -y
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
Once npm is installed, navigate to the `ktransformers/website` directory:
|
Once npm is installed, navigate to the `ktransformers/website` directory:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
59
doc/en/benchmark.md
Normal file
59
doc/en/benchmark.md
Normal file
|
@ -0,0 +1,59 @@
|
||||||
|
## Benchmark
|
||||||
|
|
||||||
|
To conduct a quick and convenient check, we have employed a simple Python script available [here](https://github.com/kvcache-ai/ktransformers/tree/main/ktransformers/tests) to assess the precision of our **[ktransformers](https://github.com/kvcache-ai/ktransformers)** project. For this evaluation, we utilized the same dataset, which was shuffled in a consistent manner and limited to the first 1,000 data points, to test our implementation across a variety of CPU kernels, MLA kernels, and quantization formats.
|
||||||
|
|
||||||
|
We selected the DeepSeek-V3 model in its bf16, int8, and q4km versions for this test. The MMLU dataset, which can be found [here](https://huggingface.co/datasets/cais/mmlu), was used (we selected all datasets and shuffled them with a fixed random seed).
|
||||||
|
|
||||||
|
**!!! However, we skipped the few-shot part and only chose the first 1,000 data points for a quick check.** Please note that this approach may result in results that are not consistent with the technical report of DeepSeek-V3. And the test of R1 and further more tests are on going.
|
||||||
|
|
||||||
|
To verify our results, we chose [cloud service platform](https://cloud.siliconflow.cn/models) as baseline. All tests were conducted using the same script and datasets, allowing us to make a preliminary assessment of our project's precision.
|
||||||
|
|
||||||
|
We set the argument `temperature=0.6`, and to simplify the test process, we skipped the few-shot part and used the following prompt: `There is a single choice question. Answer the question by replying A, B, C, D. No other answers are accepted. Just the letter. \nQuestion: {question}\nA. {option_a}\nB. {option_b}\nC. {option_c}\nD. {option_d}\nAnswer: '`. For more details, please refer to the [script](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/tests/mmlu_test.py).
|
||||||
|
|
||||||
|
Given that we have only tested 1,000 cases, which provides only a preliminary judgment, some fluctuations in the results are reasonable. We selected all datasets and shuffled them with a fixed random seed to ensure consistency.
|
||||||
|
|
||||||
|
## Some Details
|
||||||
|
|
||||||
|
- The bf16 model of DeepSeek-V3 is available [here](https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main) (you may convert it to gguf by llama.cpp). The q4km model can be found [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M).
|
||||||
|
|
||||||
|
- The optimization YAML file is located [here](https://github.com/kvcache-ai/ktransformers/tree/main/ktransformers/optimize/optimize_rules). For the GEMM Kernel, you can change `KLinearMarlin` to `KLinearTorch`.
|
||||||
|
|
||||||
|
- To switch the MLA Kernel from Triton to Torch, you can check and modify [this file](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/attention.py), specifically by using the `forward_windows` method.
|
||||||
|
|
||||||
|
- When attempting to conduct the bf16 test (both CPU Weight and GPU Weight), you may encounter issues stemming from older versions of g++ and as, particularly when using Ubuntu 20 or earlier versions. To facilitate a smoother experience and enable you to reproduce our results, we have provided a development container. This container offers a pre-configured environment tailored for this purpose. However, please note that the container does not have the ktrans package installed. Therefore, you may still need to manually install certain packages to ensure everything runs smoothly.
|
||||||
|
|
||||||
|
- You may config the model mount dir in `devcontainer/devcontainer.json`, check the `"mouts":` config.
|
||||||
|
|
||||||
|
|
||||||
|
## The Result Table
|
||||||
|
|
||||||
|
| | | | | | | | |
|
||||||
|
| ------------------------ | ----------------- | ---------- | ----------------- | ------- | ---------- | ------------------------------------------------------ | ------------ |
|
||||||
|
| DataSet | CPU Weight Format | CPU Kernel | GPU Weight Format | GEMM Kernel | MLA Kernel | [Siliconflow](https://cloud.siliconflow.cn/models)<br> | Ktrans Point |
|
||||||
|
| MMLU<br><br>(shuffle 1k) | | | | | | | |
|
||||||
|
| 1 | bf16 | cpuinfer | bf16 | torch | torch | 81.6 | 81.9 |
|
||||||
|
| 2 | q8_0 | cpuinfer | bf16 | torch | torch | 81.6 | 83.1 |
|
||||||
|
| 3 | q4km | cpuinfer | bf16 | torch | triton | 81.6 | 81.4 |
|
||||||
|
| 4 | q4km | cpuinfer | q4km->marlin 8 | marlin | triton | 81.6 | 81.1 |
|
||||||
|
| 5 | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 81.6 | 81 |
|
||||||
|
| 6 | q4km | cpuinfer | fp8 | fp8gemm | triton | 81.6 | 81.5 |
|
||||||
|
| MMLU-pro | | | | | | | |
|
||||||
|
| 1 | q4km | cpuinfer | fp8 | fp8gemm | triton | 57.7 | 57.6 |
|
||||||
|
| 2 | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 57.7 | 57.5 |
|
||||||
|
| HumanEval | tbd | tbd | tbd | tbd | tbd | tbd | tbd |
|
||||||
|
| GSM8K | tbd | tbd | tbd | tbd | tbd | tbd | tbd |
|
||||||
|
|
||||||
|
**The details for each case are listed below**:
|
||||||
|
|
||||||
|
By default, The MLA kernel uses triton in linux and torch in windows. But we need to test torch in linux, so we manually modify the [file](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/attention.py#L592). Just get rid of all the if branch and force it to use `self.forward_windows`
|
||||||
|
|
||||||
|
- MMLU test
|
||||||
|
1. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml) change all the `KLinearMarlin` to `KLinearTorch` (just find all the usage in this file). The source weight comes from [there](https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16) (you need to use llama.cpp to convert it to gguf)
|
||||||
|
2. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). You need to modify the code to separately load cpu's expert weight. We leave this as comment in these places: [1](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L122), [2](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L136), [3](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L137) (note in 3, change the path to your local weight file path). The weight file for q8_0 is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q8_0)
|
||||||
|
3. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). You need to modify the code to separately load cpu's expert weight. We leave this as comment in these places: [1](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L122), [2](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L136), [3](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L137) (note in 3, change the path to your local weight file path). The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)
|
||||||
|
4. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). You don't need to change the source code as they both use q4km. But note the yaml file [here](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml#L29) and [here](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml#L18), below these lines you need to add `num_bits: 8` (in other words: add this kwargs to all that use `KLinearMarlin`). The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)
|
||||||
|
5. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). No need to change yaml, just use the default. The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)
|
||||||
|
6. You should check the [doc](./fp8_kernel.md) to learn how to test this case. This is a mixture tensor case.
|
||||||
|
- MMLU-pro test
|
||||||
|
1. You should check the [doc](./fp8_kernel.md) to learn how to test this case. This is a mixture tensor case.
|
||||||
|
2. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). No need to change yaml, just use the default. The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)
|
76
doc/en/fp8_kernel.md
Normal file
76
doc/en/fp8_kernel.md
Normal file
|
@ -0,0 +1,76 @@
|
||||||
|
# FP8 Linear Kernel for DeepSeek-V3/R1
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
|
||||||
|
- **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
|
||||||
|
- **Hybrid Quantization Architecture**:
|
||||||
|
- Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
|
||||||
|
- Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
|
||||||
|
|
||||||
|
So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
✅ Hybrid Precision Architecture (FP8 + GGML)<br>
|
||||||
|
✅ Memory Optimization (~19GB VRAM usage)
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
### Using Pre-Merged Weights
|
||||||
|
|
||||||
|
Pre-merged weights are available on Hugging Face:<br>
|
||||||
|
[KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-V3)<br>
|
||||||
|
[KVCache-ai/DeepSeek-R1-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-R1)
|
||||||
|
|
||||||
|
> Please confirm the weights are fully uploaded before downloading. The large file size may extend Hugging Face upload time.
|
||||||
|
|
||||||
|
|
||||||
|
Download Pre-Merged Weights
|
||||||
|
```shell
|
||||||
|
pip install -U huggingface_hub
|
||||||
|
|
||||||
|
# Optional: Use HF Mirror for faster downloads in special area.
|
||||||
|
# export HF_ENDPOINT=https://hf-mirror.com
|
||||||
|
|
||||||
|
huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid --local-dir <local_dir>
|
||||||
|
```
|
||||||
|
### Using merge scripts
|
||||||
|
If you got local DeepSeek-R1/V3 fp8 safetensors and gguf weights(eg.q4km), you can merge them using the following scripts.
|
||||||
|
|
||||||
|
```shell
|
||||||
|
python merge_tensors/merge_safetensor_gguf.py \
|
||||||
|
--safetensor_path <fp8_safetensor_path> \
|
||||||
|
--gguf_path <gguf_folder_path> \
|
||||||
|
--output_path <merged_output_path>
|
||||||
|
```
|
||||||
|
|
||||||
|
* `--safetensor_path`: input path of safetensor file([Download](https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main)).
|
||||||
|
* `--gguf_path`: input path of gguf folder ([Download](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)).
|
||||||
|
* `--output_path`: output path of merged file.
|
||||||
|
|
||||||
|
|
||||||
|
### Execution Notes
|
||||||
|
|
||||||
|
Launch local_chat.py with custom quantized experts
|
||||||
|
```shell
|
||||||
|
python ktransformers/local_chat.py \
|
||||||
|
--model_path deepseek-ai/DeepSeek-V3 \
|
||||||
|
--gguf_path <merged_weights_folder> \
|
||||||
|
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml \
|
||||||
|
--cpu_infer <cpu_cores + 1>
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
⚠️ Hardware Requirements<br>
|
||||||
|
* Recommended minimum 19GB available VRAM for FP8 kernel.
|
||||||
|
* Requires GPU with FP8 support (e.g., 4090)
|
||||||
|
|
||||||
|
⏳ First-Run Optimization
|
||||||
|
JIT compilation causes longer initial execution (subsequent runs retain optimized speed).
|
||||||
|
|
||||||
|
🔄 Temporary Interface<br>
|
||||||
|
Current weight loading implementation is provisional - will be refined in future versions
|
||||||
|
|
||||||
|
📁 Path Specification<br>
|
||||||
|
Despite hybrid quantization, merged weights are stored as .safetensors - pass the containing folder path to `--gguf_path`
|
|
@ -59,6 +59,7 @@ Supported operators and their corresponding classes are as follows:
|
||||||
| Linear | KTransformersLinear | KLinearMarlin | Marlin as backend |
|
| Linear | KTransformersLinear | KLinearMarlin | Marlin as backend |
|
||||||
| | | KLinearTorch | pytorch as backend |
|
| | | KLinearTorch | pytorch as backend |
|
||||||
| | | KLinearCPUInfer | llamafile as backend |
|
| | | KLinearCPUInfer | llamafile as backend |
|
||||||
|
| | | KLinearFP8 | Triton fp8_gemm kernel. Requires GPU be able to caluculate fp8 data |
|
||||||
| experts | KTransformersExperts | KExpertsTorch | pytorch as backend |
|
| experts | KTransformersExperts | KExpertsTorch | pytorch as backend |
|
||||||
| | | KExpertsMarlin | Marlin as backend |
|
| | | KExpertsMarlin | Marlin as backend |
|
||||||
| | | KExpertsCPU | llamafile as backend |
|
| | | KExpertsCPU | llamafile as backend |
|
||||||
|
|
|
@ -11,31 +11,50 @@ Some preparation:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
# Adding CUDA to PATH
|
# Adding CUDA to PATH
|
||||||
export PATH=/usr/local/cuda/bin:$PATH
|
if [ -d "/usr/local/cuda/bin" ]; then
|
||||||
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
export PATH=$PATH:/usr/local/cuda/bin
|
||||||
export CUDA_PATH=/usr/local/cuda
|
fi
|
||||||
|
|
||||||
|
if [ -d "/usr/local/cuda/lib64" ]; then
|
||||||
|
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
|
||||||
|
# Or you can add it to /etc/ld.so.conf and run ldconfig as root:
|
||||||
|
# echo "/usr/local/cuda-12.x/lib64" | sudo tee -a /etc/ld.so.conf
|
||||||
|
# sudo ldconfig
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -d "/usr/local/cuda" ]; then
|
||||||
|
export CUDA_PATH=$CUDA_PATH:/usr/local/cuda
|
||||||
|
fi
|
||||||
```
|
```
|
||||||
|
|
||||||
- Linux-x86_64 with gcc, g++ and cmake
|
- Linux-x86_64 with gcc, g++ and cmake (using Ubuntu as an example)
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
sudo apt-get update
|
sudo apt-get update
|
||||||
sudo apt-get install gcc g++ cmake ninja-build
|
sudo apt-get install build-essential cmake ninja-build
|
||||||
```
|
```
|
||||||
|
|
||||||
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
|
- We recommend using [Miniconda3](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) or [Anaconda3](https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program. Assuming your Anaconda installation directory is `~/anaconda3`, you should ensure that the version identifier of the GNU C++standard library used by Anaconda includes `GLIBCXX-3.4.32`
|
||||||
|
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
conda create --name ktransformers python=3.11
|
conda create --name ktransformers python=3.11
|
||||||
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
|
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
|
||||||
|
|
||||||
|
conda install -c conda-forge libstdcxx-ng # Anaconda provides a package called `libstdcxx-ng` that includes a newer version of `libstdc++`, which can be installed via `conda-forge`.
|
||||||
|
|
||||||
|
strings ~/anaconda3/envs/ktransformers-0.3/lib/libstdc++.so.6 | grep GLIBCXX
|
||||||
```
|
```
|
||||||
|
|
||||||
- Make sure that PyTorch, packaging, ninja is installed
|
- Make sure that PyTorch, packaging, ninja is installed You can also [install previous versions of PyTorch](https://pytorch.org/get-started/previous-versions/)
|
||||||
|
|
||||||
```
|
```
|
||||||
pip install torch packaging ninja cpufeature numpy
|
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
|
||||||
|
pip3 install packaging ninja cpufeature numpy
|
||||||
```
|
```
|
||||||
|
|
||||||
|
- At the same time, you should download and install the corresponding version of flash-attention from https://github.com/Dao-AILab/flash-attention/releases.
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
<!-- 1. ~~Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)~~
|
<!-- 1. ~~Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)~~
|
||||||
|
@ -62,7 +81,7 @@ Some preparation:
|
||||||
git submodule update
|
git submodule update
|
||||||
```
|
```
|
||||||
|
|
||||||
- [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
|
- [Optional] If you want to run with website, please [compile the website](./api/server/website.md) before execute ```bash install.sh```
|
||||||
|
|
||||||
- For Linux
|
- For Linux
|
||||||
- For simple install:
|
- For simple install:
|
||||||
|
@ -84,7 +103,7 @@ Some preparation:
|
||||||
install.bat
|
install.bat
|
||||||
```
|
```
|
||||||
|
|
||||||
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
|
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./makefile_usage.md)
|
||||||
|
|
||||||
<h3>Local Chat</h3>
|
<h3>Local Chat</h3>
|
||||||
We provide a simple command-line local chat Python script that you can run for testing.
|
We provide a simple command-line local chat Python script that you can run for testing.
|
||||||
|
@ -102,7 +121,7 @@ We provide a simple command-line local chat Python script that you can run for t
|
||||||
mkdir DeepSeek-V2-Lite-Chat-GGUF
|
mkdir DeepSeek-V2-Lite-Chat-GGUF
|
||||||
cd DeepSeek-V2-Lite-Chat-GGUF
|
cd DeepSeek-V2-Lite-Chat-GGUF
|
||||||
|
|
||||||
wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
|
wget https://huggingface.co/mradermacher/DeepSeek-V2-Lite-GGUF/resolve/main/DeepSeek-V2-Lite.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
|
||||||
|
|
||||||
cd .. # Move to repo's root dir
|
cd .. # Move to repo's root dir
|
||||||
|
|
||||||
|
@ -122,7 +141,7 @@ It features the following arguments:
|
||||||
|
|
||||||
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
|
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
|
||||||
|
|
||||||
- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
|
- `--optimize_config_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
|
||||||
|
|
||||||
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
|
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
|
||||||
|
|
||||||
|
@ -235,7 +254,7 @@ Be aware that you need to be subject to their corresponding model licenses when
|
||||||
<!-- pin block for jump -->
|
<!-- pin block for jump -->
|
||||||
<span id='id_666'>
|
<span id='id_666'>
|
||||||
|
|
||||||
<h3>RESTful API and Web UI (deprected) </h3>
|
<h3>RESTful API and Web UI </h3>
|
||||||
|
|
||||||
|
|
||||||
Start without website:
|
Start without website:
|
||||||
|
|
|
@ -160,9 +160,14 @@ DeepSeek 的 MLA 操作符计算密集。虽然全部在 CPU 上运行是可行
|
||||||
|
|
||||||
5. 为什么选择英特尔 CPU?
|
5. 为什么选择英特尔 CPU?
|
||||||
英特尔目前是唯一支持 AMX 类似指令的 CPU 供应商,与仅支持 AVX 的替代方案相比,性能显著更好。
|
英特尔目前是唯一支持 AMX 类似指令的 CPU 供应商,与仅支持 AVX 的替代方案相比,性能显著更好。
|
||||||
|
|
||||||
## 常见问题解答
|
## 常见问题解答
|
||||||
### R1 不返回思考过程
|
### R1 不返回思考过程
|
||||||
注意!如果测试 R1 可能会跳过思考。因此,可以添加参数:`--force_think true`。详细信息在 [常见问题解答](./FAQ.md) 部分中。 <br>
|
注意!如果测试 R1 可能会跳过思考。因此,可以添加参数:`--force_think true`。详细信息在 [常见问题解答](./FAQ.md) 部分中。 <br>
|
||||||
|
|
||||||
|
## 问题
|
||||||
|
* 修复服务器集成功能以实现网络API访问支持
|
||||||
|
* 修复本地聊天功能仅支持单行提示输入的问题(目前输入换行符(\n)即开始生成提示)
|
||||||
|
|
||||||
### 更多常见问题解答
|
### 更多常见问题解答
|
||||||
[详见](./FAQ.md)
|
[详见](./FAQ.md)
|
||||||
|
|
|
@ -2,6 +2,8 @@
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
# clear build dirs
|
# clear build dirs
|
||||||
|
rm -rf build
|
||||||
|
rm -rf *.egg-info
|
||||||
rm -rf ktransformers/ktransformers_ext/build
|
rm -rf ktransformers/ktransformers_ext/build
|
||||||
rm -rf ktransformers/ktransformers_ext/cuda/build
|
rm -rf ktransformers/ktransformers_ext/cuda/build
|
||||||
rm -rf ktransformers/ktransformers_ext/cuda/dist
|
rm -rf ktransformers/ktransformers_ext/cuda/dist
|
||||||
|
|
|
@ -8,4 +8,4 @@ Version : 1.0.0
|
||||||
LastEditors : chenxl
|
LastEditors : chenxl
|
||||||
LastEditTime : 2025-02-15 03:53:02
|
LastEditTime : 2025-02-15 03:53:02
|
||||||
'''
|
'''
|
||||||
__version__ = "0.2.1"
|
__version__ = "0.2.2rc1"
|
|
@ -30,6 +30,8 @@ if (NOT MSVC)
|
||||||
option(LLAMA_F16C "llama: enable F16C" OFF)
|
option(LLAMA_F16C "llama: enable F16C" OFF)
|
||||||
endif()
|
endif()
|
||||||
option(LLAMA_AVX512_FANCY_SIMD "llama: enable AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-VNNI" OFF)
|
option(LLAMA_AVX512_FANCY_SIMD "llama: enable AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-VNNI" OFF)
|
||||||
|
option(KTRANSFORMERS_USE_CUDA "ktransformers: use CUDA" OFF)
|
||||||
|
option(KTRANSFORMERS_USE_MUSA "ktransformers: use MUSA" OFF)
|
||||||
|
|
||||||
# Architecture specific
|
# Architecture specific
|
||||||
# TODO: probably these flags need to be tweaked on some architectures
|
# TODO: probably these flags need to be tweaked on some architectures
|
||||||
|
@ -207,9 +209,33 @@ add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/llama.cpp ${CMAKE
|
||||||
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party)
|
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../../third_party)
|
||||||
if (WIN32)
|
if (WIN32)
|
||||||
include_directories("$ENV{CUDA_PATH}/include")
|
include_directories("$ENV{CUDA_PATH}/include")
|
||||||
|
add_compile_definitions(KTRANSFORMERS_USE_CUDA=1)
|
||||||
elseif (UNIX)
|
elseif (UNIX)
|
||||||
find_package(CUDA REQUIRED)
|
if (KTRANSFORMERS_USE_CUDA)
|
||||||
include_directories("${CUDA_INCLUDE_DIRS}")
|
find_package(CUDA REQUIRED)
|
||||||
|
include_directories("${CUDA_INCLUDE_DIRS}")
|
||||||
|
add_compile_definitions(KTRANSFORMERS_USE_CUDA=1)
|
||||||
|
endif()
|
||||||
|
|
||||||
|
if (KTRANSFORMERS_USE_MUSA)
|
||||||
|
if (NOT EXISTS $ENV{MUSA_PATH})
|
||||||
|
if (NOT EXISTS /opt/musa)
|
||||||
|
set(MUSA_PATH /usr/local/musa)
|
||||||
|
else()
|
||||||
|
set(MUSA_PATH /opt/musa)
|
||||||
|
endif()
|
||||||
|
else()
|
||||||
|
set(MUSA_PATH $ENV{MUSA_PATH})
|
||||||
|
endif()
|
||||||
|
|
||||||
|
list(APPEND CMAKE_MODULE_PATH "${MUSA_PATH}/cmake")
|
||||||
|
|
||||||
|
find_package(MUSAToolkit)
|
||||||
|
if (MUSAToolkit_FOUND)
|
||||||
|
message(STATUS "MUSA Toolkit found")
|
||||||
|
add_compile_definitions(KTRANSFORMERS_USE_MUSA=1)
|
||||||
|
endif()
|
||||||
|
endif()
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR} SOURCE_DIR1)
|
aux_source_directory(${CMAKE_CURRENT_SOURCE_DIR} SOURCE_DIR1)
|
||||||
|
@ -225,10 +251,15 @@ target_link_libraries(${PROJECT_NAME} PRIVATE llama)
|
||||||
if(WIN32)
|
if(WIN32)
|
||||||
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_PATH}/lib/x64/cudart.lib")#CUDA::cudart
|
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_PATH}/lib/x64/cudart.lib")#CUDA::cudart
|
||||||
elseif(UNIX)
|
elseif(UNIX)
|
||||||
if(NOT DEFINED ENV{CUDA_HOME} OR "$ENV{CUDA_HOME}" STREQUAL "")
|
if(KTRANSFORMERS_USE_CUDA)
|
||||||
set(ENV{CUDA_HOME} "/usr/local/cuda")
|
if(NOT DEFINED ENV{CUDA_HOME} OR "$ENV{CUDA_HOME}" STREQUAL "")
|
||||||
|
set(ENV{CUDA_HOME} "/usr/local/cuda")
|
||||||
|
endif()
|
||||||
|
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_HOME}/lib64/libcudart.so")
|
||||||
|
endif()
|
||||||
|
if(KTRANSFORMERS_USE_MUSA)
|
||||||
|
target_link_libraries(${PROJECT_NAME} PRIVATE MUSA::musart)
|
||||||
endif()
|
endif()
|
||||||
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_HOME}/lib64/libcudart.so")
|
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
# Define the USE_NUMA option
|
# Define the USE_NUMA option
|
||||||
|
|
|
@ -54,7 +54,12 @@ void Backend::do_work_stealing_job(int task_num,
|
||||||
init_func_ = init_func;
|
init_func_ = init_func;
|
||||||
compute_func_ = compute_func;
|
compute_func_ = compute_func;
|
||||||
finalize_func_ = finalize_func;
|
finalize_func_ = finalize_func;
|
||||||
|
#ifdef USE_NUMA
|
||||||
|
// numa node location will be calculated based on the number of threads
|
||||||
|
thread_num_ = max_thread_num_;
|
||||||
|
#else
|
||||||
thread_num_ = std::min(max_thread_num_, task_num);
|
thread_num_ = std::min(max_thread_num_, task_num);
|
||||||
|
#endif
|
||||||
int base = task_num / thread_num_;
|
int base = task_num / thread_num_;
|
||||||
int remain = task_num % thread_num_;
|
int remain = task_num % thread_num_;
|
||||||
thread_state_[0].end = base + (0 < remain);
|
thread_state_[0].end = base + (0 < remain);
|
||||||
|
@ -146,4 +151,4 @@ void Backend::worker_thread(int thread_id) {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -17,7 +17,11 @@
|
||||||
#include <queue>
|
#include <queue>
|
||||||
#include <thread>
|
#include <thread>
|
||||||
#include <vector>
|
#include <vector>
|
||||||
#include "cuda_runtime.h"
|
#ifdef KTRANSFORMERS_USE_CUDA
|
||||||
|
#include "vendors/cuda.h"
|
||||||
|
#elif KTRANSFORMERS_USE_MUSA
|
||||||
|
#include "vendors/musa.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
#include "backend.h"
|
#include "backend.h"
|
||||||
#include "task_queue.h"
|
#include "task_queue.h"
|
||||||
|
|
3
ktransformers/ktransformers_ext/cpu_backend/vendors/README.md
vendored
Normal file
3
ktransformers/ktransformers_ext/cpu_backend/vendors/README.md
vendored
Normal file
|
@ -0,0 +1,3 @@
|
||||||
|
## TODO
|
||||||
|
|
||||||
|
This directory can be removed after updating the version of `llama.cpp`.
|
3
ktransformers/ktransformers_ext/cpu_backend/vendors/cuda.h
vendored
Normal file
3
ktransformers/ktransformers_ext/cpu_backend/vendors/cuda.h
vendored
Normal file
|
@ -0,0 +1,3 @@
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include <cuda_runtime.h>
|
9
ktransformers/ktransformers_ext/cpu_backend/vendors/musa.h
vendored
Normal file
9
ktransformers/ktransformers_ext/cpu_backend/vendors/musa.h
vendored
Normal file
|
@ -0,0 +1,9 @@
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include <musa_runtime.h>
|
||||||
|
#include <musa_bf16.h>
|
||||||
|
|
||||||
|
#define cudaLaunchHostFunc musaLaunchHostFunc
|
||||||
|
#define cudaStream_t musaStream_t
|
||||||
|
#define cudaHostFn_t musaHostFn_t
|
||||||
|
#define nv_bfloat16 mt_bfloat16
|
|
@ -1,15 +1,15 @@
|
||||||
/**
|
/**
|
||||||
* @Description :
|
* @Description :
|
||||||
* @Author : Azure-Tang
|
* @Author : Azure-Tang, Boxin Zhang
|
||||||
* @Date : 2024-07-25 13:38:30
|
* @Date : 2024-07-25 13:38:30
|
||||||
* @Version : 1.0.0
|
* @Version : 0.2.2
|
||||||
* @LastEditors : kkk1nak0
|
* @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
||||||
* @LastEditTime : 2024-08-12 03:05:04
|
|
||||||
* @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
|
||||||
**/
|
**/
|
||||||
|
|
||||||
#include "custom_gguf/ops.h"
|
#include "custom_gguf/ops.h"
|
||||||
|
#ifdef KTRANSFORMERS_USE_CUDA
|
||||||
#include "gptq_marlin/ops.h"
|
#include "gptq_marlin/ops.h"
|
||||||
|
#endif
|
||||||
// Python bindings
|
// Python bindings
|
||||||
#include <pybind11/pybind11.h>
|
#include <pybind11/pybind11.h>
|
||||||
#include <pybind11/stl.h>
|
#include <pybind11/stl.h>
|
||||||
|
@ -19,22 +19,53 @@
|
||||||
// namespace py = pybind11;
|
// namespace py = pybind11;
|
||||||
|
|
||||||
PYBIND11_MODULE(KTransformersOps, m) {
|
PYBIND11_MODULE(KTransformersOps, m) {
|
||||||
m.def("dequantize_q8_0", &dequantize_q8_0, "Function to dequantize q8_0 data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
m.def("dequantize_q8_0", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
|
||||||
m.def("dequantize_q6_k", &dequantize_q6_k, "Function to dequantize q6_k data.",
|
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
return dequantize_q8_0((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
|
||||||
m.def("dequantize_q5_k", &dequantize_q5_k, "Function to dequantize q5_k data.",
|
}, "Function to dequantize q8_0 data.",
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
|
||||||
m.def("dequantize_q4_k", &dequantize_q4_k, "Function to dequantize q4_k data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
m.def("dequantize_q6_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
|
||||||
m.def("dequantize_q3_k", &dequantize_q3_k, "Function to dequantize q3_k data.",
|
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
return dequantize_q6_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
|
||||||
m.def("dequantize_q2_k", &dequantize_q2_k, "Function to dequantize q2_k data.",
|
}, "Function to dequantize q6_k data.",
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
|
||||||
m.def("dequantize_iq4_xs", &dequantize_iq4_xs, "Function to dequantize iq4_xs data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
m.def("dequantize_q5_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
|
||||||
m.def("gptq_marlin_gemm", &gptq_marlin_gemm, "Function to perform GEMM using Marlin quantization.",
|
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
|
||||||
py::arg("a"), py::arg("b_q_weight"), py::arg("b_scales"), py::arg("g_idx"),
|
return dequantize_q5_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
|
||||||
py::arg("perm"), py::arg("workspace"), py::arg("num_bits"), py::arg("size_m"),
|
}, "Function to dequantize q5_k data.",
|
||||||
py::arg("size_n"), py::arg("size_k"), py::arg("is_k_full"));
|
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
|
||||||
|
|
||||||
|
m.def("dequantize_q4_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
|
||||||
|
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
|
||||||
|
return dequantize_q4_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
|
||||||
|
}, "Function to dequantize q4_k data.",
|
||||||
|
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
|
||||||
|
|
||||||
|
m.def("dequantize_q3_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
|
||||||
|
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
|
||||||
|
return dequantize_q3_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
|
||||||
|
}, "Function to dequantize q3_k data.",
|
||||||
|
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
|
||||||
|
|
||||||
|
m.def("dequantize_q2_k", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
|
||||||
|
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
|
||||||
|
return dequantize_q2_k((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
|
||||||
|
}, "Function to dequantize q2_k data.",
|
||||||
|
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
|
||||||
|
|
||||||
|
m.def("dequantize_iq4_xs", [](const intptr_t data, int num_bytes, int blk_size, const int ele_per_blk, torch::Device device, py::object target_dtype) {
|
||||||
|
torch::Dtype dtype = torch::python::detail::py_object_to_dtype(target_dtype);
|
||||||
|
return dequantize_iq4_xs((int8_t*)data, num_bytes, blk_size, ele_per_blk, device, dtype);
|
||||||
|
}, "Function to dequantize iq4_xs data.",
|
||||||
|
py::arg("data"), py::arg("num_bytes"), py::arg("blk_size"), py::arg("ele_per_blk"), py::arg("device"), py::arg("target_dtype"));
|
||||||
|
|
||||||
|
#ifdef KTRANSFORMERS_USE_CUDA
|
||||||
|
m.def("gptq_marlin_gemm", &gptq_marlin_gemm, "Function to perform GEMM using Marlin quantization.",
|
||||||
|
py::arg("a"), py::arg("b_q_weight"), py::arg("b_scales"), py::arg("g_idx"),
|
||||||
|
py::arg("perm"), py::arg("workspace"), py::arg("num_bits"), py::arg("size_m"),
|
||||||
|
py::arg("size_n"), py::arg("size_k"), py::arg("is_k_full"));
|
||||||
|
#endif
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,35 +0,0 @@
|
||||||
#include "ops.h"
|
|
||||||
// Python bindings
|
|
||||||
#include <pybind11/pybind11.h>
|
|
||||||
#include <pybind11/stl.h>
|
|
||||||
#include <torch/library.h>
|
|
||||||
#include <torch/extension.h>
|
|
||||||
#include <torch/torch.h>
|
|
||||||
// namespace py = pybind11;
|
|
||||||
|
|
||||||
int test(){
|
|
||||||
return 5;
|
|
||||||
}
|
|
||||||
|
|
||||||
torch::Tensor dequantize_q6_k(torch::Tensor data, int blk_size, torch::Device device);
|
|
||||||
torch::Tensor dequantize_q5_k(torch::Tensor data, int blk_size, torch::Device device);
|
|
||||||
torch::Tensor dequantize_q2_k(torch::Tensor data, int blk_size, torch::Device device);
|
|
||||||
|
|
||||||
PYBIND11_MODULE(cudaops, m) {
|
|
||||||
m.def("dequantize_q8_0", &dequantize_q8_0, "Function to dequantize q8_0 data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
|
||||||
m.def("dequantize_q6_k", &dequantize_q6_k, "Function to dequantize q6_k data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
|
||||||
m.def("dequantize_q5_k", &dequantize_q5_k, "Function to dequantize q5_k data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
|
||||||
m.def("dequantize_q4_k", &dequantize_q4_k, "Function to dequantize q4_k data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
|
||||||
m.def("dequantize_q3_k", &dequantize_q3_k, "Function to dequantize q3_k data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
|
||||||
m.def("dequantize_q2_k", &dequantize_q2_k, "Function to dequantize q2_k data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
|
||||||
m.def("dequantize_iq4_xs", &dequantize_iq4_xs, "Function to dequantize iq4_xs data.",
|
|
||||||
py::arg("data"), py::arg("blk_size"), py::arg("device"));
|
|
||||||
m.def("test", &test, "Function to test.");
|
|
||||||
|
|
||||||
}
|
|
|
@ -2,26 +2,55 @@
|
||||||
* @Description :
|
* @Description :
|
||||||
* @Author : Azure-Tang, Boxin Zhang
|
* @Author : Azure-Tang, Boxin Zhang
|
||||||
* @Date : 2024-07-25 13:38:30
|
* @Date : 2024-07-25 13:38:30
|
||||||
* @Version : 1.0.0
|
* @Version : 0.2.2
|
||||||
* @LastEditors : kkk1nak0
|
|
||||||
* @LastEditTime : 2024-08-12 04:18:04
|
|
||||||
* Adapted from https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.c
|
* Adapted from https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.c
|
||||||
* Copyright (c) 2023-2024 The ggml authors
|
* Copyright (c) 2023-2024 The ggml authors
|
||||||
* Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
* Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
||||||
*/
|
*/
|
||||||
#include <cuda_runtime.h>
|
#include <cuda_runtime.h>
|
||||||
|
#include <cuda_bf16.h>
|
||||||
|
#include <cuda_fp16.h>
|
||||||
#include <torch/library.h>
|
#include <torch/library.h>
|
||||||
#include <torch/extension.h>
|
#include <torch/extension.h>
|
||||||
#include <torch/torch.h>
|
#include <torch/torch.h>
|
||||||
#include <cstdint>
|
#include <cstdint>
|
||||||
#include <c10/cuda/CUDAGuard.h>
|
#include <c10/cuda/CUDAGuard.h>
|
||||||
|
|
||||||
__global__ void dequantize_q8_0_kernel(float* output, const float* scales, const int8_t* qs, int num_blocks, int blk_size) {
|
__global__ void dequantize_q8_0_fp32_kernel(const int8_t* data, float* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
for (auto block_id=global_idx; block_id<num_blocks;block_id+=blockDim.x * gridDim.x){
|
for (long long block_id = global_idx; block_id < num_blocks; block_id += blockDim.x * gridDim.x){
|
||||||
for(int i=0;i<blk_size;i++){
|
float* __restrict__ output_blk = (float*)(output + block_id * ele_per_blk);
|
||||||
float scale = scales[block_id];
|
const int8_t* cur_block = data + block_id * blk_size;
|
||||||
output[block_id * blk_size + i] = scale * qs[block_id * blk_size + i];
|
float scale = __half2float(*((half*)cur_block));
|
||||||
|
cur_block += 2;
|
||||||
|
for (int i = 0; i < ele_per_blk; i++){
|
||||||
|
output_blk[i] = scale * cur_block[i];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q8_0_fp16_kernel(const int8_t* data, __half* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id = global_idx; block_id < num_blocks; block_id += blockDim.x * gridDim.x) {
|
||||||
|
__half* __restrict__ output_blk = (__half*)(output + block_id * ele_per_blk);
|
||||||
|
const int8_t* cur_block = data + block_id * blk_size;
|
||||||
|
float scale = __half2float(*((half*)cur_block));
|
||||||
|
cur_block += 2;
|
||||||
|
for (int i = 0; i < ele_per_blk; i++) {
|
||||||
|
output_blk[i] = __float2half(scale * cur_block[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q8_0_bf16_kernel(const int8_t* data, nv_bfloat16* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id = global_idx; block_id < num_blocks; block_id += blockDim.x * gridDim.x) {
|
||||||
|
nv_bfloat16* __restrict__ output_blk = (nv_bfloat16*)(output + block_id * ele_per_blk);
|
||||||
|
const int8_t* cur_block = data + block_id * blk_size;
|
||||||
|
float scale = __half2float(*((half*)cur_block));
|
||||||
|
cur_block += 2;
|
||||||
|
for (int i = 0; i < ele_per_blk; i++) {
|
||||||
|
output_blk[i] = __float2bfloat16(scale * cur_block[i]);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -36,13 +65,13 @@ __device__ void get_scale_min_k4(int j, const uint8_t * q, uint8_t * __restrict_
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
__global__ void dequantize_q2_k_kernel(int8_t* data, float* output, int blk_size, int num_blocks) {
|
__global__ void dequantize_q2_k_fp32_kernel(const int8_t* data, float* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
for (auto block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
||||||
float* __restrict__ output_blk = (float*)(output + block_id * 256);
|
float* __restrict__ output_blk = (float*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
const float d = __half2float(*(reinterpret_cast<half*>(data + block_id * blk_size + 80)));
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 80)));
|
||||||
const float min = __half2float(*(reinterpret_cast<half*>(data + block_id * blk_size + 82)));
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 82)));
|
||||||
|
|
||||||
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 16);
|
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 16);
|
||||||
|
|
||||||
|
@ -70,17 +99,85 @@ __global__ void dequantize_q2_k_kernel(int8_t* data, float* output, int blk_size
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
__global__ void dequantize_q3_k_kernel(int8_t* data, float* output, int blk_size, int num_blocks) {
|
__global__ void dequantize_q2_k_fp16_kernel(const int8_t* data, __half* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
||||||
|
__half* __restrict__ output_blk = (__half*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 80)));
|
||||||
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 82)));
|
||||||
|
|
||||||
|
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 16);
|
||||||
|
|
||||||
|
int is = 0;
|
||||||
|
float dl, ml;
|
||||||
|
|
||||||
|
for (int n = 0; n < 256; n += 128) {
|
||||||
|
int shift = 0;
|
||||||
|
for (int j = 0; j < 4; ++j) {
|
||||||
|
uint8_t* scales = (uint8_t*)(data + block_id * blk_size + (is++));
|
||||||
|
uint8_t sc = *scales;
|
||||||
|
dl = d * (sc & 0xF); ml = min * (sc >> 4);
|
||||||
|
for (int l = 0; l < 16; ++l) *output_blk++ = __float2half(dl * ((int8_t)((q[l] >> shift) & 3)) - ml);
|
||||||
|
|
||||||
|
scales = (uint8_t*)(data + block_id * blk_size + (is++));
|
||||||
|
sc = *scales;
|
||||||
|
|
||||||
|
dl = d * (sc & 0xF); ml = min * (sc >> 4);
|
||||||
|
for (int l = 0; l < 16; ++l) *output_blk++ = __float2half(dl * ((int8_t)((q[l+16] >> shift) & 3)) - ml);
|
||||||
|
|
||||||
|
shift += 2;
|
||||||
|
}
|
||||||
|
q += 32;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q2_k_bf16_kernel(const int8_t* data, nv_bfloat16* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
||||||
|
nv_bfloat16* __restrict__ output_blk = (nv_bfloat16*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 80)));
|
||||||
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 82)));
|
||||||
|
|
||||||
|
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 16);
|
||||||
|
|
||||||
|
int is = 0;
|
||||||
|
float dl, ml;
|
||||||
|
|
||||||
|
for (int n = 0; n < 256; n += 128) {
|
||||||
|
int shift = 0;
|
||||||
|
for (int j = 0; j < 4; ++j) {
|
||||||
|
uint8_t* scales = (uint8_t*)(data + block_id * blk_size + (is++));
|
||||||
|
uint8_t sc = *scales;
|
||||||
|
dl = d * (sc & 0xF); ml = min * (sc >> 4);
|
||||||
|
for (int l = 0; l < 16; ++l) *output_blk++ = __float2bfloat16(dl * ((int8_t)((q[l] >> shift) & 3)) - ml);
|
||||||
|
|
||||||
|
scales = (uint8_t*)(data + block_id * blk_size + (is++));
|
||||||
|
sc = *scales;
|
||||||
|
|
||||||
|
dl = d * (sc & 0xF); ml = min * (sc >> 4);
|
||||||
|
for (int l = 0; l < 16; ++l) *output_blk++ = __float2bfloat16(dl * ((int8_t)((q[l+16] >> shift) & 3)) - ml);
|
||||||
|
|
||||||
|
shift += 2;
|
||||||
|
}
|
||||||
|
q += 32;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q3_k_fp32_kernel(const int8_t* data, float* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
|
||||||
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
const uint32_t kmask1 = 0x03030303;
|
const uint32_t kmask1 = 0x03030303;
|
||||||
const uint32_t kmask2 = 0x0f0f0f0f;
|
const uint32_t kmask2 = 0x0f0f0f0f;
|
||||||
for (auto block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
||||||
float* __restrict__ output_blk = (float*)(output + block_id * 256);
|
float* __restrict__ output_blk = (float*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
uint32_t aux[4];
|
uint32_t aux[4];
|
||||||
const int8_t * scales = (const int8_t*)aux;
|
const int8_t * scales = (const int8_t*)aux;
|
||||||
const float d_all = __half2float(*(reinterpret_cast<half*>(data + block_id * blk_size + 108)));
|
const float d_all = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 108)));
|
||||||
|
|
||||||
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 32);
|
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 32);
|
||||||
const uint8_t * __restrict__ hm = (uint8_t*)(data + block_id * blk_size + 0);
|
const uint8_t * __restrict__ hm = (uint8_t*)(data + block_id * blk_size + 0);
|
||||||
|
@ -126,19 +223,131 @@ __global__ void dequantize_q3_k_kernel(int8_t* data, float* output, int blk_size
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q3_k_fp16_kernel(const int8_t* data, __half* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
const uint32_t kmask1 = 0x03030303;
|
||||||
|
const uint32_t kmask2 = 0x0f0f0f0f;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
||||||
|
__half* __restrict__ output_blk = (__half*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
__global__ void dequantize_q4_k_kernel(int8_t* data, float* output, int blk_size, int num_blocks) {
|
uint32_t aux[4];
|
||||||
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
const int8_t * scales = (const int8_t*)aux;
|
||||||
for (auto block_id=global_idx; block_id<num_blocks;block_id+=blockDim.x * gridDim.x){
|
const float d_all = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 108)));
|
||||||
float* __restrict__ output_blk = (float*)(output + block_id * 256);
|
|
||||||
|
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 32);
|
||||||
|
const uint8_t * __restrict__ hm = (uint8_t*)(data + block_id * blk_size + 0);
|
||||||
|
uint8_t m = 1;
|
||||||
|
|
||||||
|
|
||||||
|
uint8_t* block_scales = (uint8_t*)(data + block_id * blk_size + 96);
|
||||||
|
|
||||||
|
for (int i = 0; i < 3; i++) {
|
||||||
|
aux[i] = 0;
|
||||||
|
for (int j = 0; j < 4; j++) {
|
||||||
|
aux[i] |= ((uint32_t)block_scales[i * 4 + j]) << (j * 8);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
uint32_t tmp = aux[2];
|
||||||
|
aux[2] = ((aux[0] >> 4) & kmask2) | (((tmp >> 4) & kmask1) << 4);
|
||||||
|
aux[3] = ((aux[1] >> 4) & kmask2) | (((tmp >> 6) & kmask1) << 4);
|
||||||
|
aux[0] = (aux[0] & kmask2) | (((tmp >> 0) & kmask1) << 4);
|
||||||
|
aux[1] = (aux[1] & kmask2) | (((tmp >> 2) & kmask1) << 4);
|
||||||
|
|
||||||
|
int is = 0;
|
||||||
|
float dl;
|
||||||
|
for (int n = 0; n < 256; n += 128) {
|
||||||
|
int shift = 0;
|
||||||
|
for (int j = 0; j < 4; ++j) {
|
||||||
|
|
||||||
|
dl = d_all * (scales[is++] - 32);
|
||||||
|
for (int l = 0; l < 16; ++l) {
|
||||||
|
*output_blk++ = __float2half(dl * ((int8_t)((q[l+ 0] >> shift) & 3) - ((hm[l+ 0] & m) ? 0 : 4)));
|
||||||
|
}
|
||||||
|
|
||||||
|
dl = d_all * (scales[is++] - 32);
|
||||||
|
for (int l = 0; l < 16; ++l) {
|
||||||
|
*output_blk++ = __float2half(dl * ((int8_t)((q[l+16] >> shift) & 3) - ((hm[l+16] & m) ? 0 : 4)));
|
||||||
|
}
|
||||||
|
|
||||||
|
shift += 2;
|
||||||
|
m <<= 1;
|
||||||
|
}
|
||||||
|
q += 32;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q3_k_bf16_kernel(const int8_t* data, nv_bfloat16* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
const uint32_t kmask1 = 0x03030303;
|
||||||
|
const uint32_t kmask2 = 0x0f0f0f0f;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
||||||
|
nv_bfloat16* __restrict__ output_blk = (nv_bfloat16*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
|
uint32_t aux[4];
|
||||||
|
const int8_t * scales = (const int8_t*)aux;
|
||||||
|
const float d_all = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 108)));
|
||||||
|
|
||||||
|
const uint8_t * __restrict__ q = (uint8_t*)(data + block_id * blk_size + 32);
|
||||||
|
const uint8_t * __restrict__ hm = (uint8_t*)(data + block_id * blk_size + 0);
|
||||||
|
uint8_t m = 1;
|
||||||
|
|
||||||
|
|
||||||
|
uint8_t* block_scales = (uint8_t*)(data + block_id * blk_size + 96);
|
||||||
|
|
||||||
|
for (int i = 0; i < 3; i++) {
|
||||||
|
aux[i] = 0;
|
||||||
|
for (int j = 0; j < 4; j++) {
|
||||||
|
aux[i] |= ((uint32_t)block_scales[i * 4 + j]) << (j * 8);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
uint32_t tmp = aux[2];
|
||||||
|
aux[2] = ((aux[0] >> 4) & kmask2) | (((tmp >> 4) & kmask1) << 4);
|
||||||
|
aux[3] = ((aux[1] >> 4) & kmask2) | (((tmp >> 6) & kmask1) << 4);
|
||||||
|
aux[0] = (aux[0] & kmask2) | (((tmp >> 0) & kmask1) << 4);
|
||||||
|
aux[1] = (aux[1] & kmask2) | (((tmp >> 2) & kmask1) << 4);
|
||||||
|
|
||||||
|
int is = 0;
|
||||||
|
float dl;
|
||||||
|
for (int n = 0; n < 256; n += 128) {
|
||||||
|
int shift = 0;
|
||||||
|
for (int j = 0; j < 4; ++j) {
|
||||||
|
|
||||||
|
dl = d_all * (scales[is++] - 32);
|
||||||
|
for (int l = 0; l < 16; ++l) {
|
||||||
|
*output_blk++ = __float2bfloat16(dl * ((int8_t)((q[l+ 0] >> shift) & 3) - ((hm[l+ 0] & m) ? 0 : 4)));
|
||||||
|
}
|
||||||
|
|
||||||
|
dl = d_all * (scales[is++] - 32);
|
||||||
|
for (int l = 0; l < 16; ++l) {
|
||||||
|
*output_blk++ = __float2bfloat16(dl * ((int8_t)((q[l+16] >> shift) & 3) - ((hm[l+16] & m) ? 0 : 4)));
|
||||||
|
}
|
||||||
|
|
||||||
|
shift += 2;
|
||||||
|
m <<= 1;
|
||||||
|
}
|
||||||
|
q += 32;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
__global__ void dequantize_q4_k_fp32_kernel(const int8_t* data, float* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+=blockDim.x * gridDim.x){
|
||||||
|
float* __restrict__ output_blk = (float*)(output + block_id * ele_per_blk);
|
||||||
// const uint8_t * q = data[i].qs;
|
// const uint8_t * q = data[i].qs;
|
||||||
const uint8_t * q = (uint8_t*)(data + block_id * 144 + 16);
|
const uint8_t * q = (uint8_t*)(data + block_id * 144 + 16);
|
||||||
|
|
||||||
const float d = __half2float(*(reinterpret_cast<half*>(data + block_id * 144 + 0)));
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * 144 + 0)));
|
||||||
const float min = __half2float(*(reinterpret_cast<half*>(data + block_id * 144 + 2)));
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * 144 + 2)));
|
||||||
int is = 0;
|
int is = 0;
|
||||||
uint8_t sc, m;
|
uint8_t sc, m;
|
||||||
for (int j = 0; j < blk_size; j += 64) {
|
for (int j = 0; j < ele_per_blk; j += 64) {
|
||||||
uint8_t* scales = (uint8_t*)(data + block_id * 144 + 4);
|
uint8_t* scales = (uint8_t*)(data + block_id * 144 + 4);
|
||||||
get_scale_min_k4(is + 0, scales, &sc, &m);
|
get_scale_min_k4(is + 0, scales, &sc, &m);
|
||||||
const float d1 = d * sc; const float m1 = min * m;
|
const float d1 = d * sc; const float m1 = min * m;
|
||||||
|
@ -151,13 +360,61 @@ __global__ void dequantize_q4_k_kernel(int8_t* data, float* output, int blk_size
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
__global__ void dequantize_q5_k_kernel(int8_t* data, float* output, int blk_size, int num_blocks) {
|
__global__ void dequantize_q4_k_fp16_kernel(const int8_t* data, __half* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
for (auto block_id=global_idx; block_id<num_blocks; block_id+= blockDim.x * gridDim.x){
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+=blockDim.x * gridDim.x){
|
||||||
float* __restrict__ output_blk = (float*)(output + block_id * 256);
|
__half* __restrict__ output_blk = (__half*)(output + block_id * ele_per_blk);
|
||||||
|
// const uint8_t * q = data[i].qs;
|
||||||
|
const uint8_t * q = (uint8_t*)(data + block_id * 144 + 16);
|
||||||
|
|
||||||
const float d = __half2float(*(reinterpret_cast<half*>(data + block_id * blk_size + 0)));
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * 144 + 0)));
|
||||||
const float min = __half2float(*(reinterpret_cast<half*>(data + block_id * blk_size + 2)));
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * 144 + 2)));
|
||||||
|
int is = 0;
|
||||||
|
uint8_t sc, m;
|
||||||
|
for (int j = 0; j < ele_per_blk; j += 64) {
|
||||||
|
uint8_t* scales = (uint8_t*)(data + block_id * 144 + 4);
|
||||||
|
get_scale_min_k4(is + 0, scales, &sc, &m);
|
||||||
|
const float d1 = d * sc; const float m1 = min * m;
|
||||||
|
get_scale_min_k4(is + 1, scales, &sc, &m);
|
||||||
|
const float d2 = d * sc; const float m2 = min * m;
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2half(d1 * (q[l] & 0xF) - m1);
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2half(d2 * (q[l] >> 4) - m2);
|
||||||
|
q += 32; is += 2;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q4_k_bf16_kernel(const int8_t* data, nv_bfloat16* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+=blockDim.x * gridDim.x){
|
||||||
|
nv_bfloat16* __restrict__ output_blk = (nv_bfloat16*)(output + block_id * ele_per_blk);
|
||||||
|
// const uint8_t * q = data[i].qs;
|
||||||
|
const uint8_t * q = (uint8_t*)(data + block_id * 144 + 16);
|
||||||
|
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * 144 + 0)));
|
||||||
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * 144 + 2)));
|
||||||
|
int is = 0;
|
||||||
|
uint8_t sc, m;
|
||||||
|
for (int j = 0; j < ele_per_blk; j += 64) {
|
||||||
|
uint8_t* scales = (uint8_t*)(data + block_id * 144 + 4);
|
||||||
|
get_scale_min_k4(is + 0, scales, &sc, &m);
|
||||||
|
const float d1 = d * sc; const float m1 = min * m;
|
||||||
|
get_scale_min_k4(is + 1, scales, &sc, &m);
|
||||||
|
const float d2 = d * sc; const float m2 = min * m;
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2bfloat16(d1 * (q[l] & 0xF) - m1);
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2bfloat16(d2 * (q[l] >> 4) - m2);
|
||||||
|
q += 32; is += 2;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q5_k_fp32_kernel(const int8_t* data, float* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id = global_idx; block_id < num_blocks; block_id += blockDim.x * gridDim.x){
|
||||||
|
float* __restrict__ output_blk = (float*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 0)));
|
||||||
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 2)));
|
||||||
|
|
||||||
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 16);
|
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 16);
|
||||||
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size + 48);
|
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size + 48);
|
||||||
|
@ -180,46 +437,165 @@ __global__ void dequantize_q5_k_kernel(int8_t* data, float* output, int blk_size
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
__global__ void dequantize_q6_k_kernel(int8_t* data, float* output, int blk_size, int num_blocks) {
|
__global__ void dequantize_q5_k_fp16_kernel(const int8_t* data, __half* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
for (auto block_id=global_idx; block_id<num_blocks;block_id+=blockDim.x * gridDim.x){
|
for (long long block_id = global_idx; block_id < num_blocks; block_id += blockDim.x * gridDim.x){
|
||||||
float* __restrict__ output_blk = (float*)(output + block_id * 256);
|
__half* __restrict__ output_blk = (__half*)(output + block_id * ele_per_blk);
|
||||||
const float d = __half2float(*(reinterpret_cast<half*>(data + block_id * blk_size + 208)));
|
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 0)));
|
||||||
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 2)));
|
||||||
|
|
||||||
|
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 16);
|
||||||
|
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size + 48);
|
||||||
|
|
||||||
|
int is = 0;
|
||||||
|
uint8_t sc, m;
|
||||||
|
uint8_t u1 = 1, u2 = 2;
|
||||||
|
uint8_t* scales = (uint8_t*)(data + block_id * blk_size + 4);
|
||||||
|
|
||||||
|
for (int j = 0; j < 256; j += 64) {
|
||||||
|
get_scale_min_k4(is + 0, scales, &sc, &m);
|
||||||
|
const float d1 = d * sc; const float m1 = min * m;
|
||||||
|
get_scale_min_k4(is + 1, scales, &sc, &m);
|
||||||
|
const float d2 = d * sc; const float m2 = min * m;
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2half(d1 * ((ql[l] & 0xF) + (qh[l] & u1 ? 16 : 0)) - m1);
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2half(d2 * ((ql[l] >> 4) + (qh[l] & u2 ? 16 : 0)) - m2);
|
||||||
|
ql += 32; is += 2;
|
||||||
|
u1 <<= 2; u2 <<= 2;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q5_k_bf16_kernel(const int8_t* data, nv_bfloat16* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id = global_idx; block_id < num_blocks; block_id += blockDim.x * gridDim.x){
|
||||||
|
nv_bfloat16* __restrict__ output_blk = (nv_bfloat16*)(output + block_id * ele_per_blk);
|
||||||
|
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 0)));
|
||||||
|
const float min = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 2)));
|
||||||
|
|
||||||
|
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 16);
|
||||||
|
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size + 48);
|
||||||
|
|
||||||
|
int is = 0;
|
||||||
|
uint8_t sc, m;
|
||||||
|
uint8_t u1 = 1, u2 = 2;
|
||||||
|
uint8_t* scales = (uint8_t*)(data + block_id * blk_size + 4);
|
||||||
|
|
||||||
|
for (int j = 0; j < 256; j += 64) {
|
||||||
|
get_scale_min_k4(is + 0, scales, &sc, &m);
|
||||||
|
const float d1 = d * sc; const float m1 = min * m;
|
||||||
|
get_scale_min_k4(is + 1, scales, &sc, &m);
|
||||||
|
const float d2 = d * sc; const float m2 = min * m;
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2bfloat16(d1 * ((ql[l] & 0xF) + (qh[l] & u1 ? 16 : 0)) - m1);
|
||||||
|
for (int l = 0; l < 32; ++l) *output_blk++ = __float2bfloat16(d2 * ((ql[l] >> 4) + (qh[l] & u2 ? 16 : 0)) - m2);
|
||||||
|
ql += 32; is += 2;
|
||||||
|
u1 <<= 2; u2 <<= 2;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q6_k_fp32_kernel(const int8_t* data, float* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks;block_id+=blockDim.x * gridDim.x){
|
||||||
|
float* __restrict__ output_blk = (float*)(output + block_id * ele_per_blk);
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 208)));
|
||||||
|
|
||||||
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size);
|
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size);
|
||||||
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 128);
|
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 128);
|
||||||
const int8_t * __restrict__ sc = (int8_t*)(data + block_id * blk_size + 192);
|
const int8_t * __restrict__ sc = (int8_t*)(data + block_id * blk_size + 192);
|
||||||
|
|
||||||
|
|
||||||
//if (blk_size == 256){
|
for (int n = 0; n < ele_per_blk; n += 128) {
|
||||||
for (int n = 0; n < blk_size; n += 128) {
|
for (int l = 0; l < 32; ++l) {
|
||||||
for (int l = 0; l < 32; ++l) {
|
int is = l/16;
|
||||||
int is = l/16;
|
const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
|
||||||
const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
|
const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
|
||||||
const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
|
const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
|
||||||
const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
|
const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;
|
||||||
const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;
|
output_blk[l + 0] = d * sc[is + 0] * q1;
|
||||||
output_blk[l + 0] = d * sc[is + 0] * q1;
|
output_blk[l + 32] = d * sc[is + 2] * q2;
|
||||||
output_blk[l + 32] = d * sc[is + 2] * q2;
|
output_blk[l + 64] = d * sc[is + 4] * q3;
|
||||||
output_blk[l + 64] = d * sc[is + 4] * q3;
|
output_blk[l + 96] = d * sc[is + 6] * q4;
|
||||||
output_blk[l + 96] = d * sc[is + 6] * q4;
|
|
||||||
}
|
|
||||||
output_blk += 128;
|
|
||||||
ql += 64;
|
|
||||||
qh += 32;
|
|
||||||
sc += 8;
|
|
||||||
}
|
}
|
||||||
|
output_blk += 128;
|
||||||
|
ql += 64;
|
||||||
|
qh += 32;
|
||||||
|
sc += 8;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q6_k_fp16_kernel(const int8_t* data, __half* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks;block_id+=blockDim.x * gridDim.x){
|
||||||
|
__half* __restrict__ output_blk = (__half*)(output + block_id * ele_per_blk);
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 208)));
|
||||||
|
|
||||||
|
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size);
|
||||||
|
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 128);
|
||||||
|
const int8_t * __restrict__ sc = (int8_t*)(data + block_id * blk_size + 192);
|
||||||
|
|
||||||
|
|
||||||
|
for (int n = 0; n < ele_per_blk; n += 128) {
|
||||||
|
for (int l = 0; l < 32; ++l) {
|
||||||
|
int is = l/16;
|
||||||
|
const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
|
||||||
|
const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
|
||||||
|
const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
|
||||||
|
const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;
|
||||||
|
output_blk[l + 0] = __float2half(d * sc[is + 0] * q1);
|
||||||
|
output_blk[l + 32] = __float2half(d * sc[is + 2] * q2);
|
||||||
|
output_blk[l + 64] = __float2half(d * sc[is + 4] * q3);
|
||||||
|
output_blk[l + 96] = __float2half(d * sc[is + 6] * q4);
|
||||||
|
}
|
||||||
|
output_blk += 128;
|
||||||
|
ql += 64;
|
||||||
|
qh += 32;
|
||||||
|
sc += 8;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_q6_k_bf16_kernel(const int8_t* data, nv_bfloat16* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks;block_id+=blockDim.x * gridDim.x){
|
||||||
|
nv_bfloat16* __restrict__ output_blk = (nv_bfloat16*)(output + block_id * ele_per_blk);
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size + 208)));
|
||||||
|
|
||||||
|
const uint8_t * __restrict__ ql = (uint8_t*)(data + block_id * blk_size);
|
||||||
|
const uint8_t * __restrict__ qh = (uint8_t*)(data + block_id * blk_size + 128);
|
||||||
|
const int8_t * __restrict__ sc = (int8_t*)(data + block_id * blk_size + 192);
|
||||||
|
|
||||||
|
|
||||||
|
for (int n = 0; n < ele_per_blk; n += 128) {
|
||||||
|
for (int l = 0; l < 32; ++l) {
|
||||||
|
int is = l/16;
|
||||||
|
const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
|
||||||
|
const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
|
||||||
|
const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
|
||||||
|
const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;
|
||||||
|
output_blk[l + 0] = __float2bfloat16(d * sc[is + 0] * q1);
|
||||||
|
output_blk[l + 32] = __float2bfloat16(d * sc[is + 2] * q2);
|
||||||
|
output_blk[l + 64] = __float2bfloat16(d * sc[is + 4] * q3);
|
||||||
|
output_blk[l + 96] = __float2bfloat16(d * sc[is + 6] * q4);
|
||||||
|
}
|
||||||
|
output_blk += 128;
|
||||||
|
ql += 64;
|
||||||
|
qh += 32;
|
||||||
|
sc += 8;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
static constexpr __device__ int8_t kvalues_iq4nl[16] = {-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113};
|
static constexpr __device__ int8_t kvalues_iq4nl[16] = {-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113};
|
||||||
|
|
||||||
__global__ void dequantize_iq4_xs_kernel(int8_t* data, float* output, int blk_size, int num_blocks) {
|
__global__ void dequantize_iq4_xs_fp32_kernel(const int8_t* data, float* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
for (auto block_id=global_idx; block_id<num_blocks; block_id+=blockDim.x * gridDim.x) {
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+=blockDim.x * gridDim.x) {
|
||||||
float* __restrict__ output_blk = (float*)(output + block_id * 256);
|
float* __restrict__ output_blk = (float*)(output + block_id * ele_per_blk);
|
||||||
const float d = __half2float(*(reinterpret_cast<half*>(data + block_id * blk_size)));
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size)));
|
||||||
const uint16_t scales_h = *(reinterpret_cast<uint16_t*>(data + block_id * blk_size + 2));
|
const uint16_t scales_h = *(reinterpret_cast<const uint16_t*>(data + block_id * blk_size + 2));
|
||||||
const uint8_t* scales_l = (uint8_t*)(data + block_id * blk_size + 2 + 2);
|
const uint8_t* scales_l = (uint8_t*)(data + block_id * blk_size + 2 + 2);
|
||||||
const uint8_t* qs = (uint8_t*)(data + block_id * blk_size + 2 + 2 + 4);
|
const uint8_t* qs = (uint8_t*)(data + block_id * blk_size + 2 + 2 + 4);
|
||||||
|
|
||||||
|
@ -236,152 +612,267 @@ __global__ void dequantize_iq4_xs_kernel(int8_t* data, float* output, int blk_si
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
torch::Tensor dequantize_q8_0(torch::Tensor data, int blk_size, torch::Device device) {
|
__global__ void dequantize_iq4_xs_fp16_kernel(const int8_t* data, __half* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
int num_blocks = data.numel() / blk_size;
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+=blockDim.x * gridDim.x) {
|
||||||
|
__half* __restrict__ output_blk = (__half*)(output + block_id * ele_per_blk);
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size)));
|
||||||
|
const uint16_t scales_h = *(reinterpret_cast<const uint16_t*>(data + block_id * blk_size + 2));
|
||||||
|
const uint8_t* scales_l = (uint8_t*)(data + block_id * blk_size + 2 + 2);
|
||||||
|
const uint8_t* qs = (uint8_t*)(data + block_id * blk_size + 2 + 2 + 4);
|
||||||
|
|
||||||
|
for (int ib = 0; ib < 8; ++ib) {
|
||||||
|
const int ls = ((scales_l[ib / 2] >> 4 * (ib % 2)) & 0xf) | (((scales_h >> 2 * ib) & 3) << 4);
|
||||||
|
const float dl = d * (ls - 32);
|
||||||
|
for (int j = 0; j < 16; ++j) {
|
||||||
|
output_blk[j + 0] = __float2half(dl * kvalues_iq4nl[qs[j] & 0xf]);
|
||||||
|
output_blk[j + 16] = __float2half(dl * kvalues_iq4nl[qs[j] >> 4]);
|
||||||
|
}
|
||||||
|
output_blk += 32;
|
||||||
|
qs += 16;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__global__ void dequantize_iq4_xs_bf16_kernel(const int8_t* data, nv_bfloat16* output, const int blk_size, const int ele_per_blk, const int num_blocks) {
|
||||||
|
long long global_idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||||||
|
for (long long block_id=global_idx; block_id<num_blocks; block_id+=blockDim.x * gridDim.x) {
|
||||||
|
nv_bfloat16* __restrict__ output_blk = (nv_bfloat16*)(output + block_id * ele_per_blk);
|
||||||
|
const float d = __half2float(*(reinterpret_cast<const half*>(data + block_id * blk_size)));
|
||||||
|
const uint16_t scales_h = *(reinterpret_cast<const uint16_t*>(data + block_id * blk_size + 2));
|
||||||
|
const uint8_t* scales_l = (uint8_t*)(data + block_id * blk_size + 2 + 2);
|
||||||
|
const uint8_t* qs = (uint8_t*)(data + block_id * blk_size + 2 + 2 + 4);
|
||||||
|
|
||||||
|
for (int ib = 0; ib < 8; ++ib) {
|
||||||
|
const int ls = ((scales_l[ib / 2] >> 4 * (ib % 2)) & 0xf) | (((scales_h >> 2 * ib) & 3) << 4);
|
||||||
|
const float dl = d * (ls - 32);
|
||||||
|
for (int j = 0; j < 16; ++j) {
|
||||||
|
output_blk[j + 0] = __float2bfloat16(dl * kvalues_iq4nl[qs[j] & 0xf]);
|
||||||
|
output_blk[j + 16] = __float2bfloat16(dl * kvalues_iq4nl[qs[j] >> 4]);
|
||||||
|
}
|
||||||
|
output_blk += 32;
|
||||||
|
qs += 16;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
torch::Tensor dequantize_q8_0(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype) {
|
||||||
|
int num_blocks = num_bytes / blk_size;
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device);
|
const at::cuda::OptionalCUDAGuard device_guard(device);
|
||||||
// create gpu
|
|
||||||
auto options_scales = torch::TensorOptions().dtype(torch::kFloat32).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
|
||||||
auto options_qs = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
|
||||||
auto scales_gpu = torch::empty({{num_blocks, 1}}, options_scales);
|
|
||||||
auto qs_gpu = torch::empty({num_blocks, 32}, options_qs);
|
|
||||||
|
|
||||||
// read on cpu
|
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
||||||
options_scales = torch::TensorOptions().dtype(torch::kFloat16).device(torch::kCPU);
|
auto data_gpu = torch::empty({ num_bytes }, options);
|
||||||
options_qs = torch::TensorOptions().dtype(torch::kInt8).device(torch::kCPU);
|
|
||||||
|
|
||||||
// // reinterpret
|
cudaMemcpy(data_gpu.data_ptr<int8_t>(), data, num_bytes, cudaMemcpyHostToDevice);
|
||||||
auto scales = torch::from_blob(data.data_ptr(), {num_blocks, 1 + 16}, options_scales).slice(1, 0, 1);
|
//data_gpu.copy_(data, false);
|
||||||
auto qs = torch::from_blob(data.data_ptr(), {num_blocks, 2 + 32}, options_qs).slice(1, 2);
|
|
||||||
|
|
||||||
auto scales_f32 = scales.to(torch::kFloat32);
|
|
||||||
scales_gpu.copy_(scales_f32, false);
|
|
||||||
qs_gpu.copy_(qs, false);
|
|
||||||
|
|
||||||
// Create output tensor
|
// Create output tensor
|
||||||
auto output = torch::zeros_like(qs, torch::dtype(torch::kFloat32).device(device));
|
auto output = torch::zeros({ num_blocks, 32 }, torch::dtype(target_dtype).device(device));
|
||||||
|
|
||||||
// Launch kernel
|
switch (target_dtype) {
|
||||||
dequantize_q8_0_kernel<<< 512, 256 >>>(
|
case torch::kFloat16:
|
||||||
output.data_ptr<float>(), scales_gpu.data_ptr<float>(), qs_gpu.data_ptr<int8_t>(), num_blocks, 32);
|
dequantize_q8_0_fp16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (__half*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kBFloat16:
|
||||||
|
dequantize_q8_0_bf16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (nv_bfloat16*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kFloat32:
|
||||||
|
dequantize_q8_0_fp32_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
printf("target type not support\n");
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
cudaDeviceSynchronize();
|
cudaDeviceSynchronize();
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
torch::Tensor dequantize_q6_k(torch::Tensor data, int blk_size, torch::Device device) {
|
torch::Tensor dequantize_q6_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype) {
|
||||||
// data.numel%blk_size should be 0, else raise err
|
// data.numel%blk_size should be 0, else raise err
|
||||||
int num_blocks = data.numel() / blk_size;
|
int num_blocks = num_bytes / blk_size;
|
||||||
|
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device);
|
const at::cuda::OptionalCUDAGuard device_guard(device);
|
||||||
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
||||||
auto data_gpu = torch::empty({data.numel()}, options);
|
auto data_gpu = torch::empty({num_bytes}, options);
|
||||||
|
|
||||||
data_gpu.copy_(data, false);
|
cudaMemcpy(data_gpu.data_ptr<int8_t>(), data, num_bytes, cudaMemcpyHostToDevice);
|
||||||
|
//data_gpu.copy_(data, false);
|
||||||
|
|
||||||
// Create output tensor
|
// Create output tensor
|
||||||
auto output = torch::zeros({num_blocks, 256}, torch::dtype(torch::kFloat32).device(device));
|
auto output = torch::zeros({num_blocks, 256}, torch::dtype(target_dtype).device(device));
|
||||||
|
|
||||||
// Launch kernel
|
|
||||||
dequantize_q6_k_kernel<<< 512, 256 >>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, num_blocks);
|
|
||||||
// dequantize_q6_k_kernel<<< 512, 256 >>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), 256, num_blocks);
|
|
||||||
|
|
||||||
|
switch (target_dtype) {
|
||||||
|
case torch::kFloat16:
|
||||||
|
dequantize_q6_k_fp16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (__half*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kBFloat16:
|
||||||
|
dequantize_q6_k_bf16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (nv_bfloat16*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kFloat32:
|
||||||
|
dequantize_q6_k_fp32_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
printf("target type not support\n");
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
cudaDeviceSynchronize();
|
cudaDeviceSynchronize();
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
torch::Tensor dequantize_q5_k(torch::Tensor data, int blk_size, torch::Device device) {
|
torch::Tensor dequantize_q5_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype) {
|
||||||
int num_blocks = data.numel() / blk_size;
|
int num_blocks = num_bytes / blk_size;
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device);
|
const at::cuda::OptionalCUDAGuard device_guard(device);
|
||||||
|
|
||||||
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
||||||
auto data_gpu = torch::empty({data.numel()}, options);
|
auto data_gpu = torch::empty({num_bytes}, options);
|
||||||
|
|
||||||
data_gpu.copy_(data, false);
|
cudaMemcpy(data_gpu.data_ptr<int8_t>(), data, num_bytes, cudaMemcpyHostToDevice);
|
||||||
|
//data_gpu.copy_(data, false);
|
||||||
|
|
||||||
// Create output tensor
|
// Create output tensor
|
||||||
auto output = torch::zeros({num_blocks, 256}, torch::dtype(torch::kFloat32).device(device));
|
auto output = torch::zeros({num_blocks, 256}, torch::dtype(target_dtype).device(device));
|
||||||
|
|
||||||
// Launch kernel
|
|
||||||
dequantize_q5_k_kernel<<< 512, 256 >>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, num_blocks);
|
|
||||||
|
|
||||||
|
switch (target_dtype) {
|
||||||
|
case torch::kFloat16:
|
||||||
|
dequantize_q5_k_fp16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (__half*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kBFloat16:
|
||||||
|
dequantize_q5_k_bf16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (nv_bfloat16*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kFloat32:
|
||||||
|
dequantize_q5_k_fp32_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
printf("target type not support\n");
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
cudaDeviceSynchronize();
|
cudaDeviceSynchronize();
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
torch::Tensor dequantize_q4_k(torch::Tensor data, int blk_size, torch::Device device) {
|
torch::Tensor dequantize_q4_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype) {
|
||||||
// data.numel%blk_size should be 0, else raise err
|
// data.numel%blk_size should be 0, else raise err
|
||||||
int num_blocks = data.numel() / blk_size;
|
int num_blocks = num_bytes / blk_size;
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device);
|
const at::cuda::OptionalCUDAGuard device_guard(device);
|
||||||
|
|
||||||
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
||||||
auto data_gpu = torch::empty({data.numel()}, options);
|
auto data_gpu = torch::empty({num_bytes}, options);
|
||||||
|
|
||||||
data_gpu.copy_(data, false);
|
cudaMemcpy(data_gpu.data_ptr<int8_t>(), data, num_bytes, cudaMemcpyHostToDevice);
|
||||||
|
//data_gpu.copy_(data, false);
|
||||||
|
|
||||||
// Create output tensor
|
// Create output tensor
|
||||||
auto output = torch::zeros({num_blocks, 256}, torch::dtype(torch::kFloat32).device(device));
|
auto output = torch::zeros({num_blocks, 256}, torch::dtype(target_dtype).device(device));
|
||||||
|
|
||||||
// Launch kernel
|
|
||||||
dequantize_q4_k_kernel<<< 512, 256 >>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), 256, num_blocks);
|
|
||||||
|
|
||||||
|
switch (target_dtype) {
|
||||||
|
case torch::kFloat16:
|
||||||
|
dequantize_q4_k_fp16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (__half*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kBFloat16:
|
||||||
|
dequantize_q4_k_bf16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (nv_bfloat16*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kFloat32:
|
||||||
|
dequantize_q4_k_fp32_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
printf("target type not support\n");
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
cudaDeviceSynchronize();
|
cudaDeviceSynchronize();
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
torch::Tensor dequantize_q3_k(torch::Tensor data, int blk_size, torch::Device device) {
|
torch::Tensor dequantize_q3_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype) {
|
||||||
int num_blocks = data.numel() / blk_size;
|
int num_blocks = num_bytes / blk_size;
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device);
|
const at::cuda::OptionalCUDAGuard device_guard(device);
|
||||||
|
|
||||||
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
||||||
auto data_gpu = torch::empty({data.numel()}, options);
|
auto data_gpu = torch::empty({num_bytes}, options);
|
||||||
|
|
||||||
data_gpu.copy_(data, false);
|
cudaMemcpy(data_gpu.data_ptr<int8_t>(), data, num_bytes, cudaMemcpyHostToDevice);
|
||||||
|
//data_gpu.copy_(data, false);
|
||||||
|
|
||||||
// Create output tensor
|
// Create output tensor
|
||||||
auto output = torch::zeros({num_blocks, 256}, torch::dtype(torch::kFloat32).device(device));
|
auto output = torch::zeros({num_blocks, 256}, torch::dtype(target_dtype).device(device));
|
||||||
|
|
||||||
// Launch kernel
|
|
||||||
dequantize_q3_k_kernel<<< 512, 256 >>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, num_blocks);
|
|
||||||
|
|
||||||
|
switch (target_dtype) {
|
||||||
|
case torch::kFloat16:
|
||||||
|
dequantize_q3_k_fp16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (__half*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kBFloat16:
|
||||||
|
dequantize_q3_k_bf16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (nv_bfloat16*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kFloat32:
|
||||||
|
dequantize_q3_k_fp32_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
printf("target type not support\n");
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
cudaDeviceSynchronize();
|
cudaDeviceSynchronize();
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
torch::Tensor dequantize_q2_k(torch::Tensor data, int blk_size, torch::Device device) {
|
torch::Tensor dequantize_q2_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype) {
|
||||||
int num_blocks = data.numel() / blk_size;
|
int num_blocks = num_bytes / blk_size;
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device);
|
const at::cuda::OptionalCUDAGuard device_guard(device);
|
||||||
|
|
||||||
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
||||||
auto data_gpu = torch::empty({data.numel()}, options);
|
auto data_gpu = torch::empty({num_bytes}, options);
|
||||||
|
|
||||||
data_gpu.copy_(data, false);
|
cudaMemcpy(data_gpu.data_ptr<int8_t>(), data, num_bytes, cudaMemcpyHostToDevice);
|
||||||
|
//data_gpu.copy_(data, false);
|
||||||
|
|
||||||
// Create output tensor
|
// Create output tensor
|
||||||
auto output = torch::zeros({num_blocks, 256}, torch::dtype(torch::kFloat32).device(device));
|
auto output = torch::zeros({num_blocks, 256}, torch::dtype(target_dtype).device(device));
|
||||||
|
|
||||||
// Launch kernel
|
|
||||||
dequantize_q2_k_kernel<<< 512, 256 >>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, num_blocks);
|
|
||||||
|
|
||||||
|
switch (target_dtype) {
|
||||||
|
case torch::kFloat16:
|
||||||
|
dequantize_q2_k_fp16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (__half*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kBFloat16:
|
||||||
|
dequantize_q2_k_bf16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (nv_bfloat16*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kFloat32:
|
||||||
|
dequantize_q2_k_fp32_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
printf("target type not support\n");
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
cudaDeviceSynchronize();
|
cudaDeviceSynchronize();
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
||||||
torch::Tensor dequantize_iq4_xs(torch::Tensor data, int blk_size, torch::Device device) {
|
torch::Tensor dequantize_iq4_xs(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype) {
|
||||||
int num_blocks = data.numel() / blk_size;
|
int num_blocks = num_bytes / blk_size;
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device);
|
const at::cuda::OptionalCUDAGuard device_guard(device);
|
||||||
|
|
||||||
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
auto options = torch::TensorOptions().dtype(torch::kInt8).device(device).memory_format(torch::MemoryFormat::Contiguous);
|
||||||
auto data_gpu = torch::empty({data.numel()}, options);
|
auto data_gpu = torch::empty({num_bytes}, options);
|
||||||
|
|
||||||
data_gpu.copy_(data, false);
|
cudaMemcpy(data_gpu.data_ptr<int8_t>(), data, num_bytes, cudaMemcpyHostToDevice);
|
||||||
|
//data_gpu.copy_(data, false);
|
||||||
|
|
||||||
// Create output tensor
|
// Create output tensor
|
||||||
auto output = torch::zeros({num_blocks, 256}, torch::dtype(torch::kFloat32).device(device));
|
auto output = torch::zeros({num_blocks, 256}, torch::dtype(target_dtype).device(device));
|
||||||
|
|
||||||
// Launch kernel
|
|
||||||
dequantize_iq4_xs_kernel<<< 512, 256 >>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, num_blocks);
|
|
||||||
|
|
||||||
|
switch (target_dtype) {
|
||||||
|
case torch::kFloat16:
|
||||||
|
dequantize_iq4_xs_fp16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (__half*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kBFloat16:
|
||||||
|
dequantize_iq4_xs_bf16_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), (nv_bfloat16*)output.data_ptr(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
case torch::kFloat32:
|
||||||
|
dequantize_iq4_xs_fp32_kernel<<<512, 256>>>(data_gpu.data_ptr<int8_t>(), output.data_ptr<float>(), blk_size, ele_per_blk, num_blocks);
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
printf("target type not support\n");
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
cudaDeviceSynchronize();
|
cudaDeviceSynchronize();
|
||||||
return output;
|
return output;
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,11 +1,11 @@
|
||||||
/**
|
/**
|
||||||
* @Description :
|
* @Description :
|
||||||
* @Author : Azure-Tang
|
* @Author : Azure-Tang
|
||||||
* @Date : 2024-07-22 09:27:55
|
* @Date : 2024-07-22 09:27:55
|
||||||
* @Version : 1.0.0
|
* @Version : 1.0.0
|
||||||
* @LastEditors : kkk1nak0
|
* @LastEditors : kkk1nak0
|
||||||
* @LastEditTime : 2024-08-12 03:48:46
|
* @LastEditTime : 2024-08-12 03:48:46
|
||||||
* @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
* @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
||||||
**/
|
**/
|
||||||
#pragma once
|
#pragma once
|
||||||
|
|
||||||
|
@ -13,10 +13,10 @@
|
||||||
#include <torch/extension.h>
|
#include <torch/extension.h>
|
||||||
#include <torch/torch.h>
|
#include <torch/torch.h>
|
||||||
|
|
||||||
torch::Tensor dequantize_q8_0(torch::Tensor data, int blk_size, torch::Device device);
|
torch::Tensor dequantize_q8_0(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
|
||||||
torch::Tensor dequantize_q6_k(torch::Tensor data, int blk_size, torch::Device device);
|
torch::Tensor dequantize_q6_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
|
||||||
torch::Tensor dequantize_q5_k(torch::Tensor data, int blk_size, torch::Device device);
|
torch::Tensor dequantize_q5_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
|
||||||
torch::Tensor dequantize_q4_k(torch::Tensor data, int blk_size, torch::Device device);
|
torch::Tensor dequantize_q4_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
|
||||||
torch::Tensor dequantize_q3_k(torch::Tensor data, int blk_size, torch::Device device);
|
torch::Tensor dequantize_q3_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
|
||||||
torch::Tensor dequantize_q2_k(torch::Tensor data, int blk_size, torch::Device device);
|
torch::Tensor dequantize_q2_k(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
|
||||||
torch::Tensor dequantize_iq4_xs(torch::Tensor data, int blk_size, torch::Device device);
|
torch::Tensor dequantize_iq4_xs(const int8_t* data, const int num_bytes, const int blk_size, const int ele_per_blk, const torch::Device device, const torch::Dtype target_dtype);
|
||||||
|
|
16
ktransformers/ktransformers_ext/cuda/test_dequant.py
Normal file
16
ktransformers/ktransformers_ext/cuda/test_dequant.py
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0,"/home/zbx/ktransformers")
|
||||||
|
from ktransformers.util.custom_gguf import GGUFLoader
|
||||||
|
import torch
|
||||||
|
|
||||||
|
gguf_loader_1 = GGUFLoader("/mnt/data/model/DeepseekV3-q4km-gguf")
|
||||||
|
gguf_loader_2 = GGUFLoader("/mnt/data/chenht/model/gguf_for_ktransformers/DeepSeek-V3-bf16/")
|
||||||
|
|
||||||
|
torch.set_default_dtype(torch.bfloat16)
|
||||||
|
|
||||||
|
tensor_1 = gguf_loader_1.load_gguf_tensor("blk.0.attn_kv_a_mqa.weight", "cuda")
|
||||||
|
tensor_2 = gguf_loader_2.load_gguf_tensor("blk.0.attn_kv_a_mqa.weight", "cuda")
|
||||||
|
|
||||||
|
print(tensor_1[0, -64:])
|
||||||
|
print(tensor_2[0, -64:])
|
|
@ -90,7 +90,7 @@ def marlin_quantize(
|
||||||
assert group_size <= size_k
|
assert group_size <= size_k
|
||||||
|
|
||||||
# Quantize (and apply act_order if provided)
|
# Quantize (and apply act_order if provided)
|
||||||
w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
|
q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
|
||||||
act_order)
|
act_order)
|
||||||
|
|
||||||
# For act_order, sort the "weights" and "g_idx" so that group ids are
|
# For act_order, sort the "weights" and "g_idx" so that group ids are
|
||||||
|
@ -107,7 +107,7 @@ def marlin_quantize(
|
||||||
marlin_scale_perm_single[num_bits])
|
marlin_scale_perm_single[num_bits])
|
||||||
|
|
||||||
# Create result
|
# Create result
|
||||||
res_list = [w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, rand_perm]
|
res_list = [marlin_q_w, marlin_s, g_idx, sort_indices, rand_perm]
|
||||||
for i in range(len(res_list)):
|
for i in range(len(res_list)):
|
||||||
res_list[i] = res_list[i].to(w.device)
|
res_list[i] = res_list[i].to(w.device)
|
||||||
|
|
||||||
|
|
|
@ -11,8 +11,7 @@ def get_pack_factor(num_bits):
|
||||||
return 32 // num_bits
|
return 32 // num_bits
|
||||||
|
|
||||||
|
|
||||||
def permute_rows(q_w: torch.Tensor, w_ref: torch.Tensor, group_size: int):
|
def permute_rows(q_w: torch.Tensor, group_size: int):
|
||||||
assert q_w.shape == w_ref.shape
|
|
||||||
|
|
||||||
orig_device = q_w.device
|
orig_device = q_w.device
|
||||||
k_size, _ = q_w.shape
|
k_size, _ = q_w.shape
|
||||||
|
@ -26,10 +25,8 @@ def permute_rows(q_w: torch.Tensor, w_ref: torch.Tensor, group_size: int):
|
||||||
|
|
||||||
g_idx = g_idx[rand_perm].contiguous()
|
g_idx = g_idx[rand_perm].contiguous()
|
||||||
q_w = q_w[rand_perm, :].contiguous()
|
q_w = q_w[rand_perm, :].contiguous()
|
||||||
w_ref = w_ref[rand_perm, :].contiguous()
|
|
||||||
|
|
||||||
return (
|
return (
|
||||||
w_ref.to(device=orig_device),
|
|
||||||
q_w.to(device=orig_device),
|
q_w.to(device=orig_device),
|
||||||
g_idx.to(device=orig_device),
|
g_idx.to(device=orig_device),
|
||||||
rand_perm.to(device=orig_device),
|
rand_perm.to(device=orig_device),
|
||||||
|
@ -69,9 +66,6 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int,
|
||||||
q_w += half_q_val
|
q_w += half_q_val
|
||||||
q_w = torch.clamp(q_w, 0, max_q_val)
|
q_w = torch.clamp(q_w, 0, max_q_val)
|
||||||
|
|
||||||
# Compute ref (dequantized)
|
|
||||||
w_ref = (q_w - half_q_val).half() * s
|
|
||||||
|
|
||||||
# Restore original shapes
|
# Restore original shapes
|
||||||
if group_size < size_k:
|
if group_size < size_k:
|
||||||
|
|
||||||
|
@ -82,7 +76,6 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int,
|
||||||
return w
|
return w
|
||||||
|
|
||||||
q_w = reshape_w(q_w)
|
q_w = reshape_w(q_w)
|
||||||
w_ref = reshape_w(w_ref)
|
|
||||||
|
|
||||||
s = s.reshape((-1, size_n)).contiguous()
|
s = s.reshape((-1, size_n)).contiguous()
|
||||||
|
|
||||||
|
@ -95,10 +88,9 @@ def quantize_weights(w: torch.Tensor, num_bits: int, group_size: int,
|
||||||
), "For act_order, groupsize = {} must be less than size_k = {}".format(
|
), "For act_order, groupsize = {} must be less than size_k = {}".format(
|
||||||
group_size, size_k)
|
group_size, size_k)
|
||||||
|
|
||||||
w_ref, q_w, g_idx, rand_perm = permute_rows(q_w, w_ref, group_size)
|
q_w, g_idx, rand_perm = permute_rows(q_w, group_size)
|
||||||
|
|
||||||
return (
|
return (
|
||||||
w_ref.to(device=orig_device),
|
|
||||||
q_w.to(device=orig_device),
|
q_w.to(device=orig_device),
|
||||||
s.to(device=orig_device),
|
s.to(device=orig_device),
|
||||||
g_idx.to(device=orig_device),
|
g_idx.to(device=orig_device),
|
||||||
|
|
|
@ -10,6 +10,8 @@
|
||||||
|
|
||||||
#include "kvcache.h"
|
#include "kvcache.h"
|
||||||
|
|
||||||
|
#include <chrono>
|
||||||
|
|
||||||
void KVCache::attention_kvhead_(const uint16_t *q_in_data, ggml_fp16_t *output,
|
void KVCache::attention_kvhead_(const uint16_t *q_in_data, ggml_fp16_t *output,
|
||||||
float *attn_lse, int batch_size,
|
float *attn_lse, int batch_size,
|
||||||
Backend *backend) {
|
Backend *backend) {
|
||||||
|
|
|
@ -9,6 +9,9 @@
|
||||||
**/
|
**/
|
||||||
|
|
||||||
#include "kvcache.h"
|
#include "kvcache.h"
|
||||||
|
|
||||||
|
#include <chrono>
|
||||||
|
|
||||||
void KVCache::load_kvcache(std::string tensor_file_path, Backend *backend) {
|
void KVCache::load_kvcache(std::string tensor_file_path, Backend *backend) {
|
||||||
// Timer start
|
// Timer start
|
||||||
auto start = std::chrono::high_resolution_clock::now();
|
auto start = std::chrono::high_resolution_clock::now();
|
||||||
|
|
|
@ -10,6 +10,8 @@
|
||||||
|
|
||||||
#include "kvcache.h"
|
#include "kvcache.h"
|
||||||
|
|
||||||
|
#include <chrono>
|
||||||
|
|
||||||
void KVCache::get_anchor_one_block(ggml_fp16_t *anchor, int layer_id,
|
void KVCache::get_anchor_one_block(ggml_fp16_t *anchor, int layer_id,
|
||||||
int block_idx, Backend *backend) {
|
int block_idx, Backend *backend) {
|
||||||
// Timer start
|
// Timer start
|
||||||
|
|
|
@ -10,6 +10,8 @@
|
||||||
|
|
||||||
#include "kvcache.h"
|
#include "kvcache.h"
|
||||||
|
|
||||||
|
#include <chrono>
|
||||||
|
|
||||||
std::string ggml_type_to_string(ggml_type type) {
|
std::string ggml_type_to_string(ggml_type type) {
|
||||||
switch (type) {
|
switch (type) {
|
||||||
case GGML_TYPE_F32:
|
case GGML_TYPE_F32:
|
||||||
|
|
193
ktransformers/ktransformers_ext/triton/fp8gemm.py
Normal file
193
ktransformers/ktransformers_ext/triton/fp8gemm.py
Normal file
|
@ -0,0 +1,193 @@
|
||||||
|
# Adopted from https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/inference/kernel.py
|
||||||
|
from typing import Tuple
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import triton
|
||||||
|
import triton.language as tl
|
||||||
|
from triton import Config
|
||||||
|
|
||||||
|
|
||||||
|
@triton.jit
|
||||||
|
def act_quant_kernel(x_ptr, y_ptr, s_ptr, BLOCK_SIZE: tl.constexpr):
|
||||||
|
"""
|
||||||
|
Quantizes the input tensor `x_ptr` and stores the result in `y_ptr` and the scaling factor in `s_ptr`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
x_ptr (triton.Pointer): Pointer to the input tensor.
|
||||||
|
y_ptr (triton.Pointer): Pointer to the output tensor where quantized values will be stored.
|
||||||
|
s_ptr (triton.Pointer): Pointer to the output tensor where scaling factors will be stored.
|
||||||
|
BLOCK_SIZE (tl.constexpr): The size of the block to be processed by each program instance.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
pid = tl.program_id(axis=0)
|
||||||
|
offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
|
||||||
|
x = tl.load(x_ptr + offs).to(tl.float32)
|
||||||
|
s = tl.max(tl.abs(x)) / 448.
|
||||||
|
y = x / s
|
||||||
|
y = y.to(y_ptr.dtype.element_ty)
|
||||||
|
tl.store(y_ptr + offs, y)
|
||||||
|
tl.store(s_ptr + pid, s)
|
||||||
|
|
||||||
|
|
||||||
|
def act_quant(x: torch.Tensor, block_size: int = 128) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||||
|
"""
|
||||||
|
Quantizes the input tensor `x` using block-wise quantization.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
x (torch.Tensor): The input tensor to be quantized. Must be contiguous and its last dimension size must be divisible by `block_size`.
|
||||||
|
block_size (int, optional): The size of the blocks to be used for quantization. Default is 128.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple[torch.Tensor, torch.Tensor]: A tuple containing:
|
||||||
|
- The quantized tensor with dtype `torch.float8_e4m3fn`.
|
||||||
|
- A tensor of scaling factors with dtype `torch.float32`.
|
||||||
|
"""
|
||||||
|
assert x.is_contiguous(), 'Input tensor must be contiguous'
|
||||||
|
assert x.size(-1) % block_size == 0, f'Last dimension size must be divisible by block_size (block_size={block_size})'
|
||||||
|
y = torch.empty_like(x, dtype=torch.float8_e4m3fn)
|
||||||
|
s = x.new_empty(*x.size()[:-1], x.size(-1) // block_size, dtype=torch.float32)
|
||||||
|
grid = lambda meta: (triton.cdiv(x.numel(), meta['BLOCK_SIZE']), )
|
||||||
|
act_quant_kernel[grid](x, y, s, BLOCK_SIZE=block_size)
|
||||||
|
return y, s
|
||||||
|
|
||||||
|
|
||||||
|
@triton.jit
|
||||||
|
def weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE: tl.constexpr):
|
||||||
|
"""
|
||||||
|
Dequantizes weights using the provided scaling factors and stores the result.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
x_ptr (tl.pointer): Pointer to the quantized weights.
|
||||||
|
s_ptr (tl.pointer): Pointer to the scaling factors.
|
||||||
|
y_ptr (tl.pointer): Pointer to the output buffer for dequantized weights.
|
||||||
|
M (int): Number of rows in the weight matrix.
|
||||||
|
N (int): Number of columns in the weight matrix.
|
||||||
|
BLOCK_SIZE (tl.constexpr): Size of the block for tiling.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
pid_m = tl.program_id(axis=0)
|
||||||
|
pid_n = tl.program_id(axis=1)
|
||||||
|
n = tl.cdiv(N, BLOCK_SIZE)
|
||||||
|
offs_m = pid_m * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
|
||||||
|
offs_n = pid_n * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
|
||||||
|
offs = offs_m[:, None] * N + offs_n[None, :]
|
||||||
|
mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
|
||||||
|
x = tl.load(x_ptr + offs, mask=mask).to(tl.float32)
|
||||||
|
s = tl.load(s_ptr + pid_m * n + pid_n)
|
||||||
|
y = x * s
|
||||||
|
tl.store(y_ptr + offs, y, mask=mask)
|
||||||
|
|
||||||
|
|
||||||
|
def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
|
||||||
|
"""
|
||||||
|
Dequantizes the given weight tensor using the provided scale tensor.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
x (torch.Tensor): The quantized weight tensor of shape (M, N).
|
||||||
|
s (torch.Tensor): The scale tensor of shape (M, N).
|
||||||
|
block_size (int, optional): The block size to use for dequantization. Defaults to 128.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
torch.Tensor: The dequantized weight tensor of the same shape as `x`.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
|
||||||
|
"""
|
||||||
|
assert x.is_contiguous() and s.is_contiguous(), 'Input tensors must be contiguous'
|
||||||
|
assert x.dim() == 2 and s.dim() == 2, 'Input tensors must have 2 dimensions'
|
||||||
|
M, N = x.size()
|
||||||
|
y = torch.empty_like(x, dtype=torch.get_default_dtype())
|
||||||
|
grid = lambda meta: (triton.cdiv(M, meta['BLOCK_SIZE']), triton.cdiv(N, meta['BLOCK_SIZE']))
|
||||||
|
with torch.cuda.device(x.device):
|
||||||
|
weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
|
||||||
|
return y
|
||||||
|
|
||||||
|
|
||||||
|
fp8_gemm_configs = [
|
||||||
|
Config({'BLOCK_SIZE_M': block_m, 'BLOCK_SIZE_N': block_n, 'BLOCK_SIZE_K': 128}, num_stages=num_stages, num_warps=8)
|
||||||
|
for block_m in [16, 32, 64] for block_n in [32, 64, 128] for num_stages in [3, 4, 5, 6]
|
||||||
|
]
|
||||||
|
|
||||||
|
@triton.autotune(configs=fp8_gemm_configs, key=['N', 'K'])
|
||||||
|
@triton.jit
|
||||||
|
def fp8_gemm_kernel(a_ptr, b_ptr, c_ptr,
|
||||||
|
a_s_ptr, b_s_ptr,
|
||||||
|
M, N: tl.constexpr, K: tl.constexpr,
|
||||||
|
BLOCK_SIZE_M: tl.constexpr,
|
||||||
|
BLOCK_SIZE_N: tl.constexpr,
|
||||||
|
BLOCK_SIZE_K: tl.constexpr):
|
||||||
|
"""
|
||||||
|
Performs a matrix multiplication operation on FP8 matrices with scaling factors.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
a_ptr (tl.tensor): Pointer to the first input matrix A.
|
||||||
|
b_ptr (tl.tensor): Pointer to the second input matrix B.
|
||||||
|
c_ptr (tl.tensor): Pointer to the output matrix C.
|
||||||
|
a_s_ptr (tl.tensor): Pointer to the scaling factors for matrix A.
|
||||||
|
b_s_ptr (tl.tensor): Pointer to the scaling factors for matrix B.
|
||||||
|
M (int): Number of rows in matrix A and C.
|
||||||
|
N (tl.constexpr): Number of columns in matrix B and C.
|
||||||
|
K (tl.constexpr): Number of columns in matrix A and rows in matrix B.
|
||||||
|
BLOCK_SIZE_M (tl.constexpr): Block size for the M dimension.
|
||||||
|
BLOCK_SIZE_N (tl.constexpr): Block size for the N dimension.
|
||||||
|
BLOCK_SIZE_K (tl.constexpr): Block size for the K dimension.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
pid_m = tl.program_id(axis=0)
|
||||||
|
pid_n = tl.program_id(axis=1)
|
||||||
|
k = tl.cdiv(K, BLOCK_SIZE_K)
|
||||||
|
offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
|
||||||
|
offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
|
||||||
|
offs_k = tl.arange(0, BLOCK_SIZE_K)
|
||||||
|
a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :]
|
||||||
|
b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None]
|
||||||
|
a_s_ptrs = a_s_ptr + offs_m * k
|
||||||
|
b_s_ptrs = b_s_ptr + (offs_n // BLOCK_SIZE_K) * k
|
||||||
|
|
||||||
|
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
|
||||||
|
for i in range(k):
|
||||||
|
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - i * BLOCK_SIZE_K, other=0.0)
|
||||||
|
b = tl.load(b_ptrs, mask=offs_k[:, None] < K - i * BLOCK_SIZE_K, other=0.0)
|
||||||
|
a_s = tl.load(a_s_ptrs)
|
||||||
|
b_s = tl.load(b_s_ptrs)
|
||||||
|
accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]
|
||||||
|
a_ptrs += BLOCK_SIZE_K
|
||||||
|
b_ptrs += BLOCK_SIZE_K
|
||||||
|
a_s_ptrs += 1
|
||||||
|
b_s_ptrs += 1
|
||||||
|
c = accumulator.to(c_ptr.dtype.element_ty)
|
||||||
|
offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
|
||||||
|
offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
|
||||||
|
c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :]
|
||||||
|
mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
|
||||||
|
tl.store(c_ptrs, c, mask=mask)
|
||||||
|
|
||||||
|
|
||||||
|
def fp8_gemm(a: torch.Tensor, a_s: torch.Tensor, b: torch.Tensor, b_s: torch.Tensor):
|
||||||
|
"""
|
||||||
|
Perform a matrix multiplication using FP8 precision.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
a (torch.Tensor): The first input matrix, must be contiguous.
|
||||||
|
a_s (torch.Tensor): The scaling factor for the first input matrix, must be contiguous.
|
||||||
|
b (torch.Tensor): The second input matrix, must be contiguous.
|
||||||
|
b_s (torch.Tensor): The scaling factor for the second input matrix, must be contiguous.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
torch.Tensor: The result of the matrix multiplication.
|
||||||
|
"""
|
||||||
|
assert a.is_contiguous() and b.is_contiguous(), 'Input tensors must be contiguous'
|
||||||
|
assert a_s.is_contiguous() and b_s.is_contiguous(), 'Scaling factor tensors must be contiguous'
|
||||||
|
K = a.size(-1)
|
||||||
|
M = a.numel() // K
|
||||||
|
N = b.size(0)
|
||||||
|
c = a.new_empty(*a.size()[:-1], N, dtype=torch.get_default_dtype())
|
||||||
|
grid = lambda META: (triton.cdiv(M, META['BLOCK_SIZE_M']), triton.cdiv(N, META['BLOCK_SIZE_N']))
|
||||||
|
fp8_gemm_kernel[grid](a, b, c, a_s, b_s, M, N, K)
|
||||||
|
return c
|
|
@ -28,8 +28,9 @@ from ktransformers.models.modeling_qwen2_moe import Qwen2MoeForCausalLM
|
||||||
from ktransformers.models.modeling_deepseek_v3 import DeepseekV3ForCausalLM
|
from ktransformers.models.modeling_deepseek_v3 import DeepseekV3ForCausalLM
|
||||||
from ktransformers.models.modeling_llama import LlamaForCausalLM
|
from ktransformers.models.modeling_llama import LlamaForCausalLM
|
||||||
from ktransformers.models.modeling_mixtral import MixtralForCausalLM
|
from ktransformers.models.modeling_mixtral import MixtralForCausalLM
|
||||||
from ktransformers.util.utils import prefill_and_generate
|
from ktransformers.util.utils import prefill_and_generate, get_compute_capability
|
||||||
from ktransformers.server.config.config import Config
|
from ktransformers.server.config.config import Config
|
||||||
|
from ktransformers.operators.flashinfer_wrapper import flashinfer_enabled
|
||||||
|
|
||||||
custom_models = {
|
custom_models = {
|
||||||
"DeepseekV2ForCausalLM": DeepseekV2ForCausalLM,
|
"DeepseekV2ForCausalLM": DeepseekV2ForCausalLM,
|
||||||
|
@ -53,7 +54,7 @@ default_optimize_rules = {
|
||||||
|
|
||||||
def local_chat(
|
def local_chat(
|
||||||
model_path: str | None = None,
|
model_path: str | None = None,
|
||||||
optimize_rule_path: str = None,
|
optimize_config_path: str = None,
|
||||||
gguf_path: str | None = None,
|
gguf_path: str | None = None,
|
||||||
max_new_tokens: int = 300,
|
max_new_tokens: int = 300,
|
||||||
cpu_infer: int = Config().cpu_infer,
|
cpu_infer: int = Config().cpu_infer,
|
||||||
|
@ -61,9 +62,9 @@ def local_chat(
|
||||||
prompt_file : str | None = None,
|
prompt_file : str | None = None,
|
||||||
mode: str = "normal",
|
mode: str = "normal",
|
||||||
force_think: bool = False,
|
force_think: bool = False,
|
||||||
|
chunk_prefill_size: int = 8192
|
||||||
):
|
):
|
||||||
|
|
||||||
|
|
||||||
torch.set_grad_enabled(False)
|
torch.set_grad_enabled(False)
|
||||||
|
|
||||||
Config().cpu_infer = cpu_infer
|
Config().cpu_infer = cpu_infer
|
||||||
|
@ -94,12 +95,12 @@ def local_chat(
|
||||||
config, trust_remote_code=True, attn_implementation="flash_attention_2"
|
config, trust_remote_code=True, attn_implementation="flash_attention_2"
|
||||||
)
|
)
|
||||||
|
|
||||||
if optimize_rule_path is None:
|
if optimize_config_path is None:
|
||||||
if config.architectures[0] in default_optimize_rules:
|
if config.architectures[0] in default_optimize_rules:
|
||||||
print("using default_optimize_rule for", config.architectures[0])
|
print("using default_optimize_rule for", config.architectures[0])
|
||||||
optimize_rule_path = default_optimize_rules[config.architectures[0]]
|
optimize_config_path = default_optimize_rules[config.architectures[0]]
|
||||||
else:
|
else:
|
||||||
optimize_rule_path = input(
|
optimize_config_path = input(
|
||||||
"please input the path of your rule file(yaml file containing optimize rules):"
|
"please input the path of your rule file(yaml file containing optimize rules):"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -107,18 +108,18 @@ def local_chat(
|
||||||
gguf_path = input(
|
gguf_path = input(
|
||||||
"please input the path of your gguf file(gguf file in the dir containing input gguf file must all belong to current model):"
|
"please input the path of your gguf file(gguf file in the dir containing input gguf file must all belong to current model):"
|
||||||
)
|
)
|
||||||
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
|
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
model.generation_config = GenerationConfig.from_pretrained(model_path)
|
model.generation_config = GenerationConfig.from_pretrained(model_path)
|
||||||
except:
|
except Exception as e:
|
||||||
gen_config = GenerationConfig(
|
print(f"generation config can't auto create, make default. Message: {e}")
|
||||||
max_length=128,
|
gen_config = GenerationConfig(
|
||||||
temperature=0.7,
|
temperature=0.6,
|
||||||
top_p=0.9,
|
top_p=0.95,
|
||||||
do_sample=True
|
do_sample=True
|
||||||
)
|
)
|
||||||
model.generation_config = gen_config
|
model.generation_config = gen_config
|
||||||
# model.generation_config = GenerationConfig.from_pretrained(model_path)
|
# model.generation_config = GenerationConfig.from_pretrained(model_path)
|
||||||
if model.generation_config.pad_token_id is None:
|
if model.generation_config.pad_token_id is None:
|
||||||
model.generation_config.pad_token_id = model.generation_config.eos_token_id
|
model.generation_config.pad_token_id = model.generation_config.eos_token_id
|
||||||
|
@ -167,13 +168,17 @@ def local_chat(
|
||||||
if mode == 'long_context':
|
if mode == 'long_context':
|
||||||
assert Config().long_context_config['max_seq_len'] > input_tensor.shape[1] + max_new_tokens, \
|
assert Config().long_context_config['max_seq_len'] > input_tensor.shape[1] + max_new_tokens, \
|
||||||
"please change max_seq_len in ~/.ktransformers/config.yaml"
|
"please change max_seq_len in ~/.ktransformers/config.yaml"
|
||||||
torch.set_default_dtype(
|
|
||||||
torch.bfloat16
|
if system != "Windows" and (config.architectures[0] == "DeepseekV2ForCausalLM" or config.architectures[0] == "DeepseekV3ForCausalLM") and flashinfer_enabled and get_compute_capability() >= 8:
|
||||||
) # TODO: Remove this, replace dtype using config
|
generated = prefill_and_generate(
|
||||||
generated = prefill_and_generate(
|
model, tokenizer, input_tensor.cuda(), max_new_tokens, use_cuda_graph, mode = mode, force_think = force_think, chunk_prefill_size = chunk_prefill_size,
|
||||||
model, tokenizer, input_tensor.cuda(), max_new_tokens, use_cuda_graph, mode, force_think
|
use_flashinfer_mla = True, num_heads = config.num_attention_heads, head_dim_ckv = config.kv_lora_rank, head_dim_kpe = config.qk_rope_head_dim, q_head_dim = config.qk_rope_head_dim + config.qk_nope_head_dim
|
||||||
)
|
)
|
||||||
|
else:
|
||||||
|
generated = prefill_and_generate(
|
||||||
|
model, tokenizer, input_tensor.cuda(), max_new_tokens, use_cuda_graph, mode = mode, force_think = force_think, chunk_prefill_size = chunk_prefill_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
fire.Fire(local_chat)
|
fire.Fire(local_chat)
|
||||||
|
|
|
@ -138,8 +138,6 @@ class StaticCache(transformers.StaticCache):
|
||||||
page_idx = cache_position // self.page_size
|
page_idx = cache_position // self.page_size
|
||||||
page_offset = cache_position % self.page_size
|
page_offset = cache_position % self.page_size
|
||||||
# key shape (self.max_pages, self.page_size, 1, config.kv_lora_rank + config.qk_rope_head_dim)
|
# key shape (self.max_pages, self.page_size, 1, config.kv_lora_rank + config.qk_rope_head_dim)
|
||||||
#print("page_idx", page_idx)
|
|
||||||
#print("page_offset", page_offset)
|
|
||||||
k_out[page_idx, page_offset, :, :self.kv_lora_rank] = key_states
|
k_out[page_idx, page_offset, :, :self.kv_lora_rank] = key_states
|
||||||
k_out[page_idx, page_offset, :, self.kv_lora_rank:] = value_states
|
k_out[page_idx, page_offset, :, self.kv_lora_rank:] = value_states
|
||||||
return k_out, self.page_table_list[layer_idx]
|
return k_out, self.page_table_list[layer_idx]
|
||||||
|
@ -172,8 +170,21 @@ class StaticCache(transformers.StaticCache):
|
||||||
for layer_idx in range(len(self.key_cache)):
|
for layer_idx in range(len(self.key_cache)):
|
||||||
# In-place ops prevent breaking the static address
|
# In-place ops prevent breaking the static address
|
||||||
self.key_cache[layer_idx].zero_()
|
self.key_cache[layer_idx].zero_()
|
||||||
self.value_cache[layer_idx].zero_()
|
if self.value_cache[layer_idx] is not None:
|
||||||
|
self.value_cache[layer_idx].zero_()
|
||||||
|
self.past_tokens[layer_idx] = 0
|
||||||
|
|
||||||
|
def remove_suffix(self, start_pos):
|
||||||
|
for layer_idx in range(len(self.key_cache)):
|
||||||
|
# In-place ops prevent breaking the static address
|
||||||
|
if self.is_MLA:
|
||||||
|
k_cache = self.key_cache[layer_idx]
|
||||||
|
k_cache.view(-1, k_cache.shape[-1])[start_pos:].zero_()
|
||||||
|
else:
|
||||||
|
self.key_cache[layer_idx][..., start_pos:, :].zero_()
|
||||||
|
self.value_cache[layer_idx][..., start_pos:, :].zero_()
|
||||||
|
self.past_tokens[layer_idx] = start_pos
|
||||||
|
|
||||||
def get_max_cache_shape(self) -> Tuple[int, int, int, int]:
|
def get_max_cache_shape(self) -> Tuple[int, int, int, int]:
|
||||||
"""Returns the maximum shape of the cache."""
|
"""Returns the maximum shape of the cache."""
|
||||||
return self.max_cache_len
|
return self.max_cache_len
|
||||||
|
|
|
@ -1742,8 +1742,7 @@ class DeepseekV2ForCausalLM(DeepseekV2PreTrainedModel):
|
||||||
)
|
)
|
||||||
|
|
||||||
hidden_states = outputs[0]
|
hidden_states = outputs[0]
|
||||||
logits = self.lm_head(hidden_states)
|
logits = self.lm_head(hidden_states[:,-1:,:]).float()
|
||||||
logits = logits[:,-1,:].unsqueeze(0).float()
|
|
||||||
|
|
||||||
loss = None
|
loss = None
|
||||||
if labels is not None:
|
if labels is not None:
|
||||||
|
|
|
@ -1699,7 +1699,7 @@ class DeepseekV3ForCausalLM(DeepseekV3PreTrainedModel):
|
||||||
)
|
)
|
||||||
|
|
||||||
hidden_states = outputs[0]
|
hidden_states = outputs[0]
|
||||||
logits = self.lm_head(hidden_states.to(self.lm_head.weight.device))
|
logits = self.lm_head(hidden_states[:,-1:,:])
|
||||||
logits = logits.float()
|
logits = logits.float()
|
||||||
|
|
||||||
loss = None
|
loss = None
|
||||||
|
|
|
@ -42,7 +42,7 @@ class RotaryEmbedding(BaseInjectedModule, DeepseekV2RotaryEmbedding):
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(
|
BaseInjectedModule.__init__(
|
||||||
self, key, gguf_loader, config, orig_module, generate_device, **kwargs
|
self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs
|
||||||
)
|
)
|
||||||
self.orig_module.__init__(
|
self.orig_module.__init__(
|
||||||
orig_module.dim, orig_module.max_position_embeddings, orig_module.base
|
orig_module.dim, orig_module.max_position_embeddings, orig_module.base
|
||||||
|
@ -72,7 +72,7 @@ class RotaryEmbeddingV3(BaseInjectedModule):
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(
|
BaseInjectedModule.__init__(
|
||||||
self, key, gguf_loader, config, orig_module, generate_device, **kwargs
|
self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs
|
||||||
)
|
)
|
||||||
self.generate_device = generate_device
|
self.generate_device = generate_device
|
||||||
self.prefill_device = prefill_device
|
self.prefill_device = prefill_device
|
||||||
|
@ -122,7 +122,7 @@ class RotaryEmbeddingV2(BaseInjectedModule, LlamaRotaryEmbedding):
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(
|
BaseInjectedModule.__init__(
|
||||||
self, key, gguf_loader, config, orig_module, generate_device, **kwargs
|
self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs
|
||||||
)
|
)
|
||||||
self.orig_module.__init__(
|
self.orig_module.__init__(
|
||||||
orig_module.dim,
|
orig_module.dim,
|
||||||
|
@ -160,7 +160,7 @@ class YarnRotaryEmbedding(BaseInjectedModule, DeepseekV2YarnRotaryEmbedding):
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(
|
BaseInjectedModule.__init__(
|
||||||
self, key, gguf_loader, config, orig_module, generate_device, **kwargs
|
self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs
|
||||||
)
|
)
|
||||||
self.orig_module.__init__(
|
self.orig_module.__init__(
|
||||||
orig_module.dim,
|
orig_module.dim,
|
||||||
|
@ -204,7 +204,7 @@ class YarnRotaryEmbedding(BaseInjectedModule, DeepseekV2YarnRotaryEmbedding):
|
||||||
# **kwargs,
|
# **kwargs,
|
||||||
# ):
|
# ):
|
||||||
# BaseInjectedModule.__init__(
|
# BaseInjectedModule.__init__(
|
||||||
# self, key, gguf_loader, config, orig_module, generate_device, **kwargs
|
# self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs
|
||||||
# )
|
# )
|
||||||
# self.generate_device = generate_device
|
# self.generate_device = generate_device
|
||||||
# self.prefill_device = prefill_device
|
# self.prefill_device = prefill_device
|
||||||
|
@ -230,7 +230,7 @@ class YarnRotaryEmbeddingV3(BaseInjectedModule):
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(
|
BaseInjectedModule.__init__(
|
||||||
self, key, gguf_loader, config, orig_module, generate_device, **kwargs
|
self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs
|
||||||
)
|
)
|
||||||
self.generate_device = generate_device
|
self.generate_device = generate_device
|
||||||
self.prefill_device = prefill_device
|
self.prefill_device = prefill_device
|
||||||
|
@ -332,11 +332,12 @@ class DynamicNTKScalingRotaryEmbedding(
|
||||||
gguf_loader: GGUFLoader,
|
gguf_loader: GGUFLoader,
|
||||||
config: PretrainedConfig,
|
config: PretrainedConfig,
|
||||||
orig_module: nn.Module,
|
orig_module: nn.Module,
|
||||||
device: str = "cuda",
|
prefill_device: str = "cuda",
|
||||||
|
generate_device: str = "cuda",
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(
|
BaseInjectedModule.__init__(
|
||||||
self, key, gguf_loader, config, orig_module, device, **kwargs
|
self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs
|
||||||
)
|
)
|
||||||
self.orig_module.__init__(
|
self.orig_module.__init__(
|
||||||
orig_module.dim,
|
orig_module.dim,
|
||||||
|
|
|
@ -16,12 +16,17 @@ from ktransformers.models.modeling_deepseek import DeepseekV2Attention, apply_ro
|
||||||
from typing import Optional, Tuple
|
from typing import Optional, Tuple
|
||||||
from ktransformers.operators.base_operator import BaseInjectedModule
|
from ktransformers.operators.base_operator import BaseInjectedModule
|
||||||
from ktransformers.util.custom_gguf import GGUFLoader
|
from ktransformers.util.custom_gguf import GGUFLoader
|
||||||
|
from ktransformers.util.utils import get_compute_capability
|
||||||
import logging
|
import logging
|
||||||
from transformers.configuration_utils import PretrainedConfig
|
from transformers.configuration_utils import PretrainedConfig
|
||||||
from transformers.cache_utils import Cache
|
from transformers.cache_utils import Cache
|
||||||
from flash_attn import flash_attn_with_kvcache, flash_attn_func
|
from flash_attn import flash_attn_func
|
||||||
from ktransformers.operators.triton_attention import decode_attention_fwd_grouped
|
from ktransformers.operators.triton_attention import decode_attention_fwd_grouped
|
||||||
import os
|
import os
|
||||||
|
from ktransformers.operators.flashinfer_wrapper import flashinfer_enabled
|
||||||
|
if flashinfer_enabled:
|
||||||
|
from ktransformers.operators.flashinfer_wrapper import MLAWrapperSingleton, attention_ref
|
||||||
|
|
||||||
logger = logging.getLogger("attention")
|
logger = logging.getLogger("attention")
|
||||||
|
|
||||||
# Copied from transformers.models.llama.modeling_llama.rotate_half
|
# Copied from transformers.models.llama.modeling_llama.rotate_half
|
||||||
|
@ -41,29 +46,25 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
gguf_loader : GGUFLoader,
|
gguf_loader : GGUFLoader,
|
||||||
config: PretrainedConfig,
|
config: PretrainedConfig,
|
||||||
orig_module: nn.Module,
|
orig_module: nn.Module,
|
||||||
device: str = "cuda",
|
prefill_device: str = "cuda",
|
||||||
|
generate_device: str = "cuda",
|
||||||
chunck_size: int = 1000,
|
chunck_size: int = 1000,
|
||||||
|
absorb_for_prefill: bool = False,
|
||||||
**kwargs):
|
**kwargs):
|
||||||
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, device, **kwargs)
|
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs)
|
||||||
self.orig_module.__init__(orig_module.config,
|
self.orig_module.__init__(orig_module.config,
|
||||||
orig_module.layer_idx)
|
orig_module.layer_idx)
|
||||||
self.chunck_size = chunck_size # TODO, generate chunck_size automatically.
|
self.chunck_size = chunck_size # TODO, generate chunck_size automatically.
|
||||||
|
self.mla_wrapper = None
|
||||||
|
self.absorb_for_prefill = absorb_for_prefill
|
||||||
|
|
||||||
def get_absorbed(self) -> Tuple[torch.Tensor, torch.Tensor]:
|
def get_absorbed(self) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||||
if not (hasattr(self, 'q_absorb') and hasattr(self, 'out_absorb')):
|
if not (hasattr(self, 'q_absorb') and hasattr(self, 'out_absorb')):
|
||||||
kv_b_proj = self.kv_b_proj.weight.view(self.num_heads, -1, self.kv_lora_rank)
|
kv_b_proj = self.kv_b_proj.weight.view(self.num_heads, -1, self.kv_lora_rank)
|
||||||
q_absorb = kv_b_proj[:, :self.qk_nope_head_dim, :].reshape(-1, self.kv_lora_rank)
|
self.q_absorb = kv_b_proj[:, :self.qk_nope_head_dim, :].view(self.num_heads, self.qk_nope_head_dim, self.kv_lora_rank)
|
||||||
out_absorb = kv_b_proj[:, self.qk_nope_head_dim:, :].reshape(-1, self.kv_lora_rank)
|
self.out_absorb = kv_b_proj[:, self.qk_nope_head_dim:, :].view(self.num_heads, self.v_head_dim, self.kv_lora_rank)
|
||||||
self.q_absorb = nn.Linear(self.kv_lora_rank, self.num_heads * self.qk_nope_head_dim,
|
|
||||||
bias=False, dtype=q_absorb.dtype, device=q_absorb.device)
|
return self.q_absorb, self.out_absorb
|
||||||
self.q_absorb.weight.data = q_absorb
|
|
||||||
self.out_absorb = nn.Linear(self.kv_lora_rank, self.num_heads * self.v_head_dim,
|
|
||||||
bias=False, dtype=out_absorb.dtype, device=out_absorb.device)
|
|
||||||
self.out_absorb.weight.data = out_absorb
|
|
||||||
#del self.orig_module.kv_b_proj
|
|
||||||
q_absorb = self.q_absorb.weight.view(self.num_heads, self.qk_nope_head_dim, self.kv_lora_rank)
|
|
||||||
out_absorb = self.out_absorb.weight.view(self.num_heads, self.v_head_dim, self.kv_lora_rank)
|
|
||||||
return q_absorb, out_absorb
|
|
||||||
|
|
||||||
def forward_chunck(
|
def forward_chunck(
|
||||||
self,
|
self,
|
||||||
|
@ -99,7 +100,7 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
if past_key_value is not None:
|
if past_key_value is not None:
|
||||||
if self.layer_idx is None:
|
if self.layer_idx is None:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
|
f"The cache structure has changed since transformer version v4.36. If you are using {self.__class__.__name__} "
|
||||||
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
|
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
|
||||||
"with a layer index."
|
"with a layer index."
|
||||||
)
|
)
|
||||||
|
@ -123,8 +124,6 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
# compressed_kv [pages, page_size, 1, self.kv_lora_rank]
|
# compressed_kv [pages, page_size, 1, self.kv_lora_rank]
|
||||||
|
|
||||||
q_absorb, out_absorb = self.get_absorbed()
|
q_absorb, out_absorb = self.get_absorbed()
|
||||||
if hasattr(self.orig_module, 'kv_b_proj'):
|
|
||||||
del self.orig_module.kv_b_proj
|
|
||||||
|
|
||||||
# q_nope [bsz, self.num_heads, q_len, self.qk_nope_head_dim]
|
# q_nope [bsz, self.num_heads, q_len, self.qk_nope_head_dim]
|
||||||
# q_pe [bsz, self.num_heads, q_len, self.qk_rope_head_dim]
|
# q_pe [bsz, self.num_heads, q_len, self.qk_rope_head_dim]
|
||||||
|
@ -139,6 +138,7 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
#print(compressed_kv.shape)
|
#print(compressed_kv.shape)
|
||||||
|
|
||||||
attn_weights = (torch.matmul(q_pe, k_pe.mT) + torch.matmul(q_nope, compressed_kv.mT)) * self.softmax_scale
|
attn_weights = (torch.matmul(q_pe, k_pe.mT) + torch.matmul(q_nope, compressed_kv.mT)) * self.softmax_scale
|
||||||
|
|
||||||
#attn_weights [bsz, self.num_heads, q_len, kv_seq_len]
|
#attn_weights [bsz, self.num_heads, q_len, kv_seq_len]
|
||||||
compressed_kv = compressed_kv.squeeze(1)
|
compressed_kv = compressed_kv.squeeze(1)
|
||||||
"""
|
"""
|
||||||
|
@ -166,8 +166,9 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
attn_weights = nn.functional.dropout(
|
attn_weights = nn.functional.dropout(
|
||||||
attn_weights, p=self.attention_dropout, training=self.training
|
attn_weights, p=self.attention_dropout, training=self.training
|
||||||
)
|
)
|
||||||
|
|
||||||
attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv)
|
attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv)
|
||||||
|
|
||||||
attn_output = torch.matmul(attn_output, out_absorb.mT)
|
attn_output = torch.matmul(attn_output, out_absorb.mT)
|
||||||
|
|
||||||
if attn_output.size() != (bsz, self.num_heads, q_len, self.v_head_dim):
|
if attn_output.size() != (bsz, self.num_heads, q_len, self.v_head_dim):
|
||||||
|
@ -177,14 +178,14 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
)
|
)
|
||||||
|
|
||||||
attn_output = attn_output.transpose(1, 2).contiguous()
|
attn_output = attn_output.transpose(1, 2).contiguous()
|
||||||
|
|
||||||
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
|
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
|
||||||
|
|
||||||
attn_output = self.o_proj(attn_output)
|
attn_output = self.o_proj(attn_output)
|
||||||
|
|
||||||
return attn_output, None, past_key_value
|
return attn_output, None, past_key_value
|
||||||
|
|
||||||
def forward_linux(
|
def forward_linux_triton(
|
||||||
self,
|
self,
|
||||||
hidden_states: torch.Tensor,
|
hidden_states: torch.Tensor,
|
||||||
attention_mask: Optional[torch.Tensor] = None,
|
attention_mask: Optional[torch.Tensor] = None,
|
||||||
|
@ -214,6 +215,16 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
compressed_kv = self.kv_a_layernorm(compressed_kv)
|
compressed_kv = self.kv_a_layernorm(compressed_kv)
|
||||||
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim)
|
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim)
|
||||||
compressed_kv = compressed_kv.view(bsz, q_len, 1, self.kv_lora_rank)
|
compressed_kv = compressed_kv.view(bsz, q_len, 1, self.kv_lora_rank)
|
||||||
|
|
||||||
|
kv_seq_len = q_len
|
||||||
|
if past_key_value is not None:
|
||||||
|
if self.layer_idx is None:
|
||||||
|
raise ValueError(
|
||||||
|
f"The cache structure has changed since transformer version v4.36. If you are using {self.__class__.__name__} "
|
||||||
|
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
|
||||||
|
"with a layer index."
|
||||||
|
)
|
||||||
|
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
|
||||||
|
|
||||||
cos, sin = self.rotary_emb(q_pe, position_ids)
|
cos, sin = self.rotary_emb(q_pe, position_ids)
|
||||||
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, unsqueeze_dim=2)
|
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, unsqueeze_dim=2)
|
||||||
|
@ -234,7 +245,7 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
q_nope = q_nope.transpose(1, 2) # q_len is 1, no GPU overhead, same below
|
q_nope = q_nope.transpose(1, 2) # q_len is 1, no GPU overhead, same below
|
||||||
q_nope = torch.matmul(q_nope, q_absorb) # batched MM
|
q_nope = torch.matmul(q_nope, q_absorb) # batched MM
|
||||||
q_nope = q_nope.transpose(1, 2)
|
q_nope = q_nope.transpose(1, 2)
|
||||||
assert q_nope.is_contiguous()
|
#assert q_nope.is_contiguous()
|
||||||
|
|
||||||
# q_nope [bsz, q_len, self.num_heads, self.kv_lora_rank]
|
# q_nope [bsz, q_len, self.num_heads, self.kv_lora_rank]
|
||||||
# q_pe [bsz, q_len, self.num_heads, self.qk_rope_head_dim]
|
# q_pe [bsz, q_len, self.num_heads, self.qk_rope_head_dim]
|
||||||
|
@ -265,7 +276,7 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
# use triton attention kernel adapted from vLLM and SGLang for MQA
|
# use triton attention kernel adapted from vLLM and SGLang for MQA
|
||||||
decode_attention_fwd_grouped(query_states, compressed_kv_with_k_pe, compressed_kv, attn_output,
|
decode_attention_fwd_grouped(query_states, compressed_kv_with_k_pe, compressed_kv, attn_output,
|
||||||
page_table,
|
page_table,
|
||||||
position_ids.squeeze(0).to(torch.int32), attn_logits,
|
position_ids.squeeze(0).to(torch.int32)+1, attn_logits,
|
||||||
4, #num_kv_splits # follow vLLM, fix it TODO
|
4, #num_kv_splits # follow vLLM, fix it TODO
|
||||||
self.softmax_scale,
|
self.softmax_scale,
|
||||||
past_key_value.page_size)
|
past_key_value.page_size)
|
||||||
|
@ -274,6 +285,7 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
# out_absorb [self.num_heads, self.v_head_dim, self.kv_lora_rank]
|
# out_absorb [self.num_heads, self.v_head_dim, self.kv_lora_rank]
|
||||||
attn_output = attn_output.transpose(1, 2)
|
attn_output = attn_output.transpose(1, 2)
|
||||||
attn_output = torch.matmul(attn_output, out_absorb.mT)
|
attn_output = torch.matmul(attn_output, out_absorb.mT)
|
||||||
|
attn_output = attn_output.transpose(1, 2)
|
||||||
|
|
||||||
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
|
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
|
||||||
attn_output = self.o_proj(attn_output)
|
attn_output = self.o_proj(attn_output)
|
||||||
|
@ -285,26 +297,202 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
|
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
|
||||||
k_pe.squeeze(0)
|
k_pe.squeeze(0)
|
||||||
compressed_kv.squeeze(0)
|
compressed_kv.squeeze(0)
|
||||||
past_key_value.update(compressed_kv, k_pe, self.layer_idx, cache_kwargs)
|
compressed_kv_with_k_pe, _ = past_key_value.update(compressed_kv, k_pe, self.layer_idx, cache_kwargs)
|
||||||
k_pe.unsqueeze(0)
|
compressed_kv, k_pe = torch.split(
|
||||||
compressed_kv.unsqueeze(0)
|
compressed_kv_with_k_pe, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
|
||||||
|
)
|
||||||
k_pe = k_pe[:, :q_len]
|
k_pe = k_pe.view(bsz, -1, self.qk_rope_head_dim)
|
||||||
compressed_kv = compressed_kv[:, :q_len]
|
k_pe = k_pe[:, :kv_seq_len]
|
||||||
|
compressed_kv = compressed_kv.view(bsz, -1, self.kv_lora_rank)
|
||||||
|
compressed_kv = compressed_kv[:, :kv_seq_len]
|
||||||
kv = (
|
kv = (
|
||||||
self.kv_b_proj(compressed_kv)
|
self.kv_b_proj(compressed_kv)
|
||||||
.view(bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
|
.view(bsz, kv_seq_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
|
||||||
)
|
)
|
||||||
k_nope, value_states = torch.split(kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
|
k_nope, value_states = torch.split(kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
|
||||||
query_states = k_pe.new_empty(bsz, q_len, self.num_heads, self.q_head_dim)
|
query_states = k_pe.new_empty(bsz, q_len, self.num_heads, self.q_head_dim)
|
||||||
query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
|
query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
|
||||||
query_states[:, :, :, self.qk_nope_head_dim :] = q_pe
|
query_states[:, :, :, self.qk_nope_head_dim :] = q_pe
|
||||||
|
|
||||||
key_states = k_pe.new_empty(bsz, q_len, self.num_heads, self.q_head_dim)
|
key_states = k_pe.new_empty(bsz, kv_seq_len, self.num_heads, self.q_head_dim)
|
||||||
key_states[:, :, :, :self.qk_nope_head_dim] = k_nope
|
key_states[:, :, :, :self.qk_nope_head_dim] = k_nope
|
||||||
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe
|
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe.view(bsz, kv_seq_len, 1, -1)
|
||||||
|
|
||||||
value_states = value_states.view(bsz, q_len, self.num_heads, self.v_head_dim)
|
value_states = value_states.view(bsz, kv_seq_len, self.num_heads, self.v_head_dim)
|
||||||
|
value_states_padded = torch.nn.functional.pad(value_states, [0, query_states.shape[-1] - value_states.shape[-1]], value=0)
|
||||||
|
|
||||||
|
attn_output = flash_attn_func(
|
||||||
|
query_states,
|
||||||
|
key_states,
|
||||||
|
value_states_padded,
|
||||||
|
softmax_scale=self.softmax_scale,
|
||||||
|
causal=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.q_head_dim != self.v_head_dim:
|
||||||
|
attn_output = attn_output[:, :, :, : self.v_head_dim]
|
||||||
|
|
||||||
|
attn_output = attn_output.reshape(
|
||||||
|
bsz, q_len, self.num_heads * self.v_head_dim
|
||||||
|
).contiguous()
|
||||||
|
attn_output = self.o_proj(attn_output)
|
||||||
|
return attn_output, None, past_key_value
|
||||||
|
|
||||||
|
def forward_linux_flashinfer(
|
||||||
|
self,
|
||||||
|
hidden_states: torch.Tensor,
|
||||||
|
attention_mask: Optional[torch.Tensor] = None,
|
||||||
|
position_ids: Optional[torch.Tensor] = None,
|
||||||
|
past_key_value: Optional[Cache] = None,
|
||||||
|
output_attentions: bool = False,
|
||||||
|
use_cache: bool = False,
|
||||||
|
cache_position: Optional[torch.Tensor] = None,
|
||||||
|
**kwargs,
|
||||||
|
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
|
||||||
|
|
||||||
|
bsz, q_len, _ = hidden_states.size()
|
||||||
|
|
||||||
|
if self.q_lora_rank is None:
|
||||||
|
q = self.q_proj(hidden_states)
|
||||||
|
else:
|
||||||
|
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
|
||||||
|
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim)
|
||||||
|
q_nope, q_pe = torch.split(
|
||||||
|
q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
|
||||||
|
)
|
||||||
|
|
||||||
|
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
|
||||||
|
compressed_kv, k_pe = torch.split(
|
||||||
|
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
|
||||||
|
)
|
||||||
|
compressed_kv = self.kv_a_layernorm(compressed_kv)
|
||||||
|
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim)
|
||||||
|
compressed_kv = compressed_kv.view(bsz, q_len, 1, self.kv_lora_rank)
|
||||||
|
|
||||||
|
kv_seq_len = q_len
|
||||||
|
if past_key_value is not None:
|
||||||
|
if self.layer_idx is None:
|
||||||
|
raise ValueError(
|
||||||
|
f"The cache structure has changed since version transformer verision v4.36. If you are using {self.__class__.__name__} "
|
||||||
|
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
|
||||||
|
"with a layer index."
|
||||||
|
)
|
||||||
|
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
|
||||||
|
|
||||||
|
cos, sin = self.rotary_emb(q_pe, position_ids)
|
||||||
|
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, unsqueeze_dim=2)
|
||||||
|
# q_pe [bsz, q_len, self.num_heads, self.qk_rope_head_dim] k_pe [bsz, q_len, 1, self.qk_rope_head_dim]
|
||||||
|
|
||||||
|
# decode
|
||||||
|
if q_len == 1 or self.absorb_for_prefill:
|
||||||
|
if past_key_value is not None:
|
||||||
|
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
|
||||||
|
compressed_kv_with_k_pe, page_table = past_key_value.update(compressed_kv, k_pe, self.layer_idx, cache_kwargs)
|
||||||
|
compressed_kv = compressed_kv_with_k_pe [:, :, :, :self.kv_lora_rank].view(-1, past_key_value.page_size, self.kv_lora_rank)
|
||||||
|
k_pe = compressed_kv_with_k_pe [:, :, :, self.kv_lora_rank:].view(-1, past_key_value.page_size, self.qk_rope_head_dim)
|
||||||
|
# k_pe [max_pages, page_size, self.qk_rope_head_dim]
|
||||||
|
# compressed_kv [max_pages, page_size, self.kv_lora_rank]
|
||||||
|
|
||||||
|
# q_nope [bsz, q_len, self.num_heads, self.qk_nope_head_dim]
|
||||||
|
# q_absorb [self.num_heads, self.qk_nope_head_dim, self.kv_lora_rank]
|
||||||
|
q_absorb, out_absorb = self.get_absorbed()
|
||||||
|
q_nope = q_nope.transpose(1, 2) # q_len is 1, no GPU overhead, same below
|
||||||
|
q_nope = torch.matmul(q_nope, q_absorb) # batched MM
|
||||||
|
q_nope = q_nope.transpose(1, 2)
|
||||||
|
q_nope = q_nope.contiguous()
|
||||||
|
#assert q_nope.is_contiguous()
|
||||||
|
|
||||||
|
# q_nope [bsz, q_len, self.num_heads, self.kv_lora_rank]
|
||||||
|
# q_pe [bsz, q_len, self.num_heads, self.qk_rope_head_dim]
|
||||||
|
q_nope.squeeze_(0)
|
||||||
|
q_pe.squeeze_(0)
|
||||||
|
|
||||||
|
# flash attn doesn't support head_dim bigger than 256, use flashinfer
|
||||||
|
if self.mla_wrapper is None:
|
||||||
|
self.mla_wrapper = MLAWrapperSingleton.get_instance(self.device, 1, past_key_value.max_pages, use_cuda_graph = True)
|
||||||
|
if self.mla_wrapper.need_plan:
|
||||||
|
self.mla_wrapper.need_plan = False
|
||||||
|
if q_len == 1:
|
||||||
|
self.mla_wrapper.plan(None,None,None,
|
||||||
|
position_ids.squeeze(1)+1,
|
||||||
|
self.num_heads,
|
||||||
|
self.kv_lora_rank,
|
||||||
|
self.qk_rope_head_dim,
|
||||||
|
past_key_value.page_size,
|
||||||
|
self.softmax_scale,
|
||||||
|
q_nope.dtype,
|
||||||
|
compressed_kv.dtype)
|
||||||
|
else:
|
||||||
|
qo_indptr = torch.tensor([0, q_len], dtype=torch.int32, device=self.device)
|
||||||
|
kv_len_arr = torch.tensor([position_ids[0, -1].item()+1], dtype=torch.int32, device=self.device)
|
||||||
|
self.mla_wrapper.plan(qo_indptr,None,None,
|
||||||
|
kv_len_arr,
|
||||||
|
self.num_heads,
|
||||||
|
self.kv_lora_rank,
|
||||||
|
self.qk_rope_head_dim,
|
||||||
|
past_key_value.page_size,
|
||||||
|
self.softmax_scale,
|
||||||
|
q_nope.dtype,
|
||||||
|
compressed_kv.dtype)
|
||||||
|
attn_output = self.mla_wrapper.run(q_nope, q_pe, compressed_kv, k_pe).view(bsz, q_len, self.num_heads, self.kv_lora_rank)
|
||||||
|
"""
|
||||||
|
k = (
|
||||||
|
torch.cat([compressed_kv, k_pe], dim=-1)
|
||||||
|
.view(-1, 1, 512 + 64)
|
||||||
|
.repeat_interleave(self.num_heads, dim=1)
|
||||||
|
)
|
||||||
|
v = compressed_kv.view(-1, 1, 512).repeat_interleave(self.num_heads, dim=1)
|
||||||
|
lens = position_ids.item() + 1
|
||||||
|
#print("lens", lens)
|
||||||
|
attn_ref, lse_ref = attention_ref(
|
||||||
|
1,
|
||||||
|
torch.cat([q_nope, q_pe], dim=-1),
|
||||||
|
k[:lens],
|
||||||
|
v[:lens],
|
||||||
|
False,
|
||||||
|
self.softmax_scale
|
||||||
|
)
|
||||||
|
attn_output = attn_ref.view(bsz, q_len, self.num_heads, self.kv_lora_rank)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# mla_wrapper run output: [tokens, self.num_heads, self.kv_lora_rank]
|
||||||
|
# attn_output [bsz, q_len, self.num_heads, self.kv_lora_rank]
|
||||||
|
# out_absorb [self.num_heads, self.v_head_dim, self.kv_lora_rank]
|
||||||
|
attn_output = attn_output.transpose(1, 2) # [bsz, self.num_heads, q_len, self.kv_lora_rank]
|
||||||
|
attn_output = torch.matmul(attn_output, out_absorb.mT) # [bsz, self.num_heads, q_len, self.v_head_dim]
|
||||||
|
attn_output = attn_output.transpose(1, 2).contiguous() # [bsz, q_len, self.num_heads, self.kv_lora_rank]
|
||||||
|
|
||||||
|
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim) # [bsz, q_len, self.num_heads * self.v_head_dim]
|
||||||
|
attn_output = self.o_proj(attn_output)
|
||||||
|
|
||||||
|
return attn_output, None, past_key_value
|
||||||
|
else:
|
||||||
|
if past_key_value is not None:
|
||||||
|
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
|
||||||
|
k_pe.squeeze(0)
|
||||||
|
compressed_kv.squeeze(0)
|
||||||
|
compressed_kv_with_k_pe, _ = past_key_value.update(compressed_kv, k_pe, self.layer_idx, cache_kwargs)
|
||||||
|
compressed_kv, k_pe = torch.split(
|
||||||
|
compressed_kv_with_k_pe, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
|
||||||
|
)
|
||||||
|
k_pe = k_pe.view(bsz, -1, self.qk_rope_head_dim)
|
||||||
|
k_pe = k_pe[:, :kv_seq_len]
|
||||||
|
compressed_kv = compressed_kv.view(bsz, -1, self.kv_lora_rank)
|
||||||
|
compressed_kv = compressed_kv[:, :kv_seq_len]
|
||||||
|
kv = (
|
||||||
|
self.kv_b_proj(compressed_kv)
|
||||||
|
.view(bsz, kv_seq_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
|
||||||
|
)
|
||||||
|
k_nope, value_states = torch.split(kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
|
||||||
|
query_states = k_pe.new_empty(bsz, q_len, self.num_heads, self.q_head_dim)
|
||||||
|
query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
|
||||||
|
query_states[:, :, :, self.qk_nope_head_dim :] = q_pe
|
||||||
|
|
||||||
|
key_states = k_pe.new_empty(bsz, kv_seq_len, self.num_heads, self.q_head_dim)
|
||||||
|
key_states[:, :, :, :self.qk_nope_head_dim] = k_nope
|
||||||
|
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe.view(bsz, kv_seq_len, 1, -1)
|
||||||
|
|
||||||
|
value_states = value_states.view(bsz, kv_seq_len, self.num_heads, self.v_head_dim)
|
||||||
value_states_padded = torch.nn.functional.pad(value_states, [0, query_states.shape[-1] - value_states.shape[-1]], value=0)
|
value_states_padded = torch.nn.functional.pad(value_states, [0, query_states.shape[-1] - value_states.shape[-1]], value=0)
|
||||||
|
|
||||||
attn_output = flash_attn_func(
|
attn_output = flash_attn_func(
|
||||||
|
@ -401,7 +589,8 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
cache_position: Optional[torch.LongTensor] = None,
|
cache_position: Optional[torch.LongTensor] = None,
|
||||||
**kwargs,
|
**kwargs,
|
||||||
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
|
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
|
||||||
if os.name == 'nt':
|
if os.name == 'nt' or get_compute_capability()<8:
|
||||||
|
print("for Windows or GPU before ampere, use forward_windows")
|
||||||
return self.forward_windows(
|
return self.forward_windows(
|
||||||
hidden_states,
|
hidden_states,
|
||||||
attention_mask,
|
attention_mask,
|
||||||
|
@ -413,16 +602,28 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
|
||||||
**kwargs,
|
**kwargs,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
return self.forward_linux(
|
if flashinfer_enabled:
|
||||||
hidden_states,
|
return self.forward_linux_flashinfer(
|
||||||
attention_mask,
|
hidden_states,
|
||||||
position_ids,
|
attention_mask,
|
||||||
past_key_value,
|
position_ids,
|
||||||
output_attentions,
|
past_key_value,
|
||||||
use_cache,
|
output_attentions,
|
||||||
cache_position,
|
use_cache,
|
||||||
**kwargs,
|
cache_position,
|
||||||
)
|
**kwargs,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
return self.forward_linux_triton(
|
||||||
|
hidden_states,
|
||||||
|
attention_mask,
|
||||||
|
position_ids,
|
||||||
|
past_key_value,
|
||||||
|
output_attentions,
|
||||||
|
use_cache,
|
||||||
|
cache_position,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class KLlamaAttention(BaseInjectedModule):
|
class KLlamaAttention(BaseInjectedModule):
|
||||||
|
@ -433,9 +634,10 @@ class KLlamaAttention(BaseInjectedModule):
|
||||||
gguf_loader : GGUFLoader,
|
gguf_loader : GGUFLoader,
|
||||||
config: PretrainedConfig,
|
config: PretrainedConfig,
|
||||||
orig_module: nn.Module,
|
orig_module: nn.Module,
|
||||||
device: str = "cuda",
|
prefill_device: str = "cuda",
|
||||||
|
generate_device: str = "cuda",
|
||||||
**kwargs):
|
**kwargs):
|
||||||
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, device, **kwargs)
|
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs)
|
||||||
self.orig_module.__init__(orig_module.config,
|
self.orig_module.__init__(orig_module.config,
|
||||||
orig_module.layer_idx)
|
orig_module.layer_idx)
|
||||||
def apply_rotary_pos_emb(self, q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
|
def apply_rotary_pos_emb(self, q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
|
||||||
|
|
|
@ -16,14 +16,17 @@ class BaseInjectedModule(nn.Module):
|
||||||
gguf_loader : GGUFLoader,
|
gguf_loader : GGUFLoader,
|
||||||
config: PretrainedConfig,
|
config: PretrainedConfig,
|
||||||
orig_module: nn.Module,
|
orig_module: nn.Module,
|
||||||
device: str = "cuda",
|
prefill_device: str = "cuda",
|
||||||
|
generate_device: str = "cuda",
|
||||||
**kwargs):
|
**kwargs):
|
||||||
nn.Module.__init__(self)
|
nn.Module.__init__(self)
|
||||||
nn.Module.__setattr__(self, "orig_module", orig_module)
|
nn.Module.__setattr__(self, "orig_module", orig_module)
|
||||||
object.__setattr__(self, "key", key)
|
object.__setattr__(self, "key", key)
|
||||||
object.__setattr__(self, "gguf_loader", gguf_loader)
|
object.__setattr__(self, "gguf_loader", gguf_loader)
|
||||||
object.__setattr__(self, "config", config)
|
object.__setattr__(self, "config", config)
|
||||||
object.__setattr__(self, "device", device)
|
object.__setattr__(self, "prefill_device", prefill_device)
|
||||||
|
object.__setattr__(self, "generate_device", generate_device)
|
||||||
|
object.__setattr__(self, "device", generate_device)
|
||||||
|
|
||||||
def __getattr__(self, name: str) -> Any:
|
def __getattr__(self, name: str) -> Any:
|
||||||
# __getattr__ in nn.Module doesn't call super().__getattribute__ when name is not in nn.Module.__dict__,
|
# __getattr__ in nn.Module doesn't call super().__getattribute__ when name is not in nn.Module.__dict__,
|
||||||
|
|
|
@ -18,6 +18,7 @@ import torch.nn.functional as F
|
||||||
import torch
|
import torch
|
||||||
import sys, os
|
import sys, os
|
||||||
from ktransformers.operators.base_operator import BaseInjectedModule
|
from ktransformers.operators.base_operator import BaseInjectedModule
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "ktransformers_ext", "build"))
|
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "ktransformers_ext", "build"))
|
||||||
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "ktransformers_ext", "build", "Release"))
|
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "ktransformers_ext", "build", "Release"))
|
||||||
|
@ -118,6 +119,7 @@ class KExpertsCPU(KExpertsBase):
|
||||||
output_cpu:Tensor = None
|
output_cpu:Tensor = None
|
||||||
output_gpu_map:dict = {} # Manage output tensor buffer on different gpu
|
output_gpu_map:dict = {} # Manage output tensor buffer on different gpu
|
||||||
#stream_map:dict = {} # Manage cuda stream on different gpu
|
#stream_map:dict = {} # Manage cuda stream on different gpu
|
||||||
|
#gguf_loader:GGUFLoader = None
|
||||||
CPU_INFER = CPUInfer(Config().cpu_infer)
|
CPU_INFER = CPUInfer(Config().cpu_infer)
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
|
@ -131,6 +133,9 @@ class KExpertsCPU(KExpertsBase):
|
||||||
**kwargs
|
**kwargs
|
||||||
):
|
):
|
||||||
super().__init__(key, gguf_loader, config, orig_module, device, **kwargs)
|
super().__init__(key, gguf_loader, config, orig_module, device, **kwargs)
|
||||||
|
#if KExpertsCPU.gguf_loader is None:
|
||||||
|
# KExpertsCPU.gguf_loader = GGUFLoader("/mnt/data/model/DeepseekV3-q4km-gguf")
|
||||||
|
self.gguf_loader = gguf_loader
|
||||||
assert device.lower() == "cpu", "KExpertsCPU can only be loaded on CPU"
|
assert device.lower() == "cpu", "KExpertsCPU can only be loaded on CPU"
|
||||||
self.n_routed_experts = n_routed_experts
|
self.n_routed_experts = n_routed_experts
|
||||||
self.out_device = out_device
|
self.out_device = out_device
|
||||||
|
@ -154,7 +159,7 @@ class KExpertsCPU(KExpertsBase):
|
||||||
down_ptr = ctypes.addressof(
|
down_ptr = ctypes.addressof(
|
||||||
ctypes.cast(self.down.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents
|
ctypes.cast(self.down.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents
|
||||||
)
|
)
|
||||||
# print(self.gate_qtype, self.up_qtype, self.down_qtype)
|
#print(self.gate_type, self.up_type, self.down_type)
|
||||||
n_routed_experts = self.n_routed_experts
|
n_routed_experts = self.n_routed_experts
|
||||||
# n_routed_experts = len(self.orig_module)
|
# n_routed_experts = len(self.orig_module)
|
||||||
moe_config = MOEConfig(
|
moe_config = MOEConfig(
|
||||||
|
@ -225,6 +230,7 @@ class KExpertsCPU(KExpertsBase):
|
||||||
return
|
return
|
||||||
|
|
||||||
def load_weights(self, override_key: str | None = None, device: str = "cpu"):
|
def load_weights(self, override_key: str | None = None, device: str = "cpu"):
|
||||||
|
# TODO: support Bias
|
||||||
res = {}
|
res = {}
|
||||||
if override_key is not None:
|
if override_key is not None:
|
||||||
keys = override_key
|
keys = override_key
|
||||||
|
@ -239,7 +245,16 @@ class KExpertsCPU(KExpertsBase):
|
||||||
down_type = None
|
down_type = None
|
||||||
|
|
||||||
for key in keys:
|
for key in keys:
|
||||||
if key + ".ffn_gate_exps.weight" in self.gguf_loader.tensor_info:
|
if self.gguf_loader.safetensor_loader is not None:
|
||||||
|
# using a temp ugly way to temprary load the tensor
|
||||||
|
gate = self.gguf_loader.safetensor_loader.load_tensor(key + ".ffn_gate_exps.weight").numpy()
|
||||||
|
up = self.gguf_loader.safetensor_loader.load_tensor(key + ".ffn_up_exps.weight").numpy()
|
||||||
|
down = self.gguf_loader.safetensor_loader.load_tensor(key + ".ffn_down_exps.weight").numpy()
|
||||||
|
gate_type = self.gguf_loader.safetensor_loader.load_tensor(key + ".ffn_gate_exps.ggml_type").item()
|
||||||
|
up_type = self.gguf_loader.safetensor_loader.load_tensor(key + ".ffn_up_exps.ggml_type").item()
|
||||||
|
down_type = self.gguf_loader.safetensor_loader.load_tensor(key + ".ffn_down_exps.ggml_type").item()
|
||||||
|
|
||||||
|
elif key + ".ffn_gate_exps.weight" in self.gguf_loader.tensor_info:
|
||||||
gate = self.gguf_loader.get_mmap_tensor(key + ".ffn_gate_exps.weight")
|
gate = self.gguf_loader.get_mmap_tensor(key + ".ffn_gate_exps.weight")
|
||||||
up = self.gguf_loader.get_mmap_tensor(key + ".ffn_up_exps.weight")
|
up = self.gguf_loader.get_mmap_tensor(key + ".ffn_up_exps.weight")
|
||||||
down = self.gguf_loader.get_mmap_tensor(key + ".ffn_down_exps.weight")
|
down = self.gguf_loader.get_mmap_tensor(key + ".ffn_down_exps.weight")
|
||||||
|
@ -288,6 +303,8 @@ class KExpertsMarlin(KExpertsBase):
|
||||||
self.act_fn = ACT2FN[config.hidden_act]
|
self.act_fn = ACT2FN[config.hidden_act]
|
||||||
assert device.lower() != "cpu", "Marlin experts can only be loaded on GPU"
|
assert device.lower() != "cpu", "Marlin experts can only be loaded on GPU"
|
||||||
self.device = device
|
self.device = device
|
||||||
|
self.elements_per_tensor = config.moe_intermediate_size * config.hidden_size
|
||||||
|
|
||||||
# create empty marlin experts according to the number of experts per token
|
# create empty marlin experts according to the number of experts per token
|
||||||
# up
|
# up
|
||||||
self.up_projs = [KLinearMarlin(key+ "." + "ffn_up_exps", gguf_loader, config, device=device) for i in range(self.expert_num)]
|
self.up_projs = [KLinearMarlin(key+ "." + "ffn_up_exps", gguf_loader, config, device=device) for i in range(self.expert_num)]
|
||||||
|
@ -299,17 +316,34 @@ class KExpertsMarlin(KExpertsBase):
|
||||||
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str | None = None, warmup: bool = False):
|
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str | None = None, warmup: bool = False):
|
||||||
if device is None: device = self.device
|
if device is None: device = self.device
|
||||||
assert device.lower() != "cpu", "Marlin experts can only be loaded on GPU"
|
assert device.lower() != "cpu", "Marlin experts can only be loaded on GPU"
|
||||||
if w is None: w = self.load_weights()[self.key]
|
if w is None:
|
||||||
|
w = self.load_weights()
|
||||||
|
load_by_experts = True
|
||||||
|
|
||||||
if isinstance(w, dict):
|
if load_by_experts:
|
||||||
self.gate = w["gate"]
|
if isinstance(w, dict):
|
||||||
self.up = (w["up"])
|
self.gate = w["gate"]
|
||||||
self.down = (w["down"])
|
self.up = (w["up"])
|
||||||
for i in range(self.expert_num):
|
self.down = (w["down"])
|
||||||
self.up_projs[i].load(nn.Parameter(self.up[i,...]), device=device)
|
for i in tqdm(range(self.expert_num), desc=f"Dequanting and quanting for KExpertsMarlin {self.key}"):
|
||||||
self.gate_projs[i].load(nn.Parameter(self.gate[i,...]), device=device)
|
up_weights = self.gguf_loader.load_expert_tensor(self.key + ".ffn_up_exps.weight", self.up, i, self.elements_per_tensor, device=self.device)
|
||||||
self.down_projs[i].load(nn.Parameter(self.down[i,...]), device=device)
|
gate_weights = self.gguf_loader.load_expert_tensor(self.key + ".ffn_gate_exps.weight", self.gate, i, self.elements_per_tensor, device=self.device)
|
||||||
self.loaded_experts_idx.append(i)
|
down_weights = self.gguf_loader.load_expert_tensor(self.key + ".ffn_down_exps.weight", self.down, i, self.elements_per_tensor, device=self.device)
|
||||||
|
|
||||||
|
self.up_projs[i].load(nn.Parameter(up_weights), device=device)
|
||||||
|
self.gate_projs[i].load(nn.Parameter(gate_weights), device=device)
|
||||||
|
self.down_projs[i].load(nn.Parameter(down_weights), device=device)
|
||||||
|
self.loaded_experts_idx.append(i)
|
||||||
|
else:
|
||||||
|
if isinstance(w, dict):
|
||||||
|
self.gate = w["gate"]
|
||||||
|
self.up = (w["up"])
|
||||||
|
self.down = (w["down"])
|
||||||
|
for i in range(self.expert_num):
|
||||||
|
self.up_projs[i].load(nn.Parameter(self.up[i,...]), device=device)
|
||||||
|
self.gate_projs[i].load(nn.Parameter(self.gate[i,...]), device=device)
|
||||||
|
self.down_projs[i].load(nn.Parameter(self.down[i,...]), device=device)
|
||||||
|
self.loaded_experts_idx.append(i)
|
||||||
return
|
return
|
||||||
|
|
||||||
def unload(self):
|
def unload(self):
|
||||||
|
@ -329,20 +363,13 @@ class KExpertsMarlin(KExpertsBase):
|
||||||
gate = None
|
gate = None
|
||||||
up = None
|
up = None
|
||||||
down = None
|
down = None
|
||||||
gate_type = None
|
|
||||||
up_type = None
|
|
||||||
down_type = None
|
|
||||||
|
|
||||||
for key in keys:
|
for key in keys:
|
||||||
if key + ".ffn_gate_exps.weight" in self.gguf_loader.tensor_info:
|
if key + ".ffn_gate_exps.weight" in self.gguf_loader.tensor_info:
|
||||||
gate = self.gguf_loader.load_gguf_tensor(key + ".ffn_gate_exps.weight")
|
gate = self.gguf_loader.get_mmap_tensor(key + ".ffn_gate_exps.weight")
|
||||||
up = self.gguf_loader.load_gguf_tensor(key + ".ffn_up_exps.weight")
|
up = self.gguf_loader.get_mmap_tensor(key + ".ffn_up_exps.weight")
|
||||||
down = self.gguf_loader.load_gguf_tensor(key + ".ffn_down_exps.weight")
|
down = self.gguf_loader.get_mmap_tensor(key + ".ffn_down_exps.weight")
|
||||||
gate_type = self.gguf_loader.tensor_info[key + ".ffn_gate_exps.weight"]["ggml_type"]
|
res = {"gate": gate, "up": up, "down": down}
|
||||||
up_type = self.gguf_loader.tensor_info[key + ".ffn_up_exps.weight"]["ggml_type"]
|
|
||||||
down_type = self.gguf_loader.tensor_info[key + ".ffn_down_exps.weight"]["ggml_type"]
|
|
||||||
# tensors = self.load_multi(key, [".ffn_gate_exps.weight", ".ffn_up_exps.weight", ".ffn_down_exps.weight"])
|
|
||||||
res = {key:{"gate": nn.Parameter(gate), "up": nn.Parameter(up), "down": nn.Parameter(down), "gate_type": gate_type, "up_type": up_type, "down_type": down_type}}
|
|
||||||
return res
|
return res
|
||||||
|
|
||||||
def forward(self, hidden_states_cpu: torch.Tensor, selected_experts_cpu: torch.Tensor, routing_weights_cpu: torch.Tensor) -> torch.Tensor:
|
def forward(self, hidden_states_cpu: torch.Tensor, selected_experts_cpu: torch.Tensor, routing_weights_cpu: torch.Tensor) -> torch.Tensor:
|
||||||
|
@ -381,6 +408,7 @@ class KExpertsMarlin(KExpertsBase):
|
||||||
|
|
||||||
return final_hidden_states.to(dtype=org_dtype, device=org_device)
|
return final_hidden_states.to(dtype=org_dtype, device=org_device)
|
||||||
|
|
||||||
|
# untested, CUDA OOM
|
||||||
class KExpertsTorch(KExpertsBase):
|
class KExpertsTorch(KExpertsBase):
|
||||||
expert_num: int
|
expert_num: int
|
||||||
loaded_experts_idx: list[int]
|
loaded_experts_idx: list[int]
|
||||||
|
@ -402,19 +430,39 @@ class KExpertsTorch(KExpertsBase):
|
||||||
# self.loaded_experts_idx = []
|
# self.loaded_experts_idx = []
|
||||||
self.act_fn = ACT2FN[config.hidden_act]
|
self.act_fn = ACT2FN[config.hidden_act]
|
||||||
self.device = device
|
self.device = device
|
||||||
self.gate = None
|
self.elements_per_tensor = config.moe_intermediate_size * config.hidden_size
|
||||||
self.up = None
|
self.gate = [None for _ in range(self.expert_num)]
|
||||||
self.donw = None
|
self.up = [None for _ in range(self.expert_num)]
|
||||||
|
self.down = [None for _ in range(self.expert_num)]
|
||||||
self.dtype = torch.get_default_dtype()
|
self.dtype = torch.get_default_dtype()
|
||||||
|
|
||||||
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str | None = None, warmup: bool = False):
|
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str | None = None, warmup: bool = False):
|
||||||
if device is None: device = self.device
|
if device is None: device = self.device
|
||||||
if w is None: w = self.load_weights(device=device)[self.key]
|
if w is None:
|
||||||
|
w = self.load_weights()
|
||||||
|
load_by_experts = True
|
||||||
|
|
||||||
if isinstance(w, dict):
|
if load_by_experts:
|
||||||
self.gate = w["gate"].to(device=device, dtype=self.dtype)
|
if isinstance(w, dict):
|
||||||
self.up = w["up"].to(device=device, dtype=self.dtype)
|
for i in tqdm(range(self.expert_num), desc=f"Dequanting for KExpertsTorch {self.key}"):
|
||||||
self.down = w["down"].to(device=device, dtype=self.dtype)
|
up_weights = self.gguf_loader.load_expert_tensor(self.key + ".ffn_up_exps.weight", w["up"], i, self.elements_per_tensor, device=self.device)
|
||||||
|
gate_weights = self.gguf_loader.load_expert_tensor(self.key + ".ffn_gate_exps.weight", w["gate"], i, self.elements_per_tensor, device=self.device)
|
||||||
|
down_weights = self.gguf_loader.load_expert_tensor(self.key + ".ffn_down_exps.weight", w["down"], i, self.elements_per_tensor, device=self.device)
|
||||||
|
|
||||||
|
self.up[i] = up_weights
|
||||||
|
self.gate[i] = gate_weights
|
||||||
|
self.down[i] = down_weights
|
||||||
|
else:
|
||||||
|
if isinstance(w, dict):
|
||||||
|
for i in range(self.expert_num):
|
||||||
|
self.gate[i] = w["gate"][i, ...].to(device=device, dtype=self.dtype)
|
||||||
|
self.up[i] = w["up"][i, ...].to(device=device, dtype=self.dtype)
|
||||||
|
self.down[i] = w["down"][i, ...].to(device=device, dtype=self.dtype)
|
||||||
|
|
||||||
|
self.up = torch.stack(self.up, dim=0)
|
||||||
|
self.gate = torch.stack(self.gate, dim=0)
|
||||||
|
self.down = torch.stack(self.down, dim=0)
|
||||||
|
return
|
||||||
|
|
||||||
def unload(self):
|
def unload(self):
|
||||||
if self.gate is not None:
|
if self.gate is not None:
|
||||||
|
@ -422,6 +470,25 @@ class KExpertsTorch(KExpertsBase):
|
||||||
self.up = None
|
self.up = None
|
||||||
self.down = None
|
self.down = None
|
||||||
|
|
||||||
|
def load_weights(self, override_key: str | None = None):
|
||||||
|
res = {}
|
||||||
|
if override_key is not None:
|
||||||
|
keys = override_key
|
||||||
|
else:
|
||||||
|
keys = [self.key]
|
||||||
|
|
||||||
|
gate = None
|
||||||
|
up = None
|
||||||
|
down = None
|
||||||
|
|
||||||
|
for key in keys:
|
||||||
|
if key + ".ffn_gate_exps.weight" in self.gguf_loader.tensor_info:
|
||||||
|
gate = self.gguf_loader.get_mmap_tensor(key + ".ffn_gate_exps.weight")
|
||||||
|
up = self.gguf_loader.get_mmap_tensor(key + ".ffn_up_exps.weight")
|
||||||
|
down = self.gguf_loader.get_mmap_tensor(key + ".ffn_down_exps.weight")
|
||||||
|
res = {"gate": gate, "up": up, "down": down}
|
||||||
|
return res
|
||||||
|
|
||||||
def forward(self, hidden_states_cpu: torch.Tensor, selected_experts_cpu: torch.Tensor, routing_weights_cpu: torch.Tensor) -> torch.Tensor:
|
def forward(self, hidden_states_cpu: torch.Tensor, selected_experts_cpu: torch.Tensor, routing_weights_cpu: torch.Tensor) -> torch.Tensor:
|
||||||
|
|
||||||
org_device = hidden_states_cpu.device
|
org_device = hidden_states_cpu.device
|
||||||
|
@ -478,7 +545,7 @@ class KTransformersExperts(BaseInjectedModule, KExpertsBase):
|
||||||
generate_device: str = "cpu",
|
generate_device: str = "cpu",
|
||||||
generate_op: str | None = "KExpertsCPU",
|
generate_op: str | None = "KExpertsCPU",
|
||||||
**kwargs):
|
**kwargs):
|
||||||
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs)
|
||||||
KExpertsBase.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
KExpertsBase.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
||||||
if generate_op is not None:
|
if generate_op is not None:
|
||||||
self.generate_experts = EXPERTS_MAP[generate_op](key, gguf_loader, config, len(orig_module), device=generate_device, **kwargs)
|
self.generate_experts = EXPERTS_MAP[generate_op](key, gguf_loader, config, len(orig_module), device=generate_device, **kwargs)
|
||||||
|
@ -582,7 +649,7 @@ class KQwen2MoeSparseMoeBlock(BaseInjectedModule, Qwen2MoeSparseMoeBlock):
|
||||||
|
|
||||||
if isinstance(self.experts, KExpertsBase):
|
if isinstance(self.experts, KExpertsBase):
|
||||||
y = (
|
y = (
|
||||||
self.moe_on_cpuinfer(
|
self.moe_kexperts(
|
||||||
hidden_states_expert, selected_experts_expert, routing_weights_expert
|
hidden_states_expert, selected_experts_expert, routing_weights_expert
|
||||||
)
|
)
|
||||||
.view(*orig_shape)
|
.view(*orig_shape)
|
||||||
|
@ -601,8 +668,7 @@ class KQwen2MoeSparseMoeBlock(BaseInjectedModule, Qwen2MoeSparseMoeBlock):
|
||||||
return y, router_logits
|
return y, router_logits
|
||||||
|
|
||||||
@torch.no_grad()
|
@torch.no_grad()
|
||||||
def moe_on_cpuinfer(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
def moe_kexperts(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
||||||
outs = torch.empty_like(x)
|
|
||||||
outs = self.experts(x, topk_ids, topk_weight)
|
outs = self.experts(x, topk_ids, topk_weight)
|
||||||
return outs
|
return outs
|
||||||
|
|
||||||
|
@ -672,7 +738,7 @@ class KDeepseekV2MoE(BaseInjectedModule, DeepseekV2MoE):
|
||||||
y_ = self.shared_experts(identity).squeeze(0)
|
y_ = self.shared_experts(identity).squeeze(0)
|
||||||
|
|
||||||
if isinstance(self.experts, KExpertsBase):
|
if isinstance(self.experts, KExpertsBase):
|
||||||
y = self.moe_on_cpuinfer(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
|
y = self.moe_kexperts(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
|
||||||
elif hidden_states.size(0) > 10:
|
elif hidden_states.size(0) > 10:
|
||||||
# TODO may bugs here
|
# TODO may bugs here
|
||||||
y = (
|
y = (
|
||||||
|
@ -692,8 +758,7 @@ class KDeepseekV2MoE(BaseInjectedModule, DeepseekV2MoE):
|
||||||
return y
|
return y
|
||||||
|
|
||||||
@torch.no_grad()
|
@torch.no_grad()
|
||||||
def moe_on_cpuinfer(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
def moe_kexperts(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
||||||
outs = torch.empty_like(x)
|
|
||||||
outs = self.experts(x, topk_ids, topk_weight)
|
outs = self.experts(x, topk_ids, topk_weight)
|
||||||
return outs
|
return outs
|
||||||
|
|
||||||
|
@ -773,7 +838,7 @@ class KDeepseekV3MoE(BaseInjectedModule, DeepseekV3MoE):
|
||||||
y_ = self.shared_experts(identity).squeeze(0)
|
y_ = self.shared_experts(identity).squeeze(0)
|
||||||
|
|
||||||
if isinstance(self.experts, KExpertsBase):
|
if isinstance(self.experts, KExpertsBase):
|
||||||
y = self.moe_on_cpuinfer(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
|
y = self.moe_kexperts(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
|
||||||
elif hidden_states.size(0) > 10:
|
elif hidden_states.size(0) > 10:
|
||||||
# TODO may bugs here
|
# TODO may bugs here
|
||||||
y = (
|
y = (
|
||||||
|
@ -793,8 +858,7 @@ class KDeepseekV3MoE(BaseInjectedModule, DeepseekV3MoE):
|
||||||
return y
|
return y
|
||||||
|
|
||||||
@torch.no_grad()
|
@torch.no_grad()
|
||||||
def moe_on_cpuinfer(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
def moe_kexperts(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
||||||
outs = torch.empty_like(x)
|
|
||||||
outs = self.experts(x, topk_ids, topk_weight)
|
outs = self.experts(x, topk_ids, topk_weight)
|
||||||
return outs
|
return outs
|
||||||
|
|
||||||
|
@ -881,7 +945,7 @@ class KMistralSparseMoEBlock(BaseInjectedModule, MixtralSparseMoeBlock):
|
||||||
|
|
||||||
if isinstance(self.experts, KExpertsBase):
|
if isinstance(self.experts, KExpertsBase):
|
||||||
y = (
|
y = (
|
||||||
self.moe_on_cpuinfer(
|
self.moe_kexperts(
|
||||||
hidden_states_expert, selected_experts_expert, routing_weights_expert
|
hidden_states_expert, selected_experts_expert, routing_weights_expert
|
||||||
)
|
)
|
||||||
.view(*orig_shape)
|
.view(*orig_shape)
|
||||||
|
@ -900,8 +964,7 @@ class KMistralSparseMoEBlock(BaseInjectedModule, MixtralSparseMoeBlock):
|
||||||
return y, router_logits
|
return y, router_logits
|
||||||
|
|
||||||
@torch.no_grad()
|
@torch.no_grad()
|
||||||
def moe_on_cpuinfer(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
def moe_kexperts(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
|
||||||
outs = torch.empty_like(x)
|
|
||||||
outs = self.experts(x, topk_ids, topk_weight)
|
outs = self.experts(x, topk_ids, topk_weight)
|
||||||
return outs
|
return outs
|
||||||
|
|
||||||
|
|
291
ktransformers/operators/flashinfer_wrapper.py
Normal file
291
ktransformers/operators/flashinfer_wrapper.py
Normal file
|
@ -0,0 +1,291 @@
|
||||||
|
'''
|
||||||
|
Description : flashinfer MLA wrapper
|
||||||
|
Author : Boxin Zhang
|
||||||
|
Version : 0.2.2
|
||||||
|
'''
|
||||||
|
import torch
|
||||||
|
|
||||||
|
flashinfer_enabled = False
|
||||||
|
|
||||||
|
try:
|
||||||
|
import flashinfer
|
||||||
|
flashinfer_enabled = True
|
||||||
|
print("found flashinfer")
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
print("flashinfer not found, use triton for linux")
|
||||||
|
|
||||||
|
import math
|
||||||
|
|
||||||
|
def attention_ref(
|
||||||
|
batch_size,
|
||||||
|
q: torch.Tensor,
|
||||||
|
k: torch.Tensor,
|
||||||
|
v: torch.Tensor,
|
||||||
|
causal: bool,
|
||||||
|
sm_scale: float,
|
||||||
|
) -> torch.Tensor:
|
||||||
|
qo_len = q.shape[0] // batch_size
|
||||||
|
kv_len = k.shape[0] // batch_size
|
||||||
|
num_qo_heads = q.shape[1]
|
||||||
|
head_dim_qk = q.shape[2]
|
||||||
|
head_dim_vo = v.shape[2]
|
||||||
|
logits = (
|
||||||
|
torch.einsum(
|
||||||
|
"bmhd,bnhd->bhmn",
|
||||||
|
q.view(batch_size, qo_len, num_qo_heads, head_dim_qk).float(),
|
||||||
|
k.view(batch_size, kv_len, num_qo_heads, head_dim_qk).float(),
|
||||||
|
)
|
||||||
|
* sm_scale
|
||||||
|
)
|
||||||
|
|
||||||
|
#print("attn weights", logits)
|
||||||
|
|
||||||
|
if causal:
|
||||||
|
mask = (
|
||||||
|
torch.arange(kv_len - qo_len, kv_len).unsqueeze(1)
|
||||||
|
>= torch.arange(0, kv_len).unsqueeze(0)
|
||||||
|
).to(q.device)
|
||||||
|
else:
|
||||||
|
mask = torch.ones(qo_len, kv_len).to(q.device)
|
||||||
|
|
||||||
|
logits = logits.masked_fill(mask.unsqueeze(0).unsqueeze(0) == 0, float("-inf"))
|
||||||
|
lse_ref = torch.logsumexp(logits, -1).transpose(-1, -2)
|
||||||
|
p = torch.softmax(logits, dim=-1)
|
||||||
|
o_ref = (
|
||||||
|
torch.einsum(
|
||||||
|
"bhmn,bnhd->bmhd",
|
||||||
|
p,
|
||||||
|
v.view(batch_size, kv_len, num_qo_heads, head_dim_vo).float(),
|
||||||
|
)
|
||||||
|
.contiguous()
|
||||||
|
.view(batch_size * qo_len, num_qo_heads, head_dim_vo)
|
||||||
|
.to(q)
|
||||||
|
)
|
||||||
|
|
||||||
|
return o_ref, lse_ref * math.log2(math.e)
|
||||||
|
|
||||||
|
class MLAWrapper():
|
||||||
|
def __init__(self,
|
||||||
|
max_batch_size,
|
||||||
|
max_pages,
|
||||||
|
use_cuda_graph = True,
|
||||||
|
device = "cuda",
|
||||||
|
):
|
||||||
|
self.float_workspace_buffer = torch.empty(128*1024*1024, dtype=torch.int8, device=device)
|
||||||
|
self.max_batch_size = max_batch_size
|
||||||
|
self.max_pages = max_pages
|
||||||
|
if use_cuda_graph:
|
||||||
|
if self.max_batch_size == 1:
|
||||||
|
self.qo_indptr_buf = torch.arange(0, max_batch_size+1, dtype=torch.int32, device=device)
|
||||||
|
self.kv_indptr_buf = torch.tensor([0, max_pages], dtype=torch.int32, device=device)
|
||||||
|
self.kv_indices_buf = torch.arange(0, max_pages, dtype=torch.int32, device=device)
|
||||||
|
else:
|
||||||
|
self.qo_indptr_buf = torch.empty(max_batch_size+1, dtype=torch.int32, device=device)
|
||||||
|
self.kv_indptr_buf = torch.empty(max_batch_size+1, dtype=torch.int32, device=device)
|
||||||
|
self.kv_indices_buf = torch.empty(max_pages, dtype=torch.int32, device=device)
|
||||||
|
self.kv_len_arr_buf = torch.empty(max_batch_size, dtype=torch.int32, device=device)
|
||||||
|
else:
|
||||||
|
self.qo_indptr_buf = None
|
||||||
|
self.kv_indptr_buf = None
|
||||||
|
self.kv_indices_buf = None
|
||||||
|
self.kv_len_arr_buf = None
|
||||||
|
self.wrapper = flashinfer.mla.BatchMLAPagedAttentionWrapper(
|
||||||
|
self.float_workspace_buffer,
|
||||||
|
use_cuda_graph=False,
|
||||||
|
qo_indptr=self.qo_indptr_buf,
|
||||||
|
kv_indptr=self.kv_indptr_buf,
|
||||||
|
kv_indices=self.kv_indices_buf,
|
||||||
|
kv_len_arr=self.kv_len_arr_buf,
|
||||||
|
)
|
||||||
|
self.need_plan = True
|
||||||
|
|
||||||
|
def plan(self,
|
||||||
|
qo_indptr,
|
||||||
|
kv_indptr,
|
||||||
|
kv_indices,
|
||||||
|
kv_len_arr,
|
||||||
|
num_heads,
|
||||||
|
head_dim_ckv,
|
||||||
|
head_dim_kpe,
|
||||||
|
page_size,
|
||||||
|
sm_scale,
|
||||||
|
q_data_type,
|
||||||
|
kv_data_type,
|
||||||
|
):
|
||||||
|
if qo_indptr is None:
|
||||||
|
assert self.max_batch_size == 1
|
||||||
|
qo_indptr = self.qo_indptr_buf
|
||||||
|
if kv_indptr is None:
|
||||||
|
assert self.max_batch_size == 1
|
||||||
|
kv_indptr = self.kv_indptr_buf
|
||||||
|
if kv_indices is None:
|
||||||
|
assert self.max_batch_size == 1
|
||||||
|
kv_indices = self.kv_indices_buf
|
||||||
|
|
||||||
|
self.wrapper.plan(
|
||||||
|
qo_indptr,
|
||||||
|
kv_indptr,
|
||||||
|
kv_indices,
|
||||||
|
kv_len_arr,
|
||||||
|
num_heads,
|
||||||
|
head_dim_ckv,
|
||||||
|
head_dim_kpe,
|
||||||
|
page_size,
|
||||||
|
True, # causal
|
||||||
|
sm_scale,
|
||||||
|
q_data_type,
|
||||||
|
kv_data_type,
|
||||||
|
)
|
||||||
|
|
||||||
|
def run(self, q_nope, q_pe, ckv, k_pe, return_lse = False):
|
||||||
|
#print("run")
|
||||||
|
#print(self.wrapper._qo_indptr_buf)
|
||||||
|
#print(self.wrapper._kv_indptr_buf)
|
||||||
|
#print(self.wrapper._kv_indices_buf)
|
||||||
|
#print(self.wrapper._kv_len_arr_buf)
|
||||||
|
return self.wrapper.run(q_nope, q_pe, ckv, k_pe, return_lse = return_lse)
|
||||||
|
|
||||||
|
class MLAWrapperSingleton():
|
||||||
|
wrappers:dict = {}
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_instance(cls, device, *args, **kwargs)->MLAWrapper:
|
||||||
|
if device not in cls.wrappers:
|
||||||
|
cls.make_instance(device, *args, **kwargs)
|
||||||
|
return cls.wrappers[device]
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def make_instance(cls, device, *args, **kwargs):
|
||||||
|
cls.wrappers[device] = MLAWrapper(*args, **kwargs, device=device)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def plan_all(cls, qo_indptr,
|
||||||
|
kv_indptr,
|
||||||
|
kv_indices,
|
||||||
|
kv_len_arr,
|
||||||
|
num_heads,
|
||||||
|
head_dim_ckv,
|
||||||
|
head_dim_kpe,
|
||||||
|
page_size,
|
||||||
|
sm_scale,
|
||||||
|
q_data_type,
|
||||||
|
kv_data_type,):
|
||||||
|
for device, wrapper in cls.wrappers.items():
|
||||||
|
kv_len_arr_cur_device = kv_len_arr.to(device)
|
||||||
|
wrapper.plan(qo_indptr,
|
||||||
|
kv_indptr,
|
||||||
|
kv_indices,
|
||||||
|
kv_len_arr_cur_device,
|
||||||
|
num_heads,
|
||||||
|
head_dim_ckv,
|
||||||
|
head_dim_kpe,
|
||||||
|
page_size,
|
||||||
|
sm_scale,
|
||||||
|
q_data_type,
|
||||||
|
kv_data_type,)
|
||||||
|
wrapper.need_plan = False
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def need_plan_all(cls):
|
||||||
|
for device, wrapper in cls.wrappers.items():
|
||||||
|
wrapper.need_plan = True
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def reset_buffer(cls):
|
||||||
|
for device, wrapper in cls.wrappers.items():
|
||||||
|
wrapper.qo_indptr_buf[1] = 1 # assert max_batch_size=1 here.
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def update_buffer(cls, max_pages):
|
||||||
|
for device, wrapper in cls.wrappers.items():
|
||||||
|
wrapper.kv_indptr_buf[1] = max_pages # assert max_batch_size=1 here.
|
||||||
|
wrapper.kv_indices_buf = torch.arange(0, max_pages, dtype=torch.int32, device=device)
|
||||||
|
wrapper.wrapper._kv_indices_buf = wrapper.kv_indices_buf
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
torch.set_default_dtype(torch.bfloat16)
|
||||||
|
max_batch_size = 1
|
||||||
|
max_pages = 128
|
||||||
|
page_size = 64
|
||||||
|
num_heads = 128
|
||||||
|
|
||||||
|
kv_len = 4023
|
||||||
|
q_len = 1
|
||||||
|
q_nope = torch.randn((q_len, num_heads, 512), dtype=torch.bfloat16, device="cuda")
|
||||||
|
q_pe = torch.randn((q_len, num_heads, 64), dtype=torch.bfloat16, device="cuda")
|
||||||
|
ckv = torch.randn((max_pages, page_size, 512), dtype=torch.bfloat16, device="cuda")
|
||||||
|
k_pe = torch.randn((max_pages, page_size, 64), dtype=torch.bfloat16, device="cuda")
|
||||||
|
|
||||||
|
|
||||||
|
wrapper = MLAWrapperSingleton.get_instance(
|
||||||
|
"cuda",
|
||||||
|
max_batch_size,
|
||||||
|
max_pages,
|
||||||
|
)
|
||||||
|
|
||||||
|
kv_len_arr = torch.tensor([kv_len], dtype=torch.int32, device="cuda")
|
||||||
|
qo_indptr = torch.tensor([0, q_len], dtype=torch.int32, device="cuda")
|
||||||
|
wrapper.plan(
|
||||||
|
qo_indptr,
|
||||||
|
None,
|
||||||
|
None,
|
||||||
|
kv_len_arr,
|
||||||
|
128,
|
||||||
|
512,
|
||||||
|
64,
|
||||||
|
page_size,
|
||||||
|
192 ** (-0.5),
|
||||||
|
torch.bfloat16,
|
||||||
|
torch.bfloat16,
|
||||||
|
)
|
||||||
|
|
||||||
|
attn_output = wrapper.run(q_nope, q_pe, ckv, k_pe)
|
||||||
|
print(attn_output.shape)
|
||||||
|
|
||||||
|
graph = torch.cuda.CUDAGraph()
|
||||||
|
with torch.cuda.graph(graph):
|
||||||
|
attn_output = wrapper.run(q_nope, q_pe, ckv, k_pe)
|
||||||
|
|
||||||
|
kv_len = 6789
|
||||||
|
kv_len_arr = torch.tensor([kv_len], dtype=torch.int32, device="cuda")
|
||||||
|
qo_indptr = torch.tensor([0, q_len], dtype=torch.int32, device="cuda")
|
||||||
|
wrapper.plan(
|
||||||
|
qo_indptr,
|
||||||
|
None,
|
||||||
|
None,
|
||||||
|
kv_len_arr,
|
||||||
|
128,
|
||||||
|
512,
|
||||||
|
64,
|
||||||
|
page_size,
|
||||||
|
192 ** (-0.5),
|
||||||
|
torch.bfloat16,
|
||||||
|
torch.bfloat16,
|
||||||
|
)
|
||||||
|
|
||||||
|
graph.replay()
|
||||||
|
|
||||||
|
k = (
|
||||||
|
torch.cat([ckv, k_pe], dim=-1)
|
||||||
|
.view(-1, 1, 512 + 64)
|
||||||
|
.repeat_interleave(num_heads, dim=1)
|
||||||
|
)
|
||||||
|
v = ckv.view(-1, 1, 512).repeat_interleave(num_heads, dim=1)
|
||||||
|
|
||||||
|
print(k[:kv_len].shape)
|
||||||
|
print(v[:kv_len].shape)
|
||||||
|
|
||||||
|
attn_ref, lse_ref = attention_ref(
|
||||||
|
max_batch_size,
|
||||||
|
torch.cat([q_nope, q_pe], dim=-1),
|
||||||
|
k[:kv_len],
|
||||||
|
v[:kv_len],
|
||||||
|
True,
|
||||||
|
192 ** (-0.5)
|
||||||
|
)
|
||||||
|
print(attn_ref.shape)
|
||||||
|
|
||||||
|
torch.testing.assert_close(attn_output, attn_ref, rtol=1e-3, atol=1e-3)
|
||||||
|
print("test past")
|
|
@ -67,7 +67,14 @@ class KMoEGateBase(ABC):
|
||||||
|
|
||||||
for key in keys:
|
for key in keys:
|
||||||
key = ".".join(key.split(".")[:-1])
|
key = ".".join(key.split(".")[:-1])
|
||||||
if key + ".ffn_gate_inp.weight" in self.gguf_loader.tensor_info:
|
if self.gguf_loader.safetensor_loader is not None:
|
||||||
|
targets = [".ffn_gate_inp.weight", ".exp_probs_b.bias"]
|
||||||
|
weight = self.gguf_loader.safetensor_loader.load_tensor(key + ".ffn_gate_inp.weight")
|
||||||
|
e_score_correction_bias = self.gguf_loader.safetensor_loader.load_tensor(key + ".exp_probs_b.bias")
|
||||||
|
weight_type = weight.dtype
|
||||||
|
e_score_correction_bias_type = e_score_correction_bias.dtype
|
||||||
|
res = {"weight": weight, "e_score_correction_bias": e_score_correction_bias, "weight_type": weight_type, "e_score_correction_bias_type": e_score_correction_bias_type}
|
||||||
|
elif key + ".ffn_gate_inp.weight" in self.gguf_loader.tensor_info:
|
||||||
targets = [".ffn_gate_inp.weight", ".exp_probs_b.bias"]
|
targets = [".ffn_gate_inp.weight", ".exp_probs_b.bias"]
|
||||||
tensors = self.load_multi(key, targets, device=device)
|
tensors = self.load_multi(key, targets, device=device)
|
||||||
weight = tensors[".ffn_gate_inp.weight"]
|
weight = tensors[".ffn_gate_inp.weight"]
|
||||||
|
@ -93,11 +100,11 @@ class KMoEGate(BaseInjectedModule, KMoEGateBase):
|
||||||
gguf_loader: GGUFLoader,
|
gguf_loader: GGUFLoader,
|
||||||
config: PretrainedConfig,
|
config: PretrainedConfig,
|
||||||
orig_module: nn.Module = None,
|
orig_module: nn.Module = None,
|
||||||
generate_device: str = "cuda",
|
|
||||||
prefill_device: str = "cuda",
|
prefill_device: str = "cuda",
|
||||||
|
generate_device: str = "cuda",
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs)
|
||||||
KMoEGateBase.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
KMoEGateBase.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
||||||
self.generate_device = generate_device
|
self.generate_device = generate_device
|
||||||
self.prefill_device = prefill_device
|
self.prefill_device = prefill_device
|
||||||
|
@ -116,8 +123,8 @@ class KMoEGate(BaseInjectedModule, KMoEGateBase):
|
||||||
self.orig_module.e_score_correction_bias = nn.Parameter(w["e_score_correction_bias"])
|
self.orig_module.e_score_correction_bias = nn.Parameter(w["e_score_correction_bias"])
|
||||||
else:
|
else:
|
||||||
raise ValueError("Invalid weight type")
|
raise ValueError("Invalid weight type")
|
||||||
self.orig_module.weight = self.orig_module.weight.to(device)
|
self.orig_module.weight = nn.Parameter(self.orig_module.weight.to(device))
|
||||||
self.orig_module.e_score_correction_bias = self.orig_module.e_score_correction_bias.to(device)
|
self.orig_module.e_score_correction_bias = nn.Parameter(self.orig_module.e_score_correction_bias.to(device))
|
||||||
|
|
||||||
def unload(self):
|
def unload(self):
|
||||||
if self.weight is not None:
|
if self.weight is not None:
|
||||||
|
|
|
@ -21,10 +21,12 @@ from ktransformers.ktransformers_ext.operators.custom_marlin.quantize.utils.marl
|
||||||
MarlinWorkspace,
|
MarlinWorkspace,
|
||||||
marlin_quantize,
|
marlin_quantize,
|
||||||
GPTQ_MARLIN_MIN_THREAD_N,
|
GPTQ_MARLIN_MIN_THREAD_N,
|
||||||
|
GPTQ_MARLIN_MIN_THREAD_K,
|
||||||
GPTQ_MARLIN_MAX_PARALLEL,
|
GPTQ_MARLIN_MAX_PARALLEL,
|
||||||
)
|
)
|
||||||
from ktransformers.operators.base_operator import BaseInjectedModule
|
from ktransformers.operators.base_operator import BaseInjectedModule
|
||||||
from transformers.configuration_utils import PretrainedConfig
|
from transformers.configuration_utils import PretrainedConfig
|
||||||
|
from ktransformers.ktransformers_ext.triton.fp8gemm import fp8_gemm, act_quant, weight_dequant
|
||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
import sys, os
|
import sys, os
|
||||||
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "ktransformers_ext", "build"))
|
sys.path.append(os.path.join(os.path.dirname(__file__), "..", "ktransformers_ext", "build"))
|
||||||
|
@ -64,6 +66,8 @@ class KLinearBase(ABC):
|
||||||
self.in_features = self.gguf_loader.tensor_info[key + ".weight"]["shape"][0]
|
self.in_features = self.gguf_loader.tensor_info[key + ".weight"]["shape"][0]
|
||||||
self.out_features = self.gguf_loader.tensor_info[key + ".weight"]["shape"][1]
|
self.out_features = self.gguf_loader.tensor_info[key + ".weight"]["shape"][1]
|
||||||
|
|
||||||
|
self.loaded = False # for lm_head pre-load, TODO: use new way to do lm_head pre-load when layer wise prefill.
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||||
pass
|
pass
|
||||||
|
@ -75,7 +79,13 @@ class KLinearBase(ABC):
|
||||||
keys = [self.key]
|
keys = [self.key]
|
||||||
|
|
||||||
for key in keys:
|
for key in keys:
|
||||||
if key + ".weight" in self.gguf_loader.tensor_file_map:
|
if self.gguf_loader.safetensor_loader is not None:
|
||||||
|
# using safetensor_loader
|
||||||
|
tensor = self.gguf_loader.safetensor_loader.load_tensor(key+'.weight')
|
||||||
|
weight_scale_inv = self.gguf_loader.safetensor_loader.load_tensor(key+'.weight_scale_inv')
|
||||||
|
return nn.Parameter(tensor), nn.Parameter(weight_scale_inv)
|
||||||
|
|
||||||
|
elif key + ".weight" in self.gguf_loader.tensor_file_map:
|
||||||
if key + ".bias" in self.gguf_loader.tensor_file_map:
|
if key + ".bias" in self.gguf_loader.tensor_file_map:
|
||||||
tensors = self.load_multi(key, ["weight", "bias"], device=device)
|
tensors = self.load_multi(key, ["weight", "bias"], device=device)
|
||||||
tensor = tensors["weight"]
|
tensor = tensors["weight"]
|
||||||
|
@ -119,7 +129,7 @@ class KLinearTorch(KLinearBase):
|
||||||
super().__init__(key, gguf_loader, config, orig_module, device, **kwargs)
|
super().__init__(key, gguf_loader, config, orig_module, device, **kwargs)
|
||||||
self.has_bias = False
|
self.has_bias = False
|
||||||
self.dtype = torch.get_default_dtype()
|
self.dtype = torch.get_default_dtype()
|
||||||
self.w = None
|
self.weight = None
|
||||||
self.has_bias = False
|
self.has_bias = False
|
||||||
|
|
||||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||||
|
@ -127,44 +137,100 @@ class KLinearTorch(KLinearBase):
|
||||||
out_device = x.device
|
out_device = x.device
|
||||||
# TODO: support CUDA Graph when using cpu, but CPUInfer is recommended.
|
# TODO: support CUDA Graph when using cpu, but CPUInfer is recommended.
|
||||||
x = x.to(device=self.device, dtype=self.dtype)
|
x = x.to(device=self.device, dtype=self.dtype)
|
||||||
x = x @ self.w
|
x = x @ self.weight
|
||||||
if self.has_bias:
|
if self.has_bias:
|
||||||
x = x + self.bias
|
x = x + self.bias
|
||||||
x = x.to(dtype=dtype, device=out_device)
|
x = x.to(dtype=dtype, device=out_device)
|
||||||
return x
|
return x
|
||||||
|
|
||||||
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
|
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
|
||||||
|
if self.loaded: return
|
||||||
if device is None: device = self.device
|
if device is None: device = self.device
|
||||||
if w is None: w = self.load_weight(device=device)
|
if w is None: w = self.load_weight(device=device)
|
||||||
# else: self.out_features = w.shape[0], self.in_features = w.shape[1]
|
# else: self.out_features = w.shape[0], self.in_features = w.shape[1]
|
||||||
|
|
||||||
if isinstance(w, nn.Parameter):
|
if isinstance(w, nn.Parameter):
|
||||||
try:
|
try:
|
||||||
self.w = w.to(dtype=self.dtype).view(self.out_features, self.in_features).T
|
self.weight = w.to(dtype=self.dtype).view(self.out_features, self.in_features).T
|
||||||
except:
|
except:
|
||||||
self.w = w.to(dtype=self.dtype).T
|
self.weight = w.to(dtype=self.dtype).T
|
||||||
self.has_bias = False
|
self.has_bias = False
|
||||||
elif isinstance(w, tuple):
|
elif isinstance(w, tuple):
|
||||||
try:
|
try:
|
||||||
self.w = w[0].to(dtype=self.dtype).view(self.out_features, self.in_features).T
|
self.weight = w[0].to(dtype=self.dtype).view(self.out_features, self.in_features).T
|
||||||
except:
|
except:
|
||||||
self.w = w[0].to(dtype=self.dtype).T
|
self.weight = w[0].to(dtype=self.dtype).T
|
||||||
self.bias = w[1].to(dtype=self.dtype)
|
self.bias = w[1].to(dtype=self.dtype)
|
||||||
self.has_bias = True
|
self.has_bias = True
|
||||||
else:
|
else:
|
||||||
raise ValueError("Invalid weight type")
|
raise ValueError("Invalid weight type")
|
||||||
# self.linear = self.linear.to(device)
|
# self.linear = self.linear.to(device)
|
||||||
self.w = self.w.to(device)
|
self.weight = self.weight.to(device)
|
||||||
if self.has_bias:
|
if self.has_bias:
|
||||||
self.bias = self.bias.to(device)
|
self.bias = self.bias.to(device)
|
||||||
|
self.loaded = True
|
||||||
|
|
||||||
def unload(self):
|
def unload(self):
|
||||||
if self.w is not None:
|
if self.weight is not None:
|
||||||
self.w = None
|
self.weight = None
|
||||||
if self.has_bias:
|
if self.has_bias:
|
||||||
self.bias = None
|
self.bias = None
|
||||||
|
|
||||||
|
class KLinearFP8(KLinearBase):
|
||||||
|
# this kernel requires special handling for weight
|
||||||
|
# Please load the weight file downloaded from KVCache.AI
|
||||||
|
marlin_q_w: torch.Tensor
|
||||||
|
marlin_s: torch.Tensor
|
||||||
|
g_idx: torch.Tensor
|
||||||
|
sort_indices: torch.Tensor
|
||||||
|
has_bias: bool
|
||||||
|
weight: torch.Tensor
|
||||||
|
scale_w: torch.Tensor
|
||||||
|
bias: torch.Tensor
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
key: str,
|
||||||
|
gguf_loader: GGUFLoader,
|
||||||
|
config: PretrainedConfig,
|
||||||
|
orig_module: nn.Module = None,
|
||||||
|
device: str = "cuda",
|
||||||
|
block_size: int = 128,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(key, gguf_loader, config, orig_module, device, **kwargs)
|
||||||
|
self.has_bias = False
|
||||||
|
self.dtype = torch.get_default_dtype()
|
||||||
|
self.block_size = block_size
|
||||||
|
|
||||||
|
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||||
|
x = x.to(self.device)
|
||||||
|
orig_dtype = x.dtype
|
||||||
|
x_quantized, scale_x = act_quant(x, self.block_size)
|
||||||
|
y = fp8_gemm(x_quantized, scale_x, self.weight, self.weight_scale_inv)
|
||||||
|
return y.to(dtype=orig_dtype)
|
||||||
|
|
||||||
|
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
|
||||||
|
if device is None: device = self.device
|
||||||
|
if w is None:
|
||||||
|
w = self.load_weight(device=device)
|
||||||
|
### TODO fit weight_inv format
|
||||||
|
if isinstance(w, tuple):
|
||||||
|
self.weight = w[0].to(device)
|
||||||
|
self.weight_scale_inv = w[1].to(device)
|
||||||
|
self.has_bias = False
|
||||||
|
else:
|
||||||
|
raise ValueError("Invalid weight type")
|
||||||
|
self.weight = self.weight.to(device)
|
||||||
|
if self.has_bias:
|
||||||
|
self.bias = self.bias.to(device)
|
||||||
|
|
||||||
|
def unload(self):
|
||||||
|
if self.weight is not None:
|
||||||
|
self.weight = None
|
||||||
|
if self.has_bias:
|
||||||
|
self.bias = None
|
||||||
|
|
||||||
|
|
||||||
class KLinearMarlin(KLinearBase):
|
class KLinearMarlin(KLinearBase):
|
||||||
marlin_q_w: torch.Tensor
|
marlin_q_w: torch.Tensor
|
||||||
marlin_s: torch.Tensor
|
marlin_s: torch.Tensor
|
||||||
|
@ -190,20 +256,36 @@ class KLinearMarlin(KLinearBase):
|
||||||
self.group_size = group_size
|
self.group_size = group_size
|
||||||
self.act_order = act_order
|
self.act_order = act_order
|
||||||
self.is_k_full = is_k_full
|
self.is_k_full = is_k_full
|
||||||
|
self.padding = False
|
||||||
|
self.orin_in_features = self.in_features
|
||||||
|
self.orin_out_features = self.out_features
|
||||||
|
if self.in_features%GPTQ_MARLIN_MIN_THREAD_K!=0 or self.out_features%GPTQ_MARLIN_MIN_THREAD_K!=0:
|
||||||
|
#print(f"warning!, in_features={in_features} or out_features={out_features} is undivisible by GPTQ_MARLIN_MIN_THREAD_K={GPTQ_MARLIN_MIN_THREAD_K} and GPTQ_MARLIN_MIN_THREAD_N={GPTQ_MARLIN_MIN_THREAD_N}, padding")
|
||||||
|
self.padding = True
|
||||||
|
self.in_features = (self.in_features+GPTQ_MARLIN_MIN_THREAD_K-1)//GPTQ_MARLIN_MIN_THREAD_K*GPTQ_MARLIN_MIN_THREAD_K
|
||||||
|
self.out_features = (self.out_features+GPTQ_MARLIN_MIN_THREAD_N-1)//GPTQ_MARLIN_MIN_THREAD_N*GPTQ_MARLIN_MIN_THREAD_N
|
||||||
|
#print(f"After padding: in_features={in_features}, out_features={out_features}")
|
||||||
|
|
||||||
|
self.k = self.in_features
|
||||||
|
self.n = self.out_features
|
||||||
|
|
||||||
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
|
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
|
||||||
|
if self.loaded: return
|
||||||
if device is None: device = self.device
|
if device is None: device = self.device
|
||||||
assert device.lower() != "cpu", "Marlin quantized linear only supports GPU device"
|
assert device.lower() != "cpu", "Marlin quantized linear only supports GPU device"
|
||||||
|
|
||||||
|
#if self.in_features * self.out_features:
|
||||||
if w is None:
|
if w is None:
|
||||||
w = self.load_weight(device=device)
|
w = self.load_weight(device=device)
|
||||||
|
|
||||||
if isinstance(w, nn.Parameter):
|
if isinstance(w, nn.Parameter):
|
||||||
# pad weight
|
# pad weight
|
||||||
weight = w.view(self.out_features, self.in_features).T
|
weight = w.view(self.orin_out_features, self.orin_in_features).T
|
||||||
self.has_bias = False
|
self.has_bias = False
|
||||||
elif isinstance(w, tuple):
|
elif isinstance(w, tuple):
|
||||||
w = list(w)
|
w = list(w)
|
||||||
weight = w[0].view(self.out_features, self.in_features).T
|
weight = w[0].view(self.orin_out_features, self.orin_in_features).T
|
||||||
|
self.bias = w[1].view(self.orin_out_features)
|
||||||
self.bias = w[1]
|
self.bias = w[1]
|
||||||
self.has_bias = True
|
self.has_bias = True
|
||||||
else:
|
else:
|
||||||
|
@ -211,19 +293,27 @@ class KLinearMarlin(KLinearBase):
|
||||||
weight = weight.to(device)
|
weight = weight.to(device)
|
||||||
if self.has_bias:
|
if self.has_bias:
|
||||||
self.bias = self.bias.to(device)
|
self.bias = self.bias.to(device)
|
||||||
|
|
||||||
|
if self.padding:
|
||||||
|
padded_weight = torch.zeros(self.in_features, self.out_features, device=self.device)
|
||||||
|
padded_weight[:self.orin_in_features, :self.orin_out_features] = weight
|
||||||
|
weight = padded_weight
|
||||||
|
|
||||||
# Pack Marlin linear
|
# Pack Marlin linear
|
||||||
w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(
|
marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(
|
||||||
weight, self.num_bits, self.group_size, self.act_order
|
weight, self.num_bits, self.group_size, self.act_order
|
||||||
)
|
)
|
||||||
self.workspace = MarlinWorkspace(
|
self.workspace = MarlinWorkspace(
|
||||||
self.out_features, GPTQ_MARLIN_MIN_THREAD_N, GPTQ_MARLIN_MAX_PARALLEL,self.device
|
self.out_features, GPTQ_MARLIN_MIN_THREAD_N, GPTQ_MARLIN_MAX_PARALLEL,self.device
|
||||||
)
|
)
|
||||||
|
self.weight = marlin_q_w # modeling_xxx.py may use linear.weight
|
||||||
self.marlin_q_w = marlin_q_w
|
self.marlin_q_w = marlin_q_w
|
||||||
self.marlin_s = marlin_s
|
self.marlin_s = marlin_s
|
||||||
self.g_idx = g_idx
|
self.g_idx = g_idx
|
||||||
self.sort_indices = sort_indices
|
self.sort_indices = sort_indices
|
||||||
self.k = weight.shape[0]
|
self.k = weight.shape[0]
|
||||||
self.n = weight.shape[1]
|
self.n = weight.shape[1]
|
||||||
|
self.loaded = True
|
||||||
|
|
||||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||||
# Only support input x as BF16 and FP16
|
# Only support input x as BF16 and FP16
|
||||||
|
@ -231,6 +321,11 @@ class KLinearMarlin(KLinearBase):
|
||||||
orig_shape = list(x.shape)
|
orig_shape = list(x.shape)
|
||||||
orig_dtype = x.dtype
|
orig_dtype = x.dtype
|
||||||
x = x.reshape(-1, orig_shape[-1])
|
x = x.reshape(-1, orig_shape[-1])
|
||||||
|
x = x.reshape(-1, x.shape[-1])
|
||||||
|
if self.padding:
|
||||||
|
padding_input=torch.empty(x.shape[0], self.in_features, device=x.device, dtype=x.dtype)
|
||||||
|
padding_input[:,:self.orin_in_features] = x
|
||||||
|
x = padding_input
|
||||||
marlin_s = self.marlin_s.to(x.dtype)
|
marlin_s = self.marlin_s.to(x.dtype)
|
||||||
x = KTransformersOps.gptq_marlin_gemm(
|
x = KTransformersOps.gptq_marlin_gemm(
|
||||||
x,
|
x,
|
||||||
|
@ -245,6 +340,11 @@ class KLinearMarlin(KLinearBase):
|
||||||
x.shape[-1],
|
x.shape[-1],
|
||||||
self.is_k_full,
|
self.is_k_full,
|
||||||
)
|
)
|
||||||
|
if self.padding:
|
||||||
|
x = x[:,:self.orin_out_features]
|
||||||
|
orig_shape[-1] = self.orin_out_features
|
||||||
|
else:
|
||||||
|
orig_shape[-1] = self.out_features
|
||||||
if self.has_bias:
|
if self.has_bias:
|
||||||
x = x + self.bias
|
x = x + self.bias
|
||||||
orig_shape[-1] = self.n
|
orig_shape[-1] = self.n
|
||||||
|
@ -365,7 +465,8 @@ class KLinearCPUInfer(KLinearBase):
|
||||||
LINEAR_MAP = {
|
LINEAR_MAP = {
|
||||||
"KLinearMarlin": KLinearMarlin,
|
"KLinearMarlin": KLinearMarlin,
|
||||||
"KLinearTorch": KLinearTorch,
|
"KLinearTorch": KLinearTorch,
|
||||||
"KLinearCPUInfer": KLinearCPUInfer
|
"KLinearCPUInfer": KLinearCPUInfer,
|
||||||
|
"KLinearFP8": KLinearFP8,
|
||||||
}
|
}
|
||||||
|
|
||||||
class KTransformersLinear(BaseInjectedModule, KLinearBase):
|
class KTransformersLinear(BaseInjectedModule, KLinearBase):
|
||||||
|
@ -382,29 +483,18 @@ class KTransformersLinear(BaseInjectedModule, KLinearBase):
|
||||||
prefill_op: str| None = "KLinearTorch",
|
prefill_op: str| None = "KLinearTorch",
|
||||||
**kwargs,
|
**kwargs,
|
||||||
):
|
):
|
||||||
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
BaseInjectedModule.__init__(self, key, gguf_loader, config, orig_module, prefill_device, generate_device, **kwargs)
|
||||||
KLinearBase.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
KLinearBase.__init__(self, key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
||||||
# build all the linear operators
|
# build all the linear operators
|
||||||
if prefill_op is not None:
|
if prefill_op is not None:
|
||||||
assert prefill_op in LINEAR_MAP, f"linear_type {prefill_op} not supported"
|
assert prefill_op in LINEAR_MAP, f"linear_type {prefill_op} not supported"
|
||||||
if prefill_op == "KLinearMarlin" and (orig_module.in_features%GPTQ_MARLIN_MIN_THREAD_N!=0 or orig_module.out_features%GPTQ_MARLIN_MIN_THREAD_N!=0):
|
self.prefill_linear = LINEAR_MAP[prefill_op](key, gguf_loader, config, orig_module, prefill_device, **kwargs)
|
||||||
print(f"This linear module's in_features or out_features is not divisible by GPTQ_MARLIN_MIN_THREAD_N({GPTQ_MARLIN_MIN_THREAD_N}), using KLinearTorch instead.")
|
|
||||||
print(f"module info: key:{key} orig_module:{orig_module}")
|
|
||||||
self.prefill_linear = KLinearTorch(key, gguf_loader, config, orig_module, prefill_device, **kwargs)
|
|
||||||
else:
|
|
||||||
self.prefill_linear = LINEAR_MAP[prefill_op](key, gguf_loader, config, orig_module, prefill_device, **kwargs)
|
|
||||||
else:
|
else:
|
||||||
self.prefill_linear = None
|
self.prefill_linear = None
|
||||||
|
|
||||||
if generate_op is not None:
|
if generate_op is not None:
|
||||||
assert generate_op in LINEAR_MAP, f"linear_type {generate_op} not supported"
|
assert generate_op in LINEAR_MAP, f"linear_type {generate_op} not supported"
|
||||||
if generate_op == "KLinearMarlin" and (orig_module.in_features%GPTQ_MARLIN_MIN_THREAD_N!=0 or orig_module.out_features%GPTQ_MARLIN_MIN_THREAD_N!=0):
|
self.generate_linear = LINEAR_MAP[generate_op](key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
||||||
print(f"This linear module's in_features or out_features is not divisible by GPTQ_MARLIN_MIN_THREAD_N({GPTQ_MARLIN_MIN_THREAD_N}), using KLinearTorch instead.")
|
|
||||||
print(f"module info: key:{key} orig_module:{orig_module}")
|
|
||||||
self.generate_op = "KLinearTorch"
|
|
||||||
self.generate_linear = KLinearTorch(key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
|
||||||
else:
|
|
||||||
self.generate_linear = LINEAR_MAP[generate_op](key, gguf_loader, config, orig_module, generate_device, **kwargs)
|
|
||||||
else:
|
else:
|
||||||
self.generate_linear = None
|
self.generate_linear = None
|
||||||
self.mode = InferenceState.UNLOAD
|
self.mode = InferenceState.UNLOAD
|
||||||
|
@ -412,10 +502,11 @@ class KTransformersLinear(BaseInjectedModule, KLinearBase):
|
||||||
def forward(self, x):
|
def forward(self, x):
|
||||||
if self.mode == InferenceState.PREFILL:
|
if self.mode == InferenceState.PREFILL:
|
||||||
assert self.prefill_linear is not None, "cpu linear is not initialized"
|
assert self.prefill_linear is not None, "cpu linear is not initialized"
|
||||||
return self.prefill_linear.forward(x)
|
y = self.prefill_linear.forward(x)
|
||||||
else:
|
else:
|
||||||
assert self.generate_linear is not None, "gpu linear is not initialized"
|
assert self.generate_linear is not None, "gpu linear is not initialized"
|
||||||
return self.generate_linear.forward(x)
|
y = self.generate_linear.forward(x)
|
||||||
|
return y
|
||||||
|
|
||||||
def load(self, w: dict | nn.Parameter | tuple | None = None, mode: InferenceState = InferenceState.GENERATE):
|
def load(self, w: dict | nn.Parameter | tuple | None = None, mode: InferenceState = InferenceState.GENERATE):
|
||||||
if not mode:
|
if not mode:
|
||||||
|
@ -424,11 +515,13 @@ class KTransformersLinear(BaseInjectedModule, KLinearBase):
|
||||||
if mode == InferenceState.PREFILL:
|
if mode == InferenceState.PREFILL:
|
||||||
self.generate_linear.unload()
|
self.generate_linear.unload()
|
||||||
self.prefill_linear.load(w=w)
|
self.prefill_linear.load(w=w)
|
||||||
self.device = self.prefill_linear.device
|
self.device = self.prefill_linear.device
|
||||||
|
self.weight = self.prefill_linear.weight # modeling_xxx.py may use linear.weight
|
||||||
elif mode == InferenceState.GENERATE:
|
elif mode == InferenceState.GENERATE:
|
||||||
self.prefill_linear.unload()
|
self.prefill_linear.unload()
|
||||||
self.generate_linear.load(w=w)
|
self.generate_linear.load(w=w)
|
||||||
self.device = self.generate_linear.device
|
self.device = self.generate_linear.device
|
||||||
|
self.weight = self.generate_linear.weight # modeling_xxx.py may use linear.weight
|
||||||
elif mode == InferenceState.UNLOAD:
|
elif mode == InferenceState.UNLOAD:
|
||||||
self.prefill_linear.unload()
|
self.prefill_linear.unload()
|
||||||
self.generate_linear.unload()
|
self.generate_linear.unload()
|
||||||
|
|
|
@ -56,7 +56,7 @@ from ktransformers.models.modeling_deepseek import (
|
||||||
from transformers.models.qwen2_moe.configuration_qwen2_moe import Qwen2MoeConfig
|
from transformers.models.qwen2_moe.configuration_qwen2_moe import Qwen2MoeConfig
|
||||||
from ktransformers.models.configuration_llama import LlamaConfig
|
from ktransformers.models.configuration_llama import LlamaConfig
|
||||||
from ktransformers.operators.base_operator import BaseInjectedModule
|
from ktransformers.operators.base_operator import BaseInjectedModule
|
||||||
from ktransformers.util.utils import InferenceState
|
from ktransformers.util.utils import InferenceState, get_compute_capability
|
||||||
from ktransformers.util.custom_gguf import GGUFLoader
|
from ktransformers.util.custom_gguf import GGUFLoader
|
||||||
from transformers.configuration_utils import PretrainedConfig
|
from transformers.configuration_utils import PretrainedConfig
|
||||||
from ktransformers.models.modeling_llama import (
|
from ktransformers.models.modeling_llama import (
|
||||||
|
@ -649,9 +649,14 @@ class KDeepseekV2Model(BaseInjectedModule):
|
||||||
if per_layer_prefill_flag:
|
if per_layer_prefill_flag:
|
||||||
causal_mask = None
|
causal_mask = None
|
||||||
else:
|
else:
|
||||||
causal_mask = self._update_causal_mask(
|
if os.name == 'nt' or get_compute_capability()<8:
|
||||||
attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
|
print("for Windows or GPU before ampere, use forward_windows")
|
||||||
)
|
# only use mask in forward windows or can't flash attn
|
||||||
|
causal_mask = self._update_causal_mask(
|
||||||
|
attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
causal_mask = None
|
||||||
|
|
||||||
# embed positions
|
# embed positions
|
||||||
hidden_states = inputs_embeds
|
hidden_states = inputs_embeds
|
||||||
|
|
|
@ -126,6 +126,8 @@ def optimize_and_load_gguf(module: nn.Module, rule_file: str, gguf_path: str, mo
|
||||||
gguf_loader=GGUFLoader(gguf_path)
|
gguf_loader=GGUFLoader(gguf_path)
|
||||||
with torch.device("meta"):
|
with torch.device("meta"):
|
||||||
inject(module, optimize_config, model_config, gguf_loader)
|
inject(module, optimize_config, model_config, gguf_loader)
|
||||||
|
# pre load lm_head because its big inter result
|
||||||
|
load_weights(module.lm_head, gguf_loader, "lm_head.")
|
||||||
load_weights(module, gguf_loader)
|
load_weights(module, gguf_loader)
|
||||||
module.gguf_loader = gguf_loader
|
module.gguf_loader = gguf_loader
|
||||||
del_meta(module)
|
del_meta(module)
|
||||||
|
|
|
@ -219,8 +219,20 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda:2"
|
generate_device: "cuda:2"
|
||||||
prefill_device: "cuda:2"
|
prefill_device: "cuda:2"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model\\.layers\\.([5][0-9]|[4][5-9])\\.)|(^model.norm)|(^lm_head)"
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:3"
|
||||||
|
prefill_device: "cuda:3"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.([5][0-9]|[4][5-9])\\.)|(^model.norm)"
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: "default"
|
||||||
kwargs:
|
kwargs:
|
||||||
|
|
|
@ -118,7 +118,18 @@
|
||||||
prefill_device: "cuda:0"
|
prefill_device: "cuda:0"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model\\.layers\\.([345][0-9])\\.)|(model.norm)|(lm_head)"
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.([345][0-9])\\.)|(model.norm)"
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: "default"
|
||||||
kwargs:
|
kwargs:
|
||||||
|
|
|
@ -15,6 +15,18 @@
|
||||||
prefill_device: "cuda"
|
prefill_device: "cuda"
|
||||||
generate_op: "KLinearMarlin"
|
generate_op: "KLinearMarlin"
|
||||||
prefill_op: "KLinearTorch"
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "^model\\.layers\\..*\\.mlp$"
|
name: "^model\\.layers\\..*\\.mlp$"
|
||||||
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
|
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
|
||||||
|
|
|
@ -118,7 +118,18 @@
|
||||||
prefill_device: "cuda:0"
|
prefill_device: "cuda:0"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model\\.layers\\.([12][0-9])\\.)|(model.norm)|(lm_head)"
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.([12][0-9])\\.)|(model.norm)"
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: "default"
|
||||||
kwargs:
|
kwargs:
|
||||||
|
|
|
@ -15,6 +15,18 @@
|
||||||
prefill_device: "cuda"
|
prefill_device: "cuda"
|
||||||
generate_op: "KLinearMarlin"
|
generate_op: "KLinearMarlin"
|
||||||
prefill_op: "KLinearTorch"
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "^model\\.layers\\..*\\.mlp$"
|
name: "^model\\.layers\\..*\\.mlp$"
|
||||||
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
|
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
|
||||||
|
|
|
@ -0,0 +1,63 @@
|
||||||
|
- match:
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$" # regular expression
|
||||||
|
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearFP8"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\..*\\.mlp$"
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KDeepseekV3MoE # mlp module with custom forward function
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
- match:
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.MoEGate
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.gate.KMoEGate
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\..*\\.mlp\\.experts$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
|
||||||
|
kwargs:
|
||||||
|
prefill_device: "cuda"
|
||||||
|
prefill_op: "KExpertsTorch"
|
||||||
|
generate_device: "cpu"
|
||||||
|
generate_op: "KExpertsCPU"
|
||||||
|
out_device: "cuda"
|
||||||
|
recursive: False # don't recursively inject submodules of this module
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\..*\\.self_attn$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
- match:
|
||||||
|
name: "^model$"
|
||||||
|
replace:
|
||||||
|
class: "ktransformers.operators.models.KDeepseekV2Model"
|
||||||
|
kwargs:
|
||||||
|
per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
|
||||||
|
- match:
|
||||||
|
name: "^model.embed_tokens"
|
||||||
|
replace:
|
||||||
|
class: "default"
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cpu"
|
||||||
|
prefill_device: "cpu"
|
|
@ -182,6 +182,53 @@
|
||||||
generate_device: "cuda:3"
|
generate_device: "cuda:3"
|
||||||
prefill_device: "cuda:3"
|
prefill_device: "cuda:3"
|
||||||
|
|
||||||
|
# === MLP Experts Replacement ===
|
||||||
|
# replace with marlin expert. Open and modify layer-num as needed.
|
||||||
|
# Each layer of malin experts takes about 6GB of GPU memory.
|
||||||
|
# !!!Do remember 'close' cuda graph if you are using marlin expert.!!!
|
||||||
|
# !!!KExpertsTorch is untested, we don't have enough VRAM.!!!
|
||||||
|
|
||||||
|
# GPU 0: layers 3–4
|
||||||
|
# - match:
|
||||||
|
# name: "^model\\.layers\\.([3-4])\\.mlp\\.experts$"
|
||||||
|
# replace:
|
||||||
|
# class: ktransformers.operators.experts.KTransformersExperts
|
||||||
|
# kwargs:
|
||||||
|
# generate_device: "cuda:0"
|
||||||
|
# generate_op: "KExpertsMarlin"
|
||||||
|
# recursive: False
|
||||||
|
|
||||||
|
# # GPU 1: layers 15–17
|
||||||
|
# - match:
|
||||||
|
# name: "^model\\.layers\\.(1[5-7])\\.mlp\\.experts$"
|
||||||
|
# replace:
|
||||||
|
# class: ktransformers.operators.experts.KTransformersExperts
|
||||||
|
# kwargs:
|
||||||
|
# generate_device: "cuda:1"
|
||||||
|
# generate_op: "KExpertsMarlin"
|
||||||
|
# recursive: False
|
||||||
|
|
||||||
|
# # GPU 2: layers 30–32
|
||||||
|
# - match:
|
||||||
|
# name: "^model\\.layers\\.(3[0-2])\\.mlp\\.experts$"
|
||||||
|
# replace:
|
||||||
|
# class: ktransformers.operators.experts.KTransformersExperts
|
||||||
|
# kwargs:
|
||||||
|
# generate_device: "cuda:2"
|
||||||
|
# generate_op: "KExpertsMarlin"
|
||||||
|
# recursive: False
|
||||||
|
|
||||||
|
# # GPU 3: layers 45–46
|
||||||
|
# - match:
|
||||||
|
# name: "^model\\.layers\\.(4[5-6])\\.mlp\\.experts$"
|
||||||
|
# replace:
|
||||||
|
# class: ktransformers.operators.experts.KTransformersExperts
|
||||||
|
# kwargs:
|
||||||
|
# generate_device: "cuda:3"
|
||||||
|
# generate_op: "KExpertsMarlin"
|
||||||
|
# recursive: False
|
||||||
|
|
||||||
|
|
||||||
# === MLP Experts Replacement ===
|
# === MLP Experts Replacement ===
|
||||||
|
|
||||||
# GPU 0: layers 0–14
|
# GPU 0: layers 0–14
|
||||||
|
@ -246,6 +293,7 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda:0"
|
generate_device: "cuda:0"
|
||||||
prefill_device: "cuda:0"
|
prefill_device: "cuda:0"
|
||||||
|
absorb_for_prefill: False
|
||||||
|
|
||||||
# GPU 1: layers 15–29
|
# GPU 1: layers 15–29
|
||||||
- match:
|
- match:
|
||||||
|
@ -255,6 +303,7 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda:1"
|
generate_device: "cuda:1"
|
||||||
prefill_device: "cuda:1"
|
prefill_device: "cuda:1"
|
||||||
|
absorb_for_prefill: False
|
||||||
|
|
||||||
# GPU 2: layers 30–44
|
# GPU 2: layers 30–44
|
||||||
- match:
|
- match:
|
||||||
|
@ -264,6 +313,7 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda:2"
|
generate_device: "cuda:2"
|
||||||
prefill_device: "cuda:2"
|
prefill_device: "cuda:2"
|
||||||
|
absorb_for_prefill: False
|
||||||
|
|
||||||
# GPU 3: layers 45–60
|
# GPU 3: layers 45–60
|
||||||
- match:
|
- match:
|
||||||
|
@ -273,6 +323,7 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda:3"
|
generate_device: "cuda:3"
|
||||||
prefill_device: "cuda:3"
|
prefill_device: "cuda:3"
|
||||||
|
absorb_for_prefill: False
|
||||||
|
|
||||||
# === Overall Model Replacement with Transfer Map ===
|
# === Overall Model Replacement with Transfer Map ===
|
||||||
|
|
||||||
|
@ -316,9 +367,20 @@
|
||||||
generate_device: "cuda:2"
|
generate_device: "cuda:2"
|
||||||
prefill_device: "cuda:2"
|
prefill_device: "cuda:2"
|
||||||
|
|
||||||
# For final modules (model.norm and lm_head), ensure they are on GPU 3 (as in your original config)
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)|(^lm_head)"
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:3"
|
||||||
|
prefill_device: "cuda:3"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
# For final modules (model.norm), ensure they are on GPU 3 (as in your original config)
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)"
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: "default"
|
||||||
kwargs:
|
kwargs:
|
||||||
|
|
|
@ -713,9 +713,20 @@
|
||||||
generate_device: "cuda:7"
|
generate_device: "cuda:7"
|
||||||
prefill_device: "cuda:7"
|
prefill_device: "cuda:7"
|
||||||
|
|
||||||
# For final modules (model.norm and lm_head), ensure they are on GPU 7 (as in your original config)
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)|(^lm_head)"
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:7"
|
||||||
|
prefill_device: "cuda:7"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
# For final modules (model.norm), ensure they are on GPU 7 (as in your original config)
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)"
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: "default"
|
||||||
kwargs:
|
kwargs:
|
||||||
|
|
|
@ -0,0 +1,157 @@
|
||||||
|
- match:
|
||||||
|
name: "^model.embed_tokens"
|
||||||
|
replace:
|
||||||
|
class: "default"
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cpu"
|
||||||
|
prefill_device: "cpu"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\."
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.([3456][0-9])\\."
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.(?!self_attn\\.kv_b_proj).*$" # regular expression
|
||||||
|
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
generate_op: "KLinearFP8"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.([3456][0-9])\\.(?!self_attn\\.kv_b_proj).*$" # regular expression
|
||||||
|
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
generate_op: "KLinearFP8"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp$"
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KDeepseekV3MoE # mlp module with custom forward function
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.([3456][0-9])\\.mlp$"
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KDeepseekV3MoE # mlp module with custom forward function
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp\\.gate$"
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.MoEGate
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.gate.KMoEGate
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.([3456][0-9])\\.mlp\\.gate$"
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.MoEGate
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.gate.KMoEGate # mlp module with custom forward function
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.mlp\\.experts$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
|
||||||
|
kwargs:
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
prefill_op: "KExpertsTorch"
|
||||||
|
generate_device: "cpu"
|
||||||
|
generate_op: "KExpertsCPU"
|
||||||
|
out_device: "cuda:0"
|
||||||
|
recursive: False # don't recursively inject submodules of this module
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.([3456][0-9])\\.mlp\\.experts$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
|
||||||
|
kwargs:
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
prefill_op: "KExpertsTorch"
|
||||||
|
generate_device: "cpu"
|
||||||
|
generate_op: "KExpertsCPU"
|
||||||
|
out_device: "cuda:1"
|
||||||
|
recursive: False # don't recursively inject submodules of this module
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\.self_attn$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.([3456][0-9])\\.self_attn$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model$"
|
||||||
|
replace:
|
||||||
|
class: "ktransformers.operators.models.KDeepseekV2Model"
|
||||||
|
kwargs:
|
||||||
|
per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
|
||||||
|
transfer_map:
|
||||||
|
30: "cuda:1"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(0|[1-9]|[12][0-9])\\."
|
||||||
|
replace:
|
||||||
|
class: "default"
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: "default"
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.([3456][0-9])\\.)|(model.norm)"
|
||||||
|
replace:
|
||||||
|
class: "default"
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
|
@ -153,9 +153,20 @@
|
||||||
prefill_device: "cuda:0"
|
prefill_device: "cuda:0"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model\\.layers\\.([3456][0-9])\\.)|(model.norm)|(lm_head)"
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda:0"
|
generate_device: "cuda:0"
|
||||||
prefill_device: "cuda:0"
|
prefill_device: "cuda:0"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.([3456][0-9])\\.)|(model.norm)"
|
||||||
|
replace:
|
||||||
|
class: "default"
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
|
|
@ -135,7 +135,18 @@
|
||||||
prefill_device: "cuda:0"
|
prefill_device: "cuda:0"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model\\.layers\\.([3456][0-9])\\.)|(model.norm)|(lm_head)"
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "(^model\\.layers\\.([3456][0-9])\\.)|(model.norm)"
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: "default"
|
||||||
kwargs:
|
kwargs:
|
||||||
|
|
|
@ -5,6 +5,18 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda"
|
generate_device: "cuda"
|
||||||
prefill_device: "cuda"
|
prefill_device: "cuda"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^lm_head$" # regular expression
|
||||||
|
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$" # regular expression
|
name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$" # regular expression
|
||||||
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||||
|
@ -48,6 +60,7 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cuda"
|
generate_device: "cuda"
|
||||||
prefill_device: "cuda"
|
prefill_device: "cuda"
|
||||||
|
absorb_for_prefill: False # change this to True to enable long context(prefill may slower).
|
||||||
- match:
|
- match:
|
||||||
name: "^model$"
|
name: "^model$"
|
||||||
replace:
|
replace:
|
||||||
|
|
|
@ -15,6 +15,16 @@
|
||||||
prefill_device: "cuda"
|
prefill_device: "cuda"
|
||||||
generate_op: "KLinearMarlin"
|
generate_op: "KLinearMarlin"
|
||||||
prefill_op: "KLinearTorch"
|
prefill_op: "KLinearTorch"
|
||||||
|
- match:
|
||||||
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
- match:
|
- match:
|
||||||
name: "^model\\.layers\\..*\\.block_sparse_moe$"
|
name: "^model\\.layers\\..*\\.block_sparse_moe$"
|
||||||
class: ktransformers.models.modeling_mixtral.MixtralSparseMoeBlock
|
class: ktransformers.models.modeling_mixtral.MixtralSparseMoeBlock
|
||||||
|
|
86
ktransformers/optimize/optimize_rules/Moonlight-16B-A3B.yaml
Normal file
86
ktransformers/optimize/optimize_rules/Moonlight-16B-A3B.yaml
Normal file
|
@ -0,0 +1,86 @@
|
||||||
|
- match:
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.RoPE.RotaryEmbeddingV3
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^lm_head$" # regular expression
|
||||||
|
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\.(?!.*self_attn\\.kv_b_proj).*$" # regular expression
|
||||||
|
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\..*\\.mlp$"
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KDeepseekV3MoE # mlp module with custom forward function
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
- match:
|
||||||
|
class: ktransformers.models.modeling_deepseek_v3.MoEGate
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.gate.KMoEGate
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:0"
|
||||||
|
prefill_device: "cuda:0"
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\..*\\.mlp\\.experts$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
|
||||||
|
kwargs:
|
||||||
|
prefill_device: "cuda"
|
||||||
|
prefill_op: "KExpertsTorch"
|
||||||
|
generate_device: "cpu"
|
||||||
|
generate_op: "KExpertsCPU"
|
||||||
|
out_device: "cuda"
|
||||||
|
recursive: False # don't recursively inject submodules of this module
|
||||||
|
# if want to use more VRAM, use experts Marlin and disable CUDA Graph(disable CUDA Graph may cause low performance)
|
||||||
|
#- match:
|
||||||
|
# name: "^model\\.layers\\..*\\.mlp\\.experts$"
|
||||||
|
# replace:
|
||||||
|
# class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
|
||||||
|
# kwargs:
|
||||||
|
# prefill_device: "cuda"
|
||||||
|
# prefill_op: "KExpertsTorch"
|
||||||
|
# generate_device: "cuda"
|
||||||
|
# generate_op: "KExpertsMarlin"
|
||||||
|
# recursive: False # don't recursively inject submodules of this module
|
||||||
|
- match:
|
||||||
|
name: "^model\\.layers\\..*\\.self_attn$"
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
- match:
|
||||||
|
name: "^model$"
|
||||||
|
replace:
|
||||||
|
class: "ktransformers.operators.models.KDeepseekV2Model"
|
||||||
|
kwargs:
|
||||||
|
per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
|
||||||
|
- match:
|
||||||
|
name: "^model.embed_tokens"
|
||||||
|
replace:
|
||||||
|
class: "default"
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cpu"
|
||||||
|
prefill_device: "cpu"
|
|
@ -77,9 +77,19 @@
|
||||||
kwargs:
|
kwargs:
|
||||||
generate_device: "cpu"
|
generate_device: "cpu"
|
||||||
prefill_device: "cpu"
|
prefill_device: "cpu"
|
||||||
|
- match:
|
||||||
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda:1"
|
||||||
|
prefill_device: "cuda:1"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
|
|
||||||
- match:
|
- match:
|
||||||
name: "(^model.norm)|(^lm_head)"
|
name: "(^model.norm)"
|
||||||
replace:
|
replace:
|
||||||
class: "default"
|
class: "default"
|
||||||
kwargs:
|
kwargs:
|
||||||
|
|
|
@ -15,6 +15,16 @@
|
||||||
prefill_device: "cuda"
|
prefill_device: "cuda"
|
||||||
generate_op: "KLinearMarlin"
|
generate_op: "KLinearMarlin"
|
||||||
prefill_op: "KLinearTorch"
|
prefill_op: "KLinearTorch"
|
||||||
|
- match:
|
||||||
|
name: "^lm_head"
|
||||||
|
class: torch.nn.Linear
|
||||||
|
replace:
|
||||||
|
class: ktransformers.operators.linear.KTransformersLinear
|
||||||
|
kwargs:
|
||||||
|
generate_device: "cuda"
|
||||||
|
prefill_device: "cuda"
|
||||||
|
generate_op: "KLinearMarlin"
|
||||||
|
prefill_op: "KLinearTorch"
|
||||||
- match:
|
- match:
|
||||||
name: "^model\\.layers\\..*\\.mlp$"
|
name: "^model\\.layers\\..*\\.mlp$"
|
||||||
class: ktransformers.models.modeling_qwen2_moe.Qwen2MoeSparseMoeBlock
|
class: ktransformers.models.modeling_qwen2_moe.Qwen2MoeSparseMoeBlock
|
||||||
|
|
|
@ -12,8 +12,8 @@ from ktransformers.server.config.config import Config
|
||||||
from ktransformers.server.utils.create_interface import get_interface
|
from ktransformers.server.utils.create_interface import get_interface
|
||||||
from ktransformers.server.schemas.assistants.streaming import check_link_response
|
from ktransformers.server.schemas.assistants.streaming import check_link_response
|
||||||
from ktransformers.server.backend.base import BackendInterfaceBase
|
from ktransformers.server.backend.base import BackendInterfaceBase
|
||||||
router = APIRouter(prefix='/api')
|
|
||||||
|
|
||||||
|
router = APIRouter(prefix='/api')
|
||||||
|
|
||||||
# https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion
|
# https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion
|
||||||
class OllamaGenerateCompletionRequest(BaseModel):
|
class OllamaGenerateCompletionRequest(BaseModel):
|
||||||
|
@ -40,61 +40,121 @@ class OllamaGenerateCompletionRequest(BaseModel):
|
||||||
keep_alive: Optional[str] = Field(
|
keep_alive: Optional[str] = Field(
|
||||||
"5m", description="Controls how long the model will stay loaded into memory following the request.")
|
"5m", description="Controls how long the model will stay loaded into memory following the request.")
|
||||||
|
|
||||||
|
|
||||||
class OllamaGenerationStreamResponse(BaseModel):
|
class OllamaGenerationStreamResponse(BaseModel):
|
||||||
model: str
|
model: str
|
||||||
created_at: str
|
created_at: str
|
||||||
response: str
|
response: str
|
||||||
done: bool = Field(...)
|
done: bool = Field(...)
|
||||||
|
|
||||||
|
|
||||||
class OllamaGenerationResponse(BaseModel):
|
class OllamaGenerationResponse(BaseModel):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
@router.post("/generate", tags=['ollama'])
|
@router.post("/generate", tags=['ollama'])
|
||||||
async def generate(request: Request, input: OllamaGenerateCompletionRequest):
|
async def generate(request: Request, input: OllamaGenerateCompletionRequest):
|
||||||
id = str(uuid4())
|
id = str(uuid4())
|
||||||
|
|
||||||
interface: BackendInterfaceBase = get_interface()
|
interface: BackendInterfaceBase = get_interface()
|
||||||
print(f'COMPLETION INPUT:----\n{input.prompt}\n----')
|
print(f'COMPLETION INPUT:----\n{input.prompt}\n----')
|
||||||
|
|
||||||
config = Config()
|
config = Config()
|
||||||
|
|
||||||
if input.stream:
|
if input.stream:
|
||||||
async def inner():
|
async def inner():
|
||||||
async for token in interface.inference(input.prompt,id):
|
async for token in interface.inference(input.prompt, id):
|
||||||
d = OllamaGenerationStreamResponse(model=config.model_name,created_at=str(datetime.now()),response=token,done=False)
|
d = OllamaGenerationStreamResponse(
|
||||||
yield d.model_dump_json()+'\n'
|
model=config.model_name,
|
||||||
# d = {'model':config.model_name,'created_at':"", 'response':token,'done':False}
|
created_at=str(datetime.now()),
|
||||||
# yield f"{json.dumps(d)}\n"
|
response=token,
|
||||||
# d = {'model':config.model_name,'created_at':"", 'response':'','done':True}
|
done=False
|
||||||
# yield f"{json.dumps(d)}\n"
|
)
|
||||||
d = OllamaGenerationStreamResponse(model=config.model_name,created_at=str(datetime.now()),response='',done=True)
|
yield d.model_dump_json() + '\n'
|
||||||
yield d.model_dump_json()+'\n'
|
d = OllamaGenerationStreamResponse(
|
||||||
return check_link_response(request,inner())
|
model=config.model_name,
|
||||||
|
created_at=str(datetime.now()),
|
||||||
|
response='',
|
||||||
|
done=True
|
||||||
|
)
|
||||||
|
yield d.model_dump_json() + '\n'
|
||||||
|
return check_link_response(request, inner())
|
||||||
else:
|
else:
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
# https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion
|
# https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion
|
||||||
|
class OllamaChatCompletionMessage(BaseModel):
|
||||||
|
role: str
|
||||||
|
content: str
|
||||||
|
|
||||||
class OllamaChatCompletionRequest(BaseModel):
|
class OllamaChatCompletionRequest(BaseModel):
|
||||||
pass
|
model: str = Field(..., description="The model name, which is required.")
|
||||||
|
messages: List[OllamaChatCompletionMessage] = Field(
|
||||||
|
..., description="A list of messages to generate a response for.")
|
||||||
|
stream: bool = Field(True, description="If true, the response will be streamed.")
|
||||||
|
|
||||||
class OllamaChatCompletionStreamResponse(BaseModel):
|
class OllamaChatCompletionStreamResponse(BaseModel):
|
||||||
pass
|
model: str
|
||||||
|
created_at: str
|
||||||
|
message: dict
|
||||||
|
done: bool = Field(...)
|
||||||
|
total_duration: Optional[int] = Field(None, description="Total time spent in nanoseconds")
|
||||||
|
load_duration: Optional[int] = Field(None, description="Time spent loading model in nanoseconds")
|
||||||
|
prompt_eval_count: Optional[int] = Field(None, description="Number of tokens in prompt")
|
||||||
|
prompt_eval_duration: Optional[int] = Field(None, description="Time spent evaluating prompt in nanoseconds")
|
||||||
|
eval_count: Optional[int] = Field(None, description="Number of tokens generated")
|
||||||
|
eval_duration: Optional[int] = Field(None, description="Time spent generating response in nanoseconds")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
class OllamaChatCompletionResponse(BaseModel):
|
class OllamaChatCompletionResponse(BaseModel):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
@router.post("/chat", tags=['ollama'])
|
@router.post("/chat", tags=['ollama'])
|
||||||
async def chat(request: Request, input: OllamaChatCompletionRequest):
|
async def chat(request: Request, input: OllamaChatCompletionRequest):
|
||||||
raise NotImplementedError
|
id = str(uuid4())
|
||||||
|
interface: BackendInterfaceBase = get_interface()
|
||||||
|
config = Config()
|
||||||
|
|
||||||
|
# 将消息转换为提示字符串
|
||||||
|
prompt = ""
|
||||||
|
for msg in input.messages:
|
||||||
|
prompt += f"{msg.role}: {msg.content}\n"
|
||||||
|
prompt += "assistant:"
|
||||||
|
|
||||||
|
if input.stream:
|
||||||
|
async def inner():
|
||||||
|
start_time = time() # 记录开始时间(秒)
|
||||||
|
eval_count = 0 # 统计生成的 token 数量
|
||||||
|
tokens = []
|
||||||
|
|
||||||
|
async for token in interface.inference(prompt, id):
|
||||||
|
d = OllamaChatCompletionStreamResponse(
|
||||||
|
model=config.model_name,
|
||||||
|
created_at=str(datetime.now()),
|
||||||
|
message={"role": "assistant", "content": token},
|
||||||
|
done=False
|
||||||
|
)
|
||||||
|
yield d.model_dump_json() + '\n'
|
||||||
|
# 计算性能数据
|
||||||
|
end_time = time()
|
||||||
|
total_duration = int((end_time - start_time) * 1_000_000_000) # 转换为纳秒
|
||||||
|
prompt_eval_count = len(prompt.split()) # 简单估算提示词数量
|
||||||
|
eval_duration = total_duration # 假设全部时间用于生成(简化)
|
||||||
|
prompt_eval_duration = 0 # 假设无单独提示评估时间
|
||||||
|
load_duration = 0 # 假设加载时间未知
|
||||||
|
|
||||||
|
d = OllamaChatCompletionStreamResponse(
|
||||||
|
model=config.model_name,
|
||||||
|
created_at=str(datetime.now()),
|
||||||
|
message={},
|
||||||
|
done=True,
|
||||||
|
total_duration=total_duration,
|
||||||
|
load_duration=load_duration,
|
||||||
|
prompt_eval_count=prompt_eval_count,
|
||||||
|
prompt_eval_duration=prompt_eval_duration,
|
||||||
|
eval_count=eval_count,
|
||||||
|
eval_duration=eval_duration
|
||||||
|
)
|
||||||
|
yield d.model_dump_json() + '\n'
|
||||||
|
return check_link_response(request, inner())
|
||||||
|
else:
|
||||||
|
raise NotImplementedError("Non-streaming chat is not implemented.")
|
||||||
|
|
||||||
# https://github.com/ollama/ollama/blob/main/docs/api.md#list-local-models
|
# https://github.com/ollama/ollama/blob/main/docs/api.md#list-local-models
|
||||||
class OllamaModel(BaseModel):
|
class OllamaModel(BaseModel):
|
||||||
|
@ -103,9 +163,8 @@ class OllamaModel(BaseModel):
|
||||||
size: int
|
size: int
|
||||||
# TODO: fill the rest correctly
|
# TODO: fill the rest correctly
|
||||||
|
|
||||||
|
|
||||||
# mock ollama
|
# mock ollama
|
||||||
@router.get("/tags",tags=['ollama'])
|
@router.get("/tags", tags=['ollama'])
|
||||||
async def tags():
|
async def tags():
|
||||||
config = Config()
|
config = Config()
|
||||||
# TODO: fill this correctly, although it does not effect Tabby
|
# TODO: fill this correctly, although it does not effect Tabby
|
||||||
|
@ -138,25 +197,21 @@ class OllamaShowResponse(BaseModel):
|
||||||
class Config:
|
class Config:
|
||||||
protected_namespaces = ()
|
protected_namespaces = ()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/show", tags=['ollama'])
|
@router.post("/show", tags=['ollama'])
|
||||||
async def show(request: Request, input: OllamaShowRequest):
|
async def show(request: Request, input: OllamaShowRequest):
|
||||||
config = Config()
|
config = Config()
|
||||||
# TODO: Add more info in config to return, although it does not effect Tabby
|
# TODO: Add more info in config to return, although it does not effect Tabby
|
||||||
return OllamaShowResponse(
|
return OllamaShowResponse(
|
||||||
modelfile = "# Modelfile generated by ...",
|
modelfile="# Modelfile generated by ...",
|
||||||
parameters = " ",
|
parameters=" ",
|
||||||
template = " ",
|
template=" ",
|
||||||
details = OllamaShowDetial(
|
details=OllamaShowDetial(
|
||||||
parent_model = " ",
|
parent_model=" ",
|
||||||
format = "gguf",
|
format="gguf",
|
||||||
family = " ",
|
family=" ",
|
||||||
families = [
|
families=[" "],
|
||||||
" "
|
parameter_size=" ",
|
||||||
],
|
quantization_level=" "
|
||||||
parameter_size = " ",
|
|
||||||
quantization_level = " "
|
|
||||||
),
|
),
|
||||||
model_info = OllamaModelInfo()
|
model_info=OllamaModelInfo()
|
||||||
)
|
)
|
|
@ -5,18 +5,15 @@ from fastapi import APIRouter
|
||||||
from fastapi.requests import Request
|
from fastapi.requests import Request
|
||||||
from ktransformers.server.utils.create_interface import get_interface
|
from ktransformers.server.utils.create_interface import get_interface
|
||||||
from ktransformers.server.schemas.assistants.streaming import chat_stream_response
|
from ktransformers.server.schemas.assistants.streaming import chat_stream_response
|
||||||
from ktransformers.server.schemas.endpoints.chat import ChatCompletionCreate,ChatCompletionChunk,ChatCompletionObject
|
from ktransformers.server.schemas.endpoints.chat import ChatCompletionCreate,ChatCompletionChunk,ChatCompletionObject, Usage
|
||||||
from ktransformers.server.backend.base import BackendInterfaceBase
|
from ktransformers.server.backend.base import BackendInterfaceBase
|
||||||
|
from ktransformers.server.config.config import Config
|
||||||
|
|
||||||
router = APIRouter()
|
router = APIRouter()
|
||||||
|
|
||||||
models = [
|
|
||||||
{"id": "0", "name": "ktranformers-model"},
|
|
||||||
]
|
|
||||||
|
|
||||||
@router.get('/models', tags=['openai'])
|
@router.get('/models', tags=['openai'])
|
||||||
async def list_models():
|
async def list_models():
|
||||||
return models
|
return [{"id": Config().model_name, "name": Config().model_name}]
|
||||||
|
|
||||||
|
|
||||||
@router.post('/chat/completions', tags=['openai'])
|
@router.post('/chat/completions', tags=['openai'])
|
||||||
|
@ -28,15 +25,19 @@ async def chat_completion(request:Request,create:ChatCompletionCreate):
|
||||||
|
|
||||||
input_message = [json.loads(m.model_dump_json()) for m in create.messages]
|
input_message = [json.loads(m.model_dump_json()) for m in create.messages]
|
||||||
|
|
||||||
|
if Config().api_key != '':
|
||||||
|
assert request.headers.get('Authorization', '').split()[-1] == Config().api_key
|
||||||
|
|
||||||
if create.stream:
|
if create.stream:
|
||||||
async def inner():
|
async def inner():
|
||||||
chunk = ChatCompletionChunk(id=id,object='chat.completion.chunk',created=int(time()))
|
chunk = ChatCompletionChunk(id=id,object='chat.completion.chunk',created=int(time()))
|
||||||
async for token in interface.inference(input_message,id):
|
async for token in interface.inference(input_message,id,create.temperature,create.top_p):
|
||||||
chunk.set_token(token)
|
chunk.set_token(token)
|
||||||
yield chunk
|
yield chunk
|
||||||
return chat_stream_response(request,inner())
|
return chat_stream_response(request,inner())
|
||||||
else:
|
else:
|
||||||
comp = ChatCompletionObject(id=id,object='chat.completion.chunk',created=int(time()))
|
comp = ChatCompletionObject(id=id,object='chat.completion',created=int(time()))
|
||||||
async for token in interface.inference(input_message,id):
|
comp.usage = Usage(completion_tokens=1, prompt_tokens=1, total_tokens=2)
|
||||||
|
async for token in interface.inference(input_message,id,create.temperature,create.top_p):
|
||||||
comp.append_token(token)
|
comp.append_token(token)
|
||||||
return comp
|
return comp
|
||||||
|
|
|
@ -20,7 +20,7 @@ async def create_completion(request:Request,create:CompletionCreate):
|
||||||
|
|
||||||
if create.stream:
|
if create.stream:
|
||||||
async def inner():
|
async def inner():
|
||||||
async for token in interface.inference(create.prompt,id):
|
async for token in interface.inference(create.prompt,id,create.temperature,create.top_p):
|
||||||
d = {'choices':[{'delta':{'content':token}}]}
|
d = {'choices':[{'delta':{'content':token}}]}
|
||||||
yield f"data:{json.dumps(d)}\n\n"
|
yield f"data:{json.dumps(d)}\n\n"
|
||||||
d = {'choices':[{'delta':{'content':''},'finish_reason':''}]}
|
d = {'choices':[{'delta':{'content':''},'finish_reason':''}]}
|
||||||
|
@ -28,6 +28,6 @@ async def create_completion(request:Request,create:CompletionCreate):
|
||||||
return stream_response(request,inner())
|
return stream_response(request,inner())
|
||||||
else:
|
else:
|
||||||
comp = CompletionObject(id=id,object='text_completion',created=int(time()))
|
comp = CompletionObject(id=id,object='text_completion',created=int(time()))
|
||||||
async for token in interface.inference(create.prompt,id):
|
async for token in interface.inference(create.prompt,id,create.temperature,create.top_p):
|
||||||
comp.append_token(token)
|
comp.append_token(token)
|
||||||
return comp
|
return comp
|
||||||
|
|
|
@ -10,6 +10,7 @@ class ArgumentParser:
|
||||||
parser = argparse.ArgumentParser(prog="kvcache.ai", description="Ktransformers")
|
parser = argparse.ArgumentParser(prog="kvcache.ai", description="Ktransformers")
|
||||||
parser.add_argument("--host", type=str, default=self.cfg.server_ip)
|
parser.add_argument("--host", type=str, default=self.cfg.server_ip)
|
||||||
parser.add_argument("--port", type=int, default=self.cfg.server_port)
|
parser.add_argument("--port", type=int, default=self.cfg.server_port)
|
||||||
|
parser.add_argument("--api_key", type=str, default=self.cfg.api_key)
|
||||||
parser.add_argument("--ssl_keyfile", type=str)
|
parser.add_argument("--ssl_keyfile", type=str)
|
||||||
parser.add_argument("--ssl_certfile", type=str)
|
parser.add_argument("--ssl_certfile", type=str)
|
||||||
parser.add_argument("--web", type=bool, default=self.cfg.mount_web)
|
parser.add_argument("--web", type=bool, default=self.cfg.mount_web)
|
||||||
|
@ -23,13 +24,13 @@ class ArgumentParser:
|
||||||
parser.add_argument("--optimize_config_path", default=self.cfg.optimize_config_path, type=str, required=False)
|
parser.add_argument("--optimize_config_path", default=self.cfg.optimize_config_path, type=str, required=False)
|
||||||
parser.add_argument("--cpu_infer", type=int, default=self.cfg.cpu_infer)
|
parser.add_argument("--cpu_infer", type=int, default=self.cfg.cpu_infer)
|
||||||
parser.add_argument("--type", type=str, default=self.cfg.backend_type)
|
parser.add_argument("--type", type=str, default=self.cfg.backend_type)
|
||||||
|
parser.add_argument("--chunk_prefill_size", type=int, default=8192)
|
||||||
|
|
||||||
# model configs
|
# model configs
|
||||||
# parser.add_argument("--model_cache_lens", type=int, default=self.cfg.cache_lens) # int?
|
# parser.add_argument("--model_cache_lens", type=int, default=self.cfg.cache_lens) # int?
|
||||||
parser.add_argument("--paged", type=bool, default=self.cfg.paged)
|
parser.add_argument("--paged", type=bool, default=self.cfg.paged)
|
||||||
parser.add_argument("--total_context", type=int, default=self.cfg.total_context)
|
parser.add_argument("--total_context", type=int, default=self.cfg.total_context)
|
||||||
parser.add_argument("--max_batch_size", type=int, default=self.cfg.max_batch_size)
|
parser.add_argument("--max_batch_size", type=int, default=self.cfg.max_batch_size)
|
||||||
parser.add_argument("--max_chunk_size", type=int, default=self.cfg.max_chunk_size)
|
|
||||||
parser.add_argument("--max_new_tokens", type=int, default=self.cfg.max_new_tokens)
|
parser.add_argument("--max_new_tokens", type=int, default=self.cfg.max_new_tokens)
|
||||||
parser.add_argument("--json_mode", type=bool, default=self.cfg.json_mode)
|
parser.add_argument("--json_mode", type=bool, default=self.cfg.json_mode)
|
||||||
parser.add_argument("--healing", type=bool, default=self.cfg.healing)
|
parser.add_argument("--healing", type=bool, default=self.cfg.healing)
|
||||||
|
@ -90,7 +91,8 @@ class ArgumentParser:
|
||||||
# user config
|
# user config
|
||||||
parser.add_argument("--user_secret_key", type=str, default=self.cfg.user_secret_key)
|
parser.add_argument("--user_secret_key", type=str, default=self.cfg.user_secret_key)
|
||||||
parser.add_argument("--user_algorithm", type=str, default=self.cfg.user_algorithm)
|
parser.add_argument("--user_algorithm", type=str, default=self.cfg.user_algorithm)
|
||||||
parser.add_argument("--force_think", type=bool, default=self.cfg.user_force_think)
|
parser.add_argument("--force_think", action=argparse.BooleanOptionalAction, type=bool, default=self.cfg.user_force_think)
|
||||||
|
parser.add_argument("--use_cuda_graph", action=argparse.BooleanOptionalAction, type=bool, default=self.cfg.use_cuda_graph)
|
||||||
|
|
||||||
# web config
|
# web config
|
||||||
parser.add_argument("--web_cross_domain", type=bool, default=self.cfg.web_cross_domain)
|
parser.add_argument("--web_cross_domain", type=bool, default=self.cfg.web_cross_domain)
|
||||||
|
|
|
@ -23,7 +23,7 @@ class ConfigArgs(BaseModel):
|
||||||
max_batch_size: int = Field(
|
max_batch_size: int = Field(
|
||||||
None, description="Max number of batches to run at once, assuming the sequences will fit within total_context"
|
None, description="Max number of batches to run at once, assuming the sequences will fit within total_context"
|
||||||
)
|
)
|
||||||
max_chunk_size: int = Field(
|
chunk_prefill_size: int = Field(
|
||||||
None,
|
None,
|
||||||
description=(
|
description=(
|
||||||
"Max chunk size. Determines the size of prefill operations. Can be reduced to reduce pauses whenever a new"
|
"Max chunk size. Determines the size of prefill operations. Can be reduced to reduce pauses whenever a new"
|
||||||
|
|
|
@ -14,7 +14,10 @@ from ktransformers.models.custom_cache import StaticCache
|
||||||
from ktransformers.util.cuda_graph_runner import CUDAGraphRunner
|
from ktransformers.util.cuda_graph_runner import CUDAGraphRunner
|
||||||
from ktransformers.local_chat import custom_models, default_optimize_rules
|
from ktransformers.local_chat import custom_models, default_optimize_rules
|
||||||
from ktransformers.util.utils import get_device
|
from ktransformers.util.utils import get_device
|
||||||
|
from typing import Optional
|
||||||
|
from ktransformers.operators.flashinfer_wrapper import flashinfer_enabled, MLAWrapperSingleton
|
||||||
|
|
||||||
|
warm_uped = False
|
||||||
|
|
||||||
class KTransformersThreadContext(TransformersThreadContext):
|
class KTransformersThreadContext(TransformersThreadContext):
|
||||||
pass
|
pass
|
||||||
|
@ -23,19 +26,29 @@ class KTransformersThreadContext(TransformersThreadContext):
|
||||||
class KTransformersInterface(TransformersInterface):
|
class KTransformersInterface(TransformersInterface):
|
||||||
def __init__(self, args: ConfigArgs = default_args):
|
def __init__(self, args: ConfigArgs = default_args):
|
||||||
self.args = args
|
self.args = args
|
||||||
torch.set_default_dtype(torch.bfloat16)
|
|
||||||
torch.set_grad_enabled(False)
|
torch.set_grad_enabled(False)
|
||||||
self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, device=args.device, trust_remote_code=args.trust_remote_code)
|
self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, device=args.device, trust_remote_code=args.trust_remote_code)
|
||||||
config = AutoConfig.from_pretrained(args.model_dir, trust_remote_code=args.trust_remote_code)
|
config = AutoConfig.from_pretrained(args.model_dir, trust_remote_code=args.trust_remote_code)
|
||||||
|
try:
|
||||||
|
generation_config = GenerationConfig.from_pretrained(args.model_dir)
|
||||||
|
except:
|
||||||
|
generation_config = GenerationConfig(
|
||||||
|
max_length=args.max_new_tokens,
|
||||||
|
temperature=args.temperature,
|
||||||
|
top_p=args.temperature,
|
||||||
|
do_sample=True
|
||||||
|
)
|
||||||
|
|
||||||
|
torch.set_default_dtype(config.torch_dtype)
|
||||||
if config.architectures[0] == "Qwen2MoeForCausalLM":
|
if config.architectures[0] == "Qwen2MoeForCausalLM":
|
||||||
config._attn_implementation = "flash_attention_2"
|
config._attn_implementation = "flash_attention_2"
|
||||||
|
|
||||||
with torch.device("meta"):
|
with torch.device("meta"):
|
||||||
self.model = custom_models[config.architectures[0]](config)
|
self.model = custom_models[config.architectures[0]](config)
|
||||||
if default_args.optimize_config_path is None:
|
if default_args.optimize_config_path is None:
|
||||||
optimize_rule_path = default_optimize_rules[config.architectures[0]]
|
optimize_config_path = default_optimize_rules[config.architectures[0]]
|
||||||
else:
|
else:
|
||||||
optimize_rule_path = args.optimize_config_path
|
optimize_config_path = args.optimize_config_path
|
||||||
|
|
||||||
# print(optimize_config)
|
# print(optimize_config)
|
||||||
|
|
||||||
|
@ -45,8 +58,8 @@ class KTransformersInterface(TransformersInterface):
|
||||||
"please input the path of your gguf file(gguf file in the dir containing input gguf file must all"
|
"please input the path of your gguf file(gguf file in the dir containing input gguf file must all"
|
||||||
" belong to current model):"
|
" belong to current model):"
|
||||||
)
|
)
|
||||||
optimize_and_load_gguf(self.model, optimize_rule_path, gguf_path, config)
|
optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
|
||||||
|
self.model.generation_config = generation_config
|
||||||
self.device_map = self.model.gguf_loader.tensor_device_map
|
self.device_map = self.model.gguf_loader.tensor_device_map
|
||||||
# logger.info(f"{args.model_name} loaded from {args.model_dir} to {self.device_map}")
|
# logger.info(f"{args.model_name} loaded from {args.model_dir} to {self.device_map}")
|
||||||
self.cache = StaticCache(
|
self.cache = StaticCache(
|
||||||
|
@ -57,16 +70,7 @@ class KTransformersInterface(TransformersInterface):
|
||||||
dtype=self.model.dtype,
|
dtype=self.model.dtype,
|
||||||
)
|
)
|
||||||
# logger.info(f"StaticCache (length={args.cache_lens}), batch size:{args.batch_size}")
|
# logger.info(f"StaticCache (length={args.cache_lens}), batch size:{args.batch_size}")
|
||||||
try:
|
|
||||||
self.model.generation_config = GenerationConfig.from_pretrained(args.model_dir)
|
|
||||||
except:
|
|
||||||
gen_config = GenerationConfig(
|
|
||||||
max_length=128,
|
|
||||||
temperature=0.7,
|
|
||||||
top_p=0.9,
|
|
||||||
do_sample=True
|
|
||||||
)
|
|
||||||
self.model.generation_config = gen_config
|
|
||||||
if self.model.generation_config.pad_token_id is None:
|
if self.model.generation_config.pad_token_id is None:
|
||||||
self.model.generation_config.pad_token_id = self.model.generation_config.eos_token_id
|
self.model.generation_config.pad_token_id = self.model.generation_config.eos_token_id
|
||||||
self.streamer = TextStreamer(self.tokenizer)
|
self.streamer = TextStreamer(self.tokenizer)
|
||||||
|
@ -74,10 +78,13 @@ class KTransformersInterface(TransformersInterface):
|
||||||
self._infer_lock = asyncio.Lock()
|
self._infer_lock = asyncio.Lock()
|
||||||
|
|
||||||
def decode_one_tokens(self):
|
def decode_one_tokens(self):
|
||||||
|
global warm_uped
|
||||||
|
|
||||||
device_map = self.model.gguf_loader.tensor_device_map
|
device_map = self.model.gguf_loader.tensor_device_map
|
||||||
torch_device = get_device("blk.0.self_attn", device_map)
|
torch_device = get_device("blk.0.self_attn", device_map)
|
||||||
torch_device = "cuda:0" if torch_device == "cuda" else torch_device
|
torch_device = "cuda:0" if torch_device == "cuda" else torch_device
|
||||||
if self.args.use_cuda_graph:
|
torch.cuda.set_device(torch_device)
|
||||||
|
if warm_uped and self.args.use_cuda_graph:
|
||||||
if not hasattr(self, "cuda_graph_runner"):
|
if not hasattr(self, "cuda_graph_runner"):
|
||||||
self.cuda_graph_runner = CUDAGraphRunner()
|
self.cuda_graph_runner = CUDAGraphRunner()
|
||||||
self.cuda_graph_runner.capture(
|
self.cuda_graph_runner.capture(
|
||||||
|
@ -99,14 +106,15 @@ class KTransformersInterface(TransformersInterface):
|
||||||
torch.cuda.synchronize()
|
torch.cuda.synchronize()
|
||||||
logits = logits[0, -1, :]
|
logits = logits[0, -1, :]
|
||||||
return self.logits_to_token(logits)
|
return self.logits_to_token(logits)
|
||||||
|
|
||||||
|
if self.args.use_cuda_graph:
|
||||||
|
warm_uped = True
|
||||||
|
|
||||||
if self.use_static_cache:
|
if self.use_static_cache:
|
||||||
mask = torch.ones((1, self.seq_length)).to(torch_device)
|
|
||||||
logits = self.model(
|
logits = self.model(
|
||||||
self.current_ids.to(torch_device),
|
self.current_ids.to(torch_device),
|
||||||
cache_position=self.active_cache_position,
|
cache_position=self.active_cache_position,
|
||||||
past_key_values=self.cache,
|
past_key_values=self.cache,
|
||||||
attention_mask=mask,
|
|
||||||
return_dict=False,
|
return_dict=False,
|
||||||
use_cache=True,
|
use_cache=True,
|
||||||
)[0]
|
)[0]
|
||||||
|
@ -119,55 +127,98 @@ class KTransformersInterface(TransformersInterface):
|
||||||
|
|
||||||
|
|
||||||
@torch.no_grad
|
@torch.no_grad
|
||||||
def prefill(self, input_ids: torch.Tensor, is_new: bool):
|
def prefill(self, input_ids: torch.Tensor, is_new: bool, temperature: Optional[float], top_p: Optional[float]):
|
||||||
input_ids_length = input_ids.shape[-1]
|
input_ids_length = input_ids.shape[-1]
|
||||||
self.profiler.set_counter("prefill", input_ids_length)
|
if(input_ids_length >= self.args.cache_lens):
|
||||||
|
logger.warning(f"input_ids_length {input_ids_length} > cache_lens {self.args.cache_lens}")
|
||||||
|
self.seq_length = input_ids_length
|
||||||
|
return
|
||||||
logger.debug(f"input_ids: {input_ids.shape}")
|
logger.debug(f"input_ids: {input_ids.shape}")
|
||||||
|
|
||||||
device = self.device_map.get("blk.0.self_attn", {}).get("generate_device", "cuda:0")
|
device = self.device_map.get("blk.0.self_attn", {}).get("generate_device", "cuda:0")
|
||||||
|
device = "cuda:0" if device == "cuda" else device
|
||||||
|
|
||||||
if is_new:
|
if is_new:
|
||||||
self.cache.reset()
|
|
||||||
self.ever_generated_ids.clear()
|
self.ever_generated_ids.clear()
|
||||||
former_seq_length = 0
|
same_prefix = 0
|
||||||
self.seq_length = input_ids_length
|
flat_input_ids = input_ids.flatten()
|
||||||
self.generated_ids = torch.zeros(
|
|
||||||
self.args.batch_size,
|
if getattr(self, 'generated_ids', None) is None:
|
||||||
self.seq_length + self.args.max_new_tokens + 1,
|
self.generated_ids = torch.zeros(
|
||||||
dtype=torch.int,
|
self.args.batch_size,
|
||||||
device=self.args.device,
|
input_ids.shape[-1] + self.args.max_new_tokens + 1,
|
||||||
)
|
dtype=torch.int,
|
||||||
else:
|
device=self.args.device,
|
||||||
logger.debug(f"generate_ids: {self.generated_ids.shape}")
|
|
||||||
former_seq_length = self.seq_length
|
|
||||||
self.seq_length += input_ids_length
|
|
||||||
expected_length = self.seq_length + self.args.max_new_tokens + 1
|
|
||||||
delta_length = expected_length - self.generated_ids.shape[-1]
|
|
||||||
if delta_length > 0:
|
|
||||||
new_generate_ids = torch.zeros(
|
|
||||||
self.args.batch_size, delta_length, dtype=torch.int, device=self.args.device
|
|
||||||
)
|
)
|
||||||
self.generated_ids = torch.cat([self.generated_ids, new_generate_ids], dim=-1)
|
self.seq_length = 1
|
||||||
|
|
||||||
|
flat_prev_ids = self.generated_ids.flatten()
|
||||||
|
for i in range(min(self.seq_length, flat_input_ids.shape[0]) - 1):
|
||||||
|
if flat_input_ids[i] == flat_prev_ids[i]:
|
||||||
|
same_prefix += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
logger.debug(f"same prefix len: {same_prefix}")
|
||||||
|
self.cache.remove_suffix(same_prefix)
|
||||||
|
self.seq_length = same_prefix
|
||||||
|
self.generated_ids = self.generated_ids[..., :same_prefix]
|
||||||
|
input_ids = input_ids[..., same_prefix:]
|
||||||
|
input_ids_length = input_ids.shape[-1]
|
||||||
|
|
||||||
|
self.ever_generated_ids.clear()
|
||||||
|
self.profiler.set_counter("prefill", input_ids_length)
|
||||||
|
logger.debug(f"input_ids: {input_ids.shape}")
|
||||||
|
logger.debug(f"generate_ids: {self.generated_ids.shape}")
|
||||||
|
|
||||||
|
former_seq_length = self.seq_length
|
||||||
|
self.seq_length += input_ids_length
|
||||||
|
expected_length = min(self.seq_length + self.args.max_new_tokens + 1, self.args.cache_lens)
|
||||||
|
delta_length = expected_length - self.generated_ids.shape[-1]
|
||||||
|
if delta_length > 0:
|
||||||
|
new_generate_ids = torch.zeros(
|
||||||
|
self.args.batch_size, delta_length, dtype=torch.int, device=self.args.device
|
||||||
|
)
|
||||||
|
self.generated_ids = torch.cat([self.generated_ids, new_generate_ids], dim=-1)
|
||||||
|
else:
|
||||||
|
logger.warning(f"seq_length bigger than cache_lens, killed")
|
||||||
|
exit(0)
|
||||||
|
|
||||||
logger.debug(f"cache position: {former_seq_length} to {self.seq_length}")
|
logger.debug(f"cache position: {former_seq_length} to {self.seq_length}")
|
||||||
cache_position = torch.arange(former_seq_length, self.seq_length, device=device)
|
cache_position = torch.arange(former_seq_length, self.seq_length, device=device)
|
||||||
self.generated_ids[:, cache_position] = input_ids.to(self.args.device).to(torch.int)
|
self.generated_ids[:, cache_position] = input_ids.to(self.args.device).to(torch.int)
|
||||||
|
|
||||||
mask = torch.ones((1, self.seq_length)).to(device)
|
|
||||||
if not (type(self) is TransformersInterface):
|
if not (type(self) is TransformersInterface):
|
||||||
input_ids = input_ids.to("cpu")
|
input_ids = input_ids.to("cpu")
|
||||||
inputs_embeds = self.model.model.embed_tokens(input_ids).to(device)
|
|
||||||
if self.use_static_cache:
|
def chunk_prefill(input_ids, cache_position):
|
||||||
logits = self.model(
|
inputs_embeds = self.model.model.embed_tokens(input_ids).to(device)
|
||||||
inputs_embeds=inputs_embeds,
|
torch.cuda.set_device(device)
|
||||||
cache_position=cache_position,
|
if flashinfer_enabled:
|
||||||
past_key_values=self.cache,
|
MLAWrapperSingleton.need_plan_all()
|
||||||
return_dict=False,
|
if self.use_static_cache:
|
||||||
use_cache=True,
|
logits = self.model(
|
||||||
attention_mask=mask,
|
inputs_embeds=inputs_embeds,
|
||||||
)[0]
|
cache_position=cache_position,
|
||||||
else:
|
past_key_values=self.cache,
|
||||||
logits = self.model(inputs_embeds=inputs_embeds, return_dict=False)[0]
|
return_dict=False,
|
||||||
|
use_cache=True,
|
||||||
|
)[0]
|
||||||
|
else:
|
||||||
|
logits = self.model(inputs_embeds=inputs_embeds, return_dict=False)[0]
|
||||||
|
|
||||||
|
return logits
|
||||||
|
|
||||||
|
chunk_start = 0
|
||||||
|
while chunk_start < input_ids_length:
|
||||||
|
chunk_end = min(chunk_start + self.args.chunk_prefill_size, input_ids_length)
|
||||||
|
if self.cache != None:
|
||||||
|
self.cache.cur_idx=cache_position[chunk_start:chunk_end]
|
||||||
|
logits = chunk_prefill(input_ids[:, chunk_start:chunk_end], cache_position[chunk_start:chunk_end])
|
||||||
|
chunk_start += self.args.chunk_prefill_size
|
||||||
|
|
||||||
|
if flashinfer_enabled:
|
||||||
|
MLAWrapperSingleton.reset_buffer()
|
||||||
|
self.prepare_logits_wrapper(input_ids, device, temperature, top_p)
|
||||||
next_token = self.logits_to_token(logits[0, -1, :])
|
next_token = self.logits_to_token(logits[0, -1, :])
|
||||||
yield self.append_new_tokens(next_token)
|
yield self.append_new_tokens(next_token)
|
||||||
|
|
||||||
|
@ -176,7 +227,7 @@ class KTransformersInterface(TransformersInterface):
|
||||||
device = self.device_map.get("blk.0.self_attn", {}).get("generate_device", "cuda:0")
|
device = self.device_map.get("blk.0.self_attn", {}).get("generate_device", "cuda:0")
|
||||||
return torch.tensor([self.seq_length - 1], device=device)
|
return torch.tensor([self.seq_length - 1], device=device)
|
||||||
|
|
||||||
async def inference(self, local_messages, thread_id: str):
|
async def inference(self, local_messages, thread_id: str, temperature: Optional[float], top_p: Optional[float]):
|
||||||
async with self._infer_lock:
|
async with self._infer_lock:
|
||||||
async for v in super().inference(local_messages, thread_id):
|
async for v in super().inference(local_messages, thread_id, temperature, top_p):
|
||||||
yield v
|
yield v
|
||||||
|
|
|
@ -19,7 +19,7 @@ import sys, os
|
||||||
from ..base import ThreadContext, BackendInterfaceBase
|
from ..base import ThreadContext, BackendInterfaceBase
|
||||||
from ktransformers.server.config.log import logger
|
from ktransformers.server.config.log import logger
|
||||||
from ..args import ConfigArgs, default_args
|
from ..args import ConfigArgs, default_args
|
||||||
|
from ktransformers.operators.flashinfer_wrapper import flashinfer_enabled, MLAWrapperSingleton
|
||||||
|
|
||||||
# This TextStreamer is a modified version from https://github.com/huggingface/transformers/blob/main/src/transformers/generation/streamers.py
|
# This TextStreamer is a modified version from https://github.com/huggingface/transformers/blob/main/src/transformers/generation/streamers.py
|
||||||
class TextStreamer:
|
class TextStreamer:
|
||||||
|
@ -171,7 +171,7 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
for m in messages[1:]:
|
for m in messages[1:]:
|
||||||
if m["role"] == "user" and new_messages[-1]["role"] == "user":
|
if m["role"] == "user" and new_messages[-1]["role"] == "user":
|
||||||
logger.warning("merge two adjacent user messages")
|
logger.warning("merge two adjacent user messages")
|
||||||
new_messages[-1]["content"] += m["content"]
|
new_messages[-1]["content"] += '\n' + m["content"]
|
||||||
else:
|
else:
|
||||||
new_messages.append(m)
|
new_messages.append(m)
|
||||||
# if (self.last_request_id is not None) and self.last_request_id == thread_id:
|
# if (self.last_request_id is not None) and self.last_request_id == thread_id:
|
||||||
|
@ -180,7 +180,11 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
# input_ids = self.tokenizer.apply_chat_template(
|
# input_ids = self.tokenizer.apply_chat_template(
|
||||||
# new_messages, return_tensors="pt", add_generation_prompt=True
|
# new_messages, return_tensors="pt", add_generation_prompt=True
|
||||||
# ).to(self.args.device)
|
# ).to(self.args.device)
|
||||||
input_ids = self.tokenizer.apply_chat_template(new_messages,return_tensors='pt',add_generation_prompt=True).to(self.args.device)
|
input_str: str = self.tokenizer.apply_chat_template(new_messages,tokenize=False,add_generation_prompt=True)
|
||||||
|
# drop <think> token in chat template
|
||||||
|
if input_str.endswith('<think>\n'):
|
||||||
|
input_str = input_str[:-len('<think>\n')]
|
||||||
|
input_ids = self.tokenizer.encode(input_str, return_tensors="pt").to(self.args.device)
|
||||||
if (self.last_request_id is not None) and self.last_request_id == thread_id:
|
if (self.last_request_id is not None) and self.last_request_id == thread_id:
|
||||||
x = self.generated_ids[:,:self.seq_length]
|
x = self.generated_ids[:,:self.seq_length]
|
||||||
y = input_ids[:,:self.seq_length]
|
y = input_ids[:,:self.seq_length]
|
||||||
|
@ -199,14 +203,31 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
self.seq_length += 1
|
self.seq_length += 1
|
||||||
return self.streamer.put(new_tokens)
|
return self.streamer.put(new_tokens)
|
||||||
|
|
||||||
def logits_to_token(self, logits: torch.Tensor):
|
def prepare_logits_wrapper(self, inputs, device, temperature: Optional[float] = None, top_p: Optional[float] = None):
|
||||||
logits = logits / self.args.temperature if self.args.temperature!=0 else logits
|
if temperature is None or temperature == 0:
|
||||||
|
temperature = self.model.generation_config.temperature
|
||||||
|
if top_p is None:
|
||||||
|
top_p = self.model.generation_config.top_p
|
||||||
|
generation_config, model_kwargs = self.model._prepare_generation_config(
|
||||||
|
None, max_length=self.args.max_new_tokens,
|
||||||
|
do_sample=True,
|
||||||
|
top_k=self.args.top_k,
|
||||||
|
top_p=top_p,
|
||||||
|
temperature=temperature,
|
||||||
|
repetition_penalty=self.args.repetition_penalty # change this to modify generate config
|
||||||
|
)
|
||||||
|
self.inputs = inputs
|
||||||
|
try: # transformers==4.43
|
||||||
|
self.logits_warper = (
|
||||||
|
self.model._get_logits_warper(generation_config, device=device)
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
self.logits_warper = (
|
||||||
|
self.model._get_logits_warper(generation_config)
|
||||||
|
)
|
||||||
|
|
||||||
for token_idx in self.ever_generated_ids:
|
def logits_to_token(self, logits: torch.Tensor):
|
||||||
if logits[token_idx] < 0:
|
logits = self.logits_warper(self.inputs.view(1, -1), logits.view(1, -1))
|
||||||
logits[token_idx] *= self.args.repetition_penalty
|
|
||||||
else:
|
|
||||||
logits[token_idx] /= self.args.repetition_penalty
|
|
||||||
|
|
||||||
probs = torch.nn.functional.softmax(logits, dim=-1)
|
probs = torch.nn.functional.softmax(logits, dim=-1)
|
||||||
|
|
||||||
|
@ -222,12 +243,10 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
|
|
||||||
def decode_one_tokens(self):
|
def decode_one_tokens(self):
|
||||||
if self.use_static_cache:
|
if self.use_static_cache:
|
||||||
mask = torch.ones((1, self.seq_length)).to(self.args.device)
|
|
||||||
logits = self.model(
|
logits = self.model(
|
||||||
self.current_ids,
|
self.current_ids,
|
||||||
cache_position=self.active_cache_position,
|
cache_position=self.active_cache_position,
|
||||||
past_key_values=self.cache,
|
past_key_values=self.cache,
|
||||||
attention_mask=mask,
|
|
||||||
return_dict=False,
|
return_dict=False,
|
||||||
use_cache=True,
|
use_cache=True,
|
||||||
)[0]
|
)[0]
|
||||||
|
@ -238,38 +257,57 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
return self.logits_to_token(logits)
|
return self.logits_to_token(logits)
|
||||||
|
|
||||||
@torch.no_grad
|
@torch.no_grad
|
||||||
def prefill(self, input_ids: torch.Tensor, is_new: bool):
|
def prefill(self, input_ids: torch.Tensor, is_new: bool, temperature: Optional[float] = None, top_p: Optional[float] = None):
|
||||||
input_ids_length = input_ids.shape[-1]
|
input_ids_length = input_ids.shape[-1]
|
||||||
self.profiler.set_counter("prefill", input_ids_length)
|
|
||||||
logger.debug(f"input_ids: {input_ids.shape}")
|
logger.debug(f"input_ids: {input_ids.shape}")
|
||||||
|
|
||||||
if is_new:
|
if is_new:
|
||||||
self.cache.reset()
|
|
||||||
self.ever_generated_ids.clear()
|
self.ever_generated_ids.clear()
|
||||||
former_seq_length = 0
|
same_prefix = 0
|
||||||
self.seq_length = input_ids_length
|
flat_input_ids = input_ids.flatten()
|
||||||
self.generated_ids = torch.zeros(
|
|
||||||
self.args.batch_size,
|
if getattr(self, 'generated_ids', None) is None:
|
||||||
self.seq_length + self.args.max_new_tokens + 1,
|
self.generated_ids = torch.zeros(
|
||||||
dtype=torch.int,
|
self.args.batch_size,
|
||||||
device=self.args.device,
|
input_ids.shape[-1] + self.args.max_new_tokens + 1,
|
||||||
)
|
dtype=torch.int,
|
||||||
else:
|
device=self.args.device,
|
||||||
logger.debug(f"generate_ids: {self.generated_ids.shape}")
|
|
||||||
former_seq_length = self.seq_length
|
|
||||||
self.seq_length += input_ids_length
|
|
||||||
expected_length = self.seq_length + self.args.max_new_tokens + 1
|
|
||||||
delta_length = expected_length - self.generated_ids.shape[-1]
|
|
||||||
if delta_length > 0:
|
|
||||||
new_generate_ids = torch.zeros(
|
|
||||||
self.args.batch_size, delta_length, dtype=torch.int, device=self.args.device
|
|
||||||
)
|
)
|
||||||
self.generated_ids = torch.cat([self.generated_ids, new_generate_ids], dim=-1)
|
self.seq_length = 1
|
||||||
|
|
||||||
|
flat_prev_ids = self.generated_ids.flatten()
|
||||||
|
for i in range(min(self.seq_length, flat_input_ids.shape[0]) - 1):
|
||||||
|
if flat_input_ids[i] == flat_prev_ids[i]:
|
||||||
|
same_prefix += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
logger.debug(f"same prefix len: {same_prefix}")
|
||||||
|
self.cache.remove_suffix(same_prefix)
|
||||||
|
self.seq_length = same_prefix
|
||||||
|
self.generated_ids = self.generated_ids[..., :same_prefix]
|
||||||
|
input_ids = input_ids[..., same_prefix:]
|
||||||
|
input_ids_length = input_ids.shape[-1]
|
||||||
|
|
||||||
|
self.ever_generated_ids.clear()
|
||||||
|
self.profiler.set_counter("prefill", input_ids_length)
|
||||||
|
logger.debug(f"input_ids: {input_ids.shape}")
|
||||||
|
|
||||||
|
logger.debug(f"generate_ids: {self.generated_ids.shape}")
|
||||||
|
former_seq_length = self.seq_length
|
||||||
|
self.seq_length += input_ids_length
|
||||||
|
expected_length = self.seq_length + self.args.max_new_tokens + 1
|
||||||
|
delta_length = expected_length - self.generated_ids.shape[-1]
|
||||||
|
if delta_length > 0:
|
||||||
|
new_generate_ids = torch.zeros(
|
||||||
|
self.args.batch_size, delta_length, dtype=torch.int, device=self.args.device
|
||||||
|
)
|
||||||
|
self.generated_ids = torch.cat([self.generated_ids, new_generate_ids], dim=-1)
|
||||||
|
|
||||||
logger.debug(f"cache position: {former_seq_length} to {self.seq_length}")
|
logger.debug(f"cache position: {former_seq_length} to {self.seq_length}")
|
||||||
cache_position = torch.arange(former_seq_length, self.seq_length, device=self.args.device)
|
cache_position = torch.arange(former_seq_length, self.seq_length, device=self.args.device)
|
||||||
self.generated_ids[:, cache_position] = input_ids.to(self.args.device).to(torch.int)
|
self.generated_ids[:, cache_position] = input_ids.to(self.args.device).to(torch.int)
|
||||||
|
|
||||||
mask = torch.ones((1, self.seq_length)).to(self.args.device)
|
|
||||||
device = input_ids.device
|
device = input_ids.device
|
||||||
if not (type(self) is TransformersInterface):
|
if not (type(self) is TransformersInterface):
|
||||||
input_ids = input_ids.to("cpu")
|
input_ids = input_ids.to("cpu")
|
||||||
|
@ -281,22 +319,34 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
past_key_values=self.cache,
|
past_key_values=self.cache,
|
||||||
return_dict=False,
|
return_dict=False,
|
||||||
use_cache=True,
|
use_cache=True,
|
||||||
attention_mask=mask,
|
|
||||||
)[0]
|
)[0]
|
||||||
else:
|
else:
|
||||||
logits = self.model(inputs_embeds=inputs_embeds, return_dict=False)[0]
|
logits = self.model(inputs_embeds=inputs_embeds, return_dict=False)[0]
|
||||||
|
|
||||||
|
self.prepare_logits_wrapper(input_ids, device, temperature, top_p)
|
||||||
next_token = self.logits_to_token(logits[0, -1, :])
|
next_token = self.logits_to_token(logits[0, -1, :])
|
||||||
yield self.append_new_tokens(next_token)
|
yield self.append_new_tokens(next_token)
|
||||||
|
|
||||||
@torch.no_grad
|
@torch.no_grad
|
||||||
def generate(self):
|
def generate(self):
|
||||||
|
self.args.max_new_tokens = min(self.args.max_new_tokens, self.args.cache_lens - self.seq_length)
|
||||||
|
if(self.args.max_new_tokens <= 0):
|
||||||
|
logger.warning("max_new_tokens is less than 0")
|
||||||
|
yield self.streamer.end()
|
||||||
|
return
|
||||||
|
logger.info(f"max_new_tokens: {self.args.max_new_tokens}")
|
||||||
self.profiler.set_counter("decode", 0)
|
self.profiler.set_counter("decode", 0)
|
||||||
for _ in range(1, self.args.max_new_tokens):
|
|
||||||
|
for i in range(1, self.args.max_new_tokens):
|
||||||
with torch.nn.attention.sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION]):
|
with torch.nn.attention.sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION]):
|
||||||
|
if flashinfer_enabled:
|
||||||
|
MLAWrapperSingleton.plan_all(None,None,None,self.active_cache_position.to(torch.int32)+1,
|
||||||
|
num_heads=self.model.config.num_attention_heads, head_dim_ckv=self.model.config.kv_lora_rank,
|
||||||
|
head_dim_kpe=self.model.config.qk_rope_head_dim, page_size=self.cache.page_size,
|
||||||
|
sm_scale=(self.model.config.qk_rope_head_dim + self.model.config.qk_nope_head_dim) ** (-0.5), q_data_type=torch.bfloat16, kv_data_type=torch.bfloat16)
|
||||||
next_token = self.decode_one_tokens()
|
next_token = self.decode_one_tokens()
|
||||||
self.profiler.inc("decode")
|
self.profiler.inc("decode")
|
||||||
if next_token == self.tokenizer.eos_token_id:
|
if next_token == self.tokenizer.eos_token_id or "<|im_end|>" == self.tokenizer.decode(next_token):
|
||||||
assert self.args.batch_size == 1
|
assert self.args.batch_size == 1
|
||||||
break
|
break
|
||||||
yield self.append_new_tokens(next_token)
|
yield self.append_new_tokens(next_token)
|
||||||
|
@ -315,7 +365,8 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
self.last_request_id = thread_id
|
self.last_request_id = thread_id
|
||||||
return True
|
return True
|
||||||
|
|
||||||
async def inference(self, local_messages, thread_id: str):
|
async def inference(self, local_messages, thread_id: str, temperature: Optional[float] = None, top_p: Optional[float] = None):
|
||||||
|
self.streamer.reset()
|
||||||
self.profiler.create_and_start_timer("tokenize")
|
self.profiler.create_and_start_timer("tokenize")
|
||||||
if isinstance(local_messages, List):
|
if isinstance(local_messages, List):
|
||||||
input_ids = self.format_and_tokenize_input_ids(thread_id, local_messages)
|
input_ids = self.format_and_tokenize_input_ids(thread_id, local_messages)
|
||||||
|
@ -325,8 +376,9 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
#input_ids = torch.tensor([[6366]], device=input_ids.device)
|
#input_ids = torch.tensor([[6366]], device=input_ids.device)
|
||||||
else:
|
else:
|
||||||
raise ValueError("local_messages should be List or str")
|
raise ValueError("local_messages should be List or str")
|
||||||
|
|
||||||
if Config().user_force_think:
|
if Config().user_force_think:
|
||||||
token_thinks = torch.tensor([self.tokenizer.encode("<think>\\n",add_special_tokens=False)],device=input_ids.device)
|
token_thinks = torch.tensor([self.tokenizer.encode("<think>\n",add_special_tokens=False)],device=input_ids.device)
|
||||||
input_ids = torch.cat(
|
input_ids = torch.cat(
|
||||||
[input_ids, token_thinks], dim=1
|
[input_ids, token_thinks], dim=1
|
||||||
)
|
)
|
||||||
|
@ -334,11 +386,14 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
self.profiler.pause_timer("tokenize")
|
self.profiler.pause_timer("tokenize")
|
||||||
|
|
||||||
self.profiler.create_and_start_timer("prefill")
|
self.profiler.create_and_start_timer("prefill")
|
||||||
|
|
||||||
if Config().user_force_think:
|
if Config().user_force_think:
|
||||||
t = "<think>\n"
|
think = '<think>\n'
|
||||||
print(t,end="",flush=True)
|
print(think, end="",flush=True)
|
||||||
yield t
|
yield think
|
||||||
for t in self.prefill(input_ids, self.check_is_new(thread_id)):
|
|
||||||
|
for t in self.prefill(input_ids, self.check_is_new(thread_id), temperature, top_p):
|
||||||
|
# output think token after prefill done
|
||||||
if t is not None:
|
if t is not None:
|
||||||
print(t, end="",flush=True)
|
print(t, end="",flush=True)
|
||||||
yield t
|
yield t
|
||||||
|
|
|
@ -69,6 +69,7 @@ class Config(metaclass=Singleton):
|
||||||
self.server: dict = cfg.get("server", {})
|
self.server: dict = cfg.get("server", {})
|
||||||
self.server_ip = self.server.get("ip", "0.0.0.0")
|
self.server_ip = self.server.get("ip", "0.0.0.0")
|
||||||
self.server_port = self.server.get("port", 9016)
|
self.server_port = self.server.get("port", 9016)
|
||||||
|
self.api_key = self.server.get("api_key", "")
|
||||||
|
|
||||||
# db configs
|
# db configs
|
||||||
self.db_configs: dict = cfg.get("db", {})
|
self.db_configs: dict = cfg.get("db", {})
|
||||||
|
@ -104,7 +105,8 @@ class Config(metaclass=Singleton):
|
||||||
|
|
||||||
self.total_context = self.model.get("total_context", 2**18)
|
self.total_context = self.model.get("total_context", 2**18)
|
||||||
self.max_batch_size = self.model.get("max_batch_size", 20 if self.paged else 1)
|
self.max_batch_size = self.model.get("max_batch_size", 20 if self.paged else 1)
|
||||||
self.max_chunk_size = self.model.get("max_chunk_size", 2048)
|
self.chunk_prefill_size = self.model.get("chunk_prefill_size", 8192)
|
||||||
|
|
||||||
self.max_new_tokens = self.model.get("max_new_tokens", 2000)
|
self.max_new_tokens = self.model.get("max_new_tokens", 2000)
|
||||||
self.json_mode = self.model.get("json_mode", False)
|
self.json_mode = self.model.get("json_mode", False)
|
||||||
self.healing = self.model.get("healing", False)
|
self.healing = self.model.get("healing", False)
|
||||||
|
|
|
@ -105,6 +105,7 @@ def custom_openapi(app):
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
cfg = Config()
|
cfg = Config()
|
||||||
|
|
||||||
arg_parser = ArgumentParser(cfg)
|
arg_parser = ArgumentParser(cfg)
|
||||||
|
|
||||||
# 初始化消息
|
# 初始化消息
|
||||||
|
|
|
@ -73,7 +73,7 @@ class RunStepDelta(Object):
|
||||||
|
|
||||||
class Done():
|
class Done():
|
||||||
def to_stream_reply(self):
|
def to_stream_reply(self):
|
||||||
return f"event: done\ndata: [DONE]\n\n"
|
return f"data: [DONE]\n\n"
|
||||||
|
|
||||||
|
|
||||||
async def check_client_link(request: Request, async_events: AsyncIterable):
|
async def check_client_link(request: Request, async_events: AsyncIterable):
|
||||||
|
|
|
@ -25,7 +25,9 @@ class ChatCompletionCreate(BaseModel):
|
||||||
messages: List[Message]
|
messages: List[Message]
|
||||||
model : str
|
model : str
|
||||||
stream : bool = False
|
stream : bool = False
|
||||||
|
temperature: Optional[float] = None
|
||||||
|
top_p: Optional[float] = None
|
||||||
|
|
||||||
def get_tokenizer_messages(self):
|
def get_tokenizer_messages(self):
|
||||||
return [m.to_tokenizer_message() for m in self.messages]
|
return [m.to_tokenizer_message() for m in self.messages]
|
||||||
|
|
||||||
|
@ -75,4 +77,4 @@ class ChatCompletionChunk(ChatCompletionBase):
|
||||||
]
|
]
|
||||||
|
|
||||||
def to_stream_reply(self):
|
def to_stream_reply(self):
|
||||||
return f"data:{self.model_dump_json()}\n\n"
|
return f"data: {self.model_dump_json()}\n\n"
|
||||||
|
|
|
@ -9,6 +9,8 @@ class CompletionCreate(BaseModel):
|
||||||
model: str
|
model: str
|
||||||
prompt: str | List[str]
|
prompt: str | List[str]
|
||||||
stream: bool = False
|
stream: bool = False
|
||||||
|
temperature: Optional[float] = None
|
||||||
|
top_p: Optional[float] = None
|
||||||
|
|
||||||
def get_tokenizer_messages(self):
|
def get_tokenizer_messages(self):
|
||||||
if isinstance(self.prompt,List):
|
if isinstance(self.prompt,List):
|
||||||
|
|
195
ktransformers/tests/mmlu_pro_test.py
Normal file
195
ktransformers/tests/mmlu_pro_test.py
Normal file
|
@ -0,0 +1,195 @@
|
||||||
|
import argparse
|
||||||
|
import random
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import requests
|
||||||
|
import pandas as pd
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
import os
|
||||||
|
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
|
||||||
|
os.environ['https_proxy'] = ''
|
||||||
|
os.environ['http_proxy'] = ''
|
||||||
|
hint = 'There is a single choice question. Answer the question by replying A, B, C, D, E, F, G, H, I, J. No other answers are accepted. Just the letter.'
|
||||||
|
|
||||||
|
|
||||||
|
class DataEvaluator:
|
||||||
|
def __init__(self):
|
||||||
|
# self.template_prompt = template_prompt
|
||||||
|
self.data = []
|
||||||
|
|
||||||
|
def load_data(self, file_path):
|
||||||
|
"""
|
||||||
|
Load data from a Parquet file into a list.
|
||||||
|
Each record in the Parquet file should represent an individual record.
|
||||||
|
"""
|
||||||
|
# 读取 Parquet 文件
|
||||||
|
# dataset = load_dataset('parquet', data_files=file_path)
|
||||||
|
ds = load_dataset("TIGER-Lab/MMLU-Pro")
|
||||||
|
df = pd.DataFrame(ds['test'])
|
||||||
|
# print(ds)
|
||||||
|
# # ds_1 = ds['train']
|
||||||
|
# ds_2 = ds['validation']
|
||||||
|
# ds_3 = ds['test']
|
||||||
|
# # 将数据集转换为 Pandas DataFrame
|
||||||
|
# df_test = pd.DataFrame(ds['test'])
|
||||||
|
# df_val = pd.DataFrame(ds['validation'])
|
||||||
|
|
||||||
|
# for _, row in df.iterrows():
|
||||||
|
# self.data.append(row.to_dict())
|
||||||
|
# df = pd.read_parquet(file_path)
|
||||||
|
|
||||||
|
for _, row in df.iterrows():
|
||||||
|
self.data.append(row.to_dict())
|
||||||
|
|
||||||
|
def get_prompt(self, record):
|
||||||
|
"""
|
||||||
|
Combine fields from a record with the template prompt to create a full prompt.
|
||||||
|
:param record: Dictionary containing fields to populate the template.
|
||||||
|
:return: A formatted prompt string.
|
||||||
|
"""
|
||||||
|
# 查看ABCD。。。的选项
|
||||||
|
options_str = "\n".join([f"{chr(65+i)}. {opt}" for i, opt in enumerate(record['options'])])
|
||||||
|
prompt = hint + "\nQuestion: " + record['question'] + "\n" + options_str + "\nAnswer: '"
|
||||||
|
return prompt
|
||||||
|
|
||||||
|
def post_processing(self, text):
|
||||||
|
"""
|
||||||
|
Perform post-processing on the prediction string.
|
||||||
|
:param text: The raw prediction string.
|
||||||
|
:return: Processed prediction string.
|
||||||
|
"""
|
||||||
|
text = text.lstrip('\n').split('\n')[0]
|
||||||
|
return text[:1]
|
||||||
|
|
||||||
|
def score(self, pred, answers):
|
||||||
|
"""
|
||||||
|
Calculate scores between the prediction and the answer.
|
||||||
|
Uses ROUGE scores as the evaluation metric.
|
||||||
|
:param pred: The predicted string.
|
||||||
|
:param answer: The reference answer string.
|
||||||
|
:return: A dictionary containing ROUGE scores.
|
||||||
|
"""
|
||||||
|
for answer in answers:
|
||||||
|
if pred == answer:
|
||||||
|
return 1
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
# Function to generate text using API
|
||||||
|
def generate_text(api_url, question, model_name, stream=False):
|
||||||
|
headers = {
|
||||||
|
'accept': 'application/json',
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
# 添加 API Key
|
||||||
|
'Authorization' : 'Bearer '
|
||||||
|
}
|
||||||
|
data = {
|
||||||
|
"messages": [{"content": question, "role": "user"}],
|
||||||
|
"model": model_name,
|
||||||
|
"stream": stream,
|
||||||
|
# "temperature": 0.0
|
||||||
|
}
|
||||||
|
|
||||||
|
print("POST data:", data)
|
||||||
|
response = requests.post(api_url, headers=headers, json=data)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
result = response.json()
|
||||||
|
return result.get('choices', [{}])[0].get('message', {}).get('content', '').strip()
|
||||||
|
else:
|
||||||
|
print(f"API Request failed with status code {response.status_code}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Main function to handle multiple evaluations
|
||||||
|
def main(concurrent_requests, data_evaluator: DataEvaluator, result_file, log_file, api_url, model_name):
|
||||||
|
start_total_time = time.time()
|
||||||
|
|
||||||
|
total_score = 0
|
||||||
|
|
||||||
|
results = []
|
||||||
|
# 设置随机数种子
|
||||||
|
random.seed(42)
|
||||||
|
random.shuffle(data_evaluator.data)
|
||||||
|
for i in range(min(concurrent_requests, len(data_evaluator.data))):
|
||||||
|
# Randomly select a data item from data for each request
|
||||||
|
data_item = data_evaluator.data[i]
|
||||||
|
question = data_evaluator.get_prompt(data_item)
|
||||||
|
# print(question)
|
||||||
|
|
||||||
|
# Start the timer for this evaluation
|
||||||
|
start_time = time.time()
|
||||||
|
try:
|
||||||
|
# Generate prediction using the API
|
||||||
|
prediction = generate_text(api_url, question, model_name)
|
||||||
|
|
||||||
|
if prediction is None:
|
||||||
|
raise Exception(f"Failed to get prediction for {question}")
|
||||||
|
|
||||||
|
answer = data_item['answer']
|
||||||
|
# Compute score
|
||||||
|
score = data_evaluator.score(data_evaluator.post_processing(prediction), answer)
|
||||||
|
|
||||||
|
# Calculate the time taken
|
||||||
|
elapsed_time = time.time() - start_time
|
||||||
|
|
||||||
|
# Collect the result data
|
||||||
|
result_data = {
|
||||||
|
"question_id": data_item['question_id'],
|
||||||
|
"answer": answer,
|
||||||
|
"prediction": data_evaluator.post_processing(prediction),
|
||||||
|
"score": score,
|
||||||
|
"time": elapsed_time
|
||||||
|
}
|
||||||
|
|
||||||
|
# Write results to result.json with each field on a new line
|
||||||
|
with open(result_file, 'a', encoding='utf-8') as f:
|
||||||
|
json.dump(result_data, f, ensure_ascii=False, indent=4)
|
||||||
|
f.write("\n") # Ensure each JSON object is on a new line
|
||||||
|
|
||||||
|
results.append(result_data)
|
||||||
|
|
||||||
|
# Aggregate scores
|
||||||
|
total_score += score
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing request {i}: {e}")
|
||||||
|
|
||||||
|
# Calculate total time and throughput
|
||||||
|
total_time = time.time() - start_total_time
|
||||||
|
throughput = concurrent_requests / total_time
|
||||||
|
|
||||||
|
# Log the total time, throughput, and average ROUGE scores
|
||||||
|
with open(log_file, 'a', encoding='utf-8') as log_f:
|
||||||
|
log_f.write(f"Total Time: {total_time:.2f} seconds\n")
|
||||||
|
log_f.write(f"Throughput: {throughput:.2f} requests per second\n")
|
||||||
|
log_f.write(f"Average Scores: {total_score / concurrent_requests}\n")
|
||||||
|
log_f.write('-' * 40 + '\n')
|
||||||
|
|
||||||
|
print(f"Results saved to {result_file}")
|
||||||
|
print(f"Log saved to {log_file}")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser(description="API Generate Tester")
|
||||||
|
parser.add_argument("--concurrent", type=int, default=1000, help="Number of concurrent evaluations")
|
||||||
|
parser.add_argument("--file", type=str, default="TIGER-Lab/MMLU-Pro", help="Path to the mmlu.jsonl file")
|
||||||
|
parser.add_argument("--result", type=str, default="./mmlu_result_pro.json", help="Path to save the result JSON file")
|
||||||
|
parser.add_argument("--log", type=str, default="./mmlu_result_pro.log", help="Path to save the log file")
|
||||||
|
parser.add_argument("--model", type=str, default="Pro/deepseek-ai/DeepSeek-V3", help="Model name or path")
|
||||||
|
parser.add_argument("--api_url", type=str, default="http://localhost:15488/v1/chat/completions", help="API URL")
|
||||||
|
# parser.add_argument("--api_url", type=str, default="https://api.siliconflow.cn/v1/chat/completions", help="API URL")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Load the data from the provided file
|
||||||
|
# template_prompt = hint + "\nQuestion: {question}\nA. {options}\nB. {option_b}\nC. {option_c}\nD. {option_d}\nAnswer: '"
|
||||||
|
# template_prompt_pro = hint + "\nQuestion: {question}\nA. {options[0]}\nB. {options[1]}\nC. {options[2]}\nD. {options[3]}\nE. {options[4]}\nF. {options[5]}\nG. \
|
||||||
|
# {options[6]}\nH. {options[7]}\nI. {options[8]}\nJ. {options[9]}\nAnswer: '"
|
||||||
|
|
||||||
|
|
||||||
|
# Load the data from the provided file
|
||||||
|
data_evaluator = DataEvaluator()
|
||||||
|
data_evaluator.load_data(args.file)
|
||||||
|
|
||||||
|
# Run the main function with the specified number of concurrent evaluations
|
||||||
|
main(args.concurrent, data_evaluator, args.result, args.log, args.api_url, args.model)
|
195
ktransformers/tests/mmlu_test.py
Normal file
195
ktransformers/tests/mmlu_test.py
Normal file
|
@ -0,0 +1,195 @@
|
||||||
|
import argparse
|
||||||
|
import random
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import requests
|
||||||
|
import pandas as pd
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
import os
|
||||||
|
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
|
||||||
|
os.environ['https_proxy'] = ''
|
||||||
|
os.environ['http_proxy'] = ''
|
||||||
|
hint = 'There is a single choice question. Answer the question by replying A, B, C, D. No other answers are accepted. Just the letter.'
|
||||||
|
|
||||||
|
|
||||||
|
class DataEvaluator:
|
||||||
|
def __init__(self):
|
||||||
|
# self.template_prompt = template_prompt
|
||||||
|
self.data = []
|
||||||
|
|
||||||
|
def load_data(self, file_path):
|
||||||
|
"""
|
||||||
|
Load data from a Parquet file into a list.
|
||||||
|
Each record in the Parquet file should represent an individual record.
|
||||||
|
"""
|
||||||
|
# 读取 Parquet 文件
|
||||||
|
# dataset = load_dataset('parquet', data_files=file_path)
|
||||||
|
ds = load_dataset(file_path,"all")
|
||||||
|
df = pd.DataFrame(ds['test'])
|
||||||
|
# print(ds)
|
||||||
|
# # ds_1 = ds['train']
|
||||||
|
# ds_2 = ds['validation']
|
||||||
|
# ds_3 = ds['test']
|
||||||
|
# # 将数据集转换为 Pandas DataFrame
|
||||||
|
# df_test = pd.DataFrame(ds['test'])
|
||||||
|
# df_val = pd.DataFrame(ds['validation'])
|
||||||
|
|
||||||
|
# for _, row in df.iterrows():
|
||||||
|
# self.data.append(row.to_dict())
|
||||||
|
# df = pd.read_parquet(file_path)
|
||||||
|
|
||||||
|
for _, row in df.iterrows():
|
||||||
|
self.data.append(row.to_dict())
|
||||||
|
|
||||||
|
def get_prompt(self, record):
|
||||||
|
"""
|
||||||
|
Combine fields from a record with the template prompt to create a full prompt.
|
||||||
|
:param record: Dictionary containing fields to populate the template.
|
||||||
|
:return: A formatted prompt string.
|
||||||
|
"""
|
||||||
|
# 查看ABCD。。。的选项
|
||||||
|
options_str = "\n".join([f"{chr(65 + i)}. {opt}" for i, opt in enumerate(record['choices'])])
|
||||||
|
prompt = hint + "\nQuestion: " + record['question'] + "\n" + options_str + "\nAnswer: '"
|
||||||
|
return prompt
|
||||||
|
|
||||||
|
def post_processing(self, text):
|
||||||
|
"""
|
||||||
|
Perform post-processing on the prediction string.
|
||||||
|
:param text: The raw prediction string.
|
||||||
|
:return: Processed prediction string.
|
||||||
|
"""
|
||||||
|
text = text.lstrip('\n').split('\n')[0]
|
||||||
|
return text[:1]
|
||||||
|
|
||||||
|
def score(self, pred, answers):
|
||||||
|
"""
|
||||||
|
Calculate scores between the prediction and the answer.
|
||||||
|
Uses ROUGE scores as the evaluation metric.
|
||||||
|
:param pred: The predicted string.
|
||||||
|
:param answer: The reference answer string.
|
||||||
|
:return: A dictionary containing ROUGE scores.
|
||||||
|
"""
|
||||||
|
for answer in answers:
|
||||||
|
if pred == answer:
|
||||||
|
return 1
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
# Function to generate text using API
|
||||||
|
def generate_text(api_url, question, model_name, stream=False):
|
||||||
|
headers = {
|
||||||
|
'accept': 'application/json',
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
# 添加 API Key
|
||||||
|
'Authorization' : 'Bearer '
|
||||||
|
}
|
||||||
|
data = {
|
||||||
|
"messages": [{"content": question, "role": "user"}],
|
||||||
|
"model": model_name,
|
||||||
|
"stream": stream,
|
||||||
|
# "temperature": 0.0
|
||||||
|
}
|
||||||
|
|
||||||
|
print("POST data:", data)
|
||||||
|
response = requests.post(api_url, headers=headers, json=data)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
result = response.json()
|
||||||
|
return result.get('choices', [{}])[0].get('message', {}).get('content', '').strip()
|
||||||
|
else:
|
||||||
|
print(f"API Request failed with status code {response.status_code}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Main function to handle multiple evaluations
|
||||||
|
def main(concurrent_requests, data_evaluator: DataEvaluator, result_file, log_file, api_url, model_name):
|
||||||
|
start_total_time = time.time()
|
||||||
|
|
||||||
|
total_score = 0
|
||||||
|
|
||||||
|
results = []
|
||||||
|
# 设置随机数种子
|
||||||
|
random.seed(42)
|
||||||
|
random.shuffle(data_evaluator.data)
|
||||||
|
for i in range(min(concurrent_requests, len(data_evaluator.data))):
|
||||||
|
# Randomly select a data item from data for each request
|
||||||
|
data_item = data_evaluator.data[i]
|
||||||
|
question = data_evaluator.get_prompt(data_item)
|
||||||
|
# print(question)
|
||||||
|
|
||||||
|
# Start the timer for this evaluation
|
||||||
|
start_time = time.time()
|
||||||
|
try:
|
||||||
|
# Generate prediction using the API
|
||||||
|
prediction = generate_text(api_url, question, model_name)
|
||||||
|
|
||||||
|
if prediction is None:
|
||||||
|
raise Exception(f"Failed to get prediction for {question}")
|
||||||
|
|
||||||
|
answer = chr(data_item['answer'] + 65)
|
||||||
|
# Compute score
|
||||||
|
score = data_evaluator.score(data_evaluator.post_processing(prediction), answer)
|
||||||
|
|
||||||
|
# Calculate the time taken
|
||||||
|
elapsed_time = time.time() - start_time
|
||||||
|
|
||||||
|
# Collect the result data
|
||||||
|
result_data = {
|
||||||
|
"question_id": i,
|
||||||
|
"answer": answer,
|
||||||
|
"prediction": data_evaluator.post_processing(prediction),
|
||||||
|
"score": score,
|
||||||
|
"time": elapsed_time
|
||||||
|
}
|
||||||
|
|
||||||
|
# Write results to result.json with each field on a new line
|
||||||
|
with open(result_file, 'a', encoding='utf-8') as f:
|
||||||
|
json.dump(result_data, f, ensure_ascii=False, indent=4)
|
||||||
|
f.write("\n") # Ensure each JSON object is on a new line
|
||||||
|
|
||||||
|
results.append(result_data)
|
||||||
|
|
||||||
|
# Aggregate scores
|
||||||
|
total_score += score
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing request {i}: {e}")
|
||||||
|
|
||||||
|
# Calculate total time and throughput
|
||||||
|
total_time = time.time() - start_total_time
|
||||||
|
throughput = concurrent_requests / total_time
|
||||||
|
|
||||||
|
# Log the total time, throughput, and average ROUGE scores
|
||||||
|
with open(log_file, 'a', encoding='utf-8') as log_f:
|
||||||
|
log_f.write(f"Total Time: {total_time:.2f} seconds\n")
|
||||||
|
log_f.write(f"Throughput: {throughput:.2f} requests per second\n")
|
||||||
|
log_f.write(f"Average Scores: {total_score / concurrent_requests}\n")
|
||||||
|
log_f.write('-' * 40 + '\n')
|
||||||
|
|
||||||
|
print(f"Results saved to {result_file}")
|
||||||
|
print(f"Log saved to {log_file}")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser(description="API Generate Tester")
|
||||||
|
parser.add_argument("--concurrent", type=int, default=1000, help="Number of concurrent evaluations")
|
||||||
|
parser.add_argument("--file", type=str, default="cais/mmlu", help="Path to the mmlu.jsonl file")
|
||||||
|
parser.add_argument("--result", type=str, default="./mmlu_result_silicon.json", help="Path to save the result JSON file")
|
||||||
|
parser.add_argument("--log", type=str, default="./mmlu_result_silicon.log", help="Path to save the log file")
|
||||||
|
parser.add_argument("--model", type=str, default="Pro/deepseek-ai/DeepSeek-V3", help="Model name or path")
|
||||||
|
parser.add_argument("--api_url", type=str, default="http://localhost:10003/v1/chat/completions", help="API URL")
|
||||||
|
# parser.add_argument("--api_url", type=str, default="https://api.siliconflow.cn/v1/chat/completions", help="API URL")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Load the data from the provided file
|
||||||
|
# template_prompt = hint + "\nQuestion: {question}\nA. {options}\nB. {option_b}\nC. {option_c}\nD. {option_d}\nAnswer: '"
|
||||||
|
# template_prompt_pro = hint + "\nQuestion: {question}\nA. {options[0]}\nB. {options[1]}\nC. {options[2]}\nD. {options[3]}\nE. {options[4]}\nF. {options[5]}\nG. \
|
||||||
|
# {options[6]}\nH. {options[7]}\nI. {options[8]}\nJ. {options[9]}\nAnswer: '"
|
||||||
|
|
||||||
|
|
||||||
|
# Load the data from the provided file
|
||||||
|
data_evaluator = DataEvaluator()
|
||||||
|
data_evaluator.load_data(args.file)
|
||||||
|
|
||||||
|
# Run the main function with the specified number of concurrent evaluations
|
||||||
|
main(args.concurrent, data_evaluator, args.result, args.log, args.api_url, args.model)
|
116
ktransformers/tests/triton_fp8gemm_test.py
Normal file
116
ktransformers/tests/triton_fp8gemm_test.py
Normal file
|
@ -0,0 +1,116 @@
|
||||||
|
import torch
|
||||||
|
import torch.nn.functional as F
|
||||||
|
from typing import Optional
|
||||||
|
import pytest
|
||||||
|
from typing import Tuple, Optional, Literal
|
||||||
|
import time
|
||||||
|
# use dir path
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, "/home/azure/ktransformers")
|
||||||
|
print(sys.path)
|
||||||
|
from ktransformers.ktransformers_ext.triton.fp8gemm import fp8_gemm, act_quant, weight_dequant
|
||||||
|
from safetensors import safe_open
|
||||||
|
|
||||||
|
world_size = 1
|
||||||
|
rank = 0
|
||||||
|
block_size = 128
|
||||||
|
gemm_impl: Literal["bf16", "fp8"] = "bf16"
|
||||||
|
# Assuming `fp8_gemm`, `act_quant`, `weight_dequant` and other relevant functions are already defined
|
||||||
|
|
||||||
|
def test_fp8_gemm_vs_torch_matmul():
|
||||||
|
# Test case 1: Create random matrices of size (M, K) and (K, N)
|
||||||
|
M, K, N = 64, 128, 256 # Matrix dimensions
|
||||||
|
x = torch.randn(M, K, dtype=torch.bfloat16, device='cuda')
|
||||||
|
weight = torch.randn(N, K, dtype=torch.bfloat16, device='cuda')
|
||||||
|
|
||||||
|
# Apply act_quant to both matrices
|
||||||
|
x_quantized, scale_x = act_quant(x, block_size)
|
||||||
|
weight_quantized, scale_w = act_quant(weight, block_size)
|
||||||
|
|
||||||
|
# mk continous
|
||||||
|
x_quantized = x_quantized.contiguous()
|
||||||
|
weight_quantized = weight_quantized.contiguous()
|
||||||
|
scale_x = scale_x.contiguous()
|
||||||
|
scale_w = scale_w.contiguous()
|
||||||
|
|
||||||
|
# Perform fp8_gemm using the quantized tensors
|
||||||
|
result_fp8_gemm = fp8_gemm(x_quantized, scale_x, weight_quantized, scale_w)
|
||||||
|
|
||||||
|
# Perform torch.matmul using the original floating point tensors
|
||||||
|
result_torch_matmul = torch.matmul(x, weight.T)
|
||||||
|
print(f'result_torch_matmul: {result_torch_matmul.shape}')
|
||||||
|
print(f'result_fp8_gemm: {result_fp8_gemm.shape}')
|
||||||
|
|
||||||
|
print(f"result_fp8_gemm:\n {result_fp8_gemm}")
|
||||||
|
print(f"result_torch_matmul:\n {result_torch_matmul}")
|
||||||
|
|
||||||
|
def test_fp8_gemm_vs_torch_matmul_load():
|
||||||
|
file_path = "/mnt/data/model/DeepSeek-V3/model-00001-of-000163.safetensors"
|
||||||
|
with safe_open(file_path, framework="pt", device=0) as f:
|
||||||
|
weight = f.get_tensor("model.layers.0.mlp.down_proj.weight")
|
||||||
|
scale = f.get_tensor("model.layers.0.mlp.down_proj.weight_scale_inv")
|
||||||
|
|
||||||
|
# weight_dequant
|
||||||
|
weight_dequantized = weight_dequant(weight, scale)
|
||||||
|
print(f"weight_dequantized: {weight_dequantized.shape}")
|
||||||
|
N, K = weight_dequantized.shape
|
||||||
|
M = 64
|
||||||
|
x = torch.randn(2 ,M, K, dtype=torch.bfloat16, device='cuda')
|
||||||
|
x_quantized, scale_x = act_quant(x, block_size)
|
||||||
|
|
||||||
|
# Test case 1: quantized x matmal with undequantized weight
|
||||||
|
result_fp8_gemm = fp8_gemm(x_quantized, scale_x, weight, scale)
|
||||||
|
print(f"result_fp8_gemm:\n {result_fp8_gemm}")
|
||||||
|
print(f"dtype {result_fp8_gemm.dtype}")
|
||||||
|
|
||||||
|
# Perform torch.matmul using the original floating point tensors
|
||||||
|
result_torch_matmul = torch.matmul(x, weight_dequantized.to(torch.bfloat16).T)
|
||||||
|
print(f"result_torch_matmul:\n {result_torch_matmul}")
|
||||||
|
|
||||||
|
def test_fp8_gemm_tplops():
|
||||||
|
file_path = "/mnt/data/model/DeepSeek-V3/model-00001-of-000163.safetensors"
|
||||||
|
with safe_open(file_path, framework="pt", device=0) as f:
|
||||||
|
weight = f.get_tensor("model.layers.0.mlp.down_proj.weight")
|
||||||
|
scale = f.get_tensor("model.layers.0.mlp.down_proj.weight_scale_inv")
|
||||||
|
|
||||||
|
# weight_dequant
|
||||||
|
weight_dequantized = weight_dequant(weight, scale)
|
||||||
|
print(f"weight_dequantized: {weight_dequantized.shape}")
|
||||||
|
N, K = weight_dequantized.shape
|
||||||
|
M = 6400
|
||||||
|
x = torch.randn(2 ,M, K, dtype=torch.bfloat16, device='cuda')
|
||||||
|
# x_quantized, scale_x = act_quant(x, block_size)
|
||||||
|
|
||||||
|
# Calculate time for 1000 fp8_gemm
|
||||||
|
i = 10
|
||||||
|
flops_per_gemm = 2 * M * N * K
|
||||||
|
total_flops = i * flops_per_gemm
|
||||||
|
|
||||||
|
x_quantized, scale_x = act_quant(x, block_size)
|
||||||
|
result_fp8_gemm = fp8_gemm(x_quantized, scale_x, weight, scale)
|
||||||
|
x_quantized, scale_x = act_quant(x, block_size)
|
||||||
|
result_fp8_gemm = fp8_gemm(x_quantized, scale_x, weight, scale)
|
||||||
|
|
||||||
|
|
||||||
|
t0 = time.time()
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
for i in range(i):
|
||||||
|
x_quantized, scale_x = act_quant(x, block_size)
|
||||||
|
result_fp8_gemm = fp8_gemm(x_quantized, scale_x, weight, scale)
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
t1 = time.time()
|
||||||
|
|
||||||
|
total_time = t1 - t0
|
||||||
|
tflops = total_flops / total_time / 1e12
|
||||||
|
print(f"total_time: {total_time}")
|
||||||
|
print(f"tflops: {tflops}")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
test_fp8_gemm_vs_torch_matmul()
|
||||||
|
test_fp8_gemm_vs_torch_matmul_load()
|
||||||
|
test_fp8_gemm_tplops()
|
||||||
|
|
|
@ -25,6 +25,9 @@ import os
|
||||||
from enum import IntEnum
|
from enum import IntEnum
|
||||||
import torch
|
import torch
|
||||||
import KTransformersOps
|
import KTransformersOps
|
||||||
|
from .custom_loader import SafeTensorLoader
|
||||||
|
import ctypes
|
||||||
|
import math
|
||||||
|
|
||||||
class GGMLQuantizationType(IntEnum):
|
class GGMLQuantizationType(IntEnum):
|
||||||
F32 = 0
|
F32 = 0
|
||||||
|
@ -109,6 +112,7 @@ GGML_TYPES = {
|
||||||
"Q5_K": 13,
|
"Q5_K": 13,
|
||||||
"Q6_K": 14,
|
"Q6_K": 14,
|
||||||
"IQ4_XS": 23,
|
"IQ4_XS": 23,
|
||||||
|
"BF16": 30,
|
||||||
}
|
}
|
||||||
|
|
||||||
GGML_NAMES = {ggml_type: name for name, ggml_type in GGML_TYPES.items()}
|
GGML_NAMES = {ggml_type: name for name, ggml_type in GGML_TYPES.items()}
|
||||||
|
@ -116,6 +120,7 @@ GGML_NAMES = {ggml_type: name for name, ggml_type in GGML_TYPES.items()}
|
||||||
GGML_BLOCK_SIZES = {
|
GGML_BLOCK_SIZES = {
|
||||||
"F32": 4,
|
"F32": 4,
|
||||||
"F16": 2,
|
"F16": 2,
|
||||||
|
"BF16": 2,
|
||||||
"Q4_0": 2 + 16,
|
"Q4_0": 2 + 16,
|
||||||
"Q5_0": 2 + 4 + 16,
|
"Q5_0": 2 + 4 + 16,
|
||||||
"Q8_0": 2 + 32,
|
"Q8_0": 2 + 32,
|
||||||
|
@ -125,11 +130,13 @@ GGML_BLOCK_SIZES = {
|
||||||
"Q5_K": 2 + 2 + 12 + 256 // 8 + 256 // 2,
|
"Q5_K": 2 + 2 + 12 + 256 // 8 + 256 // 2,
|
||||||
"Q6_K": 256 // 2 + 256 // 4 + 256 // 16 + 2,
|
"Q6_K": 256 // 2 + 256 // 4 + 256 // 16 + 2,
|
||||||
"IQ4_XS": 2 + 2 + 256 // 2 + 256 // 64,
|
"IQ4_XS": 2 + 2 + 256 // 2 + 256 // 64,
|
||||||
|
"FP8": 1,
|
||||||
}
|
}
|
||||||
|
|
||||||
GGML_ELEMENTS_PER_BLOCK = {
|
GGML_ELEMENTS_PER_BLOCK = {
|
||||||
"F32": 1,
|
"F32": 1,
|
||||||
"F16": 1,
|
"F16": 1,
|
||||||
|
"BF16": 1,
|
||||||
"Q4_0": 32,
|
"Q4_0": 32,
|
||||||
"Q5_0": 32,
|
"Q5_0": 32,
|
||||||
"Q8_0": 32,
|
"Q8_0": 32,
|
||||||
|
@ -139,6 +146,7 @@ GGML_ELEMENTS_PER_BLOCK = {
|
||||||
"Q5_K": 256,
|
"Q5_K": 256,
|
||||||
"Q6_K": 256,
|
"Q6_K": 256,
|
||||||
"IQ4_XS": 256,
|
"IQ4_XS": 256,
|
||||||
|
"FP8": 1,
|
||||||
}
|
}
|
||||||
|
|
||||||
DATA_TYPES = {
|
DATA_TYPES = {
|
||||||
|
@ -155,6 +163,7 @@ DATA_TYPES = {
|
||||||
"uint64": 10,
|
"uint64": 10,
|
||||||
"int64": 11,
|
"int64": 11,
|
||||||
"float64": 12,
|
"float64": 12,
|
||||||
|
"FP8": 13,
|
||||||
}
|
}
|
||||||
|
|
||||||
class GGUFLoader:
|
class GGUFLoader:
|
||||||
|
@ -162,10 +171,15 @@ class GGUFLoader:
|
||||||
gguf_path: str
|
gguf_path: str
|
||||||
tensor_file_map: dict # {tensor_name: tensor_file_path}
|
tensor_file_map: dict # {tensor_name: tensor_file_path}
|
||||||
gguf_file_meta: dict
|
gguf_file_meta: dict
|
||||||
|
safetensor_loader: SafeTensorLoader
|
||||||
def __init__(self, gguf_path: str):
|
def __init__(self, gguf_path: str):
|
||||||
# Check dir exist
|
# Check dir exist
|
||||||
if not os.path.exists(gguf_path):
|
if not os.path.exists(gguf_path):
|
||||||
raise FileNotFoundError(f"GGUF dir not found: {gguf_path}")
|
raise FileNotFoundError(f"GGUF dir not found: {gguf_path}")
|
||||||
|
if os.path.isfile(gguf_path):
|
||||||
|
gguf_path = os.path.dirname(gguf_path)
|
||||||
|
|
||||||
|
self.safetensor_loader = None
|
||||||
|
|
||||||
self.tensor_info = {}
|
self.tensor_info = {}
|
||||||
self.gguf_path = gguf_path
|
self.gguf_path = gguf_path
|
||||||
|
@ -173,16 +187,26 @@ class GGUFLoader:
|
||||||
self.file_data_map = {}
|
self.file_data_map = {}
|
||||||
self.gguf_file_meta = {}
|
self.gguf_file_meta = {}
|
||||||
self.tensor_device_map = {}
|
self.tensor_device_map = {}
|
||||||
|
|
||||||
|
# I know this is ugly, but I don't want to change the original code too much
|
||||||
|
# TODO: merge gguf load and other loads.
|
||||||
|
safetensor_loader = SafeTensorLoader(gguf_path)
|
||||||
|
if safetensor_loader.tensor_file_map:
|
||||||
|
self.safetensor_loader = safetensor_loader
|
||||||
|
return
|
||||||
# Walk through all the .gguf files in the directory
|
# Walk through all the .gguf files in the directory
|
||||||
|
found_gguf = False
|
||||||
for root, dirs, files in os.walk(gguf_path):
|
for root, dirs, files in os.walk(gguf_path):
|
||||||
for file in files:
|
for file in files:
|
||||||
if file.endswith(".gguf"):
|
if file.endswith(".gguf"):
|
||||||
|
found_gguf = True
|
||||||
file_name = os.path.join(root, file)
|
file_name = os.path.join(root, file)
|
||||||
with open(file_name, "rb") as f:
|
with open(file_name, "rb") as f:
|
||||||
self.load_gguf(f)
|
self.load_gguf(f)
|
||||||
if file_name not in self.file_data_map:
|
if file_name not in self.file_data_map:
|
||||||
self.file_data_map[file_name] = np.memmap(file_name, mode = 'r')
|
self.file_data_map[file_name] = np.memmap(file_name, mode = 'r')
|
||||||
|
if not found_gguf:
|
||||||
|
raise FileNotFoundError(f"Cannot find any .gguf files in: {gguf_path}")
|
||||||
|
|
||||||
def load_gguf(self, f):
|
def load_gguf(self, f):
|
||||||
f.seek(0)
|
f.seek(0)
|
||||||
|
@ -207,7 +231,7 @@ class GGUFLoader:
|
||||||
shape = [read_value(f, DATA_TYPES["uint64"]) for _ in range(shape_len)]
|
shape = [read_value(f, DATA_TYPES["uint64"]) for _ in range(shape_len)]
|
||||||
ggml_type = read_value(f, DATA_TYPES["uint32"])
|
ggml_type = read_value(f, DATA_TYPES["uint32"])
|
||||||
bad_offset = read_value(f, DATA_TYPES["uint64"])
|
bad_offset = read_value(f, DATA_TYPES["uint64"])
|
||||||
n_elems = int(np.prod(shape))
|
n_elems = int(math.prod(shape))
|
||||||
block_size, type_size = GGML_QUANT_SIZES[ggml_type]
|
block_size, type_size = GGML_QUANT_SIZES[ggml_type]
|
||||||
n_bytes = n_elems * type_size // block_size
|
n_bytes = n_elems * type_size // block_size
|
||||||
np_dims = tuple(reversed(shape))
|
np_dims = tuple(reversed(shape))
|
||||||
|
@ -276,8 +300,49 @@ class GGUFLoader:
|
||||||
itemsize = int(np.empty([], dtype = item_type).itemsize)
|
itemsize = int(np.empty([], dtype = item_type).itemsize)
|
||||||
return mmap_data[offset : offset + itemsize * item_count]
|
return mmap_data[offset : offset + itemsize * item_count]
|
||||||
|
|
||||||
def load_gguf_tensor(self, name: str, device:str = "cpu")->torch.Tensor:
|
def get_undequanted_tensor_and_ggml_type(self, name):
|
||||||
t = self.tensor_info[name]
|
t = self.tensor_info[name]
|
||||||
|
data = self.get_mmap_tensor(name)
|
||||||
|
ggml_type = t["ggml_type"]
|
||||||
|
data = torch.from_numpy(data)
|
||||||
|
return data, ggml_type
|
||||||
|
|
||||||
|
def load_expert_tensor(self, name, data, expert_id, elements_per_expert, device = "cuda", target_dtype = torch.get_default_dtype())->torch.Tensor:
|
||||||
|
t = self.tensor_info[name]
|
||||||
|
if device.lower() == "cpu":
|
||||||
|
print(f"loading expert {expert_id} of {name} with CPU")
|
||||||
|
shape = t["shape"]
|
||||||
|
ggml_type = t["ggml_type"]
|
||||||
|
if ggml_type not in GGML_NAMES:
|
||||||
|
raise NotImplementedError(f"ggml_type {ggml_type} not implemented")
|
||||||
|
ggml_name = GGML_NAMES[ggml_type]
|
||||||
|
|
||||||
|
# TODO: experts may fused in quant block, split it
|
||||||
|
assert elements_per_expert % GGML_ELEMENTS_PER_BLOCK[ggml_name] == 0, "experts may fused in quant block, please use CPU dequant"
|
||||||
|
|
||||||
|
blocks_per_experts = elements_per_expert // GGML_ELEMENTS_PER_BLOCK[ggml_name]
|
||||||
|
block_size = GGML_BLOCK_SIZES[ggml_name]
|
||||||
|
offset = expert_id * block_size * blocks_per_experts
|
||||||
|
data = data[offset: offset + block_size * blocks_per_experts]
|
||||||
|
|
||||||
|
if "cuda" in device.lower():
|
||||||
|
values = GGML_DEQUANTIZE_GPU[ggml_name](data, device, target_dtype)
|
||||||
|
else:
|
||||||
|
values = GGML_DEQUANTIZE[ggml_name](data)
|
||||||
|
values = torch.from_numpy(values.copy())
|
||||||
|
|
||||||
|
if ggml_name == "BF16":
|
||||||
|
values = values.view(torch.bfloat16)
|
||||||
|
values = values.view(shape[-2::-1])
|
||||||
|
|
||||||
|
return values
|
||||||
|
|
||||||
|
def load_gguf_tensor(self, name: str, device:str = "cpu", target_dtype = None)->torch.Tensor:
|
||||||
|
t = self.tensor_info[name]
|
||||||
|
if device.lower() == "cpu":
|
||||||
|
print(f"loading {name} with CPU")
|
||||||
|
if target_dtype == None:
|
||||||
|
target_dtype = torch.get_default_dtype()
|
||||||
|
|
||||||
shape = t["shape"]
|
shape = t["shape"]
|
||||||
ggml_type = t["ggml_type"]
|
ggml_type = t["ggml_type"]
|
||||||
|
@ -289,14 +354,38 @@ class GGUFLoader:
|
||||||
|
|
||||||
data = self.get_mmap_tensor(name)
|
data = self.get_mmap_tensor(name)
|
||||||
|
|
||||||
if "cuda" in device.lower():
|
block_size = GGML_BLOCK_SIZES[ggml_name]
|
||||||
values = GGML_DEQUANTIZE_GPU[ggml_name](data, device)
|
elements_per_block = GGML_ELEMENTS_PER_BLOCK[ggml_name]
|
||||||
#values = GGML_DEQUANTIZE[ggml_name](data)
|
num_elements = int(np.prod(shape))
|
||||||
#print("load_gguf_tensor")
|
num_blocks = num_elements // elements_per_block
|
||||||
#values = torch.from_numpy(values).to(device = device)
|
|
||||||
|
blocks_per_iter = 16384
|
||||||
|
if num_blocks > blocks_per_iter: # dequant large tensor
|
||||||
|
values = torch.empty((num_blocks, elements_per_block), dtype=target_dtype, device=device)
|
||||||
|
for i in range( (num_blocks + blocks_per_iter - 1) // blocks_per_iter):
|
||||||
|
blocks_begin = i * blocks_per_iter
|
||||||
|
blocks_end = min(blocks_begin + blocks_per_iter, num_blocks)
|
||||||
|
if "cuda" in device.lower():
|
||||||
|
cur_values = GGML_DEQUANTIZE_GPU[ggml_name](data[blocks_begin*block_size : blocks_end*block_size], device, target_dtype)
|
||||||
|
else:
|
||||||
|
cur_values = GGML_DEQUANTIZE[ggml_name](data[blocks_begin*block_size : blocks_end*block_size])
|
||||||
|
cur_values = torch.from_numpy(cur_values.copy())
|
||||||
|
|
||||||
|
cur_values = cur_values.view(-1, elements_per_block)
|
||||||
|
if ggml_name == "BF16":
|
||||||
|
cur_values = cur_values.view(torch.bfloat16)
|
||||||
|
values[blocks_begin : blocks_end] = cur_values
|
||||||
else:
|
else:
|
||||||
values = GGML_DEQUANTIZE[ggml_name](data)
|
if "cuda" in device.lower():
|
||||||
values = torch.from_numpy(values)
|
values = GGML_DEQUANTIZE_GPU[ggml_name](data, device)
|
||||||
|
else:
|
||||||
|
values = GGML_DEQUANTIZE[ggml_name](data)
|
||||||
|
values = torch.from_numpy(values)
|
||||||
|
|
||||||
|
if ggml_name == "BF16":
|
||||||
|
values = values.view(torch.bfloat16)
|
||||||
|
|
||||||
|
|
||||||
values = values.view(shape[::-1])
|
values = values.view(shape[::-1])
|
||||||
if "attn_q" in name and self.gguf_file_meta['general.architecture'] in ["llama"]:
|
if "attn_q" in name and self.gguf_file_meta['general.architecture'] in ["llama"]:
|
||||||
n_head = self.gguf_file_meta['llama.attention.head_count']
|
n_head = self.gguf_file_meta['llama.attention.head_count']
|
||||||
|
@ -352,6 +441,9 @@ def read_value(f, data_type):
|
||||||
elem_type, count = struct.unpack("<IQ", f.read(4 + 8))
|
elem_type, count = struct.unpack("<IQ", f.read(4 + 8))
|
||||||
return [read_value(f, elem_type) for _ in range(count)]
|
return [read_value(f, elem_type) for _ in range(count)]
|
||||||
|
|
||||||
|
elif data_type == DATA_TYPES["FP8"]:
|
||||||
|
return struct.unpack("<B", f.read(1))[0]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
raise NotImplementedError(f"Data type {data_type} not implemented")
|
raise NotImplementedError(f"Data type {data_type} not implemented")
|
||||||
|
|
||||||
|
@ -392,14 +484,15 @@ def dequantize_q2_k(data):
|
||||||
|
|
||||||
return d * (scales & 15) * (tmp & 3) - dmin * (scales >> 4)
|
return d * (scales & 15) * (tmp & 3) - dmin * (scales >> 4)
|
||||||
|
|
||||||
def dequantize_q2_k_gpu(data, device:str ="cuda"):
|
def dequantize_q2_k_gpu(data, device:str ="cuda", target_dtype = torch.get_default_dtype()):
|
||||||
block_size = GGML_BLOCK_SIZES["Q2_K"]
|
block_size = GGML_BLOCK_SIZES["Q2_K"]
|
||||||
|
ele_per_blk = GGML_ELEMENTS_PER_BLOCK["Q2_K"]
|
||||||
data = np.frombuffer(data, dtype=data.dtype)
|
data = np.frombuffer(data, dtype=data.dtype)
|
||||||
device = torch.device(device)
|
device = torch.device(device)
|
||||||
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
||||||
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
||||||
data = torch.from_numpy(data)
|
c_pointer = ctypes.addressof(ctypes.cast(data.ctypes.data, ctypes.POINTER(ctypes.c_int8)).contents)
|
||||||
return KTransformersOps.dequantize_q2_k(data, block_size, device)
|
return KTransformersOps.dequantize_q2_k(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype)
|
||||||
|
|
||||||
def dequantize_q3_k(data):
|
def dequantize_q3_k(data):
|
||||||
# C implementation
|
# C implementation
|
||||||
|
@ -443,14 +536,15 @@ def dequantize_q3_k(data):
|
||||||
(((qs[:, 48:64] >> 6) & 3) - bits[:, 16:, 7])
|
(((qs[:, 48:64] >> 6) & 3) - bits[:, 16:, 7])
|
||||||
], axis=1)
|
], axis=1)
|
||||||
|
|
||||||
def dequantize_q3_k_gpu(data, device:str ="cuda"):
|
def dequantize_q3_k_gpu(data, device:str ="cuda", target_dtype = torch.get_default_dtype()):
|
||||||
block_size = GGML_BLOCK_SIZES["Q3_K"]
|
block_size = GGML_BLOCK_SIZES["Q3_K"]
|
||||||
|
ele_per_blk = GGML_ELEMENTS_PER_BLOCK["Q3_K"]
|
||||||
data = np.frombuffer(data, dtype=data.dtype)
|
data = np.frombuffer(data, dtype=data.dtype)
|
||||||
device = torch.device(device)
|
device = torch.device(device)
|
||||||
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
||||||
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
||||||
data = torch.from_numpy(data)
|
c_pointer = ctypes.addressof(ctypes.cast(data.ctypes.data, ctypes.POINTER(ctypes.c_int8)).contents)
|
||||||
return KTransformersOps.dequantize_q3_k(data, block_size, device)
|
return KTransformersOps.dequantize_q3_k(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype)
|
||||||
|
|
||||||
def dequantize_q4_k(data):
|
def dequantize_q4_k(data):
|
||||||
# C implementation
|
# C implementation
|
||||||
|
@ -474,13 +568,15 @@ def dequantize_q4_k(data):
|
||||||
# Dequantize final weights using scales and offsets
|
# Dequantize final weights using scales and offsets
|
||||||
return factors * qs2 - offsets
|
return factors * qs2 - offsets
|
||||||
|
|
||||||
def dequantize_q4_k_gpu(data, device:str ="cuda"):
|
def dequantize_q4_k_gpu(data, device:str ="cuda", target_dtype = torch.get_default_dtype()):
|
||||||
|
block_size = GGML_BLOCK_SIZES["Q4_K"]
|
||||||
|
ele_per_blk = GGML_ELEMENTS_PER_BLOCK["Q4_K"]
|
||||||
data = np.frombuffer(data, dtype=data.dtype)
|
data = np.frombuffer(data, dtype=data.dtype)
|
||||||
device = torch.device(device)
|
device = torch.device(device)
|
||||||
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
||||||
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
||||||
data = torch.from_numpy(data)
|
c_pointer = ctypes.addressof(ctypes.cast(data.ctypes.data, ctypes.POINTER(ctypes.c_int8)).contents)
|
||||||
return KTransformersOps.dequantize_q4_k(data, 144, device)
|
return KTransformersOps.dequantize_q4_k(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype)
|
||||||
|
|
||||||
def dequantize_q5_k(data):
|
def dequantize_q5_k(data):
|
||||||
# C implementation
|
# C implementation
|
||||||
|
@ -538,14 +634,15 @@ def dequantize_q5_k(data):
|
||||||
d8 * (qs_hi_4[:, 3] + (bits[:, :, 7] << 4)) - m8,
|
d8 * (qs_hi_4[:, 3] + (bits[:, :, 7] << 4)) - m8,
|
||||||
], axis=1)
|
], axis=1)
|
||||||
|
|
||||||
def dequantize_q5_k_gpu(data, device:str ="cuda"):
|
def dequantize_q5_k_gpu(data, device:str ="cuda", target_dtype = torch.get_default_dtype()):
|
||||||
block_size = GGML_BLOCK_SIZES["Q5_K"]
|
block_size = GGML_BLOCK_SIZES["Q5_K"]
|
||||||
|
ele_per_blk = GGML_ELEMENTS_PER_BLOCK["Q5_K"]
|
||||||
data = np.frombuffer(data, dtype=data.dtype)
|
data = np.frombuffer(data, dtype=data.dtype)
|
||||||
device = torch.device(device)
|
device = torch.device(device)
|
||||||
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
# TODO: this and from_numpy in other functions will cause a warning saying that numpy is not writable,
|
||||||
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
# the best way to fix this is transfer ptr to KTransformersOps instead of Tensor.
|
||||||
data = torch.from_numpy(data)
|
c_pointer = ctypes.addressof(ctypes.cast(data.ctypes.data, ctypes.POINTER(ctypes.c_int8)).contents)
|
||||||
return KTransformersOps.dequantize_q5_k(data, block_size, device)
|
return KTransformersOps.dequantize_q5_k(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype)
|
||||||
|
|
||||||
def dequantize_q6_k(data):
|
def dequantize_q6_k(data):
|
||||||
# C implementation
|
# C implementation
|
||||||
|
@ -596,13 +693,14 @@ def dequantize_q6_k(data):
|
||||||
], axis=1)
|
], axis=1)
|
||||||
|
|
||||||
# @torch.jit.script
|
# @torch.jit.script
|
||||||
def dequantize_q6_k_gpu(data: np.ndarray, device:str = "cuda"):
|
def dequantize_q6_k_gpu(data: np.ndarray, device:str = "cuda", target_dtype = torch.get_default_dtype()):
|
||||||
block_size = GGML_BLOCK_SIZES["Q6_K"]
|
block_size = GGML_BLOCK_SIZES["Q6_K"]
|
||||||
|
ele_per_blk = GGML_ELEMENTS_PER_BLOCK["Q6_K"]
|
||||||
device = torch.device(device)
|
device = torch.device(device)
|
||||||
num_blocks = len(data) // block_size
|
num_blocks = len(data) // block_size
|
||||||
data = np.frombuffer(data, dtype=data.dtype)
|
data = np.frombuffer(data, dtype=data.dtype)
|
||||||
data = torch.from_numpy(data)
|
c_pointer = ctypes.addressof(ctypes.cast(data.ctypes.data, ctypes.POINTER(ctypes.c_int8)).contents)
|
||||||
return KTransformersOps.dequantize_q6_k(data, block_size, device)
|
return KTransformersOps.dequantize_q6_k(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype)
|
||||||
|
|
||||||
kvalues_iq4nl = np.array([-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113], dtype=np.int8)
|
kvalues_iq4nl = np.array([-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113], dtype=np.int8)
|
||||||
|
|
||||||
|
@ -636,13 +734,14 @@ def dequantize_iq4_xs(data):
|
||||||
|
|
||||||
return y.flatten()
|
return y.flatten()
|
||||||
|
|
||||||
def dequantize_iq4_xs_gpu(data: np.ndarray, device:str = "cuda"):
|
def dequantize_iq4_xs_gpu(data: np.ndarray, device:str = "cuda", target_dtype = torch.get_default_dtype()):
|
||||||
block_size = GGML_BLOCK_SIZES["IQ4_XS"]
|
block_size = GGML_BLOCK_SIZES["IQ4_XS"]
|
||||||
|
ele_per_blk = GGML_ELEMENTS_PER_BLOCK["IQ4_XS"]
|
||||||
device = torch.device(device)
|
device = torch.device(device)
|
||||||
num_blocks = len(data) // block_size
|
num_blocks = len(data) // block_size
|
||||||
data = np.frombuffer(data, dtype=data.dtype)
|
data = np.frombuffer(data, dtype=data.dtype)
|
||||||
data = torch.from_numpy(data)
|
c_pointer = ctypes.addressof(ctypes.cast(data.ctypes.data, ctypes.POINTER(ctypes.c_int8)).contents)
|
||||||
return KTransformersOps.dequantize_iq4_xs(data, block_size, device)
|
return KTransformersOps.dequantize_iq4_xs(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype)
|
||||||
|
|
||||||
def dequantize_q4_0(data):
|
def dequantize_q4_0(data):
|
||||||
# C implementation
|
# C implementation
|
||||||
|
@ -659,7 +758,7 @@ def dequantize_q4_0(data):
|
||||||
scales * ((qs >> 4).astype(np.int8) - 8),
|
scales * ((qs >> 4).astype(np.int8) - 8),
|
||||||
], axis=1)
|
], axis=1)
|
||||||
|
|
||||||
def dequantize_q4_0_gpu(data):
|
def dequantize_q4_0_gpu(data, device:str = "cuda", target_dtype = torch.get_default_dtype()):
|
||||||
raise NotImplementedError()
|
raise NotImplementedError()
|
||||||
|
|
||||||
def dequantize_q5_0(data):
|
def dequantize_q5_0(data):
|
||||||
|
@ -683,7 +782,7 @@ def dequantize_q5_0(data):
|
||||||
scales * x1,
|
scales * x1,
|
||||||
], axis=1)
|
], axis=1)
|
||||||
|
|
||||||
def dequantize_q5_0_gpu(data):
|
def dequantize_q5_0_gpu(data, device:str = "cuda", target_dtype = torch.get_default_dtype()):
|
||||||
raise NotImplementedError()
|
raise NotImplementedError()
|
||||||
|
|
||||||
def dequantize_q8_0(data):
|
def dequantize_q8_0(data):
|
||||||
|
@ -695,32 +794,41 @@ def dequantize_q8_0(data):
|
||||||
qs = np.frombuffer(data, dtype=np.int8).reshape(num_blocks, 2 + 32)[:, 2:]
|
qs = np.frombuffer(data, dtype=np.int8).reshape(num_blocks, 2 + 32)[:, 2:]
|
||||||
return scales * qs
|
return scales * qs
|
||||||
|
|
||||||
def dequantize_q8_0_gpu(data, device:str = "cuda"):
|
def dequantize_q8_0_gpu(data, device:str = "cuda", target_dtype = torch.get_default_dtype()):
|
||||||
# C struct definition
|
# C struct definition
|
||||||
# https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.h#L43
|
# https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.h#L43
|
||||||
num_blocks = len(data) // GGML_BLOCK_SIZES["Q8_0"]
|
|
||||||
|
block_size = GGML_BLOCK_SIZES["Q8_0"]
|
||||||
|
ele_per_blk = GGML_ELEMENTS_PER_BLOCK["Q8_0"]
|
||||||
device = torch.device(device)
|
device = torch.device(device)
|
||||||
data = np.frombuffer(data, dtype=data.dtype)
|
data = np.frombuffer(data, dtype=data.dtype)
|
||||||
data = torch.from_numpy(data)
|
c_pointer = ctypes.addressof(ctypes.cast(data.ctypes.data, ctypes.POINTER(ctypes.c_int8)).contents)
|
||||||
return KTransformersOps.dequantize_q8_0(data, 34, device)
|
return KTransformersOps.dequantize_q8_0(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype)
|
||||||
|
|
||||||
|
|
||||||
def dequantize_f32(data):
|
def dequantize_f32(data):
|
||||||
return np.frombuffer(data, dtype=np.float32)
|
return np.frombuffer(data, dtype=np.float32)
|
||||||
|
|
||||||
def dequantize_f32_gpu(data, device):
|
def dequantize_f32_gpu(data, device, target_dtype = torch.get_default_dtype()):
|
||||||
data = np.frombuffer(data, dtype=np.float32)
|
data = np.frombuffer(data, dtype=np.float32)
|
||||||
res = torch.from_numpy(data)
|
res = torch.from_numpy(data.copy())
|
||||||
res_gpu = torch.empty_like(res, device=device)
|
res_gpu = torch.empty_like(res, device=device, dtype=target_dtype)
|
||||||
res_gpu.copy_(res)
|
res_gpu.copy_(res)
|
||||||
return res_gpu
|
return res_gpu
|
||||||
|
|
||||||
def dequantize_f16(data):
|
def dequantize_f16(data):
|
||||||
return np.frombuffer(data, dtype=np.float16)
|
return np.frombuffer(data, dtype=np.float16)
|
||||||
|
|
||||||
def dequantize_f16_gpu(data, device):
|
def dequantize_f16_gpu(data, device, target_dtype = torch.get_default_dtype()):
|
||||||
data = np.frombuffer(data, dtype=np.float16)
|
data = np.frombuffer(data, dtype=np.float16)
|
||||||
res = torch.from_numpy(data)
|
res = torch.from_numpy(data.copy())
|
||||||
|
res_gpu = torch.empty_like(res, device=device, dtype=target_dtype)
|
||||||
|
res_gpu.copy_(res)
|
||||||
|
return res_gpu
|
||||||
|
|
||||||
|
def dequantize_bf16_gpu(data, device, target_dtype = torch.get_default_dtype()):
|
||||||
|
data = np.frombuffer(data, dtype=np.float16)
|
||||||
|
res = torch.from_numpy(data.copy())
|
||||||
res_gpu = torch.empty_like(res, device=device)
|
res_gpu = torch.empty_like(res, device=device)
|
||||||
res_gpu.copy_(res)
|
res_gpu.copy_(res)
|
||||||
return res_gpu
|
return res_gpu
|
||||||
|
@ -728,6 +836,7 @@ def dequantize_f16_gpu(data, device):
|
||||||
GGML_DEQUANTIZE = {
|
GGML_DEQUANTIZE = {
|
||||||
"F32": dequantize_f32,
|
"F32": dequantize_f32,
|
||||||
"F16": dequantize_f16,
|
"F16": dequantize_f16,
|
||||||
|
"BF16": dequantize_f16,
|
||||||
"Q4_0": dequantize_q4_0,
|
"Q4_0": dequantize_q4_0,
|
||||||
"Q5_0": dequantize_q5_0,
|
"Q5_0": dequantize_q5_0,
|
||||||
"Q8_0": dequantize_q8_0,
|
"Q8_0": dequantize_q8_0,
|
||||||
|
@ -742,6 +851,7 @@ GGML_DEQUANTIZE = {
|
||||||
GGML_DEQUANTIZE_GPU = {
|
GGML_DEQUANTIZE_GPU = {
|
||||||
"F32": dequantize_f32_gpu,
|
"F32": dequantize_f32_gpu,
|
||||||
"F16": dequantize_f16_gpu,
|
"F16": dequantize_f16_gpu,
|
||||||
|
"BF16": dequantize_bf16_gpu,
|
||||||
"Q4_0": dequantize_q4_0_gpu,
|
"Q4_0": dequantize_q4_0_gpu,
|
||||||
"Q5_0": dequantize_q5_0_gpu,
|
"Q5_0": dequantize_q5_0_gpu,
|
||||||
"Q8_0": dequantize_q8_0_gpu,
|
"Q8_0": dequantize_q8_0_gpu,
|
||||||
|
|
86
ktransformers/util/custom_loader.py
Normal file
86
ktransformers/util/custom_loader.py
Normal file
|
@ -0,0 +1,86 @@
|
||||||
|
import struct
|
||||||
|
import warnings
|
||||||
|
import numpy as np
|
||||||
|
import re
|
||||||
|
import numpy.typing as npt
|
||||||
|
from typing import Sequence
|
||||||
|
import os
|
||||||
|
from enum import IntEnum
|
||||||
|
import torch
|
||||||
|
import KTransformersOps
|
||||||
|
from safetensors import safe_open
|
||||||
|
from ktransformers.ktransformers_ext.triton.fp8gemm import fp8_gemm, act_quant, weight_dequant
|
||||||
|
from safetensors.torch import save_file
|
||||||
|
|
||||||
|
class SafeTensorLoader:
|
||||||
|
tensor_file_map = {}
|
||||||
|
tensor_type_map = {}
|
||||||
|
file_handle_map = {}
|
||||||
|
|
||||||
|
def __init__(self, file_path: str):
|
||||||
|
self.__load_tensor_file_map(file_path)
|
||||||
|
|
||||||
|
def __load_tensor_file_map(self, file_path: str):
|
||||||
|
# 处理传入路径,确保是文件夹路径
|
||||||
|
if not os.path.exists(file_path):
|
||||||
|
raise FileNotFoundError(f"Path not found: {file_path}")
|
||||||
|
if os.path.isfile(file_path):
|
||||||
|
folder_path = os.path.dirname(file_path)
|
||||||
|
else:
|
||||||
|
folder_path = file_path
|
||||||
|
|
||||||
|
found_safetensor = False
|
||||||
|
for root, _, files in os.walk(folder_path):
|
||||||
|
files = sorted(files)
|
||||||
|
for file in files:
|
||||||
|
if file.endswith(".safetensors"):
|
||||||
|
found_safetensor = True
|
||||||
|
file_path = os.path.join(root, file)
|
||||||
|
if file not in self.file_handle_map:
|
||||||
|
try:
|
||||||
|
handle = safe_open(file_path, framework="pt")
|
||||||
|
self.file_handle_map[file] = handle
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error opening Safetensor file {file_path}: {e}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
f = self.file_handle_map.get(file)
|
||||||
|
if f is None:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
for key in f.keys():
|
||||||
|
self.tensor_file_map[key] = file
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading Safetensor file {file_path}: {e}")
|
||||||
|
|
||||||
|
# if not found_safetensor:
|
||||||
|
# raise FileNotFoundError(f"No Safetensor files found in {folder_path}")
|
||||||
|
|
||||||
|
def load_tensor(self, key: str, device: str="cpu"):
|
||||||
|
if key not in self.tensor_file_map:
|
||||||
|
raise KeyError(f"Key {key} not found in Safetensor files")
|
||||||
|
file = self.tensor_file_map[key]
|
||||||
|
f = self.file_handle_map.get(file)
|
||||||
|
if f is None:
|
||||||
|
raise FileNotFoundError(f"File {file} not found in Safetensor files")
|
||||||
|
tensor = f.get_tensor(key)
|
||||||
|
return tensor.to(device)
|
||||||
|
|
||||||
|
def close_all_handles(self):
|
||||||
|
for handle in self.file_handle_map.values():
|
||||||
|
handle.close()
|
||||||
|
self.file_handle_map.clear()
|
||||||
|
|
||||||
|
def load_dequantized_tensor(self, key:str, device: str="cpu"):
|
||||||
|
if key not in self.tensor_file_map:
|
||||||
|
raise KeyError(f"Key {key} not found in Safetensor files")
|
||||||
|
file = self.tensor_file_map[key]
|
||||||
|
f = self.file_handle_map.get(file)
|
||||||
|
if f is None:
|
||||||
|
raise FileNotFoundError(f"File {file} not found in Safetensor files")
|
||||||
|
tensor = f.get_tensor(key).to(device)
|
||||||
|
if key.endswith(".weight"):
|
||||||
|
if key[:-7] + ".weight_scale_inv" in self.tensor_file_map:
|
||||||
|
weight_scale_inv = f.get_tensor(key[:-7] + ".weight_scale_inv").to(device)
|
||||||
|
tensor = weight_dequant(tensor, weight_scale_inv)
|
||||||
|
return tensor.to(device)
|
|
@ -17,9 +17,22 @@ from ktransformers.operators import base_operator
|
||||||
from ktransformers.models.custom_cache import StaticCache
|
from ktransformers.models.custom_cache import StaticCache
|
||||||
from ktransformers.util.cuda_graph_runner import CUDAGraphRunner
|
from ktransformers.util.cuda_graph_runner import CUDAGraphRunner
|
||||||
from ktransformers.util.textstream import TextStreamer
|
from ktransformers.util.textstream import TextStreamer
|
||||||
|
from ktransformers.operators.flashinfer_wrapper import MLAWrapperSingleton
|
||||||
|
|
||||||
warm_uped = False
|
warm_uped = False
|
||||||
|
|
||||||
|
def get_compute_capability(device:torch.device = None):
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
if device is None:
|
||||||
|
num_gpus = torch.cuda.device_count()
|
||||||
|
min_compute_capability_major = 100
|
||||||
|
for gpu_id in range(num_gpus):
|
||||||
|
gpu_props = torch.cuda.get_device_properties(gpu_id)
|
||||||
|
min_compute_capability_major = min(min_compute_capability_major, gpu_props.major)
|
||||||
|
return min_compute_capability_major
|
||||||
|
else:
|
||||||
|
return torch.cuda.get_device_properties(device)
|
||||||
|
|
||||||
def set_module(model, submodule_key, module):
|
def set_module(model, submodule_key, module):
|
||||||
tokens = submodule_key.split('.')
|
tokens = submodule_key.split('.')
|
||||||
sub_tokens = tokens[:-1]
|
sub_tokens = tokens[:-1]
|
||||||
|
@ -65,12 +78,22 @@ def load_cur_state_dict(module: nn.Module, gguf_loader: GGUFLoader, prefix: str
|
||||||
for name, param in local_state.items():
|
for name, param in local_state.items():
|
||||||
key = prefix + name
|
key = prefix + name
|
||||||
translated_key = translate_name_to_gguf(key)
|
translated_key = translate_name_to_gguf(key)
|
||||||
if translated_key in gguf_loader.tensor_file_map:
|
|
||||||
|
# TODO: Merge all loader.
|
||||||
|
# I know this is ugly but lets do it for now.
|
||||||
|
if gguf_loader.safetensor_loader is not None:
|
||||||
|
load_dequantized_tensor = gguf_loader.safetensor_loader.load_dequantized_tensor
|
||||||
|
tensor_file_map = gguf_loader.safetensor_loader.tensor_file_map
|
||||||
|
else:
|
||||||
|
load_dequantized_tensor = gguf_loader.load_gguf_tensor
|
||||||
|
tensor_file_map = gguf_loader.tensor_file_map
|
||||||
|
|
||||||
|
if translated_key in tensor_file_map:
|
||||||
target_dtype = torch.get_default_dtype()
|
target_dtype = torch.get_default_dtype()
|
||||||
device = get_device(translated_key[:translated_key.rfind(".")], gguf_loader.tensor_device_map)
|
device = get_device(translated_key[:translated_key.rfind(".")], gguf_loader.tensor_device_map)
|
||||||
print(f"loading {translated_key} to {device}")
|
print(f"loading {translated_key} to {device}")
|
||||||
# device = "cpu" if "embd" in translated_key else "cuda"
|
torch.cuda.empty_cache()
|
||||||
weights = gguf_loader.load_gguf_tensor(translated_key, device = device).to(dtype = target_dtype)
|
weights = load_dequantized_tensor(translated_key, device=device).to(dtype=target_dtype)
|
||||||
set_param(module, name, weights)
|
set_param(module, name, weights)
|
||||||
del weights
|
del weights
|
||||||
else:
|
else:
|
||||||
|
@ -78,7 +101,7 @@ def load_cur_state_dict(module: nn.Module, gguf_loader: GGUFLoader, prefix: str
|
||||||
raise Exception(f"can't find {translated_key} in GGUF file!")
|
raise Exception(f"can't find {translated_key} in GGUF file!")
|
||||||
|
|
||||||
def load_weights(module:nn.Module, gguf_loader:GGUFLoader, prefix=''):
|
def load_weights(module:nn.Module, gguf_loader:GGUFLoader, prefix=''):
|
||||||
# print(f"recursively loading weights {prefix},{return_when_injected=}, {only_load_injected=}")
|
#print(f"recursively loading weights {prefix}")
|
||||||
if not isinstance(module, base_operator.BaseInjectedModule):
|
if not isinstance(module, base_operator.BaseInjectedModule):
|
||||||
load_cur_state_dict(module, gguf_loader, prefix)
|
load_cur_state_dict(module, gguf_loader, prefix)
|
||||||
for name, child in module._modules.items():
|
for name, child in module._modules.items():
|
||||||
|
@ -87,7 +110,8 @@ def load_weights(module:nn.Module, gguf_loader:GGUFLoader, prefix=''):
|
||||||
module.load()
|
module.load()
|
||||||
|
|
||||||
def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cuda_graph: bool = True,
|
def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cuda_graph: bool = True,
|
||||||
mode = 'normal', force_think: bool = False):
|
mode = 'normal', force_think: bool = False, chunk_prefill_size = 16384, use_flashinfer_mla = False,
|
||||||
|
num_heads = None, head_dim_ckv = None, head_dim_kpe = None, q_head_dim = None):
|
||||||
import os
|
import os
|
||||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||||
torch._dynamo.config.suppress_errors = True
|
torch._dynamo.config.suppress_errors = True
|
||||||
|
@ -100,7 +124,7 @@ def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cud
|
||||||
|
|
||||||
tokens = []
|
tokens = []
|
||||||
|
|
||||||
def decode_one_tokens(cuda_graph_runner, cur_token, position_ids, cache_position, past_key_values, use_cuda_graph: bool = True):
|
def decode_one_tokens(cuda_graph_runner, cur_token, position_ids, cache_position, past_key_values, logits_warper, generation_config, use_cuda_graph: bool = True):
|
||||||
if cuda_graph_runner is None:
|
if cuda_graph_runner is None:
|
||||||
use_cuda_graph = False
|
use_cuda_graph = False
|
||||||
if use_cuda_graph:
|
if use_cuda_graph:
|
||||||
|
@ -128,8 +152,25 @@ def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cud
|
||||||
next_token = torch.argmax(next_token_scores, dim=-1)
|
next_token = torch.argmax(next_token_scores, dim=-1)
|
||||||
return next_token
|
return next_token
|
||||||
|
|
||||||
|
# TODO: use CUDA Graph for chunk prefill, may get small improvement
|
||||||
|
def chunk_prefill(inputs, cache_position, past_key_values):
|
||||||
|
if mode == "long_context":
|
||||||
|
inputs_embeds = model.model.embed_tokens(inputs.to("cpu"))
|
||||||
|
else:
|
||||||
|
inputs_embeds = model.model.embed_tokens(inputs.to("cpu")).to(torch_device)
|
||||||
|
if use_flashinfer_mla:
|
||||||
|
MLAWrapperSingleton.update_buffer(past_key_values.max_pages)
|
||||||
|
MLAWrapperSingleton.need_plan_all()
|
||||||
|
|
||||||
|
logits = model(
|
||||||
|
inputs_embeds = inputs_embeds, cache_position=cache_position, past_key_values=past_key_values, return_dict=False, use_cache=True
|
||||||
|
)[0][:,-1,:].unsqueeze(0).clone().to(torch_device)
|
||||||
|
|
||||||
|
return logits
|
||||||
|
|
||||||
torch.cuda.set_device(torch_device)
|
torch.cuda.set_device(torch_device)
|
||||||
with torch.no_grad():
|
with torch.no_grad():
|
||||||
|
|
||||||
stream = TextStreamer(tokenizer)
|
stream = TextStreamer(tokenizer)
|
||||||
if mode != 'long_context':
|
if mode != 'long_context':
|
||||||
past_key_values = StaticCache(
|
past_key_values = StaticCache(
|
||||||
|
@ -137,26 +178,11 @@ def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cud
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
past_key_values = None
|
past_key_values = None
|
||||||
cache_position = torch.arange(seq_length, device=torch_device, dtype=torch.long)
|
|
||||||
generated_ids = torch.zeros(
|
|
||||||
batch_size, seq_length + max_new_tokens + 1, dtype=torch.int, device=torch_device
|
|
||||||
)
|
|
||||||
generated_ids[:, cache_position] = inputs.to(torch_device).to(torch.int)
|
|
||||||
if past_key_values != None:
|
|
||||||
past_key_values.cur_idx=cache_position
|
|
||||||
start_time = time.time()
|
|
||||||
|
|
||||||
inputs_embeds = model.model.embed_tokens(inputs.to("cpu")).to(torch_device)
|
|
||||||
if mode == "long_context":
|
|
||||||
inputs_embeds = model.model.embed_tokens(inputs.to("cpu"))
|
|
||||||
else:
|
|
||||||
inputs_embeds = model.model.embed_tokens(inputs.to("cpu")).to(torch_device)
|
|
||||||
logits = model(
|
|
||||||
inputs_embeds = inputs_embeds, cache_position=cache_position, past_key_values=past_key_values, return_dict=False, use_cache=True
|
|
||||||
)[0][:,-1,:].unsqueeze(0).clone().to(torch_device)
|
|
||||||
generation_config, model_kwargs = model._prepare_generation_config(
|
generation_config, model_kwargs = model._prepare_generation_config(
|
||||||
None, max_length=max_new_tokens,
|
None, do_sample=True
|
||||||
do_sample=True, top_k=5, top_p=0.85, temperature=0.1 # change this to modify generate config
|
# change this to modify generate config
|
||||||
|
#top_k=5, top_p=0.85, temperature=0.1
|
||||||
)
|
)
|
||||||
try: # transformers==4.43
|
try: # transformers==4.43
|
||||||
logits_warper = (
|
logits_warper = (
|
||||||
|
@ -166,23 +192,43 @@ def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cud
|
||||||
logits_warper = (
|
logits_warper = (
|
||||||
model._get_logits_warper(generation_config)
|
model._get_logits_warper(generation_config)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
cache_position = torch.arange(seq_length, device=torch_device, dtype=torch.int32)
|
||||||
|
generated_ids = torch.zeros(
|
||||||
|
batch_size, seq_length + max_new_tokens + 1, dtype=torch.int, device=torch_device
|
||||||
|
)
|
||||||
|
generated_ids[:, cache_position] = inputs.to(torch_device).to(torch.int)
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
chunk_start = 0
|
||||||
|
while chunk_start < seq_length:
|
||||||
|
chunk_end = min(chunk_start + chunk_prefill_size, seq_length)
|
||||||
|
if past_key_values != None:
|
||||||
|
past_key_values.cur_idx=cache_position[chunk_start:chunk_end]
|
||||||
|
logits = chunk_prefill(inputs[:, chunk_start:chunk_end], cache_position[chunk_start:chunk_end], past_key_values)
|
||||||
|
chunk_start += chunk_prefill_size
|
||||||
|
|
||||||
next_token_scores = logits_warper(inputs, logits[:, -1, :])
|
next_token_scores = logits_warper(inputs, logits[:, -1, :])
|
||||||
if generation_config.do_sample:
|
if generation_config.do_sample:
|
||||||
probs = nn.functional.softmax(next_token_scores, dim=-1)
|
probs = nn.functional.softmax(next_token_scores, dim=-1)
|
||||||
next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
|
next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
|
||||||
else:
|
else:
|
||||||
next_token = torch.argmax(next_token_scores, dim=-1)
|
next_token = torch.argmax(next_token_scores, dim=-1)
|
||||||
|
|
||||||
first_token_time = time.time() - start_time
|
first_token_time = time.time() - start_time
|
||||||
|
|
||||||
|
if use_flashinfer_mla:
|
||||||
|
MLAWrapperSingleton.reset_buffer()
|
||||||
|
|
||||||
prefill_count = seq_length
|
prefill_count = seq_length
|
||||||
prefill_time = first_token_time
|
prefill_time = first_token_time
|
||||||
if force_think:
|
if force_think:
|
||||||
print("<think>\n")
|
print("<think>")
|
||||||
print(stream.put(next_token.item()), end="", flush=True)
|
print(stream.put(next_token.item()), end="", flush=True)
|
||||||
generated_ids[:, seq_length] = next_token
|
generated_ids[:, seq_length] = next_token
|
||||||
tokens.append(int(next_token))
|
tokens.append(int(next_token))
|
||||||
inputs = torch.cat((inputs, next_token.unsqueeze(0)), dim=-1)
|
inputs = torch.cat((inputs, next_token.unsqueeze(0)), dim=-1)
|
||||||
cache_position = torch.tensor([seq_length], device=torch_device, dtype=torch.long)
|
cache_position = torch.tensor([seq_length], device=torch_device, dtype=torch.int32)
|
||||||
position_ids = cache_position.unsqueeze(0)
|
position_ids = cache_position.unsqueeze(0)
|
||||||
seq_length += 1
|
seq_length += 1
|
||||||
|
|
||||||
|
@ -190,19 +236,22 @@ def prefill_and_generate(model, tokenizer, inputs, max_new_tokens=10000, use_cud
|
||||||
|
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
for i in range(1, max_new_tokens):
|
for i in range(1, max_new_tokens):
|
||||||
|
if use_flashinfer_mla:
|
||||||
|
MLAWrapperSingleton.plan_all(None,None,None,position_ids.squeeze(1)+1,
|
||||||
|
num_heads, head_dim_ckv, head_dim_kpe, past_key_values.page_size,
|
||||||
|
q_head_dim ** (-0.5), torch.bfloat16, torch.bfloat16)
|
||||||
global warm_uped
|
global warm_uped
|
||||||
if use_cuda_graph and ( (warm_uped == True and int(i) == 1) or (warm_uped == False and int(i) == 2) ):
|
if use_cuda_graph and ( (warm_uped == True and int(i) == 1) or (warm_uped == False and int(i) == 2) ):
|
||||||
warm_uped = True
|
warm_uped = True
|
||||||
cuda_graph_runner = CUDAGraphRunner()
|
cuda_graph_runner = CUDAGraphRunner()
|
||||||
cuda_graph_runner.capture(model, next_token.unsqueeze(0), position_ids, cache_position, past_key_values, torch_device, return_dict=False, use_cache=True)
|
cuda_graph_runner.capture(model, next_token.unsqueeze(0), position_ids, cache_position, past_key_values, torch_device, return_dict=False, use_cache=True)
|
||||||
|
next_token = decode_one_tokens(cuda_graph_runner, next_token.unsqueeze(0), position_ids, cache_position, past_key_values, logits_warper, generation_config, use_cuda_graph).to(torch_device)
|
||||||
next_token = decode_one_tokens(cuda_graph_runner, next_token.unsqueeze(0), position_ids, cache_position, past_key_values, use_cuda_graph).to(torch_device)
|
|
||||||
inputs = torch.cat((inputs, next_token.unsqueeze(0)), dim=-1)
|
inputs = torch.cat((inputs, next_token.unsqueeze(0)), dim=-1)
|
||||||
generated_ids[:, cache_position] = next_token.int()
|
generated_ids[:, cache_position] = next_token.int()
|
||||||
tokens.append(int(next_token))
|
tokens.append(int(next_token))
|
||||||
seq_length += 1
|
seq_length += 1
|
||||||
|
|
||||||
if next_token[0].item() == tokenizer.eos_token_id or tokenizer.decode(next_token) == '<|im_end|>':
|
if next_token[0].item() == tokenizer.eos_token_id or tokenizer.decode(next_token.tolist()) == '<|im_end|>':
|
||||||
print(stream.end(), end="", flush=True)
|
print(stream.end(), end="", flush=True)
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
|
|
214
merge_tensors/merge_safetensor_gguf.py
Normal file
214
merge_tensors/merge_safetensor_gguf.py
Normal file
|
@ -0,0 +1,214 @@
|
||||||
|
# this script targets to merge the fp8 safe tensor and the gguf quantized tensors.
|
||||||
|
|
||||||
|
import os
|
||||||
|
# insert the path of the project
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, "/home/azure/ktransformers")
|
||||||
|
import argparse
|
||||||
|
import torch
|
||||||
|
from ktransformers.util.custom_gguf import GGUFLoader, translate_name_to_gguf
|
||||||
|
from safetensors import safe_open
|
||||||
|
from safetensors.torch import save_file
|
||||||
|
import re
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
def read_safetensor_keys_from_folder(folder_path)->dict:
|
||||||
|
"""
|
||||||
|
:param folder_path: folder path
|
||||||
|
:return: key_to_file_map
|
||||||
|
"""
|
||||||
|
# check if the folder path is exist
|
||||||
|
if not os.path.exists(folder_path):
|
||||||
|
raise FileNotFoundError(f"GGUF dir not found: {folder_path}")
|
||||||
|
if os.path.isfile(folder_path):
|
||||||
|
folder_path = os.path.dirname(folder_path)
|
||||||
|
|
||||||
|
key_to_file_map = {}
|
||||||
|
|
||||||
|
found_safetensor = False
|
||||||
|
for root, dirs, files in os.walk(folder_path):
|
||||||
|
# sort files
|
||||||
|
files = sorted(files)
|
||||||
|
for file in files:
|
||||||
|
if file.endswith(".safetensors"):
|
||||||
|
found_safetensor = True
|
||||||
|
file_path = os.path.join(root, file)
|
||||||
|
try:
|
||||||
|
with safe_open(file_path, framework="pt") as f:
|
||||||
|
for key in f.keys():
|
||||||
|
if "model.layers.61" in key:
|
||||||
|
# skip MTP layer
|
||||||
|
continue
|
||||||
|
# try:
|
||||||
|
# if int(key.split('.')[2]) > 4:
|
||||||
|
# continue
|
||||||
|
# except:
|
||||||
|
# pass
|
||||||
|
key_to_file_map[key] = file_path
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading Safetensor file {file_path}: {e}")
|
||||||
|
|
||||||
|
if not found_safetensor:
|
||||||
|
raise FileNotFoundError(f"No Safetensor files found in {folder_path}")
|
||||||
|
|
||||||
|
return key_to_file_map
|
||||||
|
|
||||||
|
tensor_from_gguf = [] # todo: add keys in gguf that should be used in the final tensor
|
||||||
|
|
||||||
|
def translate_name(name:str)->str:
|
||||||
|
"""
|
||||||
|
:param name: name of the tensor
|
||||||
|
:return: translated name
|
||||||
|
"""
|
||||||
|
name = translate_name_to_gguf(name)
|
||||||
|
name = name.replace(".up_proj.", ".ffn_up_exps.")
|
||||||
|
name = name.replace(".down_proj.", ".ffn_down_exps.")
|
||||||
|
name = name.replace(".gate_proj.", ".ffn_gate_exps.")
|
||||||
|
name = name.replace(".ffn_gate_inp.e_score_correction_bias", ".exp_probs_b.bias")
|
||||||
|
return name
|
||||||
|
|
||||||
|
|
||||||
|
def combine_tensor_sources(safetensor_path:str, gguf_path:str):
|
||||||
|
gguf_loader = GGUFLoader(gguf_path)
|
||||||
|
gguf_tensor_file_map = gguf_loader.tensor_file_map
|
||||||
|
safetensor_tensor_file_map = read_safetensor_keys_from_folder(safetensor_path)
|
||||||
|
|
||||||
|
# build a map for the key to the tensor
|
||||||
|
# according to the key, we can get the tensor from the file
|
||||||
|
|
||||||
|
target_tensor_map = {}
|
||||||
|
for key in safetensor_tensor_file_map.keys():
|
||||||
|
# for all experts, we use the gguf tensor
|
||||||
|
if ".mlp.experts." in key:
|
||||||
|
if '.weight_scale_inv' in key:
|
||||||
|
continue
|
||||||
|
key = '.'.join(key.split('.')[:5]+key.split('.')[-2:])
|
||||||
|
translated_key = translate_name(key)
|
||||||
|
target_tensor_map[key] = gguf_tensor_file_map[translated_key]
|
||||||
|
continue
|
||||||
|
|
||||||
|
if any(target_key in key for target_key in tensor_from_gguf):
|
||||||
|
target_tensor_map[key] = gguf_tensor_file_map[translate_name(key)]
|
||||||
|
else:
|
||||||
|
target_tensor_map[key] = safetensor_tensor_file_map[key]
|
||||||
|
|
||||||
|
return target_tensor_map, gguf_loader
|
||||||
|
|
||||||
|
def write_combined_tensor(target_tensor_map: dict, output_path: str, gguf_loader: GGUFLoader):
|
||||||
|
# Ensure output directory exists
|
||||||
|
os.makedirs(output_path, exist_ok=True)
|
||||||
|
|
||||||
|
# Cache for safetensor file handles and GGUF loaders
|
||||||
|
safetensors_cache = {}
|
||||||
|
gguf_cache = {}
|
||||||
|
|
||||||
|
# Group tensors by layer
|
||||||
|
layer_groups = defaultdict(list)
|
||||||
|
non_layer_keys = []
|
||||||
|
layer_pattern = re.compile(r'\.layers\.(\d+)\.')
|
||||||
|
|
||||||
|
for key in target_tensor_map:
|
||||||
|
match = layer_pattern.search(key)
|
||||||
|
if match:
|
||||||
|
layer_num = int(match.group(1))
|
||||||
|
layer_groups[layer_num].append(key)
|
||||||
|
else:
|
||||||
|
non_layer_keys.append(key)
|
||||||
|
|
||||||
|
# Calculate total shards
|
||||||
|
total_shards = len(layer_groups) + (1 if non_layer_keys else 0) - 1
|
||||||
|
if total_shards == 0:
|
||||||
|
raise ValueError("No tensors to save")
|
||||||
|
|
||||||
|
shard_idx = 0
|
||||||
|
|
||||||
|
# Save non-layer tensors to the first shard if they exist
|
||||||
|
if non_layer_keys:
|
||||||
|
tensors = {}
|
||||||
|
for key in non_layer_keys:
|
||||||
|
file_path = target_tensor_map[key]
|
||||||
|
tensor = None
|
||||||
|
ggml_type = None
|
||||||
|
if file_path.endswith('.safetensors'):
|
||||||
|
if file_path not in safetensors_cache:
|
||||||
|
safetensors_cache[file_path] = safe_open(file_path, framework='pt')
|
||||||
|
f = safetensors_cache[file_path]
|
||||||
|
tensor = f.get_tensor(key)
|
||||||
|
elif file_path.endswith('.gguf'):
|
||||||
|
gguf_name = translate_name(key)
|
||||||
|
tensor, ggml_type = gguf_loader.get_undequanted_tensor_and_ggml_type(gguf_name)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unsupported file format: {file_path}")
|
||||||
|
tensors[translate_name(key)] = tensor
|
||||||
|
if ggml_type:
|
||||||
|
ggml_type = torch.tensor(ggml_type)
|
||||||
|
ggml_key = translate_name(key)[:-7] + ".ggml_type" if translate_name(key).endswith(".weight") else translate_name(key) + ".ggml_type"
|
||||||
|
tensors[ggml_key] = ggml_type
|
||||||
|
|
||||||
|
output_file = os.path.join(output_path, f"model-{shard_idx:05}-of-{total_shards:05}.safetensors")
|
||||||
|
print(f"Saving non-layer tensors to {output_file}")
|
||||||
|
save_file(tensors, output_file)
|
||||||
|
print(tensors.keys())
|
||||||
|
|
||||||
|
shard_idx += 1
|
||||||
|
|
||||||
|
# Save each layer's tensors to subsequent shards
|
||||||
|
for layer_num in sorted(layer_groups.keys()):
|
||||||
|
layer_keys = layer_groups[layer_num]
|
||||||
|
tensors = {}
|
||||||
|
for key in layer_keys:
|
||||||
|
file_path = target_tensor_map[key]
|
||||||
|
tensor = None
|
||||||
|
ggml_type = None
|
||||||
|
if file_path.endswith('.safetensors'):
|
||||||
|
if file_path not in safetensors_cache:
|
||||||
|
safetensors_cache[file_path] = safe_open(file_path, framework='pt')
|
||||||
|
f = safetensors_cache[file_path]
|
||||||
|
tensor = f.get_tensor(key)
|
||||||
|
tensor_info = tensor.shape
|
||||||
|
elif file_path.endswith('.gguf'):
|
||||||
|
gguf_name = translate_name(key)
|
||||||
|
tensor, ggml_type = gguf_loader.get_undequanted_tensor_and_ggml_type(gguf_name)
|
||||||
|
# tensor_info = gguf_loader.tensor_info[gguf_name]
|
||||||
|
# ggml_type = gguf_loader.tensor_info[gguf_name]['ggml_type']
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unsupported file format: {file_path}")
|
||||||
|
tensors[translate_name(key)] = tensor
|
||||||
|
if ggml_type:
|
||||||
|
ggml_type = torch.tensor(ggml_type)
|
||||||
|
ggml_key = translate_name(key)[:-7] + ".ggml_type" if translate_name(key).endswith(".weight") else translate_name(key) + ".ggml_type"
|
||||||
|
tensors[ggml_key] = ggml_type
|
||||||
|
|
||||||
|
output_file = os.path.join(output_path, f"model-{shard_idx:05}-of-{total_shards:05}.safetensors")
|
||||||
|
print(f"Saving layer {layer_num} to {output_file}")
|
||||||
|
# print(tensors.keys())
|
||||||
|
save_file(tensors, output_file)
|
||||||
|
shard_idx += 1
|
||||||
|
|
||||||
|
return
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# 创建命令行参数解析器
|
||||||
|
parser = argparse.ArgumentParser(description="Read parameters from Safetensor and GGUF files")
|
||||||
|
parser.add_argument("--safetensor_path", type=str, help="Path to the Safetensor file", default="/mnt/data/model/DeepSeek-V3")
|
||||||
|
parser.add_argument("--gguf_path", type=str, help="Path to the GGUF file", default="/mnt/data/model/DeepseekV3-q4km-gguf")
|
||||||
|
parser.add_argument("--output_path", type=str, help="Path to the output file", default="/mnt/data/model/ktrans-safetensors/DeepSeek-V3-q4km-fp8")
|
||||||
|
|
||||||
|
# print all the arguments
|
||||||
|
print("All the arguments:")
|
||||||
|
print(parser.parse_args())
|
||||||
|
|
||||||
|
# 解析命令行参数
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
safetensor_path = args.safetensor_path
|
||||||
|
gguf_path = args.gguf_path
|
||||||
|
output_path = args.output_path
|
||||||
|
|
||||||
|
target_tensor_map, gguf_loader = combine_tensor_sources(safetensor_path, gguf_path)
|
||||||
|
write_combined_tensor(target_tensor_map, output_path, gguf_loader)
|
||||||
|
|
||||||
|
return
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
|
@ -4,4 +4,6 @@ numpy
|
||||||
torch>=2.3.0
|
torch>=2.3.0
|
||||||
packaging
|
packaging
|
||||||
cpufeature
|
cpufeature
|
||||||
protobuf
|
protobuf
|
||||||
|
tiktoken
|
||||||
|
blobfile
|
121
setup.py
121
setup.py
|
@ -1,16 +1,16 @@
|
||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
# coding=utf-8
|
# coding=utf-8
|
||||||
'''
|
'''
|
||||||
Description :
|
Description :
|
||||||
Author : chenxl
|
Author : chenxl
|
||||||
Date : 2024-07-27 16:15:27
|
Date : 2024-07-27 16:15:27
|
||||||
Version : 1.0.0
|
Version : 1.0.0
|
||||||
LastEditors : chenxl
|
LastEditors : chenxl
|
||||||
LastEditTime : 2024-08-14 16:36:19
|
LastEditTime : 2024-08-14 16:36:19
|
||||||
Adapted from:
|
Adapted from:
|
||||||
https://github.com/Dao-AILab/flash-attention/blob/v2.6.3/setup.py
|
https://github.com/Dao-AILab/flash-attention/blob/v2.6.3/setup.py
|
||||||
Copyright (c) 2023, Tri Dao.
|
Copyright (c) 2023, Tri Dao.
|
||||||
Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
|
||||||
'''
|
'''
|
||||||
|
|
||||||
import os
|
import os
|
||||||
|
@ -30,6 +30,11 @@ from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
|
||||||
from setuptools import setup, Extension
|
from setuptools import setup, Extension
|
||||||
from cpufeature.extension import CPUFeature
|
from cpufeature.extension import CPUFeature
|
||||||
from torch.utils.cpp_extension import BuildExtension, CUDAExtension, CUDA_HOME
|
from torch.utils.cpp_extension import BuildExtension, CUDAExtension, CUDA_HOME
|
||||||
|
try:
|
||||||
|
from torch_musa.utils.simple_porting import SimplePorting
|
||||||
|
from torch_musa.utils.musa_extension import BuildExtension, MUSAExtension, MUSA_HOME
|
||||||
|
except ImportError:
|
||||||
|
MUSA_HOME=None
|
||||||
|
|
||||||
class CpuInstructInfo:
|
class CpuInstructInfo:
|
||||||
CPU_INSTRUCT = os.getenv("CPU_INSTRUCT", "NATIVE")
|
CPU_INSTRUCT = os.getenv("CPU_INSTRUCT", "NATIVE")
|
||||||
|
@ -40,7 +45,7 @@ class CpuInstructInfo:
|
||||||
CMAKE_FANCY = "-DLLAMA_NATIVE=OFF -DLLAMA_FMA=ON -DLLAMA_F16C=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DLLAMA_AVX512=ON -DLLAMA_AVX512_FANCY_SIMD=ON"
|
CMAKE_FANCY = "-DLLAMA_NATIVE=OFF -DLLAMA_FMA=ON -DLLAMA_F16C=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DLLAMA_AVX512=ON -DLLAMA_AVX512_FANCY_SIMD=ON"
|
||||||
CMAKE_AVX512 = "-DLLAMA_NATIVE=OFF -DLLAMA_FMA=ON -DLLAMA_F16C=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DLLAMA_AVX512=ON"
|
CMAKE_AVX512 = "-DLLAMA_NATIVE=OFF -DLLAMA_FMA=ON -DLLAMA_F16C=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DLLAMA_AVX512=ON"
|
||||||
CMAKE_AVX2 = "-DLLAMA_NATIVE=OFF -DLLAMA_FMA=ON -DLLAMA_F16C=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON"
|
CMAKE_AVX2 = "-DLLAMA_NATIVE=OFF -DLLAMA_FMA=ON -DLLAMA_F16C=ON -DLLAMA_AVX=ON -DLLAMA_AVX2=ON"
|
||||||
|
|
||||||
class VersionInfo:
|
class VersionInfo:
|
||||||
THIS_DIR = os.path.dirname(os.path.abspath(__file__))
|
THIS_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||||
PACKAGE_NAME = "ktransformers"
|
PACKAGE_NAME = "ktransformers"
|
||||||
|
@ -49,6 +54,16 @@ class VersionInfo:
|
||||||
)
|
)
|
||||||
FORCE_BUILD = os.getenv("KTRANSFORMERS_FORCE_BUILD", "FALSE") == "TRUE"
|
FORCE_BUILD = os.getenv("KTRANSFORMERS_FORCE_BUILD", "FALSE") == "TRUE"
|
||||||
|
|
||||||
|
def get_musa_bare_metal_version(self, musa_dir):
|
||||||
|
raw_output = subprocess.run(
|
||||||
|
[musa_dir + "/bin/mcc", "-v"], check=True,
|
||||||
|
stdout=subprocess.PIPE, stderr=subprocess.STDOUT).stdout.decode("utf-8")
|
||||||
|
output = raw_output.split()
|
||||||
|
release_idx = output.index("version") + 1
|
||||||
|
bare_metal_version = parse(output[release_idx].split(",")[0])
|
||||||
|
musa_version = f"{bare_metal_version.major}{bare_metal_version.minor}"
|
||||||
|
return musa_version
|
||||||
|
|
||||||
def get_cuda_bare_metal_version(self, cuda_dir):
|
def get_cuda_bare_metal_version(self, cuda_dir):
|
||||||
raw_output = subprocess.check_output(
|
raw_output = subprocess.check_output(
|
||||||
[cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
|
[cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
|
||||||
|
@ -58,7 +73,7 @@ class VersionInfo:
|
||||||
cuda_version = f"{bare_metal_version.major}{bare_metal_version.minor}"
|
cuda_version = f"{bare_metal_version.major}{bare_metal_version.minor}"
|
||||||
return cuda_version
|
return cuda_version
|
||||||
|
|
||||||
def get_cuda_version_of_torch(self,):
|
def get_cuda_version_of_torch(self):
|
||||||
torch_cuda_version = parse(torch.version.cuda)
|
torch_cuda_version = parse(torch.version.cuda)
|
||||||
cuda_version = f"{torch_cuda_version.major}{torch_cuda_version.minor}"
|
cuda_version = f"{torch_cuda_version.major}{torch_cuda_version.minor}"
|
||||||
return cuda_version
|
return cuda_version
|
||||||
|
@ -117,7 +132,7 @@ class VersionInfo:
|
||||||
torch_version_raw = parse(torch.__version__)
|
torch_version_raw = parse(torch.__version__)
|
||||||
torch_version = f"{torch_version_raw.major}{torch_version_raw.minor}"
|
torch_version = f"{torch_version_raw.major}{torch_version_raw.minor}"
|
||||||
return torch_version
|
return torch_version
|
||||||
|
|
||||||
def get_flash_version(self,):
|
def get_flash_version(self,):
|
||||||
version_file = os.path.join(
|
version_file = os.path.join(
|
||||||
Path(VersionInfo.THIS_DIR), VersionInfo.PACKAGE_NAME, "__init__.py")
|
Path(VersionInfo.THIS_DIR), VersionInfo.PACKAGE_NAME, "__init__.py")
|
||||||
|
@ -128,12 +143,21 @@ class VersionInfo:
|
||||||
return flash_version
|
return flash_version
|
||||||
|
|
||||||
def get_package_version(self, full_version=False):
|
def get_package_version(self, full_version=False):
|
||||||
flash_version = self.get_flash_version()
|
flash_version = str(self.get_flash_version())
|
||||||
package_version = f"{str(flash_version)}+cu{self.get_cuda_bare_metal_version(CUDA_HOME)}torch{self.get_torch_version()}{self.get_cpu_instruct()}"
|
torch_version = self.get_torch_version()
|
||||||
|
cpu_instruct = self.get_cpu_instruct()
|
||||||
|
backend_version = ""
|
||||||
|
if CUDA_HOME is not None:
|
||||||
|
backend_version = f"cu{self.get_cuda_bare_metal_version(CUDA_HOME)}"
|
||||||
|
elif MUSA_HOME is not None:
|
||||||
|
backend_version = f"mu{self.get_musa_bare_metal_version(MUSA_HOME)}"
|
||||||
|
else:
|
||||||
|
raise ValueError("Unsupported backend: CUDA_HOME and MUSA_HOME are not set.")
|
||||||
|
package_version = f"{flash_version}+{backend_version}torch{torch_version}{cpu_instruct}"
|
||||||
if full_version:
|
if full_version:
|
||||||
return package_version
|
return package_version
|
||||||
if not VersionInfo.FORCE_BUILD:
|
if not VersionInfo.FORCE_BUILD:
|
||||||
return str(flash_version)
|
return flash_version
|
||||||
return package_version
|
return package_version
|
||||||
|
|
||||||
|
|
||||||
|
@ -218,11 +242,19 @@ class CMakeBuild(BuildExtension):
|
||||||
f"-DPYTHON_EXECUTABLE={sys.executable}",
|
f"-DPYTHON_EXECUTABLE={sys.executable}",
|
||||||
f"-DCMAKE_BUILD_TYPE={cfg}", # not used on MSVC, but no harm
|
f"-DCMAKE_BUILD_TYPE={cfg}", # not used on MSVC, but no harm
|
||||||
]
|
]
|
||||||
|
|
||||||
|
if CUDA_HOME is not None:
|
||||||
|
cmake_args += ["-DKTRANSFORMERS_USE_CUDA=ON"]
|
||||||
|
elif MUSA_HOME is not None:
|
||||||
|
cmake_args += ["-DKTRANSFORMERS_USE_MUSA=ON"]
|
||||||
|
else:
|
||||||
|
raise ValueError("Unsupported backend: CUDA_HOME and MUSA_HOME are not set.")
|
||||||
|
|
||||||
build_args = []
|
build_args = []
|
||||||
if "CMAKE_ARGS" in os.environ:
|
if "CMAKE_ARGS" in os.environ:
|
||||||
cmake_args += [
|
cmake_args += [
|
||||||
item for item in os.environ["CMAKE_ARGS"].split(" ") if item]
|
item for item in os.environ["CMAKE_ARGS"].split(" ") if item]
|
||||||
|
|
||||||
if CpuInstructInfo.CPU_INSTRUCT == CpuInstructInfo.FANCY:
|
if CpuInstructInfo.CPU_INSTRUCT == CpuInstructInfo.FANCY:
|
||||||
cpu_args = CpuInstructInfo.CMAKE_FANCY
|
cpu_args = CpuInstructInfo.CMAKE_FANCY
|
||||||
elif CpuInstructInfo.CPU_INSTRUCT == CpuInstructInfo.AVX512:
|
elif CpuInstructInfo.CPU_INSTRUCT == CpuInstructInfo.AVX512:
|
||||||
|
@ -231,7 +263,7 @@ class CMakeBuild(BuildExtension):
|
||||||
cpu_args = CpuInstructInfo.CMAKE_AVX2
|
cpu_args = CpuInstructInfo.CMAKE_AVX2
|
||||||
else:
|
else:
|
||||||
cpu_args = CpuInstructInfo.CMAKE_NATIVE
|
cpu_args = CpuInstructInfo.CMAKE_NATIVE
|
||||||
|
|
||||||
cmake_args += [
|
cmake_args += [
|
||||||
item for item in cpu_args.split(" ") if item
|
item for item in cpu_args.split(" ") if item
|
||||||
]
|
]
|
||||||
|
@ -258,7 +290,7 @@ class CMakeBuild(BuildExtension):
|
||||||
|
|
||||||
# CMake allows an arch-in-generator style for backward compatibility
|
# CMake allows an arch-in-generator style for backward compatibility
|
||||||
contains_arch = any(x in cmake_generator for x in {"ARM", "Win64"})
|
contains_arch = any(x in cmake_generator for x in {"ARM", "Win64"})
|
||||||
if not single_config and not contains_arch:
|
if not single_config and not contains_arch and cmake_generator:
|
||||||
cmake_args += ["-A", PLAT_TO_CMAKE[self.plat_name]]
|
cmake_args += ["-A", PLAT_TO_CMAKE[self.plat_name]]
|
||||||
|
|
||||||
# Multi-config generators have a different way to specify configs
|
# Multi-config generators have a different way to specify configs
|
||||||
|
@ -276,8 +308,13 @@ class CMakeBuild(BuildExtension):
|
||||||
"-DCMAKE_OSX_ARCHITECTURES={}".format(";".join(archs))]
|
"-DCMAKE_OSX_ARCHITECTURES={}".format(";".join(archs))]
|
||||||
|
|
||||||
if "CMAKE_BUILD_PARALLEL_LEVEL" not in os.environ:
|
if "CMAKE_BUILD_PARALLEL_LEVEL" not in os.environ:
|
||||||
|
cpu_count = os.cpu_count()
|
||||||
|
if cpu_count is None:
|
||||||
|
cpu_count = 1
|
||||||
if hasattr(self, "parallel") and self.parallel:
|
if hasattr(self, "parallel") and self.parallel:
|
||||||
build_args += [f"-j{self.parallel}"]
|
build_args += [f"--parallel={self.parallel}"]
|
||||||
|
else:
|
||||||
|
build_args += [f"--parallel={cpu_count}"]
|
||||||
print("CMake args:", cmake_args)
|
print("CMake args:", cmake_args)
|
||||||
build_temp = Path(ext.sourcedir) / "build"
|
build_temp = Path(ext.sourcedir) / "build"
|
||||||
if not build_temp.exists():
|
if not build_temp.exists():
|
||||||
|
@ -288,28 +325,56 @@ class CMakeBuild(BuildExtension):
|
||||||
print("Standard output:", result.stdout)
|
print("Standard output:", result.stdout)
|
||||||
print("Standard error:", result.stderr)
|
print("Standard error:", result.stderr)
|
||||||
subprocess.run(
|
subprocess.run(
|
||||||
["cmake", "--build", ".", *build_args], cwd=build_temp, check=True
|
["cmake", "--build", ".", "--verbose", *build_args], cwd=build_temp, check=True
|
||||||
)
|
)
|
||||||
|
|
||||||
|
if CUDA_HOME is not None:
|
||||||
|
ops_module = CUDAExtension('KTransformersOps', [
|
||||||
|
'ktransformers/ktransformers_ext/cuda/custom_gguf/dequant.cu',
|
||||||
|
'ktransformers/ktransformers_ext/cuda/binding.cpp',
|
||||||
|
'ktransformers/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.cu'
|
||||||
|
],
|
||||||
|
extra_compile_args={
|
||||||
|
'cxx': ['-O3', '-DKTRANSFORMERS_USE_CUDA'],
|
||||||
|
'nvcc': [
|
||||||
|
'-O3',
|
||||||
|
'--use_fast_math',
|
||||||
|
'-Xcompiler', '-fPIC',
|
||||||
|
'-DKTRANSFORMERS_USE_CUDA',
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
elif MUSA_HOME is not None:
|
||||||
|
SimplePorting(cuda_dir_path="ktransformers/ktransformers_ext/cuda", mapping_rule={
|
||||||
|
# Common rules
|
||||||
|
"at::cuda": "at::musa",
|
||||||
|
"#include <ATen/cuda/CUDAContext.h>": "#include \"torch_musa/csrc/aten/musa/MUSAContext.h\"",
|
||||||
|
"#include <c10/cuda/CUDAGuard.h>": "#include \"torch_musa/csrc/core/MUSAGuard.h\"",
|
||||||
|
"nv_bfloat16": "mt_bfloat16",
|
||||||
|
}).run()
|
||||||
|
ops_module = MUSAExtension('KTransformersOps', [
|
||||||
|
'ktransformers/ktransformers_ext/cuda_musa/custom_gguf/dequant.mu',
|
||||||
|
'ktransformers/ktransformers_ext/cuda_musa/binding.cpp',
|
||||||
|
# TODO: Add Marlin support for MUSA.
|
||||||
|
# 'ktransformers/ktransformers_ext/cuda_musa/gptq_marlin/gptq_marlin.mu'
|
||||||
|
],
|
||||||
|
extra_compile_args={
|
||||||
|
'cxx': ['force_mcc'],
|
||||||
|
'mcc': [
|
||||||
|
'-O3',
|
||||||
|
'-DKTRANSFORMERS_USE_MUSA',
|
||||||
|
'-DTHRUST_IGNORE_CUB_VERSION_CHECK',
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ValueError("Unsupported backend: CUDA_HOME and MUSA_HOME are not set.")
|
||||||
|
|
||||||
setup(
|
setup(
|
||||||
version=VersionInfo().get_package_version(),
|
version=VersionInfo().get_package_version(),
|
||||||
cmdclass={"bdist_wheel":BuildWheelsCommand ,"build_ext": CMakeBuild},
|
cmdclass={"bdist_wheel":BuildWheelsCommand ,"build_ext": CMakeBuild},
|
||||||
ext_modules=[
|
ext_modules=[
|
||||||
CMakeExtension("cpuinfer_ext"),
|
CMakeExtension("cpuinfer_ext"),
|
||||||
CUDAExtension('KTransformersOps', [
|
ops_module,
|
||||||
'ktransformers/ktransformers_ext/cuda/custom_gguf/dequant.cu',
|
|
||||||
'ktransformers/ktransformers_ext/cuda/binding.cpp',
|
|
||||||
'ktransformers/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.cu'
|
|
||||||
],
|
|
||||||
extra_compile_args={
|
|
||||||
'cxx': ['-O3'],
|
|
||||||
'nvcc': [
|
|
||||||
'-O3',
|
|
||||||
'--use_fast_math',
|
|
||||||
'-Xcompiler', '-fPIC',
|
|
||||||
]
|
|
||||||
}
|
|
||||||
)
|
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
|
@ -1,9 +0,0 @@
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair. None of them noticed a large, tawny owl flutter past the window. At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. “Little tyke,” chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive. It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day. But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes — the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt — these people were obviously collecting for something… yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.
|
|
||||||
阅读以上文字,并概括大意
|
|
Loading…
Add table
Add a link
Reference in a new issue