From 8ed8eb2a9e820b39c5d2d88110dd200bbd26ef00 Mon Sep 17 00:00:00 2001 From: Atream <80757050+Atream@users.noreply.github.com> Date: Sat, 15 Feb 2025 23:27:35 +0800 Subject: [PATCH] Update FAQ.md --- doc/en/FAQ.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/en/FAQ.md b/doc/en/FAQ.md index 75e5e10..e738a29 100644 --- a/doc/en/FAQ.md +++ b/doc/en/FAQ.md @@ -25,7 +25,7 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552 1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value. 2. server: Increase the `--cache_lens' to a larger value. 2. Move more weights to the GPU. - Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml + Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml ```yaml - match: name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert @@ -39,6 +39,8 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552 You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU. > Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid. + > Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization. + > Note KExpertsTorch is untested. ### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?