Update FAQ.md

This commit is contained in:
Atream 2025-02-15 23:27:35 +08:00 committed by GitHub
parent c189d55bd1
commit 8ed8eb2a9e
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -25,7 +25,7 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
2. server: Increase the `--cache_lens' to a larger value.
2. Move more weights to the GPU.
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
```yaml
- match:
name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
@ -39,6 +39,8 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU.
> Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
> NoteCurrently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization.
> Note KExpertsTorch is untested.
### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?