mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-06 20:49:55 +00:00
Update FAQ.md
This commit is contained in:
parent
c189d55bd1
commit
8ed8eb2a9e
1 changed files with 3 additions and 1 deletions
|
@ -25,7 +25,7 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
|
|||
1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
|
||||
2. server: Increase the `--cache_lens' to a larger value.
|
||||
2. Move more weights to the GPU.
|
||||
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml
|
||||
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
|
||||
```yaml
|
||||
- match:
|
||||
name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
|
||||
|
@ -39,6 +39,8 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
|
|||
You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU.
|
||||
|
||||
> Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
|
||||
> Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization.
|
||||
> Note KExpertsTorch is untested.
|
||||
|
||||
|
||||
### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
|
||||
|
|
Loading…
Add table
Reference in a new issue