Merge pull request #166 from kvcache-ai/update-yaml

[Update] Update FAQ to address common questions
2025-09-06 20:49:55 +00:00 · 2025-02-12 17:00:04 +08:00 · 2025-02-12 17:00:04 +08:00 · 9e42f33c29
commit 9e42f33c29
parent 7e58f9d254 101db0e9de
3 changed files with 55 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -9,7 +9,7 @@

 </p>
  <h3>A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations</h3>
-  <strong><a href="#show-cases">🌟 Show Cases</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#tutorial">📃 Tutorial</a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬  Discussion </a> </strong>
+  <strong><a href="#show-cases">🌟 Show Cases</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#tutorial">📃 Tutorial</a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬  Discussion </a>|<a href="#FAQ"> 🙋 FAQ</a> </strong>
 </div>

 <h2 id="intro">🎉 Introduction</h2>
@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

 <h2 id="Updates">🔥 Updates</h2>

-* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
+* **Fed 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md).
 * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
 * **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
 * **Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU. 
@ -41,7 +41,7 @@ https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285

 </p>

- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM.
+- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM([Tutorial](./doc/en/DeepseekR1_V3_tutorial.md)).
 	- Prefill Speed (tokens/s): 
 		- KTransfermor: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)  
 		- Compared to 10.31 tokens/s in llama.cpp with 2×32 cores, achieving up to **27.79× speedup**.  
@ -376,3 +376,7 @@ KTransformer is actively maintained and developed by contributors from the <a hr
 <h2 id="ack">Discussion</h2>

 If you have any questions, feel free to open an issue. Alternatively, you can join our WeChat group for further discussion. QR Code: [WeChat Group](WeChatGrouop.jpg)
+
+<h2 id="FAQ">🙋 FAQ</h2>
+
+Some common questions are answered in the [FAQ](doc/en/FAQ.md).
--- a/doc/en/FAQ.md
+++ b/doc/en/FAQ.md
@ -1,6 +1,6 @@
 # FAQ
 ## Install
-### 1 ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found
+### Q: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found
 ```
 in Ubuntu 22.04 installation need to add the:
 sudo add-apt-repository ppa:ubuntu-toolchain-r/test
@ -8,7 +8,7 @@ sudo apt-get update
 sudo apt-get install --only-upgrade libstdc++6
 ```
 from-https://github.com/kvcache-ai/ktransformers/issues/117#issuecomment-2647542979
-### 2 DeepSeek-R1 not outputting initial <think> token
+### Q: DeepSeek-R1 not outputting initial <think> token

 > from deepseek-R1 doc:<br>
 > Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think>\n\n\</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think>\n" at the beginning of every output.
@ -18,7 +18,51 @@ and pass the arg `--force_think true ` can let the local_chat initiate the respo

 from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552

-### 3 version `GLIBCXX_3.4.30' not found
+## Usage
+### Q: If I got more VRAM than the model's requirement, how can I fully utilize it?
+
+1. Get larger context.
+   1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
+   2. server: Increase the `--cache_lens' to a larger value.
+2. Move more weights to the GPU.
+    Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml
+    ```yaml
+    - match:
+       name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
+     replace:
+       class: ktransformers.operators.experts.KTransformersExperts  
+       kwargs:
+         generate_device: "cuda:0" # run in cuda:0; marlin only support GPU
+         generate_op:  "KExpertsMarlin" # use marlin expert
+     recursive: False
+    ```
+    You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU.
+
+    > Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
+
+
+### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
+
+Use the `--optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
+
+> Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.
+
+### Q: How to get the best performance?
+
+You have to set `--cpu_infer` to the number of cores you want to use. The more cores you use, the faster the model will run. But it's not the more the better. Adjust it slightly lower to your actual number of cores.
+
+### Q: My DeepSeek-R1 model is not thinking.
+
+According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think true `.
+
+### Q: Loading gguf error
+
+Make sure you:
+1. Have the `gguf` file in the `--gguf_path` directory.
+2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
+3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
+
+### Q: Version `GLIBCXX_3.4.30' not found
 The detailed error:
 >ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)

--- a/ktransformers/models/modeling_deepseek_v3.py
+++ b/ktransformers/models/modeling_deepseek_v3.py
@ -1699,7 +1699,7 @@ class DeepseekV3ForCausalLM(DeepseekV3PreTrainedModel):
        )

        hidden_states = outputs[0]
-        logits = self.lm_head(hidden_states)
+        logits = self.lm_head(hidden_states.to(self.lm_head.weight.device))
        logits = logits.float()

        loss = None