kvcache-ai-ktransformers/archive/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-8.yaml
Jiaqi Liao 57d14d22bc
Refactor: restructure repository to focus on kt-kernel and KT-SFT modulesq recon (#1581)
* refactor: move legacy code to archive/ directory

  - Moved ktransformers, csrc, third_party, merge_tensors to archive/
  - Moved build scripts and configurations to archive/
  - Kept kt-kernel, KT-SFT, doc, and README files in root
  - Preserved complete git history for all moved files

* refactor: restructure repository to focus on kt-kernel and KT-SFT modules

* fix README

* fix README

* fix README

* fix README

* docs: add performance benchmarks to kt-kernel section

Add comprehensive performance data for kt-kernel to match KT-SFT's presentation:
- AMX kernel optimization: 21.3 TFLOPS (3.9× faster than PyTorch)
- Prefill phase: up to 20× speedup vs baseline
- Decode phase: up to 4× speedup
- NUMA optimization: up to 63% throughput improvement
- Multi-GPU (8×L20): 227.85 tokens/s total throughput with DeepSeek-R1 FP8

Source: https://lmsys.org/blog/2025-10-22-KTransformers/

This provides users with concrete performance metrics for both core modules,
making it easier to understand the capabilities of each component.

* refactor: improve kt-kernel performance data with specific hardware and models

Replace generic performance descriptions with concrete benchmarks:
- Specify exact hardware: 8×L20 GPU + Xeon Gold 6454S, Single/Dual-socket Xeon + AMX
- Include specific models: DeepSeek-R1-0528 (FP8), DeepSeek-V3 (671B)
- Show detailed metrics: total throughput, output throughput, concurrency details
- Match KT-SFT presentation style for consistency

This provides users with actionable performance data they can use to evaluate
hardware requirements and expected performance for their use cases.

* fix README

* docs: clean up performance table and improve formatting

* add pic for README

* refactor: simplify .gitmodules and backup legacy submodules

- Remove 7 legacy submodules from root .gitmodules (archive/third_party/*)
- Keep only 2 active submodules for kt-kernel (llama.cpp, pybind11)
- Backup complete .gitmodules to archive/.gitmodules
- Add documentation in archive/README.md for researchers who need legacy submodules

This reduces initial clone size by ~500MB and avoids downloading unused dependencies.

* refactor: move doc/ back to root directory

Keep documentation in root for easier access and maintenance.

* refactor: consolidate all images to doc/assets/

- Move kt-kernel/assets/heterogeneous_computing.png to doc/assets/
- Remove KT-SFT/assets/ (images already in doc/assets/)
- Update KT-SFT/README.md image references to ../doc/assets/
- Eliminates ~7.9MB image duplication
- Centralizes all documentation assets in one location

* fix pic path for README
2025-11-10 17:42:26 +08:00

734 lines
No EOL
20 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
# === Rotary Embedding Replacement ===
# GPU 0: layers 07
- match:
name: "^model\\.layers\\.([0-7])\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
# GPU 1: layers 815
- match:
name: "^model\\.layers\\.(8|9|1[0-5])\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
# GPU 2: layers 1623
- match:
name: "^model\\.layers\\.(1[6-9]|2[0-3])\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
# GPU 3: layers 2431
- match:
name: "^model\\.layers\\.(2[4-9]|3[0-1])\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
# GPU 4: layers 3239
- match:
name: "^model\\.layers\\.([3][2-9])\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:4"
prefill_device: "cuda:4"
# GPU 5: layers 4047
- match:
name: "^model\\.layers\\.(4[0-7])\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:5"
prefill_device: "cuda:5"
# GPU 6: layers 4855
- match:
name: "^model\\.layers\\.(4[8-9]|5[0-5])\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:6"
prefill_device: "cuda:6"
# GPU 7: layers 5660
- match:
name: "^model\\.layers\\.(5[6-9]|60)\\."
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"
# === Linear Layers Replacement (excluding self_attn.kv_b_proj) ===
# GPU 0: layers 07
- match:
name: "^model\\.layers\\.([0-7])\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# GPU 1: layers 815
- match:
name: "^model\\.layers\\.(8|9|1[0-5])\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# GPU 2: layers 1623
- match:
name: "^model\\.layers\\.(1[6-9]|2[0-3])\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# GPU 3: layers 2431
- match:
name: "^model\\.layers\\.(2[4-9]|3[0-1])\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# GPU 4: layers 3239
- match:
name: "^model\\.layers\\.(3[2-9])\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:4"
prefill_device: "cuda:4"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# GPU 5: layers 4047
- match:
name: "^model\\.layers\\.(4[0-7])\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:5"
prefill_device: "cuda:5"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# GPU 6: layers 4855
- match:
name: "^model\\.layers\\.(4[8-9]|5[0-5])\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:6"
prefill_device: "cuda:6"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# GPU 7: layers 5663
- match:
name: "^model\\.layers\\.(5[6-9]|60)\\.(?!self_attn\\.kv_b_proj).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# === MLP (MoE) Replacement ===
# GPU 0: layers 07
- match:
name: "^model\\.layers\\.([0-7])\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
# GPU 1: layers 815
- match:
name: "^model\\.layers\\.(8|9|1[0-5])\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
# GPU 2: layers 1623
- match:
name: "^model\\.layers\\.(1[6-9]|2[0-3])\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
# GPU 3: layers 2431
- match:
name: "^model\\.layers\\.(2[4-9]|3[0-1])\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
# GPU 4: layers 3239
- match:
name: "^model\\.layers\\.(3[2-9])\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:4"
prefill_device: "cuda:4"
# GPU 5: layers 4047
- match:
name: "^model\\.layers\\.(4[0-7])\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:5"
prefill_device: "cuda:5"
# GPU 6: layers 4855
- match:
name: "^model\\.layers\\.(4[8-9]|5[0-5])\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:6"
prefill_device: "cuda:6"
# GPU 7: layers 5660
- match:
name: "^model\\.layers\\.(5[6-9]|60)\\.mlp$"
class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
replace:
class: ktransformers.operators.experts.KDeepseekV3MoE
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"
# === MLP Gate Replacement ===
# GPU 0: layers 07
- match:
name: "^model\\.layers\\.([0-7])\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
# GPU 1: layers 815
- match:
name: "^model\\.layers\\.(8|9|1[0-5])\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
# GPU 2: layers 1623
- match:
name: "^model\\.layers\\.(1[6-9]|2[0-3])\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
# GPU 3: layers 2431
- match:
name: "^model\\.layers\\.(2[4-9]|3[0-1])\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
# GPU 4: layers 3239
- match:
name: "^model\\.layers\\.(3[2-9])\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:4"
prefill_device: "cuda:4"
# GPU 5: layers 4047
- match:
name: "^model\\.layers\\.(4[0-7])\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:5"
prefill_device: "cuda:5"
# GPU 6: layers 4855
- match:
name: "^model\\.layers\\.(4[8-9]|5[0-5])\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:6"
prefill_device: "cuda:6"
# GPU 7: layers 5660
- match:
name: "^model\\.layers\\.(5[6-9]|60)\\.mlp\\.gate$"
class: ktransformers.models.modeling_deepseek_v3.MoEGate
replace:
class: ktransformers.operators.gate.KMoEGate
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"
# === MLP Experts Replacement ===
# replace with marlin expert. Open and modify layer-num as needed.
# Each layer of malin experts takes about 6GB of GPU memory.
# !!!Do remember 'close' cuda graph if you are using marlin expert.!!!
# !!!Loading marlin expert will take signifcant time.!!!
# GPU 0: layers 07
# - match:
# name: "^model\\.layers\\.([0-7])\\.mlp\\.experts$" # inject experts in layer 0~4 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:0"
# generate_op: "KExpertsMarlin"
# recursive: False
# # GPU 1: layers 815
# - match:
# name: "^model\\.layers\\.([8-9]|1[0-5)\\.mlp\\.experts$" # inject experts in layer 30~31 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:1"
# generate_op: "KExpertsMarlin"
# recursive: False
# # GPU 2: layers 1623
# - match:
# name: "^model\\.layers\\.(1[6-9]|2[0-3])\\.mlp\\.experts$" # inject experts in layer 0~4 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:0"
# generate_op: "KExpertsMarlin"
# recursive: False
# # GPU 3: layers 2431
# - match:
# name: "^model\\.layers\\.(2[4-9]|3[0-1])\\.mlp\\.experts$" # inject experts in layer 30~31 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:1"
# generate_op: "KExpertsMarlin"
# recursive: False
# # GPU 4: layers 3239
# - match:
# name: "^model\\.layers\\.(3[2-9])\\.mlp\\.experts$" # inject experts in layer 0~4 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:0"
# generate_op: "KExpertsMarlin"
# recursive: False
# # GPU 5: layers 4047
# - match:
# name: "^model\\.layers\\.(4[0-7])\\.mlp\\.experts$" # inject experts in layer 30~31 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:1"
# generate_op: "KExpertsMarlin"
# recursive: False
# # GPU 6: layers 4855
# - match:
# name: "^model\\.layers\\.(4[8-9]|5[0-5])\\.mlp\\.experts$" # inject experts in layer 0~4 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:0"
# generate_op: "KExpertsMarlin"
# recursive: False
# # GPU 7: layers 5660
# - match:
# name: "^model\\.layers\\.(5[6-9]|60)\\.mlp\\.experts$" # inject experts in layer 30~31 as marlin expert
# replace:
# class: ktransformers.operators.experts.KTransformersExperts
# kwargs:
# generate_device: "cuda:1"
# generate_op: "KExpertsMarlin"
# recursive: False
# === MLP Experts Replacement ===
# GPU 0: layers 07
- match:
name: "^model\\.layers\\.([0-7])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:0"
recursive: False
# GPU 1: layers 815
- match:
name: "^model\\.layers\\.(8|9|1[0-5])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:1"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:1"
recursive: False
# GPU 2: layers 1623
- match:
name: "^model\\.layers\\.(1[6-9]|2[0-3])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:2"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:2"
recursive: False
# GPU 3: layers 2431
- match:
name: "^model\\.layers\\.(2[4-9]|3[0-1])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:3"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:3"
recursive: False
# GPU 4: layers 3239
- match:
name: "^model\\.layers\\.(3[2-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:4"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:4"
recursive: False
# GPU 5: layers 4047
- match:
name: "^model\\.layers\\.(4[0-7])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:5"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:5"
recursive: False
# GPU 6: layers 4855
- match:
name: "^model\\.layers\\.(4[8-9]|5[0-5])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:6"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:6"
recursive: False
# GPU 7: layers 5660
- match:
name: "^model\\.layers\\.(5[6-9]|60)\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:7"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:7"
recursive: False
# === Self-Attention Replacement ===
# GPU 0: layers 07
- match:
name: "^model\\.layers\\.([0-7])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
# GPU 1: layers 815
- match:
name: "^model\\.layers\\.(8|9|1[0-5])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
# GPU 2: layers 1623
- match:
name: "^model\\.layers\\.(1[6-9]|2[0-3])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
# GPU 3: layers 2431
- match:
name: "^model\\.layers\\.(2[4-9]|3[0-1])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
# GPU 4: layers 3239
- match:
name: "^model\\.layers\\.(3[2-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:4"
prefill_device: "cuda:4"
# GPU 5: layers 4047
- match:
name: "^model\\.layers\\.(4[0-7])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:5"
prefill_device: "cuda:5"
# GPU 6: layers 4855
- match:
name: "^model\\.layers\\.(4[8-9]|5[0-5])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:6"
prefill_device: "cuda:6"
# GPU 7: layers 5660
- match:
name: "^model\\.layers\\.(5[6-9]|60)\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"
# === Overall Model Replacement with Transfer Map ===
- match:
name: "^model$"
replace:
class: "ktransformers.operators.models.KDeepseekV2Model"
kwargs:
per_layer_prefill_intput_threshold: 0 # 0 means close layerwise prefill
transfer_map:
8: "cuda:1"
16: "cuda:2"
24: "cuda:3"
32: "cuda:4"
40: "cuda:5"
48: "cuda:6"
56: "cuda:7"
# === Default Catch-All for Other Modules ===
# GPU 0: layers 07
- match:
name: "^model\\.layers\\.([0-7])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
# GPU 1: layers 815
- match:
name: "^model\\.layers\\.(8|9|1[0-5])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
# GPU 2: layers 1623
- match:
name: "^model\\.layers\\.(1[6-9]|2[0-3])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:2"
prefill_device: "cuda:2"
# GPU 3: layers 2431
- match:
name: "^model\\.layers\\.(2[4-9]|3[0-1])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:3"
prefill_device: "cuda:3"
# GPU 4: layers 3239
- match:
name: "^model\\.layers\\.(3[2-9])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:4"
prefill_device: "cuda:4"
# GPU 5: layers 4047
- match:
name: "^model\\.layers\\.(4[0-7])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:5"
prefill_device: "cuda:5"
# GPU 6: layers 4855
- match:
name: "^model\\.layers\\.(4[8-9]|5[0-5])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:6"
prefill_device: "cuda:6"
# GPU 7: layers 5663
- match:
name: "^model\\.layers\\.(5[6-9]|60)\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"
- match:
name: "^lm_head"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
# For final modules (model.norm), ensure they are on GPU 7 (as in your original config)
- match:
name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)"
replace:
class: "default"
kwargs:
generate_device: "cuda:7"
prefill_device: "cuda:7"