[docs]: update web doc (#1625)
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run

This commit is contained in:
ZiWei Yuan 2025-11-16 14:40:22 +08:00 committed by GitHub
parent be6db6f46b
commit ab8ad0a110
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 27 additions and 44 deletions

View file

@ -1,23 +1,13 @@
- [KTransformers Fine-Tuning × LLaMA-Factory Integration User Guide](#ktransformers-fine-tuning-x-llama-factory-integration--user-guide)
- [Introduction](#introduction)
- [Fine-Tuning Results (Examples)](#fine-tuning-results-examples)
- [Stylized Dialogue (CatGirl tone)](#stylized-dialogue-catgirl-tone)
- [Benchmarks](#benchmarks)
- [Translational-Style dataset](#translational-style-dataset)
- [AfriMed-QA (short answer)](#afrimed-qa-short-answer)
- [AfriMed-QA (multiple choice)](#afrimed-qa-multiple-choice)
- [Fine-Tuning Results (Examples)](#fine-tuning-results-examples)
- [Quick to Start](#quick-to-start)
- [Environment Setup](#environment-setup)
- [Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models](#core-feature-1-use-ktransformers-backend-to-fine-tune-ultra-large-moe-models)
- [Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)](#core-feature-2-chat-with-the-fine-tuned-model-base--lora-adapter)
- [Core Feature 3: Batch inference + metrics (base + LoRA adapter)](#core-feature-3-batch-inference--metrics-base--lora-adapter)
- [KT Fine-Tuning Speed (User-Side View)](#kt-fine-tuning-speed-user-side-view)
- [End-to-End Performance](#end-to-end-performance)
- [GPU/CPU Memory Footprint](#gpucpu-memory-footprint)
- [Conclusion](#conclusion)
@ -33,7 +23,7 @@ Our goal is to give resource-constrained researchers a **local path to explore f
As shown below, LLaMA-Factory is the unified orchestration/configuration layer for the whole fine-tuning workflow—handling data, training scheduling, LoRA injection, and inference interfaces. **KTransformers** acts as a pluggable high-performance backend that takes over core operators like Attention/MoE under the same training configs, enabling efficient **GPU+CPU heterogeneous cooperation**.
![image-20251011010558909](../assets/image-20251011010558909.png)
![image-20251011010558909](../../assets/image-20251011010558909.png)
Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unsloth**, and **KTransformers** backends. KTransformers is the **only workable 4090-class solution** for ultra-large MoE models (e.g., 671B) and also delivers higher throughput and lower GPU memory on smaller MoE models (e.g., DeepSeek-14B).
@ -46,7 +36,7 @@ Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unslo
**1400 GB** is a **theoretical** FP16 full-parameter resident footprint (not runnable). **70 GB** is the **measured peak** with KT strategy (Attention on GPU + layered MoE offload).
![按照模型划分的对比图_02](../assets/image-compare_model.png)
![按照模型划分的对比图_02](../../assets/image-compare_model.png)
### Fine-Tuning Results (Examples)
@ -56,7 +46,7 @@ Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). Goal: i
The figure compares responses from the base vs. fine-tuned models. The fine-tuned model maintains the target tone and address terms more consistently (red boxes), validating the effectiveness of **style-transfer fine-tuning**.
![image-20251016175046882](../assets/image-20251016175046882.png)
![image-20251016175046882](../../assets/image-20251016175046882.png)
#### Benchmarks
@ -219,7 +209,7 @@ We recommend **AMX acceleration** where available (`lscpu | grep amx`). AMX supp
Outputs go to `output_dir` in safetensors format plus adapter metadata for later loading.
![image-20251016171537997](../assets/image-20251016171537997.png)
![image-20251016171537997](../../assets/image-20251016171537997.png)
### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)
@ -244,7 +234,7 @@ We also support **GGUF** adapters: for safetensors, set the **directory**; for G
During loading, LLaMA-Factory maps layer names to KTs naming. Youll see logs like `Loaded adapter weight: XXX -> XXX`:
![image-20251016171526210](../assets/image-20251016171526210.png)
![image-20251016171526210](../../assets/image-20251016171526210.png)
### Core Feature 3: Batch inference + metrics (base + LoRA adapter)