mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-28 11:49:51 +00:00
[docs]: update web doc (#1625)
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
This commit is contained in:
parent
be6db6f46b
commit
ab8ad0a110
3 changed files with 27 additions and 44 deletions
|
|
@ -1,23 +1,13 @@
|
|||
- [KTransformers Fine-Tuning × LLaMA-Factory Integration – User Guide](#ktransformers-fine-tuning-x-llama-factory-integration-–-user-guide)
|
||||
- [Introduction](#introduction)
|
||||
|
||||
- [Fine-Tuning Results (Examples)](#fine-tuning-results-examples)
|
||||
- [Stylized Dialogue (CatGirl tone)](#stylized-dialogue-catgirl-tone)
|
||||
- [Benchmarks](#benchmarks)
|
||||
- [Translational-Style dataset](#translational-style-dataset)
|
||||
- [AfriMed-QA (short answer)](#afrimed-qa-short-answer)
|
||||
- [AfriMed-QA (multiple choice)](#afrimed-qa-multiple-choice)
|
||||
|
||||
- [Fine-Tuning Results (Examples)](#fine-tuning-results-examples)
|
||||
- [Quick to Start](#quick-to-start)
|
||||
- [Environment Setup](#environment-setup)
|
||||
- [Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models](#core-feature-1-use-ktransformers-backend-to-fine-tune-ultra-large-moe-models)
|
||||
- [Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)](#core-feature-2-chat-with-the-fine-tuned-model-base--lora-adapter)
|
||||
- [Core Feature 3: Batch inference + metrics (base + LoRA adapter)](#core-feature-3-batch-inference--metrics-base--lora-adapter)
|
||||
|
||||
- [KT Fine-Tuning Speed (User-Side View)](#kt-fine-tuning-speed-user-side-view)
|
||||
- [End-to-End Performance](#end-to-end-performance)
|
||||
- [GPU/CPU Memory Footprint](#gpucpu-memory-footprint)
|
||||
|
||||
- [Conclusion](#conclusion)
|
||||
|
||||
|
||||
|
|
@ -33,7 +23,7 @@ Our goal is to give resource-constrained researchers a **local path to explore f
|
|||
|
||||
As shown below, LLaMA-Factory is the unified orchestration/configuration layer for the whole fine-tuning workflow—handling data, training scheduling, LoRA injection, and inference interfaces. **KTransformers** acts as a pluggable high-performance backend that takes over core operators like Attention/MoE under the same training configs, enabling efficient **GPU+CPU heterogeneous cooperation**.
|
||||
|
||||

|
||||

|
||||
|
||||
Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unsloth**, and **KTransformers** backends. KTransformers is the **only workable 4090-class solution** for ultra-large MoE models (e.g., 671B) and also delivers higher throughput and lower GPU memory on smaller MoE models (e.g., DeepSeek-14B).
|
||||
|
||||
|
|
@ -46,7 +36,7 @@ Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unslo
|
|||
|
||||
† **1400 GB** is a **theoretical** FP16 full-parameter resident footprint (not runnable). **70 GB** is the **measured peak** with KT strategy (Attention on GPU + layered MoE offload).
|
||||
|
||||

|
||||

|
||||
|
||||
### Fine-Tuning Results (Examples)
|
||||
|
||||
|
|
@ -56,7 +46,7 @@ Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). Goal: i
|
|||
|
||||
The figure compares responses from the base vs. fine-tuned models. The fine-tuned model maintains the target tone and address terms more consistently (red boxes), validating the effectiveness of **style-transfer fine-tuning**.
|
||||
|
||||

|
||||

|
||||
|
||||
#### Benchmarks
|
||||
|
||||
|
|
@ -219,7 +209,7 @@ We recommend **AMX acceleration** where available (`lscpu | grep amx`). AMX supp
|
|||
|
||||
Outputs go to `output_dir` in safetensors format plus adapter metadata for later loading.
|
||||
|
||||

|
||||

|
||||
|
||||
### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)
|
||||
|
||||
|
|
@ -244,7 +234,7 @@ We also support **GGUF** adapters: for safetensors, set the **directory**; for G
|
|||
|
||||
During loading, LLaMA-Factory maps layer names to KT’s naming. You’ll see logs like `Loaded adapter weight: XXX -> XXX`:
|
||||
|
||||

|
||||

|
||||
|
||||
### Core Feature 3: Batch inference + metrics (base + LoRA adapter)
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue