kvcache-ai-ktransformers/doc/en/benchmark.md at c05ebb74b1a04376cc4f7863a66efec1457bdede

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2025-09-16 18:19:42 +00:00

2025-02-25 17:45:17 +08:00

5.8 KiB

Raw Blame History

Benchmark

To conduct a quick and convenient check, we have employed a simple Python script available here to assess the precision of our ktransformers project. For this evaluation, we utilized the same dataset, which was shuffled in a consistent manner and limited to the first 1,000 data points, to test our implementation across a variety of CPU kernels, MLA kernels, and quantization formats.

We selected the DeepSeek-V3 model in its bf16, int8, and q4km versions for this test. The MMLU dataset, which can be found here, was used (we selected all datasets and shuffled them with a fixed random seed).

!!! However, we skipped the few-shot part and only chose the first 1,000 data points for a quick check. Please note that this approach may result in results that are not consistent with the technical report of DeepSeek-V3. And the test of R1 and further more tests are on going.

To verify our results, we chose cloud service platform as baseline. All tests were conducted using the same script and datasets, allowing us to make a preliminary assessment of our project's precision.

We set the argument temperature=0.6, and to simplify the test process, we skipped the few-shot part and used the following prompt: There is a single choice question. Answer the question by replying A, B, C, D. No other answers are accepted. Just the letter. \nQuestion: {question}\nA. {option_a}\nB. {option_b}\nC. {option_c}\nD. {option_d}\nAnswer: '. For more details, please refer to the script.

Given that we have only tested 1,000 cases, which provides only a preliminary judgment, some fluctuations in the results are reasonable. We selected all datasets and shuffled them with a fixed random seed to ensure consistency.

Some Detail

The bf16 model of DeepSeek-V3 is available here (you may convert it to gguf by llama.cpp). The q4km model can be found here.
The optimization YAML file is located here. For the Matrix MUL Kernel, you can change KLinearMarlin to KLinearTorch.
To switch the MLA Kernel from Triton to Torch, you can check and modify this file, specifically by using the forward_windows method.
When attempting to conduct the bf16 test (both CPU Weight and GPU Weight), you may encounter issues stemming from older versions of g++ and as, particularly when using Ubuntu 20 or earlier versions. To facilitate a smoother experience and enable you to reproduce our results, we have provided a development container. This container offers a pre-configured environment tailored for this purpose. However, please note that the container does not have the ktrans package installed. Therefore, you may still need to manually install certain packages to ensure everything runs smoothly.
- You may config the model mount dir in devcontainer/devcontainer.json, check the "mouts": config.

The Result Table


DataSet	CPU Weight Format	CPU Kernel	GPU Weight Format	GEMM	MLA Kernel	Siliconflow	Ktrans Point
MMLU (shuffle 1k)	bf16	cpuinfer	bf16	torch	torch	81.6	81.9
	int8	cpuinfer	bf16	torch	torch	81.6	83.1
	q4km	cpuinfer	bf16	torch	torch	81.6	82.8
	q4km	cpuinfer	bf16	torch	triton	81.6	81.4
	q4km	cpuinfer	q4km->marlin 8	marlin	triton	81.6	81.1
	q4km	cpuinfer	q4km->marlin 4	marlin	triton	81.6	81
	q4km	cpuinfer	fp8	marlin	triton	81.6	81.5
MMLU-pro	q4km	cpuinfer	fp8	fp8gemm	triton	57.7	57.6
MMLU-pro	q4km	cpuinfer	q4km->marlin 4	marlin	triton	57.7	57.5
HumanEval	tbd	tbd	tbd	tbd	tbd	tbd	tbd
GSM8K	tbd	tbd	tbd	tbd	tbd	tbd	tbd

5.8 KiB Raw Blame History

Benchmark

Some Detail

The Result Table

5.8 KiB

Raw Blame History