Commit graph

1260 commits

Author SHA1 Message Date
mrhaoxx
b12b63d831 [perf]: Panel5 rank-outer optimization for LoRA fused_add kernel
Replaces O_BLOCK=16 rank-inner loop with Panel-5 rank-outer in
lora_fp32_bf16_fused_add_transposed. Each broadcast now drives
5 output vectors (80 elements) instead of 1, shifting bottleneck
from port 5 (broadcast) to FMA ports (0/1).

Benchmark (sap4, E=8, H=7168, I=2048, R=8):
- Forward: +30-85% throughput across all qlen
- Backward: +20-53% throughput
- vs torch autograd: 5.3-10.3x (up from 4.2-6.3x)
- Correctness: all PASS (cos > 0.999)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:39:44 +00:00
mrhaoxx
c85bce7288 [perf]: merge backward dispatch phases, reduce barriers
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:39:44 +00:00
mrhaoxx
484711e873
revert tp opt which have correctness issues 2026-03-24 15:39:01 +08:00
mrhaoxx
0f85b2a744 fix seqlen buffer size int32 overflow 2026-03-18 06:05:09 +00:00
mrhaoxx
9881851c23 opt perf 2026-03-10 16:39:41 +00:00
mrhaoxx
e8a1d37e3b opt perf 2026-03-09 18:24:06 +00:00
mrhaoxx
442674b155 [fix]: fused expert axis mismatch for Qwen3.5 Int8 conversion
gate_up_proj is [E, 2I, H] not [E, H, 2I] -- remove incorrect transpose
and split on dim 1 using config dimensions. down_proj [E, H, I] needs
no transpose either.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 18:24:06 +00:00
JimmyPeilinLi
40e8f21541 bench: align moe torch/amx benchmarks and add consolidated report 2026-03-02 10:07:33 +00:00
mrhaoxx
b1a9691201 online repack 2026-02-24 11:01:53 +00:00
mrhaoxx
950f052f96 ckpt 2026-02-24 03:03:29 +00:00
mrhaoxx
8aea4b3e2a ckpt 2026-02-22 09:18:26 +00:00
mrhaoxx
cf09b3e1ca [fix]: skiplora 2026-02-17 03:06:19 +00:00
mrhaoxx
a1a9eca311 [fix]: perf 2026-02-05 16:57:09 +00:00
mrhaoxx
b0740436b5 [fix]: perf 2026-02-05 14:52:59 +00:00
mrhaoxx
8825cbf4b6 [fix]: perf 2026-02-04 15:26:04 +00:00
mrhaoxx
3b2de00593 [fix]: wip 2026-02-04 05:40:47 +00:00
mrhaoxx
fac81ed147 [fix]: fix tp 2026-02-03 14:46:40 +00:00
mrhaoxx
391cb6f79d [fix]: fix memory footprint 2026-02-02 08:34:01 +00:00
mrhaoxx
06fb3b5dbf [fix]: prequant weight load 2026-02-01 15:17:47 +00:00
mrhaoxx
9efe1317b1 [fix]: fix forward cache (maybe) 2026-01-31 21:19:46 +00:00
mrhaoxx
e1e64f7948 [fix]: fix lora grad compute 2026-01-31 15:41:55 +00:00
mrhaoxx
6a1e7c48cb [fix]: direct accumulation 2026-01-27 04:27:59 +00:00
mrhaoxx
7b62d826e4 [fix]: pinned memory causes numa issue 2026-01-26 12:04:46 +00:00
mrhaoxx
7b432f4b5a [fix]: avoid unnecessary memcpy 2026-01-26 05:34:19 +00:00
mrhaoxx
773ac20847 [fix]: fix missing bufferB init 2026-01-25 18:58:49 +00:00
mrhaoxx
192a9584f1 [chore]: save 2026-01-25 18:28:40 +00:00
mrhaoxx
63863b6322 [feat]: disable timer 2026-01-25 17:43:40 +00:00
mrhaoxx
b53a3dbb2b [feat]: use vectorized transpose 2026-01-25 17:31:35 +00:00
mrhaoxx
ae83d8237b [fix]: use buffer pool 2026-01-25 17:09:31 +00:00
mrhaoxx
32dfc5390c [feat]: merge some kernel 2026-01-25 16:45:48 +00:00
mrhaoxx
0669b910aa [feat]: optmize to use tr lora params 2026-01-25 16:17:42 +00:00
mrhaoxx
9abb104c9b [feat]: optimize kernels 2026-01-25 13:00:29 +00:00
mrhaoxx
f80fe1682f [feat]: introduce json profiler 2026-01-24 17:18:54 +00:00
mrhaoxx
57580016ea [feat]: support async sft forward 2026-01-24 14:45:08 +00:00
mrhaoxx
03a710bc68 [chore]: remove some metrics 2026-01-24 09:42:18 +00:00
mrhaoxx
15d91e0880 [fix]: fix scheduling 2026-01-23 19:04:30 +00:00
mrhaoxx
ae1252e874 [fix]: fix buffer A out of bounds read 2026-01-23 18:10:08 +00:00
mrhaoxx
451c91dce1 [fix]: nan 2026-01-23 15:30:39 +00:00
mrhaoxx
503d109fbc [fix]: fix buffer memory overuse 2026-01-23 07:08:59 +00:00
mrhaoxx
d90f035735 [fix]: fix memory overflow 2026-01-23 05:40:30 +00:00
mrhaoxx
d50d19fcf9 [fix]: remove debug message 2026-01-22 20:14:45 +00:00
mrhaoxx
8f44a64a7a [fix]: fix memory pool 2026-01-22 19:41:07 +00:00
mrhaoxx
5c89cec5e3 [fix]: optimize job sched 2026-01-22 07:04:52 +00:00
mrhaoxx
8ff417f46c [feat]: vectorized lora compute 2026-01-22 06:17:09 +00:00
mrhaoxx
4826281455 [chore]: remove debug 2026-01-20 11:50:54 +00:00
mrhaoxx
5f6482ff50 [feat]: support skip lora 2026-01-20 05:38:11 +00:00
mrhaoxx
dfcc370756 [fix]: fix bugs for activation, sft forward and backward 2026-01-19 17:35:11 +00:00
mrhaoxx
4e5b1e7399 [chore]: Merge commit 'ddb957596f' into ksft-sglang 2026-01-17 09:44:04 +00:00
JimmyPeilinLi
e60f199510 [feat](kt-sft-refactor): load from huggingface safetensor file 2026-01-16 03:36:13 +00:00
JimmyPeilinLi
18ab0cb943 [feat](kt-sft-refactor): add KT-SFT to KTMoEWrapper 2026-01-15 12:29:52 +00:00