mrhaoxx
b12b63d831
[perf]: Panel5 rank-outer optimization for LoRA fused_add kernel
...
Replaces O_BLOCK=16 rank-inner loop with Panel-5 rank-outer in
lora_fp32_bf16_fused_add_transposed. Each broadcast now drives
5 output vectors (80 elements) instead of 1, shifting bottleneck
from port 5 (broadcast) to FMA ports (0/1).
Benchmark (sap4, E=8, H=7168, I=2048, R=8):
- Forward: +30-85% throughput across all qlen
- Backward: +20-53% throughput
- vs torch autograd: 5.3-10.3x (up from 4.2-6.3x)
- Correctness: all PASS (cos > 0.999)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:39:44 +00:00
mrhaoxx
c85bce7288
[perf]: merge backward dispatch phases, reduce barriers
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:39:44 +00:00
mrhaoxx
484711e873
revert tp opt which have correctness issues
2026-03-24 15:39:01 +08:00
mrhaoxx
0f85b2a744
fix seqlen buffer size int32 overflow
2026-03-18 06:05:09 +00:00
mrhaoxx
9881851c23
opt perf
2026-03-10 16:39:41 +00:00
mrhaoxx
e8a1d37e3b
opt perf
2026-03-09 18:24:06 +00:00
mrhaoxx
442674b155
[fix]: fused expert axis mismatch for Qwen3.5 Int8 conversion
...
gate_up_proj is [E, 2I, H] not [E, H, 2I] -- remove incorrect transpose
and split on dim 1 using config dimensions. down_proj [E, H, I] needs
no transpose either.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 18:24:06 +00:00
JimmyPeilinLi
40e8f21541
bench: align moe torch/amx benchmarks and add consolidated report
2026-03-02 10:07:33 +00:00
mrhaoxx
b1a9691201
online repack
2026-02-24 11:01:53 +00:00
mrhaoxx
950f052f96
ckpt
2026-02-24 03:03:29 +00:00
mrhaoxx
8aea4b3e2a
ckpt
2026-02-22 09:18:26 +00:00
mrhaoxx
cf09b3e1ca
[fix]: skiplora
2026-02-17 03:06:19 +00:00
mrhaoxx
a1a9eca311
[fix]: perf
2026-02-05 16:57:09 +00:00
mrhaoxx
b0740436b5
[fix]: perf
2026-02-05 14:52:59 +00:00
mrhaoxx
8825cbf4b6
[fix]: perf
2026-02-04 15:26:04 +00:00
mrhaoxx
3b2de00593
[fix]: wip
2026-02-04 05:40:47 +00:00
mrhaoxx
fac81ed147
[fix]: fix tp
2026-02-03 14:46:40 +00:00
mrhaoxx
391cb6f79d
[fix]: fix memory footprint
2026-02-02 08:34:01 +00:00
mrhaoxx
06fb3b5dbf
[fix]: prequant weight load
2026-02-01 15:17:47 +00:00
mrhaoxx
9efe1317b1
[fix]: fix forward cache (maybe)
2026-01-31 21:19:46 +00:00
mrhaoxx
e1e64f7948
[fix]: fix lora grad compute
2026-01-31 15:41:55 +00:00
mrhaoxx
6a1e7c48cb
[fix]: direct accumulation
2026-01-27 04:27:59 +00:00
mrhaoxx
7b62d826e4
[fix]: pinned memory causes numa issue
2026-01-26 12:04:46 +00:00
mrhaoxx
7b432f4b5a
[fix]: avoid unnecessary memcpy
2026-01-26 05:34:19 +00:00
mrhaoxx
773ac20847
[fix]: fix missing bufferB init
2026-01-25 18:58:49 +00:00
mrhaoxx
192a9584f1
[chore]: save
2026-01-25 18:28:40 +00:00
mrhaoxx
63863b6322
[feat]: disable timer
2026-01-25 17:43:40 +00:00
mrhaoxx
b53a3dbb2b
[feat]: use vectorized transpose
2026-01-25 17:31:35 +00:00
mrhaoxx
ae83d8237b
[fix]: use buffer pool
2026-01-25 17:09:31 +00:00
mrhaoxx
32dfc5390c
[feat]: merge some kernel
2026-01-25 16:45:48 +00:00
mrhaoxx
0669b910aa
[feat]: optmize to use tr lora params
2026-01-25 16:17:42 +00:00
mrhaoxx
9abb104c9b
[feat]: optimize kernels
2026-01-25 13:00:29 +00:00
mrhaoxx
f80fe1682f
[feat]: introduce json profiler
2026-01-24 17:18:54 +00:00
mrhaoxx
57580016ea
[feat]: support async sft forward
2026-01-24 14:45:08 +00:00
mrhaoxx
03a710bc68
[chore]: remove some metrics
2026-01-24 09:42:18 +00:00
mrhaoxx
15d91e0880
[fix]: fix scheduling
2026-01-23 19:04:30 +00:00
mrhaoxx
ae1252e874
[fix]: fix buffer A out of bounds read
2026-01-23 18:10:08 +00:00
mrhaoxx
451c91dce1
[fix]: nan
2026-01-23 15:30:39 +00:00
mrhaoxx
503d109fbc
[fix]: fix buffer memory overuse
2026-01-23 07:08:59 +00:00
mrhaoxx
d90f035735
[fix]: fix memory overflow
2026-01-23 05:40:30 +00:00
mrhaoxx
d50d19fcf9
[fix]: remove debug message
2026-01-22 20:14:45 +00:00
mrhaoxx
8f44a64a7a
[fix]: fix memory pool
2026-01-22 19:41:07 +00:00
mrhaoxx
5c89cec5e3
[fix]: optimize job sched
2026-01-22 07:04:52 +00:00
mrhaoxx
8ff417f46c
[feat]: vectorized lora compute
2026-01-22 06:17:09 +00:00
mrhaoxx
4826281455
[chore]: remove debug
2026-01-20 11:50:54 +00:00
mrhaoxx
5f6482ff50
[feat]: support skip lora
2026-01-20 05:38:11 +00:00
mrhaoxx
dfcc370756
[fix]: fix bugs for activation, sft forward and backward
2026-01-19 17:35:11 +00:00
mrhaoxx
4e5b1e7399
[chore]: Merge commit ' ddb957596f' into ksft-sglang
2026-01-17 09:44:04 +00:00
JimmyPeilinLi
e60f199510
[feat](kt-sft-refactor): load from huggingface safetensor file
2026-01-16 03:36:13 +00:00
JimmyPeilinLi
18ab0cb943
[feat](kt-sft-refactor): add KT-SFT to KTMoEWrapper
2026-01-15 12:29:52 +00:00