mrhaoxx
9544a8960d
feat(sft): AMX MoE SFT backend with LoRA support ( #1936 )
...
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
* feat(sft): AMX MoE SFT backend with LoRA support
Complete SFT (Supervised Fine-Tuning) backend for MoE models using AMX SIMD:
Core C++ implementation:
- sft_moe.hpp: Forward/backward with LoRA fused operations (~5500 lines)
- moe-sft-tp.hpp: Tensor-parallel wrapper for multi-NUMA
- amx/moe-sft-tp.hpp: AMX-specific TP implementation
- avx_kernels.hpp: AVX512 SIMD kernels for LoRA GEMM
- amx_kernels.hpp: AMX tile kernels for Panel5 rank-outer optimization
- worker_pool: RDTSC profiling, Chrome trace output, SFT timer infrastructure
- ext_bindings.cpp: SFT MOE pybind bindings (BF16/INT8/INT4 + SkipLoRA variants)
Python sft/ submodule (kt_kernel.sft):
- base.py: BaseSFTMoEWrapper with buffer management (template method pattern)
- amx.py: AMXSFTMoEWrapper (weight loading, C++ task construction)
- autograd.py: KTMoEFunction (torch.autograd.Function for distributed training)
- layer.py: KTMoELayerWrapper (nn.Module replacing HF MoE layers)
- arch.py: MOEArchConfig (Qwen3/DeepSeek/Mixtral architecture detection)
- weights.py: Expert weight extraction and checkpoint loading
- lora.py: PEFT LoRA adaptation (view buffers, grad buffers, save/load adapter)
- wrapper.py: wrap_moe_layers_with_kt_wrapper, load_kt_model, build_kt_device_map
- config.py: KTConfig dataclass (DeepSpeed-style opaque config passthrough)
- dist_utils.py: Distributed gather/scatter, checkpoint-phase detection
Design decisions:
- Rank-0-only expert pattern: only rank 0 holds C++ wrapper and expert weights
- DeepSpeed-style integration: accelerate keeps only KTransformersPlugin (framework
interaction fields), all logic in kt_kernel.sft
- Inference isolation: importing kt_kernel does not load sft/ submodule
- Old field name compatibility: _get_kt_config() converts kt_xxx→xxx automatically
Verified: Qwen3-235B-A22B 4GPU AMXBF16 training, loss converges normally.
* refactor(sft): unify KTConfig field names with kt_ prefix, add share_cache_pool, remove dead code
- KTConfig fields all use kt_ prefix matching dict keys — eliminates
_OLD_TO_NEW mapping and prefix-stripping in wrapper.py
- Add kt_share_cache_pool field, auto-enabled when gradient_checkpointing
is on (via training_args.py), flows through to C++ cache allocation
- Remove dead checkpoint detection code: in_ckpt_recompute,
in_ckpt_first_forward vars (assigned but never read), fallback
_is_in_checkpoint_first_forward() function, unused inspect import
- Remove redundant env var fallbacks in wrapper.py for share_backward_bb
and share_cache_pool (KTConfig.__post_init__ already handles env vars)
- Simplify layer.py checkpoint logic to single _checkpoint_hook_mode() check
Verified: Qwen3-235B 3-step training on sap4, loss matches baseline
(1.2886 / 1.9824 / 1.377 vs 1.2886 / 1.9766 / 1.3809)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(sft): share_backward_bb default True, share_cache_pool auto-derived
- kt_share_backward_bb defaults to True (always saves memory)
- kt_share_cache_pool no longer reads from env var; defaults False,
auto-set to True by trainer_config_process when gradient checkpointing
is enabled
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add missing gpu_experts_mask=None to KTMoEWrapper call in SFT wrapper
KTMoEWrapper.__new__() requires gpu_experts_mask as a positional argument,
but the SFT wrapper omitted it, causing MoE layer wrapping to fail silently
and FSDP2 to attempt broadcasting all expert weights (OOM/NCCL crash).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(sft): support transformers v5 fused expert format
Fused experts (e.g. Qwen3MoeExperts) store weights as 3D Parameters
(gate_up_proj [E,2I,H], down_proj [E,H,I]) instead of per-expert
nn.Linear modules. PEFT cannot attach LoRA to these, so we create
KT-managed LoRA buffers with kaiming init, nn.Parameter wrappers
for the optimizer, and pre-assigned .grad for C++ backward.
- arch.py: detect_fused_experts() detection
- weights.py: fused format extraction and weight clearing
- wrapper.py: detect fused at wrap time, store _fused_experts/_lora_rank
- lora.py: _create_fused_expert_lora_buffers, save/load fused LoRA,
get_kt_lora_params collects fused params, deduplicate wrapper finding
- layer.py: handle v5 TopKRouter tuple output, remove dead code
- autograd.py: sync_forward_sft/submit_forward_sft API rename
Verified: v5 loss/expert-LoRA values match v4 baseline, v4 backward compat.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(sft): add Qwen3.5 MoE support + fused checkpoint loading
- arch.py: add Qwen3_5Moe arch match, read config from text_config,
_get_layers_prefix returns model.language_model.layers for Qwen3.5,
_get_model_container_and_layers searches language_model attr
- weights.py: load_experts_from_checkpoint_files detects fused format
(gate_up_proj in weight_map) and splits into gate/up/down
- wrapper.py: hidden_size fallback to text_config
Verified: Qwen3.5-35B-A3B (256 experts, fused format) E2E pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [fix](sft): align Python API with C++ backend after v5 refactor
- wrapper.py: pass gpu_experts_mask=None to KTMoEWrapper (required by C++ signature)
- layer.py: rename submit_forward_sft/sync_forward_sft to submit_forward/sync_forward
- autograd.py: rename sync_forward_sft to sync_forward
The sft-v5 refactor (commits 58d7eab , dd1da65 ) renamed Python-side method
calls but the C++ backend (AMXSFTMoEWrapper) still exposes the original
method names. This caused AttributeError on Qwen3.5-35B and other models.
* align sft branch with main: revert worker_pool, strip sft_timer, fix inference defaults
- Revert worker_pool.cpp/.h to main (remove RDTSC timer, Chrome Trace,
sft_timer namespace, ITT API, extended do_work_stealing_job API)
- Strip all sft_timer instrumentation from sft-only files (sft_moe.hpp,
moe-sft-tp.hpp, avx_kernels.hpp)
- Restore pin_memory=True in KExpertsCPUBuffer (inference path)
- Restore fused tensor transpose logic in convert_cpu_weights.py (main layout)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* revert CMakeLists.txt to main: remove debug flags and cpptrace dep
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* clean up dev artifacts: remove SFT design docs, debug examples, bench scripts
Remove files not needed in the merge:
- docs/SFT+KTWrapper/ (6 Chinese design docs)
- docs/sft_moe_amx/ (21 dev/debug docs)
- 12 debug/test example scripts
- 6 SFT-specific bench scripts and report
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* remove dev version stamps from ext_bindings, sft_moe, moe-sft-tp
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: JimmyPeilinLi <lipeilin@mail.nwpu.edu.cn>
2026-04-22 11:27:01 +08:00
callmegaga
a9411f1d72
Supports vnni-256 for GPTQ INT4 ( #1926 )
...
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Book-CI / test (push) Waiting to run
* [feat](kt-kernel): support avx-vnni-256 for gptq int4
2026-04-13 17:59:59 +08:00
Oql
9e6484a538
[fix]: fix --numa-nodes handling ( #1904 )
...
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
Release sglang-kt to PyPI / Build sglang-kt wheel (push) Waiting to run
Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Blocked by required conditions
* [fix]: fix --numa-nodes handling
2026-03-31 17:50:22 +08:00
ErvinXie
3903c9afcc
(kt-kernel): add numa_nodes parameter for explicit NUMA node mapping ( #1891 )
...
Add numa_nodes parameter to BaseMoEWrapper and all subclasses, allowing
users to explicitly specify which NUMA node IDs to use for subpool
mapping instead of always defaulting to sequential [0, 1, ..., N-1].
This enables running multiple KTransformers instances on different NUMA
nodes of the same machine, e.g. --kt-threadpool-count 1 --kt-numa-nodes 1
to bind to NUMA node 1. Previously this required external numactl
workarounds since subpool_numa_map was hardcoded to start from 0.
2026-03-31 10:27:50 +08:00
mrhaoxx
7a9daf0cd4
[feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 ( #1892 )
...
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
* feat: support avx2 bf16 fp8 inference
* feat: support avx2 gptq int4 inference
* fix: numeric issues in fp8 dequant
* Tutorial avx2 (#1900 )
* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines
* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs
* Tutorial avx2 (#1901 )
* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines
* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs
* docs: update README.md
---------
Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>
2026-03-27 14:45:02 +08:00
Chen Hongtao
9e69fccb02
[feat]: add mistral moe loader compatibility ( #1873 )
...
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Book-CI / test (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>
2026-02-28 17:50:23 +08:00
VYSE V.E.O
20262b2743
Fix Qwen3.5 FP8 load for VL detection ( #1857 )
...
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
* Fix Qwen3.5 FP8 load for VL detection
1, for VL models(Qwen3.5), modify base_key: model.layers.{N} -> model.language_model.layers.{N}
2, clean DUPLICATED class BF16SafeTensorLoader(SafeTensorLoader) , only the first overrided one.
* Indent type
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-26 15:47:22 +08:00
Jianwei Dong
16a8b98f3e
support qwen3.5 ( #1846 )
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
2026-02-16 15:48:14 +08:00
Jiaqi Liao
db82d99fa6
feat: add fallback expert prefix lookup in loader.py from kimi_k2.5 ( #1822 )
2026-01-30 14:09:38 +08:00
Jiaqi Liao
edc48aba37
[fix]: fix wrapper import issue ( #1819 )
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
2026-01-28 16:31:56 +08:00
Oql
bf4c8a690b
Add Native Precision Tutorial, update worker strategy and README.md ( #1807 )
2026-01-23 18:00:13 +08:00
Jianwei Dong
027832c590
[feat](kt-kernel): CPU-GPU experts sched ( #1796 )
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
2026-01-16 17:01:15 +08:00
Oql
6277da4c2b
support GLM 4.7 ( #1791 )
...
Book-CI / test-2 (push) Has been cancelled
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
support GLM 4.7
2026-01-13 17:36:25 +08:00
Oql
5edc456749
support Native BF16 format MoE. ( #1788 )
...
support Native BF16 format MoE
2026-01-12 14:43:28 +08:00
ErvinXie
d8046e1bb4
Kt minimax ( #1742 )
...
[feat]: fp8 kernel and kt-cli support
2025-12-24 15:39:44 +08:00
SCDESPERTATE
008de19e16
[fix](kt-kernel): drop the weights held in Python for loading weights operation in C++ ( #1695 )
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
2025-12-12 11:42:33 +08:00
Oql
8139c092bf
Reduce CPU memory usage during large chunk prefill ( Fixes #1676 ) ( #1683 )
...
* fix(amx): add BufferASmallKGroupImpl to fix buffer overflow in from_mat
The original BufferAKGroupImpl::from_mat writes 64 bytes per K_STEP iteration
but when K_STEP=32 (for GemmKernel224Int4SmallKGroup), this causes buffer overflow.
BufferASmallKGroupImpl overrides from_mat to write only 32 bytes per iteration.
* perf(k2-moe): optimize memory allocation with pooled buffers
- Replace per-expert buffer allocation with shared memory pools
- Dynamically assign buffer slices based on activated experts
- Add group_size inference from scale tensor shape in amx.py
* delete kimi k2 forward test
* add TODO comment for pool_count_ calculation
2025-12-08 20:19:07 +08:00
ErvinXie
71f683acec
Support Native Kimi K2 Thinking ( #1663 )
...
* [feat]: fix k2 prefill
* Update Kimi-K2-Thinking.md
* Create Kimi-K2-Thinking-Native.md
* Update Kimi-K2-Thinking.md
* Update Kimi-K2-Thinking.md
* Update Kimi-K2-Thinking-Native.md
* [perf] optimize K2 MoE weight loading with per-expert pointers
- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s)
- Use per-expert pointer arrays (gate_projs) instead of contiguous memory
- C++ worker pool performs parallel memcpy for TP slicing
- Add LOAD_TIME_PROFILE for load_weights timing analysis
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-12-05 21:53:05 +08:00
Jiaqi Liao
fcf8882075
[Feature] Add avx-based kimi-k2 support ( #1656 )
...
Book-CI / test-2 (push) Waiting to run
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
* support Kimi-K2-Thinking original weight
fix amx kernel bug
* update k2 avx kernel.
* feat: add CPUInfer write buffer task
* [feat]: add kimi k2 cpu write buffer support
- Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights
- Fix down (w2) weight column-wise slicing for different TP configurations
- Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp
- Add comprehensive test cases for weight extraction validation
- Ensure compatibility with Kimi model's MoE architecture
* [fix]: correct write_weight_scale_to_buffer expert offset calculation
Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios.
Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com>
* [feat]: add write buffer wrapper
* [fix] fix comment
---------
Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-12-02 16:01:07 +08:00
ZiWei Yuan
1374b98ee5
[feat](moe_kernel): add amd blis support (int8) ( #1600 )
...
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
* [feat]: init amd adaption
* [feat]: add blis support
* [fix]: fix setup and moe kernel warpper
* [fix](setup.py): support rebuild with cache and import kt_kernel works
fine
* [feat]: add moe_kernel converter for amd and implement the load
method(haven't tested yet)
* [feat](moe_kernel/moe.hpp): delete unused memory when using save
* [fix](moe_kernel): update PLAIN for pack
* [fix](moe_kernel): rm printf debug
* [fix](moe_kernel): skip gpu experts
* [fix](moe_kernel/moe.hpp): update include memory path
* [feat](moe_kernel/moe.hpp): support expert deferral
* [feat]: finish amd
---------
Co-authored-by: mrhaoxx <mr.haoxx@gmail.com>
2025-11-27 12:08:53 +08:00
Jiaqi Liao
d483147307
Fix kt-kernel compile issue ( #1595 )
...
* update install.sh
* fix import issue
* update README
2025-11-11 19:30:27 +08:00
Jiaqi Liao
94c25626dc
Fix kt-kernel for new wrapper ( #1588 )
...
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
* update README for kt-kernel
* style: format C++ and Python code in kt-kernel
- Format C++ files: task_queue, ext_bindings, and MoE operators
- Format Python utility modules: amx, llamafile, and loader
- Improve code readability and consistency
2025-11-10 21:47:34 +08:00
Jiaqi Liao
9bc00e587b
Refactor KTMoEWrapper backend ( #1587 )
...
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
* universal backend for cpu inference
* expert defer
2025-11-10 20:26:15 +08:00