vrr/kvcache-ai-ktransformers

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-04-29 04:09:52 +00:00

Author	SHA1	Message	Date
mrhaoxx	a98d544833	merge: integrate origin/main into sft branch Resolved 6 conflicts: - CMakeLists.txt: keep cpptrace + debug flag, accept flexible build type - worker_pool.cpp: keep SFT profiling + main's block=1 spin fix - ext_bindings.cpp: keep both SFT MOE bindings and AVX2/BF16/FP8 bindings - common.hpp: keep gpu_experts_mask + SFT backward weight fields - __init__.py: export both generate_gpu_experts_masks and AMXSFTMoEWrapper - experts.py: gpu_experts_mask for inference, num_gpu_experts for SFT, new methods	2026-04-08 23:19:28 +08:00
mrhaoxx	f36699affd	feat(sft): AMX MoE SFT backend with LoRA support Complete SFT (Supervised Fine-Tuning) backend for MoE models using AMX SIMD: Core C++ implementation: - sft_moe.hpp: Forward/backward with LoRA fused operations (~5500 lines) - moe-sft-tp.hpp: Tensor-parallel wrapper for multi-NUMA - amx/moe-sft-tp.hpp: AMX-specific TP implementation - avx_kernels.hpp: AVX512 SIMD kernels for LoRA GEMM - amx_kernels.hpp: AMX tile kernels for Panel5 rank-outer optimization - worker_pool: RDTSC profiling, Chrome trace output, SFT timer infrastructure - ext_bindings.cpp: SFT MOE pybind bindings (BF16/INT8/INT4 + SkipLoRA variants) Python sft/ submodule (kt_kernel.sft): - base.py: BaseSFTMoEWrapper with buffer management (template method pattern) - amx.py: AMXSFTMoEWrapper (weight loading, C++ task construction) - autograd.py: KTMoEFunction (torch.autograd.Function for distributed training) - layer.py: KTMoELayerWrapper (nn.Module replacing HF MoE layers) - arch.py: MOEArchConfig (Qwen3/DeepSeek/Mixtral architecture detection) - weights.py: Expert weight extraction and checkpoint loading - lora.py: PEFT LoRA adaptation (view buffers, grad buffers, save/load adapter) - wrapper.py: wrap_moe_layers_with_kt_wrapper, load_kt_model, build_kt_device_map - config.py: KTConfig dataclass (DeepSpeed-style opaque config passthrough) - dist_utils.py: Distributed gather/scatter, checkpoint-phase detection Design decisions: - Rank-0-only expert pattern: only rank 0 holds C++ wrapper and expert weights - DeepSpeed-style integration: accelerate keeps only KTransformersPlugin (framework interaction fields), all logic in kt_kernel.sft - Inference isolation: importing kt_kernel does not load sft/ submodule - Old field name compatibility: _get_kt_config() converts kt_xxx→xxx automatically Verified: Qwen3-235B-A22B 4GPU AMXBF16 training, loss converges normally.	2026-04-08 23:11:00 +08:00
mrhaoxx	7a9daf0cd4	[feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * feat: support avx2 bf16 fp8 inference * feat: support avx2 gptq int4 inference * fix: numeric issues in fp8 dequant * Tutorial avx2 (#1900) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * Tutorial avx2 (#1901) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * docs: update README.md --------- Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>	2026-03-27 14:45:02 +08:00
Jianwei Dong	027832c590	[feat](kt-kernel): CPU-GPU experts sched (#1796 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details	2026-01-16 17:01:15 +08:00
Oql	5edc456749	support Native BF16 format MoE. (#1788 ) support Native BF16 format MoE	2026-01-12 14:43:28 +08:00
ErvinXie	d8046e1bb4	Kt minimax (#1742 ) [feat]: fp8 kernel and kt-cli support	2025-12-24 15:39:44 +08:00
Jiaqi Liao	fcf8882075	[Feature] Add avx-based kimi-k2 support (#1656 ) Some checks are pending Book-CI / test-2 (push) Waiting to run Details Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * support Kimi-K2-Thinking original weight fix amx kernel bug * update k2 avx kernel. * feat: add CPUInfer write buffer task * [feat]: add kimi k2 cpu write buffer support - Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights - Fix down (w2) weight column-wise slicing for different TP configurations - Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp - Add comprehensive test cases for weight extraction validation - Ensure compatibility with Kimi model's MoE architecture * [fix]: correct write_weight_scale_to_buffer expert offset calculation Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios. Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * [feat]: add write buffer wrapper * [fix] fix comment --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude <noreply@anthropic.com>	2025-12-02 16:01:07 +08:00
ZiWei Yuan	1374b98ee5	[feat](moe_kernel): add amd blis support (int8) (#1600 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * [feat]: init amd adaption * [feat]: add blis support * [fix]: fix setup and moe kernel warpper * [fix](setup.py): support rebuild with cache and import kt_kernel works fine * [feat]: add moe_kernel converter for amd and implement the load method(haven't tested yet) * [feat](moe_kernel/moe.hpp): delete unused memory when using save * [fix](moe_kernel): update PLAIN for pack * [fix](moe_kernel): rm printf debug * [fix](moe_kernel): skip gpu experts * [fix](moe_kernel/moe.hpp): update include memory path * [feat](moe_kernel/moe.hpp): support expert deferral * [feat]: finish amd --------- Co-authored-by: mrhaoxx <mr.haoxx@gmail.com>	2025-11-27 12:08:53 +08:00
Jiaqi Liao	9bc00e587b	Refactor KTMoEWrapper backend (#1587 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * universal backend for cpu inference * expert defer	2025-11-10 20:26:15 +08:00
chenht2022	6fe30af50d	Merge branch 'main' into develop-cht	2025-11-03 14:35:44 +00:00
ovowei	f854d03bd7	update kt-kernel	2025-11-03 15:19:52 +08:00
chenht2022	dd4377b60b	feat: add deferred expert scheduling support	2025-10-31 08:03:37 +00:00
ovowei	28d8663374	fix	2025-10-22 18:14:34 +08:00
Atream	4c5fcf9774	add kt-kernel	2025-10-12 05:13:00 +00:00

14 commits