vrr/kvcache-ai-ktransformers

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-07-09 17:18:38 +00:00

Author	SHA1	Message	Date
VectorPeak	cb9f47d142	[fix](cli): detect bound ports before launch (#2071 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details * [fix](cli): detect bound ports before launch * [fix](cli): align port reuse check by platform	2026-07-06 18:25:31 +08:00
lutianshu824	79b265b2f6	fix: normalize compressed RAWINT4 weights (#2075 ) * fix: normalize compressed RAWINT4 weights * docs: add Hygon DCU ROCm notes --------- Co-authored-by: lutianshu824 <lutianshu824@users.noreply.github.com>	2026-07-06 18:06:52 +08:00
Benjamin	8e46e5896c	release: bump version to 0.6.3.post1 (#2063 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Release Fake Tag / publish (push) Has been cancelled Details Release to PyPI / Build & publish sglang-kt (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.11) (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.12) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release to PyPI / Publish kt-kernel to PyPI (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details	2026-06-25 22:42:55 +08:00
github-actions[bot]	459df61262	[build] Sync sglang-kt submodule (v0.6.3) (#2055 )	2026-06-25 20:41:04 +08:00
Benjamin	a0c7431187	ci(release-pypi): make release pipeline self-consistent (#2062 )	2026-06-25 20:05:06 +08:00
Willow Lopez	983a88b620	docs: add FAQ entry about SGLang MoE kernel config warning (#2029 ) Some checks are pending Book-CI / test-2 (push) Waiting to run Details Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details	2026-06-24 15:12:54 +08:00
Hermann Hans Klie	08d70e8605	[fix](kt-kernel): enable -mavx512vl for AVX512_VBMI/BF16 multi-variant builds (#2021 )	2026-06-24 15:05:36 +08:00
Jiaheng Dai	1ed332b6ea	[docs]: clean up Qwen3.5 KT LoRA serving guide (#2057 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details * [docs]: clean up Qwen3.5 KT LoRA serving guide * Update doc/zh/Qwen3.5-SGLang-LoRA-Serving_zh.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-06-23 20:52:23 +08:00
Benjamin	c884dbd221	release: bump version to 0.6.3 (#2054 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Release Fake Tag / publish (push) Has been cancelled Details Release to PyPI / Build & publish sglang-kt (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.11) (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.12) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release to PyPI / Publish kt-kernel to PyPI (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details	2026-06-21 17:55:24 +08:00
github-actions[bot]	ce7c3ddbe9	[build]: sync sglang submodule to 8b636f9008dbad58c0a8e481b03e794739e6c146 (#2047 ) Co-authored-by: ovowei <80044717+ovowei@users.noreply.github.com>	2026-06-21 17:37:52 +08:00
Benjamin	6c9c95601d	[fix](kt-kernel): CudaGraph replay fix and Add MoE startup log (#2037 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details CudaGraph replay fix and Add MoE startup log	2026-06-18 17:15:37 +08:00
Benjamin	943cc4daeb	[feat] MXFP8 MoE support (#2041 ) Some checks are pending Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Book-CI / test (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Add MXFP8 kernel for Minimax-M3-MXFP8 Day 0	2026-06-18 16:18:49 +08:00
Tai An	21d5b40323	docs(deepseek-v4-flash): pin apache-tvm-ffi<0.1.12 to avoid tilelang TVM FFI registration clash (#2045 ) (#2048 )	2026-06-18 11:10:56 +08:00
Benjamin	50d9434800	[docs](kt-kernel): add MiniMax-M3 SGLang + KT-Kernel tutorial (#2051 )	2026-06-18 10:40:51 +08:00
Jianwei Dong	0f2e7905e2	Update README.md (#2049 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details	2026-06-17 14:24:23 +08:00
Jianwei Dong	0e2ff783d8	update glm52 tutorial (#2046 ) Some checks are pending Book-CI / test-2 (push) Waiting to run Details Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details	2026-06-16 17:15:23 +08:00
Dayuxiaoshui	7641f5445d	[feat] : Add MACA backend support for kt-kernel and fix CPU MoE tests (#2044 ) * Add MACA backend support for kt-kernel * Add MACA event API mappings * Fix AMX build flags and CPU MoE tests --------- Co-authored-by: <Engle_Chaveztih@sociologist.com>	2026-06-16 16:49:06 +08:00
devangpratap	89d30a3d01	[fix(loader)]: correct off-by-one expert-count guard in load_experts (#2026 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details * [fix(loader)]: correct off-by-one expert-count guard in SafeTensorLoader.load_experts After the discovery loop, max_experts_count is the highest expert index found (expert count - 1), and is -1 only when the key has no experts. The guard checked == 0, which falsely rejected single-expert layers and silently returned empty weight lists for the zero-expert case. Check == -1 instead. Adds a CPU regression test covering the single-, zero-, and multi-expert cases. * [test(loader)]: import loader as a top-level module in expert-count guard test Per review feedback: add python/utils to sys.path and import loader directly instead of the importlib.util boilerplate. Still bypasses utils/__init__.py (and the compiled kt_kernel_ext) while keeping the import idiomatic.	2026-06-07 23:41:04 +08:00
github-actions[bot]	c1cb22311b	[build]: sync sglang submodule to 51032b71279d9038058563f8d2e758d99b278ef4 (#2032 ) Some checks failed Book-CI / test-2 (push) Has been cancelled Details Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details Co-authored-by: ovowei <80044717+ovowei@users.noreply.github.com>	2026-06-05 19:24:22 +08:00
Jiaheng Dai	c9a915e6ac	[feat](kt-lora): add end-to-end Qwen3.5 MoE KT LoRA serving workflow (#2031 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Waiting to run Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Blocked by required conditions Details * [feat](kt-lora): add KT expert LoRA adapter serving * [feat]: pin Qwen3.5 non-expert LoRA support * [feat](kt-lora): add merged SGLang adapter workflow Document the KT SFT to SGLang serving loop and extend the converter with optional split outputs so users can serve one merged adapter while retaining debug-friendly expert/non-expert artifacts. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](kt-lora): validate adapter conversion Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 16:57:14 +08:00
devangpratap	d41f569e84	[fix](cli): detect SGLANG_DSV4_2604_SUBMODE conflict before launch (#2025 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details * [fix](cli): detect SGLANG_DSV4_2604_SUBMODE conflict before launch * [fix](cli): tighten env-var validation per review feedback doctor.py: skip SGLANG_DSV4_2604_SUBMODE row when value is empty string, not just None, to avoid spurious noise in kt doctor output. run.py: guard kt_method against None/empty before calling .upper() in _check_conflicting_env_vars to prevent AttributeError.	2026-05-30 19:20:47 +08:00
Benjamin	ef6c47f9d2	[feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch (#2015 ) * [feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch - Add AVX2 MXFP4 MoE kernel (mxfp4-moe.hpp) with 4-token M-blocking - Wire AVX2MXFP4_MOE binding in ext_bindings.cpp - Support TP_MOE down_proj slicing and multi-pool per-expert loading - Add test_fp4_moe_avx2.py integration test * [fix](kt-kernel): address PR #2010 review — memory leaks, alignment, dynamic expert update - Track aligned_alloc pointers in AVX2_MOE_BASE::owned_aligned_allocs_ and free them in the destructor (fixes BufferB backing memory leak on destroy). - Track per-TP down_buf allocations in TP_MOE::tp_owned_down_bufs_ with nullptr checks and size rounding to alignment boundary. - Add nibble-alignment runtime check for per_tp_interm in MXFP4 TP K-split. - Add write_weight_scale_to_buffer override to TP_MOE<AVX2_MXFP4_MOE_TP>, enabling dynamic expert update with kt-threadpool-count>=2. - Guard against ZeroDivisionError in test_fp4_moe_avx2.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [fix](kt-kernel): add intermediate_size parity check in MXFP4 TP flat-buffer path The per-expert path validates that intermediate_size is even (required for nibble-aligned FP4 addressing), but the flat-buffer path was missing this check — an odd value would silently truncate /2 divisions, corrupting memcpy sizes and offsets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(avx2-moe): fix TP offset calculation and add safety checks C1-C4: Fix incorrect TP offset calculations in load_weights() - Per-expert mode used per_tp_interm instead of full_interm for offsets - This caused segfault when TP > 1 due to invalid pointer arithmetic H1-H3: Add safety checks - H1: Validate source weight pointers are not null - H2: Check lid index is within bounds - H3: Check BufferB.b is not null in gemm_mxfp4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(avx2-moe): revert incorrect C2/C4 offset changes, keep safety checks Reverts the incorrect offset calculation changes from previous commit. The original per_tp_interm-based offsets were correct: - gate/up weights are N-split (along intermediate dim) - Each TP partition handles per_tp_interm rows - Offset = i * per_tp_interm * hidden / 2 (not full_interm) Keeps H1-H3 safety checks: - H1: Validate source weight pointers are not null - H2: Check lid index is within bounds - H3: Check BufferB.b is not null in gemm_mxfp4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(avx2): copy weights to owned buffers in per-expert mode Previously, AVX2 MXFP4 MoE per-expert mode directly pointed BufferB.b into mmap'd safetensor data. This caused use-after-free when Python layer releases the mmap after load_weights() returns. Now AVX2 copies weights into owned buffers via memcpy/from_raw_mat(), matching AMX behavior. This decouples the MoE weights from mmap lifecycle. Changes: - buffer_b_required_size_impl: always allocate full buffer (weights + scales) - make_buffer_b_impl: always create full BufferB with owned storage - Single-TP per-expert: use from_raw_mat() instead of direct pointer - TP_MOE per-expert: add gate/up owned buffers with memcpy - Destructor: free gate/up buffers alongside down Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Revert "[fix] Add runtime AMX BF16 check to prevent SIGILL on pre-Sapphire Rapids CPUs (#2018)" This reverts commit `f1e2b82c74`. * Remove AMX tile MXFP4 kernel (GemmKernel224MXFP4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-30 19:20:16 +08:00
Li Tingfang	f1e2b82c74	[fix] Add runtime AMX BF16 check to prevent SIGILL on pre-Sapphire Rapids CPUs (#2018 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details	2026-05-21 17:36:12 +08:00
login256	eeeeae5e91	Fix duplicate BF16 loader definition (#1984 ) Some checks failed Book-CI / test-1 (push) Failing after 6s Details Deploy / deploy (ubuntu-latest) (push) Failing after 10s Details Book-CI / test (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details	2026-05-20 15:04:47 +08:00
Jim James	f0772445a1	[perf]: native path for MXFP4 MoE on AVX512F (#2006 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details * [perf]: native path for MXFP4 MoE on AVX512F * [perf]: move inline static constants outside structs	2026-05-18 15:44:33 +08:00
github-actions[bot]	95e20f9c55	[build]: sync sglang submodule to ebaff7729b9e41c29d94f8d19a53473d321dc566 (#2005 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details Co-authored-by: ovowei <80044717+ovowei@users.noreply.github.com>	2026-05-14 22:25:31 +08:00
Benjamin F	f05b4009f3	[fix](kt-kernel): fix double mem used by safetensor loader (#1997 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Release the SafeTensor mmap loader singleton after each layer's load_weights() completes. The C++ engine already holds a deep copy (cpu_infer.sync() guarantees this), so releasing the mmap handles is safe. The next layer recreates the loader on demand. This halves peak memory usage during model loading (e.g. DSv3.2: 1.2T -> 613G). Based on #1966 by @poryfly — adapted to v0.6.2.post3 codebase (adds MXFP4 support missing from the original PR). Co-authored-by: xiongchenhui <xiongchenhui@hisense.com>	2026-05-11 12:00:30 +08:00
Benjamin F	bb15fdf47e	Release/0.6.2.post3: carry kt-kernel SwiGLU clamp companion missing from post2 Some checks failed Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Release to PyPI / Build kt-kernel (Python 3.12) (push) Has been cancelled Details Release Fake Tag / publish (push) Has been cancelled Details Release to PyPI / Build & publish sglang-kt (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.11) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release to PyPI / Publish kt-kernel to PyPI (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details	2026-05-10 03:55:02 +08:00
Benjamin F	37db9a3b83	0.6.2.post2: submodule refactor and update tutorial (#1993 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Release Fake Tag / publish (push) Waiting to run Details Release to PyPI / Build & publish sglang-kt (push) Waiting to run Details Release to PyPI / Build kt-kernel (Python 3.11) (push) Waiting to run Details Release to PyPI / Build kt-kernel (Python 3.12) (push) Waiting to run Details Release to PyPI / Publish kt-kernel to PyPI (push) Blocked by required conditions Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Waiting to run Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Blocked by required conditions Details - sglang submodule -> 43ed1ec77: V4-Flash hybrid SWA chunked-prefill hang fix (#44) + DSV4 plugin registry refactor (#47) - pin sglang-kt==0.6.2.post2 - tutorial: switch V4-Flash launch example from 8x RTX 5090 to single-card (decode 20+ tok/s); flip Ada Lovelace SM_89 row to validated; update Hardware Requirements GPU line accordingly	2026-05-09 18:53:59 +08:00
Jim James	f7c4fa68c5	[fix]: add guard for SFT MoE and remove guard for AMX FP4 MoE on AVX512F+BW (#1980 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details	2026-05-08 16:05:22 +08:00
Benjamin F	c465557c23	docs(v4-flash): add optional AMXINT4 CPU-weight conversion path (#1986 ) - Add convert_cpu_weights_ds4.py: dequantizes MXFP4 routed experts (E2M1 + ue8m0, group size 32) on GPU and re-quantizes to AMX-INT4 on CPU. - Document the script as Step 2 in DeepSeek-V4-Flash.md so AMX users can opt into AMXINT4 mode instead of the default MXFP4 CPU experts.	2026-05-08 15:35:05 +08:00
Benjamin F	8b9d233d42	docs(v4-flash): tilelang install, MTP flags, Ampere unsupported (#1979 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details - Prerequisites: add `pip install tilelang` (sglang-kt extras don't pull it; required for the NSA tilelang indexer fallback on non-Hopper GPUs). Validated with tilelang==0.1.8. - Step 2: add an "Optional: Enable MTP" subsection with the EAGLE speculative-decoding flags for V4-Flash NextN.	2026-05-06 17:29:38 +08:00
Benjamin F	d7b5b49a3e	[release]: 0.6.2.post1 Some checks failed Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Release to PyPI / Build & publish sglang-kt (push) Has been cancelled Details Book-CI / test (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Release Fake Tag / publish (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.11) (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.12) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release to PyPI / Publish kt-kernel to PyPI (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details V4-Flash MXFP4 full-GPU prefill fallback now works: - Previously crashed all TP schedulers with StopIteration/AttributeError whenever --kt-gpu-prefill-token-threshold was low enough to actually fire (path was hardcoded for FP8/INT4 layouts). - Now detects MXFP4, re-runs the V4 swizzle on the 256-expert gpu_layer, caches the load across prefill chunks. - Measured on 8x RTX 5090 (threshold=1024, chunked=1024): 16k input -> 2011 tok/s, 65k -> 2798, 262k -> 2154 prefill TPS.	2026-05-03 21:07:23 +08:00
Benjamin F	96189972d8	build: bump sglang submodule to c9edb75e0 (V4-Flash GPU prefill fallback fix + perf) (#1975 )	2026-05-03 19:42:19 +08:00
Benjamin F	088ed979d5	docs(v4-flash): pin transformers==4.57.1 in tutorial prerequisites (#1974 ) V4-Flash is incompatible with the transformers 5.x series. transformers 5.x adds default-valued fields to PretrainedConfig that make DeepSeekV4Config's dataclass declaration crash at import time with: TypeError: non-default argument 'quantization_config' follows default argument Reproduced on a fresh venv with `pip install sglang-kt`: pip resolves transformers to 5.7.0 (sglang-kt's pyproject does not pin transformers), and `python -m sglang.launch_server --model .../DeepSeek-V4-Flash` fails during import of sglang.srt.configs.deepseek_v4. Pinning to 4.57.1 fixes it. Add a Prerequisite #4 documenting the explicit pin alongside the existing flashinfer override.	2026-05-03 16:07:31 +08:00
Benjamin F	4b4312c0a2	release: bump version to 0.6.2 (#1973 )	2026-05-03 14:28:09 +08:00
Benjamin F	bb3b6e8413	build: bump sglang submodule to 40d3a82 (V4-Flash flashinfer guard) (#1972 )	2026-05-03 14:06:33 +08:00
Benjamin F	041bdfc636	[New Model] DeepSeek-V4-Flash: kt-kernel MXFP4 MoE + sglang hybrid inference (#1970 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Waiting to run Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Blocked by required conditions Details * [feat](kt-kernel): add MXFP4 MoE operator with E2M1 weights × BF16 activations Implements AMX_FP4_MOE_TP based on the RAWINT4 (k2-moe) CRTP pattern. FP4 E2M1 weights are nibble-packed and decoded via PSHUFB LUT, then computed with BF16 activations using _mm512_dpbf16_ps. Supports weight-only per-kgroup scaling (group_size=32) and tensor parallelism. Includes a Python validation test covering uniform, alternating, ramp, and random weight patterns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950) V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}` naming and converts ue8m0 → bf16 via a lossless bit-cast, register the model entry, and ship an end-to-end numerical validation script. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [perf](kt-kernel): MXFP4 MoE add mat-mat 4×4 tile, refine mat-vec reduce (#1957) mat_mul_kgroup previously aliased to fp4_mat_vec_kgroup, leaving large batches stuck on the per-token path. Implement fp4_mat_mat_kgroup as a 4×4 register tile (MB=NB=4, 16 zmm accumulators) so each PSHUFB decode of four weight rows is reused across four tokens. Refactor fp4_mat_vec_kgroup to accumulate four N-rows in parallel and flush them with a new reduce4 helper, removing per-row reduce_add_ps calls from the hot loop. Mark mxfp4_to_bf16_32 always_inline. Add bench/bench_fp4_moe.py with --routing {balanced,concentrated} and a backend registry so future kernels can be added without changing the runner. Dispatch thresholds, derived_init, GeneralMOEConfig handling, load_weights, write_weights_to_buffer and the TP_MOE specialization are unchanged. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(loader): avoid uint16 lshift in ue8m0->bf16 conversion PyTorch CPU has no lshift kernel for UInt16, so the previous `(scale_t.to(torch.uint16) << 7)` raised NotImplementedError when loading any V4-Flash MXFP4 routed-expert scale tensor on the host. Switch to int32 for the shift (kernel exists) and narrow to int16 afterwards. The shifted value max is 255<<7 = 32640, well within int16 range, so the narrow is lossless. The .view(bfloat16) bit pattern is identical (bf16 sign bit is always 0 for ue8m0 values). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v4-flash): hybrid CPU/GPU recipe + bump kt-sglang submodule Bumps third_party/sglang to kvcache-ai/sglang main (3cbd49c29) which now contains DeepSeek V4 Flash model support + consumer-GPU (SM_120) portable Triton/TileLang fallbacks (kt-sglang PR #38). Adds doc/en/DeepSeek-V4-Flash.md tutorial: 8x RTX 5090 hybrid recipe with the full launch command, OpenAI-compatible /generate + /v1/chat/completions examples, and the kt chat CLI client. --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-03 10:48:31 +08:00
github-actions[bot]	fe06c4d355	[build]: sync sglang submodule to 537eb762b0881071a0e098bd78666fe052b83deb (#1967 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Waiting to run Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Blocked by required conditions Details Co-authored-by: ovowei <80044717+ovowei@users.noreply.github.com>	2026-05-02 12:42:04 +08:00
Aliez Ren	02be2bf53f	[feat](kt-kernel): add AVX2/AVX-VNNI RAWINT4 MoE backend (#1942 ) Some checks failed Book-CI / test-2 (push) Has been cancelled Details Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details * [feat](kt-kernel): add AVX2/AVX-VNNI RAWINT4 MoE backend * Update AVX2 tutorial with AVX2 compilation instructions Added instructions for forcing AVX2 compilation on AVX512 or AMX machines. * Add instructions for AVX2 compilation --------- Co-authored-by: Jiaheng Dai <108478605+jdai0@users.noreply.github.com>	2026-04-30 17:16:49 +08:00
Peilin Li	8c634d5dca	[docs]: refresh kt inference and sft entry points Refresh README entry points and add KT SFT Quick Start.	2026-04-30 16:25:34 +08:00
Peilin Li	24b1941b85	[fix]: point sglang extra to post2 (#1964 ) Some checks are pending Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Book-CI / test (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details	2026-04-30 11:57:02 +08:00
Peilin Li	72044ad65f	[build]: bump v0.6.1 post1 package metadata Some checks failed Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Release Fake Tag / publish (push) Has been cancelled Details Release to PyPI / Build & publish sglang-kt (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.11) (push) Has been cancelled Details Release to PyPI / Build kt-kernel (Python 3.12) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release to PyPI / Publish kt-kernel to PyPI (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details Bump ktransformers/kt-kernel to 0.6.1.post1 and point extras to post1 companion packages.	2026-04-30 01:02:44 +08:00
Peilin Li	ef5822639f	[fix](kt-kernel): pin torch 2.9.1 wheel baseline Pin kt-kernel torch 2.9.1 metadata, update autosetup for cu130 wheels, register kt_kernel.kt_kernel_ext, and bump the sglang submodule.	2026-04-30 00:57:24 +08:00
Benjamin F	9f34ef46e6	[fix](Qwen3 series): fix gibberish output by correcting RoPE write-back (#31 ) (#1959 ) Some checks failed Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details	2026-04-27 22:04:29 +08:00
Peilin Li	0656e01ac1	[docs]: refresh KT install commands (#1958 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details	2026-04-27 00:45:43 +08:00
Peilin Li	07e274467a	[build]: flatten ktransformers package shim (#1955 ) Some checks failed Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details Release sglang-kt to PyPI / Build sglang-kt wheel (push) Has been cancelled Details Release sglang-kt to PyPI / Publish sglang-kt to PyPI (push) Has been cancelled Details	2026-04-25 22:08:52 +08:00
Peilin Li	bfbd0e9352	[chore]: archive kt-sft package (#1954 )	2026-04-25 21:49:21 +08:00
Peilin Li	85f1ab530b	[ci]: use hosted runner for sglang-kt release Use GitHub-hosted runners for the pure Python sglang-kt release workflow so the PyPI release is not blocked by unavailable self-hosted runners.	2026-04-25 21:05:18 +08:00
Peilin Li	bc7afff13b	[chore]: sync sglang-kt packaging fix Update third_party/sglang to the merged sglang-kt 0.6.1 dependency metadata fix so the release workflow builds the corrected inference package.	2026-04-25 21:02:25 +08:00

1 2 3 4 5 ...

1297 commits