vrr/kvcache-ai-ktransformers

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-04-28 11:49:51 +00:00

Author	SHA1	Message	Date
Benjamin F	8484ef8b16	[feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950 ) V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}` naming and converts ue8m0 → bf16 via a lossless bit-cast, register the model entry, and ship an end-to-end numerical validation script. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:11:53 +08:00
mrhaoxx	7a9daf0cd4	[feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * feat: support avx2 bf16 fp8 inference * feat: support avx2 gptq int4 inference * fix: numeric issues in fp8 dequant * Tutorial avx2 (#1900) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * Tutorial avx2 (#1901) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * docs: update README.md --------- Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>	2026-03-27 14:45:02 +08:00
Chen Hongtao	9e69fccb02	[feat]: add mistral moe loader compatibility (#1873 ) Some checks failed Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Book-CI / test (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>	2026-02-28 17:50:23 +08:00
VYSE V.E.O	20262b2743	Fix Qwen3.5 FP8 load for VL detection (#1857 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details * Fix Qwen3.5 FP8 load for VL detection 1, for VL models(Qwen3.5), modify base_key: model.layers.{N} -> model.language_model.layers.{N} 2, clean DUPLICATED class BF16SafeTensorLoader(SafeTensorLoader) , only the first overrided one. * Indent type Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-02-26 15:47:22 +08:00
Jianwei Dong	16a8b98f3e	support qwen3.5 (#1846 ) Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details	2026-02-16 15:48:14 +08:00
Jiaqi Liao	db82d99fa6	feat: add fallback expert prefix lookup in loader.py from kimi_k2.5 (#1822 )	2026-01-30 14:09:38 +08:00
Oql	6277da4c2b	support GLM 4.7 (#1791 ) Some checks failed Book-CI / test-2 (push) Has been cancelled Details Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details support GLM 4.7	2026-01-13 17:36:25 +08:00
Oql	5edc456749	support Native BF16 format MoE. (#1788 ) support Native BF16 format MoE	2026-01-12 14:43:28 +08:00
ErvinXie	d8046e1bb4	Kt minimax (#1742 ) [feat]: fp8 kernel and kt-cli support	2025-12-24 15:39:44 +08:00
Jiaqi Liao	fcf8882075	[Feature] Add avx-based kimi-k2 support (#1656 ) Some checks are pending Book-CI / test-2 (push) Waiting to run Details Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * support Kimi-K2-Thinking original weight fix amx kernel bug * update k2 avx kernel. * feat: add CPUInfer write buffer task * [feat]: add kimi k2 cpu write buffer support - Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights - Fix down (w2) weight column-wise slicing for different TP configurations - Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp - Add comprehensive test cases for weight extraction validation - Ensure compatibility with Kimi model's MoE architecture * [fix]: correct write_weight_scale_to_buffer expert offset calculation Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios. Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * [feat]: add write buffer wrapper * [fix] fix comment --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude <noreply@anthropic.com>	2025-12-02 16:01:07 +08:00
Jiaqi Liao	94c25626dc	Fix kt-kernel for new wrapper (#1588 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * update README for kt-kernel * style: format C++ and Python code in kt-kernel - Format C++ files: task_queue, ext_bindings, and MoE operators - Format Python utility modules: amx, llamafile, and loader - Improve code readability and consistency	2025-11-10 21:47:34 +08:00
Jiaqi Liao	9bc00e587b	Refactor KTMoEWrapper backend (#1587 ) Some checks are pending Book-CI / test (push) Waiting to run Details Book-CI / test-1 (push) Waiting to run Details Book-CI / test-2 (push) Waiting to run Details Deploy / deploy (macos-latest) (push) Waiting to run Details Deploy / deploy (ubuntu-latest) (push) Waiting to run Details Deploy / deploy (windows-latest) (push) Waiting to run Details * universal backend for cpu inference * expert defer	2025-11-10 20:26:15 +08:00

12 commits