* Support for GLM-5 and Minimax-M2.5
Add CPU weight conversion support for GLM-5 and Minimax-M2.5
* fix: remove overly restrictive MiniMax condition and deduplicate code
- Remove `args.input_type == "fp8"` from MiniMaxConverter selection so
bf16/fp16 MiniMax models no longer fall through to OnlineQuantConverter
(which doesn't handle w1/w2/w3 naming and would fail).
- Remove OnlineQuantConverter._find_expert_layers() which is identical
to the inherited ConverterBase._find_expert_layers().
- Remove redundant expert_key_filter assignment (same as base default).
---------
Co-authored-by: ErvinXie <ervinxie@foxmail.com>
Add numa_nodes parameter to BaseMoEWrapper and all subclasses, allowing
users to explicitly specify which NUMA node IDs to use for subpool
mapping instead of always defaulting to sequential [0, 1, ..., N-1].
This enables running multiple KTransformers instances on different NUMA
nodes of the same machine, e.g. --kt-threadpool-count 1 --kt-numa-nodes 1
to bind to NUMA node 1. Previously this required external numactl
workarounds since subpool_numa_map was hardcoded to start from 0.
* initial fix for issue 1858
* [fix]: add done flag check to sync() wait predicate to prevent deadlock during destruction
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ben Appleby <Ben.Appleby@microsoft.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Increase timeout for Check if --kt-gpu-prefill-token-threshold is in the help output to 90 seconds.
In cloud environments,CUDA initialization and Python module loading can easily exceed 30 seconds.
* Update kt-kernel/python/cli/utils/sglang_checker.py
add comment about the change
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment
- Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main)
- Add top-level install.sh for one-click source installation (sglang + kt-kernel)
- Add sglang-kt as hard dependency in kt-kernel/pyproject.toml
- Add CI workflow to auto-sync sglang submodule daily and create PR
- Add CI workflow to build and publish sglang-kt to PyPI
- Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages)
- Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection
- Update Dockerfile to use submodule and inject aligned version
- Update all 13 doc files, CLI hints, and i18n strings to reference new install methods
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: rename PyPI package from kt-kernel to ktransformers
Users can now `pip install ktransformers` to get everything
(sglang-kt is auto-installed as a dependency).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "[build]: rename PyPI package from kt-kernel to ktransformers"
This reverts commit e0cbbf6364.
* [build]: add ktransformers meta-package for PyPI
`pip install ktransformers` now works as a single install command.
It pulls kt-kernel (which in turn pulls sglang-kt).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: show sglang-kt package version in kt version command
- Prioritize sglang-kt package version (aligned with ktransformers)
over sglang internal __version__
- Update display name from "sglang" to "sglang-kt"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: improve sglang-kt detection in kt doctor and kt version
Recognize sglang-kt package name as proof of kvcache-ai fork installation.
Previously both commands fell through to "PyPI (not recommended)" for
non-editable local source installs. Now version.py reuses the centralized
check_sglang_installation() logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2.post1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Handle unquoted paths and special characters in model scanner
* Fix ValueError: capture_output cannot be used with stderr
`capture_output=True` internally sets `stderr=PIPE`, which conflicts
with `stderr=subprocess.DEVNULL`. Replace `capture_output=True` with
explicit `stdout=subprocess.PIPE` to keep stderr suppressed correctly.
Also remove redundant `shell=False` (already the default).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: ErvinXie <ervinxie@foxmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [feat]: redesign kt run interactive configuration with i18n support
- Redesign kt run with 8-step interactive flow (model selection, inference method, NUMA/CPU, GPU experts, KV cache, GPU/TP selection, parsers, host/port)
- Add configuration save/load system (~/.ktransformers/run_configs.yaml)
- Add i18n support for kt chat (en/zh translations)
- Add universal input validators with auto-retry and Chinese comma support
- Add port availability checker with auto-suggestion
- Add parser configuration (--tool-call-parser, --reasoning-parser)
- Remove tuna command and clean up redundant files
- Fix: variable reference bug in run.py, filter to show only MoE models
* [feat]: unify model selection UI and enable shared experts fusion by default
- Unify kt run model selection table with kt model list display
* Add Total size, MoE Size, Repo, and SHA256 status columns
* Use consistent formatting and styling
* Improve user decision-making with more information
- Enable --disable-shared-experts-fusion by default
* Change default value from False to True
* Users can still override with --enable-shared-experts-fusion
* [feat]: improve kt chat with performance metrics and better CJK support
- Add performance metrics display after each response
* Total time, TTFT (Time To First Token), TPOT (Time Per Output Token)
* Accurate input/output token counts using model tokenizer
* Fallback to estimation if tokenizer unavailable
* Metrics shown in dim style (not prominent)
- Fix Chinese character input issues
* Replace Prompt.ask() with console.input() for better CJK support
* Fixes backspace deletion showing half-characters
- Suppress NumPy subnormal warnings
* Filter "The value of the smallest subnormal" warnings
* Cleaner CLI output on certain hardware environments
* [fix]: correct TTFT measurement in kt chat
- Move start_time initialization before API call
- Previously start_time was set when receiving first chunk, causing TTFT ≈ 0ms
- Now correctly measures time from request sent to first token received
* [docs]: 添加 Clawdbot 集成指南 - KTransformers 企业级 AI 助手部署方案
* [docs]: 强调推荐使用 Kimi K2.5 作为核心模型,突出企业级推理能力
* [docs]: 添加 Clawdbot 飞书接入教程链接
* [feat]: improve CLI table display, model verification, and chat experience
- Add sequence number (#) column to all model tables by default
- Filter kt edit to show only MoE GPU models (exclude AMX)
- Extend kt model verify to check *.json and *.py files in addition to weights
- Fix re-verification bug where repaired files caused false failures
- Suppress tokenizer debug output in kt chat token counting
* [fix]: fix cpu cores.
---------
Co-authored-by: skqliao <skqliao@gmail.com>
* [feat]: Enhance CPU feature detection and support for AVX512 extensions
- Added cmake/DetectCPU.cmake for automatic CPU feature detection.
- Updated CMakeLists.txt to include auto-detection logic for AVX512 features.
- Modified install.sh to include new AVX512_VBMI option for FP8 MoE.
- Enhanced _cpu_detect.py to support progressive matching of CPU variants.
- Created scripts/check_cpu_features.py for manual CPU feature checks.
- Updated setup.py to reflect changes in CPU variant building and environment variables.
* [fix](kt-kernel): Add conditional inclusion of FP8 MoE for AVX512 BF16 support
* [chore](kt-kernel): update project version to 0.5.0 in CMakeLists.txt and version.py
* [fix](kt-kernel): fix AVX512 cpu instruction set detection
* [feat](kt-kernel): AVX512 fallback kernel for RAW-INT4
* [fix](kt-kernel): fix setup version issue
* [fix](kt-kernel): update install for custom build
* [docs](kt-kernel): new installation guide for various cpu instruction set
* [fix](kt-kernel): fix _mm512_dpbusd_epi32_compat fallback implmentation
* [style](kt-kernel): clang format
* fix(amx): add BufferASmallKGroupImpl to fix buffer overflow in from_mat
The original BufferAKGroupImpl::from_mat writes 64 bytes per K_STEP iteration
but when K_STEP=32 (for GemmKernel224Int4SmallKGroup), this causes buffer overflow.
BufferASmallKGroupImpl overrides from_mat to write only 32 bytes per iteration.
* perf(k2-moe): optimize memory allocation with pooled buffers
- Replace per-expert buffer allocation with shared memory pools
- Dynamically assign buffer slices based on activated experts
- Add group_size inference from scale tensor shape in amx.py
* delete kimi k2 forward test
* add TODO comment for pool_count_ calculation