Commit graph

6 commits

Author SHA1 Message Date
ErvinXie
71f683acec
Support Native Kimi K2 Thinking (#1663)
* [feat]: fix k2 prefill

* Update Kimi-K2-Thinking.md

* Create Kimi-K2-Thinking-Native.md

* Update Kimi-K2-Thinking.md

* Update Kimi-K2-Thinking.md

* Update Kimi-K2-Thinking-Native.md

* [perf] optimize K2 MoE weight loading with per-expert pointers

- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s)
- Use per-expert pointer arrays (gate_projs) instead of contiguous memory
- C++ worker pool performs parallel memcpy for TP slicing
- Add LOAD_TIME_PROFILE for load_weights timing analysis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-12-05 21:53:05 +08:00
Jiaqi Liao
46af8fcab5
[doc] fix kt parameters (#1629)
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
2025-11-19 16:41:57 +08:00
Atream
b67cc4095d
Change attention backend to 'flashinfer' in launch command
Updated the launch command to include 'flashinfer' as the attention backend.
2025-11-08 20:56:09 +08:00
Atream
0651dbda04
Simplify launch command by removing unused option
Removed the unused '--attention-backend triton' option from the launch command.
2025-11-08 16:54:18 +08:00
Atream
d6ee384fe2
Fix download link for Kimi-K2-Thinking weights
Updated the download link for AMX INT4 quantized weights.
2025-11-06 19:07:15 +08:00
Atream
d419024bb4
Add KTransformers SGLang inference documentation
Add documentation for KTransformers SGLang inference deployment, including installation steps, model download links, server launch instructions, and performance benchmarks.
2025-11-06 17:53:58 +08:00