liam
8ddc990668
⚡ fix server cache lens
2025-03-01 00:09:57 +08:00
Shuaiyi
a34a25d5cc
Delete unused code
2025-02-27 13:18:19 +00:00
wang jiahao
7a19f3b781
Merge pull request #721 from kvcache-ai/fix_temperature
...
fix temperature
2025-02-27 21:01:21 +08:00
qiyuxinlin
22df52e94e
fix temperature
2025-02-27 21:00:44 +08:00
Atream
85e2cc7bf4
Merge pull request #719 from kvcache-ai/fix-use-generation-json
...
use generation config from json file in official repo
2025-02-27 19:49:41 +08:00
Atream
e645d84794
use generation config from json file in official repo
2025-02-27 11:48:34 +00:00
lazymio
b121ca4df8
Fix according to upstream changes
2025-02-27 18:11:35 +08:00
wang jiahao
26f7b4af11
Merge branch 'main' into temperature_top_p_from_request
2025-02-27 18:08:55 +08:00
Atream
50c691297f
Merge pull request #622 from akemimadoka/fix-msvc
...
Fix missing macro definition for KTRANSFORMERS_USE_CUDA and <chrono> includes on MSVC
2025-02-27 17:42:00 +08:00
Atream
0422152cf3
Merge pull request #670 from akemimadoka/fix-win
...
Fix RuntimeError on Windows caused by integer overflow in np.prod
2025-02-27 17:40:27 +08:00
Atream
798e1d0cfa
Merge pull request #532 from xv44586/fix-sse-formatting
...
fix: fix SSE formatting
2025-02-27 12:19:23 +08:00
Atream
f403cde6d4
Merge pull request #650 from ceerRep/main
...
feat: basic api key support
2025-02-27 12:16:53 +08:00
Atream
8db6a4d402
Merge branch 'main' into main
2025-02-27 12:12:32 +08:00
wang jiahao
3c8c580580
Merge pull request #691 from swu-hyk/ollama_api_chat
...
feat:implementation of chat routing for Ollama
2025-02-27 11:17:48 +08:00
Atream
90eb87b3fc
Update DeepSeek-V3-Chat-multi-gpu-marlin.yaml
2025-02-26 21:53:50 +08:00
swu-hyk
ec7e912fee
modify
2025-02-26 19:21:30 +08:00
swu-hyk
68e7df3a25
implementation of chat routing for Ollama
2025-02-26 17:05:00 +08:00
Chen Hongtao
9660b2cc1e
Merge pull request #685 from vproxy-tools/main
...
fix numa cpu distribution
2025-02-26 15:35:19 +08:00
liam
ffb86c66e3
⚡ fix experts torch
2025-02-26 15:04:40 +08:00
wkgcass
b2bff17775
fix numa cpu distribution
...
The numa node location would be calculated based on the total number
of worker threads.
So we should always use the actual number of threads instead of using a min() op.
2025-02-26 14:49:57 +08:00
akemimadoka
8817777e11
Fix RuntimeError on Windows caused by integer overflow in np.prod
2025-02-26 03:50:12 +08:00
liam
ddf3339339
⚡ release v0.2.2rc1
2025-02-25 22:06:36 +08:00
Azure
91c1619296
Merge branch 'develop-0.2.2' into support-fp8
...
Update README.md
2025-02-25 13:43:26 +00:00
Azure
2c0cce90d0
add fp8 multi gpu yaml example
2025-02-25 13:32:09 +00:00
Atream
d9b2895bd3
Merge branch 'fix-update-flashinfer_wrapper_local_chat' into develop-0.2.2
2025-02-25 12:47:48 +00:00
Atream
477ac28a9c
fix-update-flashinfer_wrapper_local_chat
2025-02-25 12:47:31 +00:00
Azure
7e5962af3d
fix fp8 multi gpu; update FQA
2025-02-25 10:52:29 +00:00
liam
0ca0b99fab
⚡ update git ignore add docker dev container
2025-02-25 17:22:11 +08:00
Azure
5474be5299
Merge branch 'main' into develop-0.2.2
2025-02-25 09:04:22 +00:00
Atream
b443c7dfa2
Merge pull request #657 from kvcache-ai/feat-absorb-for-long-prefill
...
Feat absorb for long prefill
2025-02-25 16:53:21 +08:00
Atream
f4c198bd42
support absorb for prefill long context
2025-02-25 08:52:02 +00:00
Azure
36fbeee341
Update doc
2025-02-25 08:21:18 +00:00
ceerrep
f639fbc19e
feat: basic api key support
2025-02-25 14:11:39 +08:00
Azure
4dc5518e4d
update fp8 kernel tutorial
2025-02-24 15:37:01 +00:00
Atream
7b2a6690ab
Merge pull request #608 from makllama/fix_musa_ext
...
musa: support bf16
2025-02-24 23:12:54 +08:00
Xiaodong Ye
f88c05a6f1
Ensure backward compatibility with Torch 2.2
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-02-24 21:55:30 +08:00
lazymio
07eb712a73
Left out
2025-02-24 21:51:14 +08:00
lazymio
91062a834f
Default values
2025-02-24 21:38:01 +08:00
lazymio
76487c4dcb
Revert repetition_penalty as it is not in API spec
2025-02-24 21:30:03 +08:00
lazymio
05ad288453
Also /chat/completions
2025-02-24 21:08:36 +08:00
lazymio
bf36547f98
Also allow repetition_penalty
2025-02-24 21:07:35 +08:00
lazymio
8704c09192
Allow temperature and top_p from requests
2025-02-24 21:01:33 +08:00
Azure
ca7366d2db
Merge remote-tracking branch 'upstream/develop-0.2.2' into support-fp8
2025-02-24 11:58:10 +00:00
Azure
581a524f65
Add data loader to read special weights for fp8; Add special weight process script
2025-02-24 11:34:17 +00:00
Atream
e9b1216a9a
Merge branch 'main' into feat-absorb-for-long-prefill
2025-02-24 09:44:17 +00:00
Atream
f327695079
fix KExpertsMarlin on GPU with out CUDA Graph
2025-02-24 09:30:54 +00:00
Yuhao Tsui
cea07d1998
Feat: Clear cache during weight loading to prevent OOM on GPUs with <=8GB VRAM
...
This change explicitly clears CUDA cache during weight loading to mitigate memory fragmentation issues, particularly beneficial for low-VRAM GPUs.
2025-02-24 10:09:42 +08:00
akemimadoka
706e69f4fc
Fix missing macro definition for KTRANSFORMERS_USE_CUDA and <chrono> includes on MSVC
2025-02-24 01:37:50 +08:00
Atream
f5f6c6b95d
update yaml
2025-02-23 14:33:58 +00:00
Atream
e8e02e5ccc
support Moonlight
2025-02-23 14:21:18 +00:00