qiyuxinlin
03a65d6bea
roll back ktransformers backend, add max_tokens, max_completion_tokens param
2025-04-21 12:55:37 +00:00
qiyuxinlin
38e841900d
Move KV cache creation to balance_serve
2025-04-18 10:10:07 +00:00
sean.su
8699109129
Refactor the chat interface to support tool calling and parameter processing
...
Defined new data structures in chat.py to replace OpenAI's original implementation, adding support for tool calling.
Implemented logic for extracting and processing tool calls, enabling dynamic function invocation during conversations.
Added methods in balance_serve.py to retrieve sampling parameters, handling default values and edge cases.
Updated ktransformers.py and transformers.py to support the passing of tool parameters.
Modified the default value of top_p in config.py to 1.0 to increase generation diversity.
Extended the message model in chat.py to support the transmission of tool call information.
These changes enhance the system's flexibility and functionality, enabling more complex interaction patterns.
2025-04-14 15:23:37 +08:00
wangkuigang-yewu-cmss
4538bdae97
prevent rpc process from crashing on long prompt
...
当prompt超过cache_len的时候,rpc进程会crash掉,导致整体不可用。
这里增加一个检查,让过长的prompt在请求早期就被提前过滤掉
2025-04-13 16:13:16 +08:00
dongjw
ec03bcbd7f
fix temperature=0, flashinfer sample error
2025-04-07 12:30:47 +08:00
dongjw
5c7ed7b579
fix top_p = 0 bug
2025-04-01 20:38:33 +08:00
Azure-Tang
31677181c3
Fix ktransformers-server flashinfer wrapper position arg issue;
...
Fix db position issue
2025-04-01 07:30:23 +00:00
Atream
25cee5810e
add balance-serve, support concurrence
2025-03-31 22:55:32 +08:00
Atream
09c043d8a6
Merge pull request #842 from BITcyman/fix-openai_chat_completion
...
[fix] thread context bug
2025-03-07 22:56:19 +08:00
BITcyman
08a8b553d6
[fix] thread context bug
2025-03-07 14:52:16 +00:00
Atream
d453c320f1
fix flashinfer precision
2025-03-07 14:07:00 +00:00
BITcyman
299c4dca64
[update] support openai chat completion api
2025-03-07 08:51:09 +00:00
Azure
662c1e4c14
small fix about max new token
2025-03-05 09:25:41 +00:00
wang jiahao
48b9800790
Merge pull request #759 from 3wweiweiwu/fix_top_p_typo
...
fix typo for top_p
2025-03-02 13:58:11 +08:00
1668068727@qq.com
7cdf8139f0
fix ollama api temperature bug
2025-03-02 13:55:26 +08:00
Wix Woo
3aa0cfc29d
fix typo for top_p
2025-03-01 20:15:36 +00:00
Atream
ca1dc1e7d1
Merge branch 'main' into main
2025-03-01 23:24:10 +08:00
Atream
fa03ea48dd
Merge branch 'main' into feat-chunk-prefill-flashinfer
2025-03-01 11:35:09 +00:00
Atream
f35e8d41d8
support chunk prefill, support 139K context for 24G VRAM
2025-03-01 11:28:25 +00:00
liam
80e0536fb0
Merge branch 'main' of https://github.com/KMSorSMS/ktransformers into main
2025-03-01 00:12:21 +08:00
liam
8ddc990668
⚡ fix server cache lens
2025-03-01 00:09:57 +08:00
qiyuxinlin
22df52e94e
fix temperature
2025-02-27 21:00:44 +08:00
lazymio
b121ca4df8
Fix according to upstream changes
2025-02-27 18:11:35 +08:00
wang jiahao
26f7b4af11
Merge branch 'main' into temperature_top_p_from_request
2025-02-27 18:08:55 +08:00
Atream
b443c7dfa2
Merge pull request #657 from kvcache-ai/feat-absorb-for-long-prefill
...
Feat absorb for long prefill
2025-02-25 16:53:21 +08:00
Atream
f4c198bd42
support absorb for prefill long context
2025-02-25 08:52:02 +00:00
Azure
36fbeee341
Update doc
2025-02-25 08:21:18 +00:00
Azure
4dc5518e4d
update fp8 kernel tutorial
2025-02-24 15:37:01 +00:00
lazymio
07eb712a73
Left out
2025-02-24 21:51:14 +08:00
lazymio
76487c4dcb
Revert repetition_penalty as it is not in API spec
2025-02-24 21:30:03 +08:00
lazymio
bf36547f98
Also allow repetition_penalty
2025-02-24 21:07:35 +08:00
lazymio
8704c09192
Allow temperature and top_p from requests
2025-02-24 21:01:33 +08:00
Atream
024009675e
Merge branch 'main' into feat-more-context
2025-02-22 06:17:39 +00:00
Atream
7e1fe256c8
optimize GPU
2025-02-21 05:06:57 +00:00
Atream
a529518346
clean PR code and disable flashinfer
2025-02-19 04:42:47 +00:00
ceerrep
73d072f609
Merge branch 'fix_precision_MLA' of https://github.com/kvcache-ai/ktransformers into server-prefix-cache
2025-02-18 11:44:28 +08:00
Xie Weiyu
f029588b61
fix server warmup
2025-02-18 11:39:45 +08:00
ceerrep
c70b6f4d5b
fix: use 'cuda:0' by default if torch_device is 'cuda'
2025-02-18 11:15:17 +08:00
Xie Weiyu
c176e516b5
server mix mla
2025-02-17 20:40:28 +08:00
ceerrep
ee24eb8dc3
fix: fix server for triton kernel
2025-02-17 18:08:45 +08:00
ceerrep
cd9f7f8f34
fix: server: drop <think> tag in chat template
2025-02-17 14:25:27 +08:00
ceerrep
bb0ccc7b1a
feat: add prefix cache for server
2025-02-17 00:10:55 +08:00
MuWinds
f74c2d1d17
Solve torch.backends.cuda.sdp_kernel()
is deprecated.
2025-02-15 12:41:51 +08:00
hrz6976
2c3dcd9774
Add a lock to server inference()
2025-02-13 10:05:22 +00:00
liam
4385e85096
⚡ support force thinking
2025-02-12 12:43:53 +08:00
liam
6f3a39be08
⚡ update force_think config
2025-02-12 12:10:16 +08:00
liam
e536e1420d
⚡ update force_think
2025-02-12 11:42:55 +08:00
liam
bf1d413be0
Merge branch 'feat-DeepSeekV3' of github.com:kvcache-ai/ktransformers into feat-DeepSeekV3
2025-02-08 13:17:10 +08:00
liam
c18ecd7b7f
⚡ add flush print in local_chat output and change default optimize yaml of deepseekv3 to single gpu
2025-02-08 13:15:52 +08:00
Azure
c4d9bc6670
support KExpertsMarlin backend
2025-02-07 05:57:40 +00:00