vrr/kvcache-ai-ktransformers

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2025-09-06 20:49:55 +00:00

Author	SHA1	Message	Date
qiyuxinlin	03a65d6bea	roll back ktransformers backend, add max_tokens, max_completion_tokens param	2025-04-21 12:55:37 +00:00
qiyuxinlin	38e841900d	Move KV cache creation to balance_serve	2025-04-18 10:10:07 +00:00
sean.su	8699109129	Refactor the chat interface to support tool calling and parameter processing Defined new data structures in chat.py to replace OpenAI's original implementation, adding support for tool calling. Implemented logic for extracting and processing tool calls, enabling dynamic function invocation during conversations. Added methods in balance_serve.py to retrieve sampling parameters, handling default values and edge cases. Updated ktransformers.py and transformers.py to support the passing of tool parameters. Modified the default value of top_p in config.py to 1.0 to increase generation diversity. Extended the message model in chat.py to support the transmission of tool call information. These changes enhance the system's flexibility and functionality, enabling more complex interaction patterns.	2025-04-14 15:23:37 +08:00
wangkuigang-yewu-cmss	4538bdae97	prevent rpc process from crashing on long prompt 当prompt超过cache_len的时候，rpc进程会crash掉，导致整体不可用。这里增加一个检查，让过长的prompt在请求早期就被提前过滤掉	2025-04-13 16:13:16 +08:00
dongjw	ec03bcbd7f	fix temperature=0, flashinfer sample error	2025-04-07 12:30:47 +08:00
dongjw	5c7ed7b579	fix top_p = 0 bug	2025-04-01 20:38:33 +08:00
Azure-Tang	31677181c3	Fix ktransformers-server flashinfer wrapper position arg issue; Fix db position issue	2025-04-01 07:30:23 +00:00
Atream	25cee5810e	add balance-serve, support concurrence	2025-03-31 22:55:32 +08:00
Atream	09c043d8a6	Merge pull request #842 from BITcyman/fix-openai_chat_completion [fix] thread context bug	2025-03-07 22:56:19 +08:00
BITcyman	08a8b553d6	[fix] thread context bug	2025-03-07 14:52:16 +00:00
Atream	d453c320f1	fix flashinfer precision	2025-03-07 14:07:00 +00:00
BITcyman	299c4dca64	[update] support openai chat completion api	2025-03-07 08:51:09 +00:00
Azure	662c1e4c14	small fix about max new token	2025-03-05 09:25:41 +00:00
wang jiahao	48b9800790	Merge pull request #759 from 3wweiweiwu/fix_top_p_typo fix typo for top_p	2025-03-02 13:58:11 +08:00
1668068727@qq.com	7cdf8139f0	fix ollama api temperature bug	2025-03-02 13:55:26 +08:00
Wix Woo	3aa0cfc29d	fix typo for top_p	2025-03-01 20:15:36 +00:00
Atream	ca1dc1e7d1	Merge branch 'main' into main	2025-03-01 23:24:10 +08:00
Atream	fa03ea48dd	Merge branch 'main' into feat-chunk-prefill-flashinfer	2025-03-01 11:35:09 +00:00
Atream	f35e8d41d8	support chunk prefill, support 139K context for 24G VRAM	2025-03-01 11:28:25 +00:00
liam	80e0536fb0	Merge branch 'main' of https://github.com/KMSorSMS/ktransformers into main	2025-03-01 00:12:21 +08:00
liam	8ddc990668	⚡ fix server cache lens	2025-03-01 00:09:57 +08:00
qiyuxinlin	22df52e94e	fix temperature	2025-02-27 21:00:44 +08:00
lazymio	b121ca4df8	Fix according to upstream changes	2025-02-27 18:11:35 +08:00
wang jiahao	26f7b4af11	Merge branch 'main' into temperature_top_p_from_request	2025-02-27 18:08:55 +08:00
Atream	b443c7dfa2	Merge pull request #657 from kvcache-ai/feat-absorb-for-long-prefill Feat absorb for long prefill	2025-02-25 16:53:21 +08:00
Atream	f4c198bd42	support absorb for prefill long context	2025-02-25 08:52:02 +00:00
Azure	36fbeee341	Update doc	2025-02-25 08:21:18 +00:00
Azure	4dc5518e4d	update fp8 kernel tutorial	2025-02-24 15:37:01 +00:00
lazymio	07eb712a73	Left out	2025-02-24 21:51:14 +08:00
lazymio	76487c4dcb	Revert repetition_penalty as it is not in API spec	2025-02-24 21:30:03 +08:00
lazymio	bf36547f98	Also allow repetition_penalty	2025-02-24 21:07:35 +08:00
lazymio	8704c09192	Allow temperature and top_p from requests	2025-02-24 21:01:33 +08:00
Atream	024009675e	Merge branch 'main' into feat-more-context	2025-02-22 06:17:39 +00:00
Atream	7e1fe256c8	optimize GPU	2025-02-21 05:06:57 +00:00
Atream	a529518346	clean PR code and disable flashinfer	2025-02-19 04:42:47 +00:00
ceerrep	73d072f609	Merge branch 'fix_precision_MLA' of https://github.com/kvcache-ai/ktransformers into server-prefix-cache	2025-02-18 11:44:28 +08:00
Xie Weiyu	f029588b61	fix server warmup	2025-02-18 11:39:45 +08:00
ceerrep	c70b6f4d5b	fix: use 'cuda:0' by default if torch_device is 'cuda'	2025-02-18 11:15:17 +08:00
Xie Weiyu	c176e516b5	server mix mla	2025-02-17 20:40:28 +08:00
ceerrep	ee24eb8dc3	fix: fix server for triton kernel	2025-02-17 18:08:45 +08:00
ceerrep	cd9f7f8f34	fix: server: drop <think> tag in chat template	2025-02-17 14:25:27 +08:00
ceerrep	bb0ccc7b1a	feat: add prefix cache for server	2025-02-17 00:10:55 +08:00
MuWinds	f74c2d1d17	Solve `torch.backends.cuda.sdp_kernel()` is deprecated.	2025-02-15 12:41:51 +08:00
hrz6976	2c3dcd9774	Add a lock to server inference()	2025-02-13 10:05:22 +00:00
liam	4385e85096	⚡ support force thinking	2025-02-12 12:43:53 +08:00
liam	6f3a39be08	⚡ update force_think config	2025-02-12 12:10:16 +08:00
liam	e536e1420d	⚡ update force_think	2025-02-12 11:42:55 +08:00
liam	bf1d413be0	Merge branch 'feat-DeepSeekV3' of github.com:kvcache-ai/ktransformers into feat-DeepSeekV3	2025-02-08 13:17:10 +08:00
liam	c18ecd7b7f	⚡ add flush print in local_chat output and change default optimize yaml of deepseekv3 to single gpu	2025-02-08 13:15:52 +08:00
Azure	c4d9bc6670	support KExpertsMarlin backend	2025-02-07 05:57:40 +00:00

1 2

60 commits