prima.cpp

mirror of https://github.com/Lizonghang/prima.cpp.git synced 2025-09-05 15:49:04 +00:00

Author	SHA1	Message	Date
Li, Zonghang	86ca21e49c	server: fix bugs when running speculative decoding	2025-07-13 21:52:59 +08:00
Li, Zonghang	b019a707b8	server: fix bugs	2025-07-13 13:42:24 +08:00
DeEMO	0cf87c8837	fix: set cache_prompt default to true	2025-07-06 10:05:24 +08:00
DeEMO	ca5996e7a6	fix: slot id	2025-06-30 09:35:35 +00:00
DeEMO	b4929d510a	fix: args in speculative	2025-06-30 09:35:35 +00:00
DeEMO	9bf6565df4	fix: load draft model first	2025-06-30 09:35:35 +00:00
DeEMO	d248f3c40e	fix: some fields in cparams_draft	2025-06-30 09:35:35 +00:00
DeEMO	2e8e42a5ad	Add speculative decoding support to the server and command-line interfaces	2025-06-30 09:35:35 +00:00
Zonghang Li	1ea2d61a97	speedup: add arg --keep-out-in-cuda to run the output layer on CUDA	2025-06-28 10:58:18 +04:00
Li, Zonghang	e8d3e5a631	update README	2025-06-27 20:16:30 +04:00
Zonghang Li	11ce0d58f7	fix compute buffer estimate: don't reverse CUDA VRAM for output layer	2025-06-27 12:42:16 +00:00
Li, Zonghang	3a03549fed	update README	2025-06-26 22:37:08 +04:00
Li, Zonghang	ba59a1a07a	update README	2025-06-26 22:33:28 +04:00
Li, Zonghang	aacfa8a231	fix compute buffer estimate: reserve 300 MiB VRAM to avoid potential OOM	2025-06-26 20:45:45 +04:00
Li, Zonghang	a05022c05a	communication: use barrier instead of manually adding delay	2025-06-26 17:30:47 +04:00
Li, Zonghang	3f27a25340	topo rebuild: add a delay to avoid packet interleaving	2025-06-26 14:50:58 +04:00
Li, Zonghang	729870fcd7	topo rebuild: add a delay to avoid packet interleaving	2025-06-26 14:47:34 +04:00
Li, Zonghang	50807fd4e1	halda: handle infeasible solution with weak device	2025-06-26 08:56:31 +04:00
Li, Zonghang	72701ae872	fix compute buffer estimate: reserve 200 MiB VRAM to avoid potential OOM	2025-06-24 20:39:49 +04:00
Li, Zonghang	4dde8458cf	fix compute buffer estimate: reserve 100 MiB VRAM to avoid potential OOM	2025-06-24 19:29:10 +04:00
Li, Zonghang	90b1079d78	fix compute_buffer estimate: remove unused memory for CUDA device	2025-06-24 16:37:16 +04:00
Li, Zonghang	16ba3564ce	fix compute_buffer estimate: add context GPU usage	2025-06-24 16:09:59 +04:00
Li, Zonghang	c926088d6a	fix compute buffer estimate: test without highs	2025-06-22 16:27:55 +04:00
Zonghang Li	45e8b0420c	fix compute buffer estimate: tested on cuda	2025-06-22 08:10:57 +00:00
Li, Zonghang	80e5b71b48	fix compute buffer estimate: tested on metal	2025-06-20 13:43:55 +04:00
Zonghang Li	dd589561b4	improve the computing buffer estimate	2025-06-19 08:02:43 +00:00
Li, Zonghang	0b4ffdfde5	Merge branch 'dev'	2025-06-17 09:40:27 +04:00
DeEMO	deeec668b8	fix: n_worker in draft model (cherry picked from commit 921ad2b453b24b715ad5db6a703fb3df65fdcb80)	2025-06-17 13:23:20 +08:00
Zonghang Li	2b902f89bd	fix: change default ip to `127.0.0.1` & improve args for setting ports fix: change default ip to `127.0.0.1` & improve args for setting ports	2025-06-17 08:23:25 +04:00
DeEMO	67c4f70357	fix: add log when serving as a proxy	2025-06-17 12:08:53 +08:00
DeEMO	6ff38b2a0c	add args: data-port and signal-port	2025-06-17 12:00:04 +08:00
DeEMO	104e3b2356	fix: replace localhost to 127.0.0.1	2025-06-17 11:27:58 +08:00
Li, Zonghang	fbbc30c950	Merge branch 'speculative' into dev	2025-06-16 13:27:36 +04:00
Zonghang Li	dc797e94f5	Fix speculative decoding Power prima.cpp with speculative decoding: Further speeds up by up to 80%	2025-06-16 12:11:12 +04:00
Li, Zonghang	dfb1feb54e	update README	2025-06-16 12:09:07 +04:00
Li, Zonghang	45de284f3d	Merge branch 'fix' into speculative	2025-06-14 18:57:17 +04:00
Li, Zonghang	f38cfc625c	Merge branch 'fix' into dev	2025-06-14 18:56:36 +04:00
Li, Zonghang	b5ccd62135	fix n_gpu_layers allocation errors	2025-06-14 18:55:53 +04:00
Li, Zonghang	0a535cbdc1	Merge branch 'speculative' of github.com:Lizonghang/prima.cpp into speculative	2025-06-13 13:31:12 +04:00
Li, Zonghang	c9cae626cf	speculative: free sockets and send stop signal when inference ends	2025-06-13 13:30:29 +04:00
Li, Zonghang	2687ef3126	speculative: free sockets and send stop signal when inference ends	2025-06-13 11:25:42 +04:00
Li, Zonghang	dc875bbef9	fix speculative decoding	2025-06-13 08:18:12 +04:00
Zonghang Li	ba29717613	add feature: keep the forwarder if its previous device cannot directly connect to its next device. feat: nodes attempt connections during topology rebuild while preserving forwarders	2025-06-12 16:57:35 +04:00
DeEMO	d4618de991	fix: block when free socket	2025-06-12 12:26:10 +00:00
DeEMO	2039e3b0c1	fix: send and recv meta	2025-06-12 12:26:10 +00:00
DeEMO	d6c8d322cd	fix try_connect	2025-06-12 12:26:10 +00:00
DeEMO	d1b97f798e	support reconnection	2025-06-12 12:26:09 +00:00
Zonghang Li	e50b3aa473	Merge pull request #27 from Lizonghang/lizh_dev Fix seq_id mismatch between the head and worker devices.	2025-06-11 17:12:08 +04:00
Li, Zonghang	3e6d831930	fix seq_id mismatch between head and worker devices	2025-06-11 17:10:21 +04:00
Li, Zonghang	fb9b1f2b00	reformat llama.cpp	2025-06-09 13:04:22 +04:00

1 2 3 4 5 ...

4229 commits