prima.cpp

vrr/prima.cpp

Fork 0

mirror of https://github.com/Lizonghang/prima.cpp.git synced 2025-09-05 18:19:03 +00:00

Commit graph

749934a5e0 fix: improve distributed sync robustness with retry mechanism and longer timeout tao_dev leeetao 2025-07-20 08:12:21 +00:00
663ad2896d feat(comm): Enhance robustness of communication leeetao 2025-07-19 07:57:57 +00:00
50a916f123 Fix batch metadata chain forwarding in distributed perplexity leeetao 2025-07-18 14:05:34 +00:00
bdf9d8e74b llama-server: fix k-shift when output overlength main dev Li, Zonghang 2025-07-17 21:03:41 +08:00
f032680cab

Add support for speculative decoding in llama-server Zonghang Li 2025-07-13 21:54:41 +08:00
86ca21e49c server: fix bugs when running speculative decoding feat/speculative_server Li, Zonghang 2025-07-13 21:52:59 +08:00
b019a707b8 server: fix bugs Li, Zonghang 2025-07-13 13:42:24 +08:00
0cf87c8837 fix: set cache_prompt default to true DeEMO 2025-07-06 10:05:01 +08:00
da31acbe6a Modified batch backend_buffer size to actual size leeetao 2025-07-04 08:52:15 +00:00
82787be7eb Enable distributed model perplexity measurement for different bit-width models with -lw and -ngl parameters leeetao 2025-07-01 09:19:19 +00:00
ca5996e7a6 fix: slot id DeEMO 2025-06-30 07:31:05 +00:00
b4929d510a fix: args in speculative DeEMO 2025-06-30 04:35:59 +00:00
9bf6565df4 fix: load draft model first DeEMO 2025-06-27 06:30:57 +00:00
d248f3c40e fix: some fields in cparams_draft DeEMO 2025-06-27 06:07:47 +00:00
2e8e42a5ad Add speculative decoding support to the server and command-line interfaces DeEMO 2025-06-23 20:36:32 +08:00
1ea2d61a97 speedup: add arg --keep-out-in-cuda to run the output layer on CUDA Zonghang Li 2025-06-28 05:59:19 +00:00
e8d3e5a631 update README Li, Zonghang 2025-06-27 20:16:30 +04:00
11ce0d58f7 fix compute buffer estimate: don't reverse CUDA VRAM for output layer Zonghang Li 2025-06-27 12:42:16 +00:00
48b7f53abb Removed some unnecessary synchronization logic and added n_chunks communication content leeetao 2025-06-27 07:04:10 +00:00
5b4a63abb8 fix: load draft model first DeEMO 2025-06-27 06:30:57 +00:00
0cdf99828c fix: some fields in cparams_draft DeEMO 2025-06-27 06:07:47 +00:00
ea3ebcbca2 Add speculative decoding support to the server and command-line interfaces DeEMO 2025-06-23 20:36:32 +08:00
3a03549fed update README Li, Zonghang 2025-06-26 22:37:08 +04:00
ba59a1a07a update README Li, Zonghang 2025-06-26 22:33:28 +04:00
aacfa8a231 fix compute buffer estimate: reserve 300 MiB VRAM to avoid potential OOM Li, Zonghang 2025-06-26 20:45:45 +04:00
a05022c05a communication: use barrier instead of manually adding delay Li, Zonghang 2025-06-26 17:30:47 +04:00
3f27a25340 topo rebuild: add a delay to avoid packet interleaving Li, Zonghang 2025-06-26 14:50:58 +04:00
729870fcd7 topo rebuild: add a delay to avoid packet interleaving Li, Zonghang 2025-06-26 14:47:34 +04:00
50807fd4e1 halda: handle infeasible solution with weak device Li, Zonghang 2025-06-26 08:56:31 +04:00
72701ae872 fix compute buffer estimate: reserve 200 MiB VRAM to avoid potential OOM Li, Zonghang 2025-06-24 20:39:49 +04:00
4dde8458cf fix compute buffer estimate: reserve 100 MiB VRAM to avoid potential OOM Li, Zonghang 2025-06-24 19:29:10 +04:00
90b1079d78 fix compute_buffer estimate: remove unused memory for CUDA device Li, Zonghang 2025-06-24 16:37:16 +04:00
16ba3564ce fix compute_buffer estimate: add context GPU usage Li, Zonghang 2025-06-24 16:09:59 +04:00
a3becb586a Refactored the logic related to communication content and timing control leeetao 2025-06-24 10:40:37 +00:00
c926088d6a fix compute buffer estimate: test without highs Li, Zonghang 2025-06-22 16:27:55 +04:00
45e8b0420c fix compute buffer estimate: tested on cuda Zonghang Li 2025-06-22 08:10:57 +00:00
4b823775ec Fix compilation warnings and uninitialized variable in perplexity test leeetao 2025-06-22 06:58:12 +00:00
80e5b71b48 fix compute buffer estimate: tested on metal Li, Zonghang 2025-06-20 13:43:55 +04:00
dd589561b4 improve the computing buffer estimate Zonghang Li 2025-06-19 08:02:43 +00:00
2123879cfe Modify the perplexity test to a distributed version leeetao 2025-06-18 07:05:53 +00:00
0b4ffdfde5 Merge branch 'dev' Li, Zonghang 2025-06-17 09:40:27 +04:00
deeec668b8 fix: n_worker in draft model DeEMO 2025-06-17 13:20:06 +08:00
2b902f89bd

fix: change default ip to 127.0.0.1 & improve args for setting ports Zonghang Li 2025-06-17 08:23:25 +04:00
67c4f70357 fix: add log when serving as a proxy DeEMO 2025-06-17 12:08:53 +08:00
6ff38b2a0c add args: data-port and signal-port DeEMO 2025-06-17 12:00:04 +08:00
104e3b2356 fix: replace localhost to 127.0.0.1 DeEMO 2025-06-17 11:27:58 +08:00
fbbc30c950 Merge branch 'speculative' into dev Li, Zonghang 2025-06-16 13:27:36 +04:00
dc797e94f5

Fix speculative decoding Zonghang Li 2025-06-16 12:11:12 +04:00
dfb1feb54e update README speculative Li, Zonghang 2025-06-16 12:09:07 +04:00
45de284f3d Merge branch 'fix' into speculative Li, Zonghang 2025-06-14 18:57:17 +04:00
f38cfc625c Merge branch 'fix' into dev Li, Zonghang 2025-06-14 18:56:36 +04:00
b5ccd62135 fix n_gpu_layers allocation errors Li, Zonghang 2025-06-14 18:55:53 +04:00
0a535cbdc1 Merge branch 'speculative' of github.com:Lizonghang/prima.cpp into speculative Li, Zonghang 2025-06-13 13:31:12 +04:00
c9cae626cf speculative: free sockets and send stop signal when inference ends Li, Zonghang 2025-06-13 13:30:29 +04:00
2687ef3126 speculative: free sockets and send stop signal when inference ends Li, Zonghang 2025-06-13 11:25:42 +04:00
dc875bbef9 fix speculative decoding Li, Zonghang 2025-06-13 08:18:12 +04:00
ba29717613

add feature: keep the forwarder if its previous device cannot directly connect to its next device. Zonghang Li 2025-06-12 16:57:35 +04:00
d4618de991 fix: block when free socket DeEMO 2025-06-11 21:47:58 +08:00
2039e3b0c1 fix: send and recv meta DeEMO 2025-06-11 21:05:31 +08:00
d6c8d322cd fix try_connect DeEMO 2025-06-03 15:02:59 +08:00
d1b97f798e support reconnection DeEMO 2025-05-23 10:08:30 +00:00
e50b3aa473

Merge pull request #27 from Lizonghang/lizh_dev Zonghang Li 2025-06-11 17:12:08 +04:00
3e6d831930 fix seq_id mismatch between head and worker devices Li, Zonghang 2025-06-11 17:10:21 +04:00
fb9b1f2b00 reformat llama.cpp Li, Zonghang 2025-06-09 13:04:22 +04:00
32e1088162 Fixed the issue where RAM reading was 0 in v2 containers leeetao 2025-06-07 08:41:53 +00:00
fbf853341b add endpoint /v1/cancel Li, Zonghang 2025-06-07 11:34:38 +04:00
c8af1be27e

Merge pull request #24 from Lizonghang/lizh_dev Zonghang Li 2025-06-07 01:02:10 +04:00
22a6ddef13 fix batch decoding and dynamic batching Li, Zonghang 2025-06-07 00:53:56 +04:00
e56be76bdf assume only a single seq_id per token is needed Lizonghang 2025-06-07 00:42:44 +04:00
d8aea899d1 fix n_seq_id and seq_id Lizonghang 2025-06-06 23:58:03 +04:00
a1a2238831 add batch_all.n_seq_id and batch_all.seq_id to sync_meta Lizonghang 2025-06-06 23:36:53 +04:00
68ecc8509d add batch_all.logits to sync_meta Lizonghang 2025-06-06 22:58:48 +04:00
500e066a2f fix batch decoding and dynamic batching Lizonghang 2025-06-06 16:53:22 +04:00
e38f13ba17 Restored support for calculating perplexity in standalone test models leeetao 2025-06-06 12:12:17 +00:00
4adc3791dc Merge branch 'main' into review review Lizonghang 2025-06-04 15:17:54 +04:00
ef1e10101e add test for IQ1 and doc for device selection Lizonghang 2025-06-04 15:12:00 +04:00
27756ee182 fix: enable rolling back set assignment when all devices are assigned to M4 but no feasible solutions Lizonghang 2025-06-04 15:11:29 +04:00
6439090920 reformat code Li, Zonghang 2025-06-03 23:53:24 +04:00
b6fdbd541b Merge branch 'dev' of github.com:Lizonghang/prima.cpp into dev Li, Zonghang 2025-06-03 18:20:17 +04:00
9f0ec78a4b Merge branch 'dev' of github.com:Lizonghang/prima.cpp into dev Lizonghang 2025-06-03 18:18:53 +04:00
a01fafd126 Merge branch 'main' into dev Li, Zonghang 2025-06-03 17:56:47 +04:00
1b3b6a506f fix: add warm-up in profiling to prevent init delay Li, Zonghang 2025-06-03 17:10:09 +04:00
b30f749e5e fix n_embd cannot be divided by quantized block size Li, Zonghang 2025-06-03 14:06:31 +04:00
e25c739ecf

Merge pull request #19 from yezhizi/feat/auto-exit Li, Zonghang 2025-05-20 02:04:50 +08:00
7b0ededd24

Merge branch 'dev' into feat/auto-exit Li, Zonghang 2025-05-20 02:04:14 +08:00
421b3deca5 fix llama-cli pos sync Lizonghang 2025-05-19 18:08:27 +04:00
c54a6a0132 fix context shifting Lizonghang 2025-05-19 16:58:35 +04:00
34eaa8224d fix: handle socket closure and connection in llama_rebuild_topo DeEMO 2025-05-16 20:48:51 +08:00
8b61cb2fa4 fix: adapt the new topo DeEMO 2025-05-16 17:03:36 +08:00
df16b1876f refactor: add zmq helper to generate message DeEMO 2025-05-16 16:02:25 +08:00
0ad009a2f4 fix: update serialization and deserialization for next_ip in device_info DeEMO 2025-05-16 15:26:16 +08:00
4b36aef157 fix some bugs DeEMO 2025-05-15 06:25:12 +00:00
cc46aa9828 update rank and n_world DeEMO 2025-05-15 13:57:16 +08:00
fdd6694633 add topo rebuild DeEMO 2025-05-15 04:22:12 +00:00
26bb86c09b Add tune_layer_allocation DeEMO 2025-05-14 07:50:04 +00:00
07c4966a80 reduce fio data size to 1gb to speed up profiling Lizonghang 2025-05-14 21:26:01 +04:00
2cc01483fd support server mode Lizonghang 2025-05-14 18:28:46 +04:00
ebd09fc83c Merge branch 'dev' Lizonghang 2025-05-14 14:19:53 +04:00
258fb2d06b add QA: How to manually profile a device Lizonghang 2025-05-14 14:19:20 +04:00
2fbc0c8da3 fix: reset -ngl to 0 when GPU is not used and reformat code Lizonghang 2025-05-14 13:27:20 +04:00