Commit graph

  • 749934a5e0 fix: improve distributed sync robustness with retry mechanism and longer timeout tao_dev leeetao 2025-07-20 08:12:21 +00:00
  • 663ad2896d feat(comm): Enhance robustness of communication leeetao 2025-07-19 07:57:57 +00:00
  • 50a916f123 Fix batch metadata chain forwarding in distributed perplexity leeetao 2025-07-18 14:05:34 +00:00
  • bdf9d8e74b llama-server: fix k-shift when output overlength main dev Li, Zonghang 2025-07-17 21:03:41 +08:00
  • f032680cab
    Add support for speculative decoding in llama-server Zonghang Li 2025-07-13 21:54:41 +08:00
  • 86ca21e49c server: fix bugs when running speculative decoding feat/speculative_server Li, Zonghang 2025-07-13 21:52:59 +08:00
  • b019a707b8 server: fix bugs Li, Zonghang 2025-07-13 13:42:24 +08:00
  • 0cf87c8837 fix: set cache_prompt default to true DeEMO 2025-07-06 10:05:01 +08:00
  • da31acbe6a Modified batch backend_buffer size to actual size leeetao 2025-07-04 08:52:15 +00:00
  • 82787be7eb Enable distributed model perplexity measurement for different bit-width models with -lw and -ngl parameters leeetao 2025-07-01 09:19:19 +00:00
  • ca5996e7a6 fix: slot id DeEMO 2025-06-30 07:31:05 +00:00
  • b4929d510a fix: args in speculative DeEMO 2025-06-30 04:35:59 +00:00
  • 9bf6565df4 fix: load draft model first DeEMO 2025-06-27 06:30:57 +00:00
  • d248f3c40e fix: some fields in cparams_draft DeEMO 2025-06-27 06:07:47 +00:00
  • 2e8e42a5ad Add speculative decoding support to the server and command-line interfaces DeEMO 2025-06-23 20:36:32 +08:00
  • 1ea2d61a97 speedup: add arg --keep-out-in-cuda to run the output layer on CUDA Zonghang Li 2025-06-28 05:59:19 +00:00
  • e8d3e5a631 update README Li, Zonghang 2025-06-27 20:16:30 +04:00
  • 11ce0d58f7 fix compute buffer estimate: don't reverse CUDA VRAM for output layer Zonghang Li 2025-06-27 12:42:16 +00:00
  • 48b7f53abb Removed some unnecessary synchronization logic and added n_chunks communication content leeetao 2025-06-27 07:04:10 +00:00
  • 5b4a63abb8 fix: load draft model first DeEMO 2025-06-27 06:30:57 +00:00
  • 0cdf99828c fix: some fields in cparams_draft DeEMO 2025-06-27 06:07:47 +00:00
  • ea3ebcbca2 Add speculative decoding support to the server and command-line interfaces DeEMO 2025-06-23 20:36:32 +08:00
  • 3a03549fed update README Li, Zonghang 2025-06-26 22:37:08 +04:00
  • ba59a1a07a update README Li, Zonghang 2025-06-26 22:33:28 +04:00
  • aacfa8a231 fix compute buffer estimate: reserve 300 MiB VRAM to avoid potential OOM Li, Zonghang 2025-06-26 20:45:45 +04:00
  • a05022c05a communication: use barrier instead of manually adding delay Li, Zonghang 2025-06-26 17:30:47 +04:00
  • 3f27a25340 topo rebuild: add a delay to avoid packet interleaving Li, Zonghang 2025-06-26 14:50:58 +04:00
  • 729870fcd7 topo rebuild: add a delay to avoid packet interleaving Li, Zonghang 2025-06-26 14:47:34 +04:00
  • 50807fd4e1 halda: handle infeasible solution with weak device Li, Zonghang 2025-06-26 08:56:31 +04:00
  • 72701ae872 fix compute buffer estimate: reserve 200 MiB VRAM to avoid potential OOM Li, Zonghang 2025-06-24 20:39:49 +04:00
  • 4dde8458cf fix compute buffer estimate: reserve 100 MiB VRAM to avoid potential OOM Li, Zonghang 2025-06-24 19:29:10 +04:00
  • 90b1079d78 fix compute_buffer estimate: remove unused memory for CUDA device Li, Zonghang 2025-06-24 16:37:16 +04:00
  • 16ba3564ce fix compute_buffer estimate: add context GPU usage Li, Zonghang 2025-06-24 16:09:59 +04:00
  • a3becb586a Refactored the logic related to communication content and timing control leeetao 2025-06-24 10:40:37 +00:00
  • c926088d6a fix compute buffer estimate: test without highs Li, Zonghang 2025-06-22 16:27:55 +04:00
  • 45e8b0420c fix compute buffer estimate: tested on cuda Zonghang Li 2025-06-22 08:10:57 +00:00
  • 4b823775ec Fix compilation warnings and uninitialized variable in perplexity test leeetao 2025-06-22 06:58:12 +00:00
  • 80e5b71b48 fix compute buffer estimate: tested on metal Li, Zonghang 2025-06-20 13:43:55 +04:00
  • dd589561b4 improve the computing buffer estimate Zonghang Li 2025-06-19 08:02:43 +00:00
  • 2123879cfe Modify the perplexity test to a distributed version leeetao 2025-06-18 07:05:53 +00:00
  • 0b4ffdfde5 Merge branch 'dev' Li, Zonghang 2025-06-17 09:40:27 +04:00
  • deeec668b8 fix: n_worker in draft model DeEMO 2025-06-17 13:20:06 +08:00
  • 2b902f89bd
    fix: change default ip to 127.0.0.1 & improve args for setting ports Zonghang Li 2025-06-17 08:23:25 +04:00
  • 67c4f70357 fix: add log when serving as a proxy DeEMO 2025-06-17 12:08:53 +08:00
  • 6ff38b2a0c add args: data-port and signal-port DeEMO 2025-06-17 12:00:04 +08:00
  • 104e3b2356 fix: replace localhost to 127.0.0.1 DeEMO 2025-06-17 11:27:58 +08:00
  • fbbc30c950 Merge branch 'speculative' into dev Li, Zonghang 2025-06-16 13:27:36 +04:00
  • dc797e94f5
    Fix speculative decoding Zonghang Li 2025-06-16 12:11:12 +04:00
  • dfb1feb54e update README speculative Li, Zonghang 2025-06-16 12:09:07 +04:00
  • 45de284f3d Merge branch 'fix' into speculative Li, Zonghang 2025-06-14 18:57:17 +04:00
  • f38cfc625c Merge branch 'fix' into dev Li, Zonghang 2025-06-14 18:56:36 +04:00
  • b5ccd62135 fix n_gpu_layers allocation errors Li, Zonghang 2025-06-14 18:55:53 +04:00
  • 0a535cbdc1 Merge branch 'speculative' of github.com:Lizonghang/prima.cpp into speculative Li, Zonghang 2025-06-13 13:31:12 +04:00
  • c9cae626cf speculative: free sockets and send stop signal when inference ends Li, Zonghang 2025-06-13 13:30:29 +04:00
  • 2687ef3126 speculative: free sockets and send stop signal when inference ends Li, Zonghang 2025-06-13 11:25:42 +04:00
  • dc875bbef9 fix speculative decoding Li, Zonghang 2025-06-13 08:18:12 +04:00
  • ba29717613
    add feature: keep the forwarder if its previous device cannot directly connect to its next device. Zonghang Li 2025-06-12 16:57:35 +04:00
  • d4618de991 fix: block when free socket DeEMO 2025-06-11 21:47:58 +08:00
  • 2039e3b0c1 fix: send and recv meta DeEMO 2025-06-11 21:05:31 +08:00
  • d6c8d322cd fix try_connect DeEMO 2025-06-03 15:02:59 +08:00
  • d1b97f798e support reconnection DeEMO 2025-05-23 10:08:30 +00:00
  • e50b3aa473
    Merge pull request #27 from Lizonghang/lizh_dev Zonghang Li 2025-06-11 17:12:08 +04:00
  • 3e6d831930 fix seq_id mismatch between head and worker devices Li, Zonghang 2025-06-11 17:10:21 +04:00
  • fb9b1f2b00 reformat llama.cpp Li, Zonghang 2025-06-09 13:04:22 +04:00
  • 32e1088162 Fixed the issue where RAM reading was 0 in v2 containers leeetao 2025-06-07 08:41:53 +00:00
  • fbf853341b add endpoint /v1/cancel Li, Zonghang 2025-06-07 11:34:38 +04:00
  • c8af1be27e
    Merge pull request #24 from Lizonghang/lizh_dev Zonghang Li 2025-06-07 01:02:10 +04:00
  • 22a6ddef13 fix batch decoding and dynamic batching Li, Zonghang 2025-06-07 00:53:56 +04:00
  • e56be76bdf assume only a single seq_id per token is needed Lizonghang 2025-06-07 00:42:44 +04:00
  • d8aea899d1 fix n_seq_id and seq_id Lizonghang 2025-06-06 23:58:03 +04:00
  • a1a2238831 add batch_all.n_seq_id and batch_all.seq_id to sync_meta Lizonghang 2025-06-06 23:36:53 +04:00
  • 68ecc8509d add batch_all.logits to sync_meta Lizonghang 2025-06-06 22:58:48 +04:00
  • 500e066a2f fix batch decoding and dynamic batching Lizonghang 2025-06-06 16:53:22 +04:00
  • e38f13ba17 Restored support for calculating perplexity in standalone test models leeetao 2025-06-06 12:12:17 +00:00
  • 4adc3791dc Merge branch 'main' into review review Lizonghang 2025-06-04 15:17:54 +04:00
  • ef1e10101e add test for IQ1 and doc for device selection Lizonghang 2025-06-04 15:12:00 +04:00
  • 27756ee182 fix: enable rolling back set assignment when all devices are assigned to M4 but no feasible solutions Lizonghang 2025-06-04 15:11:29 +04:00
  • 6439090920 reformat code Li, Zonghang 2025-06-03 23:53:24 +04:00
  • b6fdbd541b Merge branch 'dev' of github.com:Lizonghang/prima.cpp into dev Li, Zonghang 2025-06-03 18:20:17 +04:00
  • 9f0ec78a4b Merge branch 'dev' of github.com:Lizonghang/prima.cpp into dev Lizonghang 2025-06-03 18:18:53 +04:00
  • a01fafd126 Merge branch 'main' into dev Li, Zonghang 2025-06-03 17:56:47 +04:00
  • 1b3b6a506f fix: add warm-up in profiling to prevent init delay Li, Zonghang 2025-06-03 17:10:09 +04:00
  • b30f749e5e fix n_embd cannot be divided by quantized block size Li, Zonghang 2025-06-03 14:06:31 +04:00
  • e25c739ecf
    Merge pull request #19 from yezhizi/feat/auto-exit Li, Zonghang 2025-05-20 02:04:50 +08:00
  • 7b0ededd24
    Merge branch 'dev' into feat/auto-exit Li, Zonghang 2025-05-20 02:04:14 +08:00
  • 421b3deca5 fix llama-cli pos sync Lizonghang 2025-05-19 18:08:27 +04:00
  • c54a6a0132 fix context shifting Lizonghang 2025-05-19 16:58:35 +04:00
  • 34eaa8224d fix: handle socket closure and connection in llama_rebuild_topo DeEMO 2025-05-16 20:48:51 +08:00
  • 8b61cb2fa4 fix: adapt the new topo DeEMO 2025-05-16 17:03:36 +08:00
  • df16b1876f refactor: add zmq helper to generate message DeEMO 2025-05-16 16:02:25 +08:00
  • 0ad009a2f4 fix: update serialization and deserialization for next_ip in device_info DeEMO 2025-05-16 15:26:16 +08:00
  • 4b36aef157 fix some bugs DeEMO 2025-05-15 06:25:12 +00:00
  • cc46aa9828 update rank and n_world DeEMO 2025-05-15 13:57:16 +08:00
  • fdd6694633 add topo rebuild DeEMO 2025-05-15 04:22:12 +00:00
  • 26bb86c09b Add tune_layer_allocation DeEMO 2025-05-14 07:50:04 +00:00
  • 07c4966a80 reduce fio data size to 1gb to speed up profiling Lizonghang 2025-05-14 21:26:01 +04:00
  • 2cc01483fd support server mode Lizonghang 2025-05-14 18:28:46 +04:00
  • ebd09fc83c Merge branch 'dev' Lizonghang 2025-05-14 14:19:53 +04:00
  • 258fb2d06b add QA: How to manually profile a device Lizonghang 2025-05-14 14:19:20 +04:00
  • 2fbc0c8da3 fix: reset -ngl to 0 when GPU is not used and reformat code Lizonghang 2025-05-14 13:27:20 +04:00