Commit graph

91 commits

Author SHA1 Message Date
Zonghang Li
1ea2d61a97 speedup: add arg --keep-out-in-cuda to run the output layer on CUDA 2025-06-28 10:58:18 +04:00
Zonghang Li
45e8b0420c fix compute buffer estimate: tested on cuda 2025-06-22 08:10:57 +00:00
Li, Zonghang
80e5b71b48 fix compute buffer estimate: tested on metal 2025-06-20 13:43:55 +04:00
Zonghang Li
dd589561b4 improve the computing buffer estimate 2025-06-19 08:02:43 +00:00
DeEMO
6ff38b2a0c add args: data-port and signal-port 2025-06-17 12:00:04 +08:00
Li, Zonghang
fbbc30c950 Merge branch 'speculative' into dev 2025-06-16 13:27:36 +04:00
Li, Zonghang
dc875bbef9 fix speculative decoding 2025-06-13 08:18:12 +04:00
DeEMO
2039e3b0c1 fix: send and recv meta 2025-06-12 12:26:10 +00:00
DeEMO
d6c8d322cd fix try_connect 2025-06-12 12:26:10 +00:00
DeEMO
d1b97f798e support reconnection 2025-06-12 12:26:09 +00:00
Li, Zonghang
3e6d831930 fix seq_id mismatch between head and worker devices 2025-06-11 17:10:21 +04:00
Li, Zonghang
6439090920 reformat code 2025-06-03 23:53:24 +04:00
Li, Zonghang
7b0ededd24
Merge branch 'dev' into feat/auto-exit 2025-05-20 02:04:14 +08:00
Lizonghang
c54a6a0132 fix context shifting 2025-05-19 16:58:35 +04:00
DeEMO
cc46aa9828 update rank and n_world
Signed-off-by: DeEMO <yzzxrx@gmail.com>
2025-05-19 09:22:02 +00:00
DeEMO
fdd6694633 add topo rebuild
Signed-off-by: DeEMO <yzzxrx@gmail.com>
2025-05-19 09:21:53 +00:00
Lizonghang
2fbc0c8da3 fix: reset -ngl to 0 when GPU is not used and reformat code 2025-05-14 13:27:20 +04:00
DeEMO
168c14f4e8 remove unnecessary profile when --lw is specified 2025-04-17 13:49:09 +00:00
leeetao
fc1e2d3fc6 Added support for iq1s and iq1m quantization type 2025-04-17 10:27:53 +00:00
Zonghang Li
bcfdace59b add args -k and --force 2025-03-11 20:44:36 +04:00
leeetao 
e2cda4cfa0 Removed support for GGML_TYPE_Q4_0_4_4, GGML_TYPE_0_4_8, and GGML_TYPE_0_8_8 (GGUF no longer supports these types) 2025-03-01 14:31:38 +00:00
leeetao
7bf1b743fb Merge branch 'dev' into lt_test
Merge dev branch updates into local branch lt_test.
2025-02-23 08:35:45 +00:00
leeetao
f99e08b9fe Added inference support for the Deepseek distilled model 2025-02-23 08:27:37 +00:00
Lizonghang
c84f9d29fe use arg prefetch and remove arg unload 2025-02-12 17:04:41 +04:00
Lizonghang
1c0087e919 rename arg --keep-inp-out-in-metal to --keep-out-in-metal 2025-01-23 23:17:06 +04:00
Lizonghang
78a544d716 add metal mem limit 2025-01-23 16:08:52 +04:00
Lizonghang
facb4ea736 add option --keep-inp-out-in-metal and fix bugs in unmap 2025-01-22 11:15:19 +04:00
Zonghang Li
46e99218b4 add arg --cuda-mem 2025-01-16 09:15:34 +04:00
Lizonghang
3d75b8576e add api llama_model_set_n_gpu_layers 2025-01-15 10:48:19 +04:00
Lizonghang
9279a2e3ff fix error in llama_context_n_gpu_layers 2025-01-15 10:08:41 +04:00
Lizonghang
5d9aadf3d5 use highs to solve the allocation program 2025-01-15 10:04:04 +04:00
Lizonghang
8e9ab45458 fix model bytes counter 2024-12-10 14:57:48 +04:00
Lizonghang
d78fa427e7 add memory copy speed test 2024-12-09 10:07:42 +04:00
Zonghang Li
df813675d0 fix flops count and ram/vram speed test 2024-12-08 10:14:05 +04:00
Lizonghang
cd823546dd llama_profile_device: add arg n_predict 2024-12-06 16:37:25 +04:00
Lizonghang
6f54a12c7d add gpu support in llama_model_kvcache_size and llama_model_compute_buf_size 2024-11-29 21:06:32 +04:00
Lizonghang
68ecabc8c3 add cpu_read_ram_bw, metal_read_vram_bw, cuda_read_vram_bw 2024-11-29 19:04:53 +04:00
Lizonghang
0f73d12247 decrease compute buf from available memory 2024-11-29 11:15:54 +04:00
Lizonghang
45a1e55eec reduce kv cache from available memory 2024-11-28 20:21:21 +04:00
Lizonghang
9a7bbce7ad fix t_load_us 2024-11-28 15:55:21 +04:00
Lizonghang
9cd22177d0 remove arg test_file 2024-11-27 21:34:45 +04:00
Zonghang Li
f78c437172 add device_inp_embd_delay test, device_memory_bw test, device_cuda_memory_bw test, 2024-11-26 22:28:02 +04:00
Lizonghang
3fe00a16a0 count model flops for f32xf32, f16xf32, q4kxf32, q6kxf32 2024-11-24 13:13:32 +04:00
Zonghang Li
7ee1423006 add model_flops 2024-11-21 20:06:16 +04:00
Lizonghang
477ecf2084 add llama_model_n_flops 2024-11-20 19:40:27 +04:00
Lizonghang
5fae6ac36f add cpu flops test 2024-11-09 20:53:42 +04:00
Lizonghang
2bd4d03aa8 add automatic layer window size assignment workflow 2024-11-08 18:21:03 +04:00
Lizonghang
53cb3a6069 synchronize device info 2024-11-07 22:02:01 +04:00
Lizonghang
ef7fdf70cc add LLAMA_API llama_profile_device 2024-11-07 09:30:39 +04:00
Lizonghang
407c71ae52 add cpu and gpu profile 2024-11-06 20:42:28 +04:00