prima.cpp

mirror of https://github.com/Lizonghang/prima.cpp.git synced 2025-09-07 03:39:04 +00:00

Author	SHA1	Message	Date
Li, Zonghang	dc875bbef9	fix speculative decoding	2025-06-13 08:18:12 +04:00
Li, Zonghang	3e6d831930	fix seq_id mismatch between head and worker devices	2025-06-11 17:10:21 +04:00
Li, Zonghang	6439090920	reformat code	2025-06-03 23:53:24 +04:00
Li, Zonghang	7b0ededd24	Merge branch 'dev' into feat/auto-exit	2025-05-20 02:04:14 +08:00
Lizonghang	c54a6a0132	fix context shifting	2025-05-19 16:58:35 +04:00
DeEMO	cc46aa9828	update rank and n_world Signed-off-by: DeEMO <yzzxrx@gmail.com>	2025-05-19 09:22:02 +00:00
DeEMO	fdd6694633	add topo rebuild Signed-off-by: DeEMO <yzzxrx@gmail.com>	2025-05-19 09:21:53 +00:00
Lizonghang	2fbc0c8da3	fix: reset -ngl to 0 when GPU is not used and reformat code	2025-05-14 13:27:20 +04:00
DeEMO	168c14f4e8	remove unnecessary profile when `--lw` is specified	2025-04-17 13:49:09 +00:00
leeetao	fc1e2d3fc6	Added support for iq1s and iq1m quantization type	2025-04-17 10:27:53 +00:00
Zonghang Li	bcfdace59b	add args -k and --force	2025-03-11 20:44:36 +04:00
leeetao	e2cda4cfa0	Removed support for GGML_TYPE_Q4_0_4_4, GGML_TYPE_0_4_8, and GGML_TYPE_0_8_8 (GGUF no longer supports these types)	2025-03-01 14:31:38 +00:00
leeetao	7bf1b743fb	Merge branch 'dev' into lt_test Merge dev branch updates into local branch lt_test.	2025-02-23 08:35:45 +00:00
leeetao	f99e08b9fe	Added inference support for the Deepseek distilled model	2025-02-23 08:27:37 +00:00
Lizonghang	c84f9d29fe	use arg prefetch and remove arg unload	2025-02-12 17:04:41 +04:00
Lizonghang	1c0087e919	rename arg --keep-inp-out-in-metal to --keep-out-in-metal	2025-01-23 23:17:06 +04:00
Lizonghang	78a544d716	add metal mem limit	2025-01-23 16:08:52 +04:00
Lizonghang	facb4ea736	add option --keep-inp-out-in-metal and fix bugs in unmap	2025-01-22 11:15:19 +04:00
Zonghang Li	46e99218b4	add arg --cuda-mem	2025-01-16 09:15:34 +04:00
Lizonghang	3d75b8576e	add api llama_model_set_n_gpu_layers	2025-01-15 10:48:19 +04:00
Lizonghang	9279a2e3ff	fix error in llama_context_n_gpu_layers	2025-01-15 10:08:41 +04:00
Lizonghang	5d9aadf3d5	use highs to solve the allocation program	2025-01-15 10:04:04 +04:00
Lizonghang	8e9ab45458	fix model bytes counter	2024-12-10 14:57:48 +04:00
Lizonghang	d78fa427e7	add memory copy speed test	2024-12-09 10:07:42 +04:00
Zonghang Li	df813675d0	fix flops count and ram/vram speed test	2024-12-08 10:14:05 +04:00
Lizonghang	cd823546dd	llama_profile_device: add arg n_predict	2024-12-06 16:37:25 +04:00
Lizonghang	6f54a12c7d	add gpu support in llama_model_kvcache_size and llama_model_compute_buf_size	2024-11-29 21:06:32 +04:00
Lizonghang	68ecabc8c3	add cpu_read_ram_bw, metal_read_vram_bw, cuda_read_vram_bw	2024-11-29 19:04:53 +04:00
Lizonghang	0f73d12247	decrease compute buf from available memory	2024-11-29 11:15:54 +04:00
Lizonghang	45a1e55eec	reduce kv cache from available memory	2024-11-28 20:21:21 +04:00
Lizonghang	9a7bbce7ad	fix t_load_us	2024-11-28 15:55:21 +04:00
Lizonghang	9cd22177d0	remove arg test_file	2024-11-27 21:34:45 +04:00
Zonghang Li	f78c437172	add device_inp_embd_delay test, device_memory_bw test, device_cuda_memory_bw test,	2024-11-26 22:28:02 +04:00
Lizonghang	3fe00a16a0	count model flops for f32xf32, f16xf32, q4kxf32, q6kxf32	2024-11-24 13:13:32 +04:00
Zonghang Li	7ee1423006	add model_flops	2024-11-21 20:06:16 +04:00
Lizonghang	477ecf2084	add llama_model_n_flops	2024-11-20 19:40:27 +04:00
Lizonghang	5fae6ac36f	add cpu flops test	2024-11-09 20:53:42 +04:00
Lizonghang	2bd4d03aa8	add automatic layer window size assignment workflow	2024-11-08 18:21:03 +04:00
Lizonghang	53cb3a6069	synchronize device info	2024-11-07 22:02:01 +04:00
Lizonghang	ef7fdf70cc	add LLAMA_API llama_profile_device	2024-11-07 09:30:39 +04:00
Lizonghang	407c71ae52	add cpu and gpu profile	2024-11-06 20:42:28 +04:00
Lizonghang	76a7fc7527	support different window sizes	2024-10-26 12:34:14 +04:00
Lizonghang	c97ea10617	add mmap prefetch and unloading	2024-10-25 16:33:56 +04:00
Lizonghang	2a01ff5fb1	init	2024-10-23 09:42:32 +04:00
Georgi Gerganov	f4d2b8846a	llama : add reranking support (#9510 ) * py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-28 17:42:03 +03:00
Georgi Gerganov	739842703e	llama : add comment about thread-safety [no ci] (#9449 )	2024-09-28 15:13:42 +03:00
nopperl	9a913110cf	llama : add support for Chameleon (#8543 ) * convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <git@compilade.net> * fix lint * trigger ci --------- Co-authored-by: compilade <git@compilade.net>	2024-09-28 15:08:43 +03:00
Georgi Gerganov	b0f27361f3	sampling : avoid expensive softmax during greedy sampling (#9605 ) * sampling : avoid expensive softmax during greedy sampling ggml-ci * speculative : fix default RNG seed + set sparams.n_probs * Update tests/test-sampling.cpp Co-authored-by: slaren <slarengh@gmail.com> * sampling : add clarifying comment [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 09:03:17 +03:00
Michael Podvitskiy	37f3a3810e	llama : add llama_n_head() (#9512 )	2024-09-17 09:23:30 +03:00
Georgi Gerganov	0abc6a2c25	llama : llama_perf + option to disable timings during decode (#9355 ) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-13 09:53:38 +03:00

1 2

82 commits