prima.cpp

mirror of https://github.com/Lizonghang/prima.cpp.git synced 2025-09-06 10:39:02 +00:00

Author	SHA1	Message	Date
Lizonghang	2fbc0c8da3	fix: reset -ngl to 0 when GPU is not used and reformat code	2025-05-14 13:27:20 +04:00
DeEMO	168c14f4e8	remove unnecessary profile when `--lw` is specified	2025-04-17 13:49:09 +00:00
leeetao	fc1e2d3fc6	Added support for iq1s and iq1m quantization type	2025-04-17 10:27:53 +00:00
Zonghang Li	bcfdace59b	add args -k and --force	2025-03-11 20:44:36 +04:00
leeetao	e2cda4cfa0	Removed support for GGML_TYPE_Q4_0_4_4, GGML_TYPE_0_4_8, and GGML_TYPE_0_8_8 (GGUF no longer supports these types)	2025-03-01 14:31:38 +00:00
leeetao	7bf1b743fb	Merge branch 'dev' into lt_test Merge dev branch updates into local branch lt_test.	2025-02-23 08:35:45 +00:00
leeetao	f99e08b9fe	Added inference support for the Deepseek distilled model	2025-02-23 08:27:37 +00:00
Lizonghang	c84f9d29fe	use arg prefetch and remove arg unload	2025-02-12 17:04:41 +04:00
Lizonghang	1c0087e919	rename arg --keep-inp-out-in-metal to --keep-out-in-metal	2025-01-23 23:17:06 +04:00
Lizonghang	78a544d716	add metal mem limit	2025-01-23 16:08:52 +04:00
Lizonghang	facb4ea736	add option --keep-inp-out-in-metal and fix bugs in unmap	2025-01-22 11:15:19 +04:00
Zonghang Li	46e99218b4	add arg --cuda-mem	2025-01-16 09:15:34 +04:00
Lizonghang	3d75b8576e	add api llama_model_set_n_gpu_layers	2025-01-15 10:48:19 +04:00
Lizonghang	9279a2e3ff	fix error in llama_context_n_gpu_layers	2025-01-15 10:08:41 +04:00
Lizonghang	5d9aadf3d5	use highs to solve the allocation program	2025-01-15 10:04:04 +04:00
Lizonghang	8e9ab45458	fix model bytes counter	2024-12-10 14:57:48 +04:00
Lizonghang	d78fa427e7	add memory copy speed test	2024-12-09 10:07:42 +04:00
Zonghang Li	df813675d0	fix flops count and ram/vram speed test	2024-12-08 10:14:05 +04:00
Lizonghang	cd823546dd	llama_profile_device: add arg n_predict	2024-12-06 16:37:25 +04:00
Lizonghang	6f54a12c7d	add gpu support in llama_model_kvcache_size and llama_model_compute_buf_size	2024-11-29 21:06:32 +04:00
Lizonghang	68ecabc8c3	add cpu_read_ram_bw, metal_read_vram_bw, cuda_read_vram_bw	2024-11-29 19:04:53 +04:00
Lizonghang	0f73d12247	decrease compute buf from available memory	2024-11-29 11:15:54 +04:00
Lizonghang	45a1e55eec	reduce kv cache from available memory	2024-11-28 20:21:21 +04:00
Lizonghang	9a7bbce7ad	fix t_load_us	2024-11-28 15:55:21 +04:00
Lizonghang	9cd22177d0	remove arg test_file	2024-11-27 21:34:45 +04:00
Zonghang Li	f78c437172	add device_inp_embd_delay test, device_memory_bw test, device_cuda_memory_bw test,	2024-11-26 22:28:02 +04:00
Lizonghang	3fe00a16a0	count model flops for f32xf32, f16xf32, q4kxf32, q6kxf32	2024-11-24 13:13:32 +04:00
Zonghang Li	7ee1423006	add model_flops	2024-11-21 20:06:16 +04:00
Lizonghang	477ecf2084	add llama_model_n_flops	2024-11-20 19:40:27 +04:00
Lizonghang	5fae6ac36f	add cpu flops test	2024-11-09 20:53:42 +04:00
Lizonghang	2bd4d03aa8	add automatic layer window size assignment workflow	2024-11-08 18:21:03 +04:00
Lizonghang	53cb3a6069	synchronize device info	2024-11-07 22:02:01 +04:00
Lizonghang	ef7fdf70cc	add LLAMA_API llama_profile_device	2024-11-07 09:30:39 +04:00
Lizonghang	407c71ae52	add cpu and gpu profile	2024-11-06 20:42:28 +04:00
Lizonghang	76a7fc7527	support different window sizes	2024-10-26 12:34:14 +04:00
Lizonghang	c97ea10617	add mmap prefetch and unloading	2024-10-25 16:33:56 +04:00
Lizonghang	2a01ff5fb1	init	2024-10-23 09:42:32 +04:00
Georgi Gerganov	f4d2b8846a	llama : add reranking support (#9510 ) * py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-28 17:42:03 +03:00
Georgi Gerganov	739842703e	llama : add comment about thread-safety [no ci] (#9449 )	2024-09-28 15:13:42 +03:00
nopperl	9a913110cf	llama : add support for Chameleon (#8543 ) * convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <git@compilade.net> * fix lint * trigger ci --------- Co-authored-by: compilade <git@compilade.net>	2024-09-28 15:08:43 +03:00
Georgi Gerganov	b0f27361f3	sampling : avoid expensive softmax during greedy sampling (#9605 ) * sampling : avoid expensive softmax during greedy sampling ggml-ci * speculative : fix default RNG seed + set sparams.n_probs * Update tests/test-sampling.cpp Co-authored-by: slaren <slarengh@gmail.com> * sampling : add clarifying comment [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 09:03:17 +03:00
Michael Podvitskiy	37f3a3810e	llama : add llama_n_head() (#9512 )	2024-09-17 09:23:30 +03:00
Georgi Gerganov	0abc6a2c25	llama : llama_perf + option to disable timings during decode (#9355 ) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-13 09:53:38 +03:00
Gilad S.	bd35cb0ae3	feat: remove a sampler from a chain (#9445 ) * feat: remove a sampler from a chain * fix: return removed sampler * fix: safer casting	2024-09-13 03:54:49 +02:00
slaren	49006c67b4	llama : move random seed generation to the samplers (#9398 ) * llama_sampler_penalties : clamp penalty_last_n to zero	2024-09-10 18:04:25 +02:00
slaren	5fb5e24811	llama : minor sampling refactor (2) (#9386 )	2024-09-09 17:10:46 +02:00
Georgi Gerganov	df270ef745	llama : refactor sampling v2 (#9294 ) - Add `struct llama_sampler` and `struct llama_sampler_i` - Add `llama_sampler_` API - Add `llama_sampler_chain_` API for chaining multiple samplers - Remove `LLAMA_API_INTERNAL` - Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`	2024-09-07 15:16:19 +03:00
compilade	9bc6db28d0	ggml-quants : ternary packing for TriLMs and BitNet b1.58 (#8151 ) * ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b * ggml-quants : faster 1.625 bpw AVX2 vec_dot Not using a lookup table anymore makes it match q4_0 speed. * gguf-py : fix formatting * llama : remove spaces on empty line * ggml-quants : subtract 1 when back in epi8 This makes the 1.625 bpw type go faster than q4_0. Still not the fastest. * ggml-quants : Q2_2 now faster than Q4_K on with AVX2 * ggml-quants : cleanup Q1_3 code formatting * ggml-quants : ARM NEON vec_dot for q2_2 and q1_3 * ggml-quants : use ceiling division when quantizing q1_3 * convert-hf : simplify BitNet pre-quantization This still results in the exact same tensor weights and scales, but it reveals some weirdness in the current algorithm. * convert-hf : allow converting the weird BitNet 1.3B Its FFN size is 5460 which is not convenient. The offending tensors are kept in F16, which makes the final model 5.01 bpw. * bitnet : replace 1.58b with b1.58, as in the paper * ggml-quants : fix build failure on Windows * ggml-quants : attempt to fix Arm 32-bit support * ggml : add some informative comments in q1_3 vec_dot * ggml : add TQ1_0 and TQ2_0 ternary quantization types * ggml : even faster TQ2_0 * ggml : also faster TQ1_0 Same optimization as for TQ2_0 by offsetting the sum instead of the weights. This makes TQ1_0 almost as fast as Q8_0 on AVX2. * ggml : fix build issues in certain environments * ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0 * ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat The compiler seems smart enough to use the same instruction even when using vget_high_s8 instead. * ggml : remove q1_3 and q2_2 No more 1.625 bpw and 2.000 bpw, now instead using 1.6875 bpw and 2.0625 bpw with TQ1_0 and TQ2_0, respectively. * llama : remove the separate scale tensors of BitNet b1.58 They won't be needed, since the remaining ternary quant types have built-in scales. * ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency * ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot Not yet tested on hardware which supports it, might not work or might not even compile. But also it might. It should make the performance better on recent ARM CPUs. * ggml-quants : remove comment about possible format change of TQ2_0 Making it slightly more convenient for AVX512 but less convenient for everything else is not worth the trouble. * gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0 * ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0 This does not change anything for ternary models, since their values should never end up being in halfway cases anyway. * convert : allow direct conversion to TQ1_0 and TQ2_0 The token embeddings and output tensors are kept in F16 to allow quantizing them to Q4_K and Q6_K with llama-quantize. * llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0 Q4_0 is not completely symmetric (so not lossless for ternary models), but it should be good enough. * ggml-quants : allow using ARM dot product instructions for TQ1_0 * ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support * ggml : remove unused ggml_mul special case It would otherwise conflict with the more general optimization coming with Mamba-2. * ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators * test-backend-ops : add TQ1_0 and TQ2_0 comments for later Not yet adding uncommented, because some backends like SYCL and Metal do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT. (and Metal also doesn't handle it with GGML_OP_GET_ROWS) Support for TQ1_0 and TQ2_0 for other backends than CPU will be added in follow-up pull requests.	2024-09-05 21:48:47 -04:00
Molly Sophia	8f1d81a0b6	llama : support RWKV v6 models (#8980 ) * convert_hf_to_gguf: Add support for RWKV v6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add RWKV tokenization * Fix build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Do not use special tokens when matching in RWKV tokenizer * Fix model loading * Add (broken) placeholder graph builder for RWKV * Add workaround for kv cache * Add logits conversion to rwkv5 * Add rwkv5 layer norms * Add time mix KVRG & correct merge mistake * Add remaining time mix parameters * Add time mix output loading * Add placeholder llm_build_time_mix * Fix build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Load more tensors for rwkv v6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix rwkv tokenizer Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add unary operator Exp Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV v6 graph building Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add ``rescale_every_n_layers`` parameter Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add ``wkv.head_size`` key for RWKV so it doesn't reuse Mamba ssm parameters Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix offloading layers to CUDA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix parallel inferencing for RWKV Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Remove trailing whitespaces Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * build_rwkv: Avoid using inplace operations Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * convert_hf_to_gguf: rwkv: Avoid using ``eval`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * ggml: Add backward computation for unary op ``exp`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * build_rwkv6: Simplify graph Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Detect model.type Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Fix tensor loading for 7B/14B models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Fix group_norm assertion failure with Metal Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Clean up Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add quantization tensor exclusion Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Use the new advanced batch splits Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm`` Co-authored-by: compilade <git@compilade.net> * llama: rwkv6: Apply code style and misc changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Use class name ``Rwkv6Model`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Make use of key ``feed_forward_length`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Keep ``time_mix_w1/w2`` as F32 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Remove unused nodes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Apply code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add lora for some supported tensors Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * rwkv : speed-up tokenization using trie * minor : style + indentation * llama: rwkv6: Avoid division by zero Co-authored-by: compilade <git@compilade.net> * ggml: rwkv_wkv: Avoid copying the state Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Layl Bongers <3094382+LaylBongers@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-01 17:38:17 +03:00
Sutou Kouhei	0ab30f8d82	llama : fix llama_split_mode enum values in main_gpu document (#9057 ) LLAMA_SPLIT_* were renamed to LLAMA_SPLIT_MODE_* in #5697.	2024-08-30 20:08:10 +02:00

1 2

75 commits