mirror of https://github.com/Lizonghang/prima.cpp.git synced 2025-09-05 20:09:10 +00:00

https://new.reddit.com/r/LocalLLaMA/comments/1k013u1/primacpp_speeding_up_70bscale_llm_inference_on/ https://github.com/Lizonghang/prima.cpp

distributed-ai distributed-inference llama-cpp llm-inference on-device-llms

Find a file

Lizonghang f97a97003b fix type convert		2025-04-07 17:57:50 +04:00
.devops	Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641 )	2024-09-30 20:57:12 +02:00
ci	rerank : use [SEP] token instead of [BOS] (#9737 )	2024-10-05 15:55:04 +03:00
cmake	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
common	fix type convert	2025-04-07 17:57:50 +04:00
docs	Update building for Android (#9672 )	2024-10-07 09:37:31 -07:00
examples	add automatic layer window size assignment workflow	2024-11-08 18:21:03 +04:00
figures	add logo	2025-03-30 17:21:42 +04:00
ggml	test	2025-01-28 16:36:47 +04:00
gguf-py	convert : handle tokenizer merges format from transformers 4.45 (#9696 )	2024-10-03 17:22:15 +03:00
grammars	server : match OAI structured output response (#9527 )	2024-09-18 09:50:34 +03:00
include	add args -k and --force	2025-03-11 20:44:36 +04:00
media	README: add graphic for matrix multiplication (#6881 )	2024-04-24 21:29:13 +02:00
models	Added deepseek-r1-qwen vocabulary file	2025-02-23 08:33:57 +00:00
pocs	`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )	2024-06-13 00:41:52 +01:00
prompts	llama : add Qwen support (#4281 )	2023-12-01 20:16:31 +02:00
requirements	py : update transfomers version (#9694 )	2024-09-30 18:03:47 +03:00
scripts	sync : llama.cpp	2024-10-06 12:53:28 +03:00
spm-headers	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
src	add args -k and --force	2025-03-11 22:09:39 +04:00
tests	ggml : add backend registry / device interfaces to BLAS backend (#9752 )	2024-10-07 21:55:08 +02:00
.clang-tidy	cuda : refactor into multiple files (#6269 )	2024-03-25 13:50:23 +01:00
.dockerignore	ci : fix docker build number and tag name (#9638 )	2024-09-25 17:26:01 +02:00
.ecrc	common : Update stb_image.h to latest version (#9161 )	2024-08-27 08:58:50 +03:00
.editorconfig	cvector: fix CI + correct help message (#8064 )	2024-06-22 18:11:30 +02:00
.flake8	py : logging and flake8 suppression refactoring (#7081 )	2024-05-05 08:07:48 +03:00
.gitignore	common : refactor arg parser (#9308 )	2024-09-07 20:43:51 +02:00
.gitmodules	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
.pre-commit-config.yaml	convert.py : add python logging instead of print() (#6511 )	2024-05-03 22:36:41 +03:00
CMakeLists.txt	cmake : add option for common library (#9661 )	2024-09-27 10:42:06 +03:00
CMakePresets.json	CMake fix: host for msvc compiler can only be x86 or x64 (#8624 )	2024-09-06 00:14:12 +02:00
convert_hf_to_gguf.py	convert : refactor rope_freqs generation (#9396 )	2024-10-01 09:31:36 +03:00
convert_hf_to_gguf_update.py	llama : add reranking support (#9510 )	2024-09-28 17:42:03 +03:00
convert_llama_ggml_to_gguf.py	py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928 )	2024-08-16 13:36:30 +03:00
convert_lora_to_gguf.py	convert : refactor rope_freqs generation (#9396 )	2024-10-01 09:31:36 +03:00
flake.lock	flake.lock: Update (#9753 )	2024-10-07 09:35:42 -07:00
flake.nix	build(nix): Package gguf-py (#5664 )	2024-09-02 14:21:01 +03:00
LICENSE	license : update copyright notice + add AUTHORS (#6405 )	2024-04-09 09:23:19 +03:00
Makefile	remove conda path	2025-02-23 01:38:13 +04:00
mypy.ini	convert : partially revert PR #4818 (#5041 )	2024-01-20 18:14:18 -05:00
Package.swift	ggml-backend : add device and backend reg interfaces (#9707 )	2024-10-03 01:49:47 +02:00
poetry.lock	build(python): Package scripts with pip-0517 compliance	2024-07-04 15:39:13 +00:00
pyproject.toml	build(nix): Package gguf-py (#5664 )	2024-09-02 14:21:01 +03:00
pyrightconfig.json	ci : reduce severity of unused Pyright ignore comments (#9697 )	2024-09-30 14:13:16 -04:00
README.md	update README	2025-03-30 23:39:36 +04:00
requirements.txt	init	2024-10-23 14:29:14 +04:00
SECURITY.md	chore: Fix markdown warnings (#6625 )	2024-04-12 10:52:36 +02:00

README.md

prima.cpp: Speeding up 70B-level LLM inference on low-resource everyday home clusters

prima.cpp is a magic trick that lets you run 70B-level LLMs on your everyday devices—💻 laptops, 🖥️ desktops, 📱 phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

Worried about OOM or your device stucking? Never again! prima.cpp keeps its memory pressure below 10%, you can run very large models while enjoying Tiktok (if you don't mind the inference speed).

How about speed? prima.cpp is built on llama.cpp, but it’s 15x faster! 🚀 On my poor devices, QwQ-32B generates 11 tokens per second, and Llama 3-70B generates 1.5 tokens per second. That's about the same speed as audiobook apps, from slow to fast speaking. We plan to power a Home Siri soon, then we can have private chats without privacy concerns.

And, if your devices are more powerful, you could unlock even more possibilities, like running LLM agents right in your home! If you do, we’d love to hear about it, just share your cluster setup and token throughput with us!

Table 1: Home cluster configurations.

	D1	D2	D3	D4
Device	Mac M1	Laptop	Desktop	Mate40Pro
OS	MacOS (UMA)	Linux	Linux	Linux (on HarmonyOS)
CPU	Apple M1	Intel i9	Intel i9	Kirin 9000
CPU Cores	8	8	16	8
RAM (available)	2.4 GiB	4.1 GiB	9.7 GiB	1.9 GiB
Disk Read Speed	0.72 GB/s	2.98 GB/s	3.17 GB/s	1.37 GB/s
GPU Type	Apple Metal	3070	2080TI	-
VRAM (available)	-	8 GiB	11 GiB	-

Device D4 runs inside a Termux-simulated Linux. Device D1 reads disk data in random mode and D2~D4 read in sequential mode.

Table 2: Token latency for Llama models.

Model	llama.cpp	exo	dllama	prima.cpp
Llama 3-8B	15 ms	263 ms	459 ms	54 ms
Llama 3-14B	20 ms	-	-	65 ms
Llama 1-30B	202 ms	-	-	72 ms
Llama 3-45B	328 ms	-	-	233 ms
Llama 3-60B	7965 ms	-	-	468 ms
Llama 1-65B	8807 ms	-	-	569 ms
Llama 3-70B	10120 ms	OOM	OOM	674 ms

Table 3: Token latency for Qwen 2.5, QwQ, and DeepSeek R1 models.

Model	llama.cpp	exo	dllama	prima.cpp
Qwen-2.5-7B	14 ms	86 ms	-	44 ms
DeepSeek-R1-Distill-Qwen-7B	14 ms	68 ms	-	52 ms
DeepSeek-R1-Distill-Llama-8B	14 ms	77 ms	435 ms	59 ms
Qwen-2.5-14B	23 ms	31710 ms	-	65 ms
DeepSeek-R1-Distill-Qwen-14B	24 ms	23475 ms	-	76 ms
Qwen-2.5-32B and QwQ-32B	224 ms	OOM	-	89 ms
DeepSeek-R1-Distill-Qwen-32B	232 ms	OOM	-	93 ms
DeepSeek-R1-Distill-Llama-70B	10978 ms	OOM	-	724 ms
Qwen-2.5-72B	12227 ms	OOM	-	867 ms

In current implementation, each device is assigned at least one model layer. For example, this leads to a 1:1:29:1 split for Llama 3-8B, which makes prima.cpp less efficient. In future updates, we will have a 0:0:32:0 split and idle devices removed, then llama.cpp would become a special case of prima.cpp when serving small models.

Key Features

Run larger models with low memory pressure: Use mmap to lazily load model weights, and the OS would free page cache on demand, then you can run models of any size with a low memory pressure.
Faster speed on small-scale, heterogeneous and cheap home clusters:
- GPU & CPU Offloading: If a device has a GPU, you can use both GPU and CPU for inference. For example, when VRAM is full, we can offload some model layers to RAM.
- Piped-ring parallelism with prefetching: Prefetch upcoming layer weights to overlap disk loading latency and use advanced piped-ring parallelism to prevent the "prefetch-release" effect. This new parallelism improves pipeline parallelism by using a ring structure and allows devices to run multiple cycles to predict a new token.
- Heterogeneity-aware workload distribution: A scheduler is designed to optimize workload distribution based on each device's computing power, disk speed, memory, and OS (the OS will affect the disk speed and the memory management strategy). It decides how many model layers a device should handle and how many should run on GPU (if available).
- Quantization: We now support Q4K and IQ1 quantization (GGUF format) and are exploring a Q4K-IQ1 hybrid for a better balance between performance and speed.
Support Models: We now support hot models like the Llama, Qwen (and QwQ), and DeepSeek series. More will be added in future updates.
Cross-Platform: The cluster can consist of devices with different OSs, including macOS, Linux, Android, HarmonyOS, etc. Now, Android and HarmonyOS devices require Termux, and Windows support will be added in future update.

Models

Here are the models we have tested so far. You can also try more on Hugging Face!

README.md

prima.cpp: Speeding up 70B-level LLM inference on low-resource everyday home clusters

Key Features

Models

Llama

Qwen 2.5 / QwQ

DeepSeek

How to Use?

Prerequisites

Download, Compile and Run

README.md Unescape Escape

prima.cpp: Speeding up 70B-level LLM inference on low-resource everyday home clusters

Key Features

Models

Llama

Qwen 2.5 / QwQ

DeepSeek

How to Use?

Prerequisites

Download, Compile and Run

README.md