Find a file
2025-04-07 17:57:57 +04:00
.devops Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641) 2024-09-30 20:57:12 +02:00
ci rerank : use [SEP] token instead of [BOS] (#9737) 2024-10-05 15:55:04 +03:00
cmake vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
common fix type convert 2025-04-07 17:57:50 +04:00
docs Update building for Android (#9672) 2024-10-07 09:37:31 -07:00
examples add automatic layer window size assignment workflow 2024-11-08 18:21:03 +04:00
figures add logo 2025-03-30 17:21:42 +04:00
ggml test 2025-01-28 16:36:47 +04:00
gguf-py convert : handle tokenizer merges format from transformers 4.45 (#9696) 2024-10-03 17:22:15 +03:00
grammars server : match OAI structured output response (#9527) 2024-09-18 09:50:34 +03:00
include add args -k and --force 2025-03-11 20:44:36 +04:00
media README: add graphic for matrix multiplication (#6881) 2024-04-24 21:29:13 +02:00
models Added deepseek-r1-qwen vocabulary file 2025-02-23 08:33:57 +00:00
pocs build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) 2024-06-13 00:41:52 +01:00
prompts llama : add Qwen support (#4281) 2023-12-01 20:16:31 +02:00
requirements py : update transfomers version (#9694) 2024-09-30 18:03:47 +03:00
scripts sync : llama.cpp 2024-10-06 12:53:28 +03:00
spm-headers llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
src add args -k and --force 2025-03-11 22:09:39 +04:00
tests ggml : add backend registry / device interfaces to BLAS backend (#9752) 2024-10-07 21:55:08 +02:00
.clang-tidy cuda : refactor into multiple files (#6269) 2024-03-25 13:50:23 +01:00
.dockerignore ci : fix docker build number and tag name (#9638) 2024-09-25 17:26:01 +02:00
.ecrc common : Update stb_image.h to latest version (#9161) 2024-08-27 08:58:50 +03:00
.editorconfig cvector: fix CI + correct help message (#8064) 2024-06-22 18:11:30 +02:00
.flake8 py : logging and flake8 suppression refactoring (#7081) 2024-05-05 08:07:48 +03:00
.gitignore common : refactor arg parser (#9308) 2024-09-07 20:43:51 +02:00
.gitmodules llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
.pre-commit-config.yaml convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
CMakeLists.txt cmake : add option for common library (#9661) 2024-09-27 10:42:06 +03:00
CMakePresets.json CMake fix: host for msvc compiler can only be x86 or x64 (#8624) 2024-09-06 00:14:12 +02:00
convert_hf_to_gguf.py convert : refactor rope_freqs generation (#9396) 2024-10-01 09:31:36 +03:00
convert_hf_to_gguf_update.py llama : add reranking support (#9510) 2024-09-28 17:42:03 +03:00
convert_llama_ggml_to_gguf.py py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928) 2024-08-16 13:36:30 +03:00
convert_lora_to_gguf.py convert : refactor rope_freqs generation (#9396) 2024-10-01 09:31:36 +03:00
flake.lock flake.lock: Update (#9753) 2024-10-07 09:35:42 -07:00
flake.nix build(nix): Package gguf-py (#5664) 2024-09-02 14:21:01 +03:00
LICENSE license : update copyright notice + add AUTHORS (#6405) 2024-04-09 09:23:19 +03:00
Makefile remove conda path 2025-02-23 01:38:13 +04:00
mypy.ini convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
Package.swift ggml-backend : add device and backend reg interfaces (#9707) 2024-10-03 01:49:47 +02:00
poetry.lock build(python): Package scripts with pip-0517 compliance 2024-07-04 15:39:13 +00:00
pyproject.toml build(nix): Package gguf-py (#5664) 2024-09-02 14:21:01 +03:00
pyrightconfig.json ci : reduce severity of unused Pyright ignore comments (#9697) 2024-09-30 14:13:16 -04:00
README.md update README 2025-04-07 17:57:57 +04:00
requirements.txt init 2024-10-23 14:29:14 +04:00
SECURITY.md chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00

prima.cpp: Speeding up 70B-level LLM inference on low-resource everyday home clusters

prima License: MIT

prima.cpp is a magic trick that lets you run 70B-level LLMs on your everyday devices💻 laptops, 🖥️ desktops, 📱 phones, and tablets (GPU or no GPU, its all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

Worried about OOM or your device stucking? Never again! prima.cpp keeps its memory pressure below 10%, you can run very large models while enjoying Tiktok (if you don't mind the inference speed).

How about speed? prima.cpp is built on llama.cpp, but its 15x faster! 🚀 On my poor devices, QwQ-32B generates 11 tokens per second, and Llama 3-70B generates 1.5 tokens per second. That's about the same speed as audiobook apps, from slow to fast speaking. We plan to power a Home Siri soon, then we can have private chats without privacy concerns.

And, if your devices are more powerful, you could unlock even more possibilities, like running LLM agents right in your home! If you do, wed love to hear about it, just share your cluster setup and token throughput with us!

Table 1: Home cluster configurations.

D1 D2 D3 D4
Device Mac M1 Laptop Desktop Mate40Pro
OS MacOS (UMA) Linux Linux Linux (on HarmonyOS)
CPU Apple M1 Intel i9 Intel i9 Kirin 9000
CPU Cores 8 8 16 8
RAM (available) 2.4 GiB 4.1 GiB 9.7 GiB 1.9 GiB
Disk Read Speed 0.72 GB/s 2.98 GB/s 3.17 GB/s 1.37 GB/s
GPU Type Apple Metal 3070 2080TI -
VRAM (available) - 8 GiB 11 GiB -

Device D4 runs inside a Termux-simulated Linux. Device D1 reads disk data in random mode and D2~D4 read in sequential mode.

Table 2: Token latency for Llama models.

Model llama.cpp exo dllama prima.cpp
Llama 3-8B 15 ms 263 ms 459 ms 54 ms
Llama 3-14B 20 ms - - 65 ms
Llama 1-30B 202 ms - - 72 ms
Llama 3-45B 328 ms - - 233 ms
Llama 3-60B 7965 ms - - 468 ms
Llama 1-65B 8807 ms - - 569 ms
Llama 3-70B 10120 ms OOM OOM 674 ms

Table 3: Token latency for Qwen 2.5, QwQ, and DeepSeek R1 models.

Model llama.cpp exo dllama prima.cpp
Qwen-2.5-7B 14 ms 86 ms - 44 ms
DeepSeek-R1-Distill-Qwen-7B 14 ms 68 ms - 52 ms
DeepSeek-R1-Distill-Llama-8B 14 ms 77 ms 435 ms 59 ms
Qwen-2.5-14B 23 ms 31710 ms - 65 ms
DeepSeek-R1-Distill-Qwen-14B 24 ms 23475 ms - 76 ms
Qwen-2.5-32B and QwQ-32B 224 ms OOM - 89 ms
DeepSeek-R1-Distill-Qwen-32B 232 ms OOM - 93 ms
DeepSeek-R1-Distill-Llama-70B 10978 ms OOM - 724 ms
Qwen-2.5-72B 12227 ms OOM - 867 ms

In current implementation, each device is assigned at least one model layer. For example, this leads to a 1:1:29:1 split for Llama 3-8B, which makes prima.cpp less efficient. In future updates, we will have a 0:0:32:0 split and idle devices removed, then llama.cpp would become a special case of prima.cpp when serving small models.

Key Features

  • Run larger models with low memory pressure: Use mmap to lazily load model weights, and the OS would free page cache on demand, then you can run models of any size with a low memory pressure.
  • Faster speed on small-scale, heterogeneous and cheap home clusters:
    • GPU & CPU Offloading: If a device has a GPU, you can use both GPU and CPU for inference. For example, when VRAM is full, we can offload some model layers to RAM.
    • Piped-ring parallelism with prefetching: Prefetch upcoming layer weights to overlap disk loading latency and use advanced piped-ring parallelism to prevent the "prefetch-release" effect. This new parallelism improves pipeline parallelism by using a ring structure and allows devices to run multiple cycles to predict a new token.
    • Heterogeneity-aware workload distribution: A scheduler is designed to optimize workload distribution based on each device's computing power, disk speed, memory, and OS (the OS will affect the disk speed and the memory management strategy). It decides how many model layers a device should handle and how many should run on GPU (if available).
    • Quantization: We now support Q4K and IQ1 quantization (GGUF format) and are exploring a Q4K-IQ1 hybrid for a better balance between performance and speed.
  • Support Models: We now support hot models like the Llama, Qwen (and QwQ), and DeepSeek series. More will be added in future updates.
  • Cross-Platform: The cluster can consist of devices with different OSs, including macOS, Linux, Android, HarmonyOS, etc. Now, Android and HarmonyOS devices require Termux, and Windows support will be added in future update.

Models

Here are the models we have tested so far. You can also try more on Hugging Face!

Llama

Qwen 2.5 / QwQ

DeepSeek

How to Use?

Prerequisites

Before using this project, ensure you have the following dependencies installed:

  • gcc >= 9.4.0
  • make >= 4.2.1
  • cmake >= 3.16.3
  • fio >= 3.16 (used for disk speed test)
  • zmq >= 4.3.2 (used for cross-device communication)
  • HiGHS >= 1.9.0 (used for automatic workload distribution)
  • CUDA (optional, if you have a GPU)

Linux (e.g., Ubuntu):

sudo apt update -y && sudo apt install -y gcc-9 make cmake fio git wget libzmq3-dev

For HiGHS, download and install from source:

git clone https://github.com/ERGO-Code/HiGHS.git
cd HiGHS
mkdir build && cd build
cmake ..
make -j$(nproc)
sudo make install

macOS:

brew install gcc make cmake fio git wget highs zeromq

Build, Download, and Test

First, clone our repo from Github:

git clone https://github.com/Lizonghang/prima.cpp.git
cd prima.cpp

Then, run the following command to build the project:

# If you are on the device with rank 0, USE_HIGHS=1 must be added:
make USE_HIGHS=1 -j$(nproc)

# If you have CUDA installed, add GGML_CUDA=1:
make GGML_CUDA=1 -j$(nproc)  

# For macOS with very large models, disable Metal might be better:
make LLAMA_NO_METAL=1 -j$(nproc)  

# To enable debug mode, add LLAMA_DEBUG=1:
make LLAMA_DEBUG=1 -j$(nproc) 

# Otherwise, just use:
make -j$(nproc) 

To test if it works, we download a GGUF model file from Hugging Face (e.g., qwq-32b-q4_k_m.gguf):

mkdir download  # You can put it in any other path, but try to put it on an SSD if possible.
wget https://huggingface.co/Qwen/QwQ-32B-GGUF/resolve/main/qwq-32b-q4_k_m.gguf -P download/

Note: Put this project and model files on SSD, if SSD and HDD coexist.

After downloading, run the following command to launch the inference task (if running on a single device, prima.cpp degrades to llama.cpp):

./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -p "what is edge AI?" -n 256 -ngl 30

Adjust -ngl according to your VRAM capacity. Here, the VRAM is 11 GiB, so setting -ngl to a maximum of 30 will not cause GPU to OOM. If there is no GPU, just ignore it. For other parameters, please refer to llama.cpp.

Run on Multiple Devices

To run on more home devices, first connect them to the same local Wi-Fi. For example, assume we have 4 devices with IP addresses and ranks as follows:

  • Rank 0: 192.168.1.2 (act as the head device, which initiates the request)
  • Rank 1: 192.168.1.3 (worker device)
  • Rank 2: 192.168.1.4 (worker device)
  • Rank 3: 192.168.1.5 (worker device)
graph LR;
    Rank0["Rank 0 (192.168.1.2)"] --> Rank1["Rank 1 (192.168.1.3)"];
    Rank1 --> Rank2["Rank 2 (192.168.1.4)"];
    Rank2 --> Rank3["Rank 3 (192.168.1.5)"];
    Rank3 --> Rank0;

These devices are physically fully connected as they all connect to the same Wi-Fi, but logically, they follow a ring communication topology.

If possible, disable the firewall to prevent the ports needed (e.g., 9000, 10000) been blocked.