mirror of
https://github.com/Lizonghang/prima.cpp.git
synced 2025-09-06 16:09:05 +00:00
update README
This commit is contained in:
parent
aacfa8a231
commit
ba59a1a07a
1 changed files with 15 additions and 17 deletions
32
README.md
32
README.md
|
@ -123,7 +123,7 @@ Before using this project, ensure you have the following dependencies installed:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
# Use apt in Linux and pkg in Termux
|
# Use apt in Linux and pkg in Termux
|
||||||
sudo apt update -y && sudo apt install -y gcc-9 make cmake fio git wget libzmq3-dev
|
sudo apt update -y && sudo apt install -y gcc-9 make cmake fio git wget libzmq3-dev curl
|
||||||
```
|
```
|
||||||
|
|
||||||
For HiGHS, download and install from [source](https://github.com/ERGO-Code/HiGHS):
|
For HiGHS, download and install from [source](https://github.com/ERGO-Code/HiGHS):
|
||||||
|
@ -141,7 +141,7 @@ sudo ldconfig
|
||||||
**macOS:**
|
**macOS:**
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
brew install gcc make cmake fio git wget highs zeromq
|
brew install gcc make cmake fio git wget highs zeromq curl
|
||||||
```
|
```
|
||||||
|
|
||||||
### Build, Download, and Test
|
### Build, Download, and Test
|
||||||
|
@ -205,7 +205,7 @@ graph LR;
|
||||||
|
|
||||||
> **NOTE:** This ring communication is a communication overlay, not the physical topology. These devices are physically fully connected because they all connect to the same Wi-Fi.
|
> **NOTE:** This ring communication is a communication overlay, not the physical topology. These devices are physically fully connected because they all connect to the same Wi-Fi.
|
||||||
|
|
||||||
> If possible, disable the firewall to prevent the ports needed (e.g., 9000, 10000) been blocked.
|
> If possible, disable the firewall to prevent the ports needed (e.g., 9000, 10000) been blocked, or you can use `--data-port` (9000, by default) and `--signal-port` (10000, by default) to customize the ports used.
|
||||||
|
|
||||||
|
|
||||||
Take QwQ-32B as an example, run the following commands on the devices to launch distributed inference:
|
Take QwQ-32B as an example, run the following commands on the devices to launch distributed inference:
|
||||||
|
@ -229,7 +229,7 @@ Once started, prima.cpp will profile each device and decide how much workload to
|
||||||
### (Optional) Run with Prebuilt Docker Image
|
### (Optional) Run with Prebuilt Docker Image
|
||||||
Assume we have a host machine with at least 32 CPU cores, 32 GiB RAM, and 32 GiB VRAM. We simulate 4 homogeneous nodes using Docker containers, with each node allocated 8 CPU cores, 8 GiB RAM, and 8 GiB VRAM. Follow the below steps to get started:
|
Assume we have a host machine with at least 32 CPU cores, 32 GiB RAM, and 32 GiB VRAM. We simulate 4 homogeneous nodes using Docker containers, with each node allocated 8 CPU cores, 8 GiB RAM, and 8 GiB VRAM. Follow the below steps to get started:
|
||||||
|
|
||||||
1. Pull our prebuilt Docker image (e.g., [`prima.cpp:1.0.1-cuda`](https://hub.docker.com/repository/docker/lizonghango00o1/prima.cpp/general)) and run 4 containers:
|
1. Pull our prebuilt Docker image (e.g., [`prima.cpp:1.0.2-cuda`](https://hub.docker.com/repository/docker/lizonghango00o1/prima.cpp/general)) and run 4 containers:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
sudo docker run -dit --name prima-v1 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="0-7" --network host --gpus all prima.cpp:1.0.1-cuda
|
sudo docker run -dit --name prima-v1 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="0-7" --network host --gpus all prima.cpp:1.0.1-cuda
|
||||||
|
@ -238,9 +238,7 @@ sudo docker run -dit --name prima-v3 --memory=8gb --memory-swap=8gb --cpus 8 --c
|
||||||
sudo docker run -dit --name prima-v4 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="24-31" --network host --gpus all prima.cpp:1.0.1-cuda
|
sudo docker run -dit --name prima-v4 --memory=8gb --memory-swap=8gb --cpus 8 --cpuset-cpus="24-31" --network host --gpus all prima.cpp:1.0.1-cuda
|
||||||
```
|
```
|
||||||
|
|
||||||
> If your host machine does not have a GPU, ignore the `--gpus all` option.
|
1. Download the model file [`qwq-32b-q4_k_m.gguf`](https://huggingface.co/Qwen/QwQ-32B-GGUF) and copy it into each container:
|
||||||
|
|
||||||
2. Download the model file [`qwq-32b-q4_k_m.gguf`](https://huggingface.co/Qwen/QwQ-32B-GGUF) and copy it into each container:
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
cd prima.cpp/download
|
cd prima.cpp/download
|
||||||
|
@ -250,27 +248,27 @@ sudo docker cp qwq-32b-q4_k_m.gguf prima-v3:/root/prima.cpp/download/
|
||||||
sudo docker cp qwq-32b-q4_k_m.gguf prima-v4:/root/prima.cpp/download/
|
sudo docker cp qwq-32b-q4_k_m.gguf prima-v4:/root/prima.cpp/download/
|
||||||
```
|
```
|
||||||
|
|
||||||
3. (Optional) Enter each container, rebuild prima.cpp if your host machine does not have a GPU:
|
1. Enter each container and build prima.cpp:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
cd /root/prima.cpp && make clean
|
cd /root/prima.cpp
|
||||||
make -j$(nproc) # If not rank 0
|
make GGML_CUDA=1 USE_HIGHS=1 -j$(nproc) # For rank 0
|
||||||
make USE_HIGHS=1 -j$(nproc) # If rank 0
|
make GGML_CUDA=1 -j$(nproc) # For other ranks
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Enter each container and launch the distributed inference:
|
4. Enter each container and launch the distributed inference:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
cd /root/prima.cpp
|
cd /root/prima.cpp
|
||||||
(prima-v1) ./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 -n 256 -p "what is edge AI?" --world 4 --rank 0 --prefetch --gpu-mem 8
|
(prima-v1) ./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 4 --rank 0 --prefetch --gpu-mem 8 -c 4096 -n 256 -p "what is edge AI?"
|
||||||
(prima-v2) ./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 1 --prefetch --gpu-mem 8
|
(prima-v2) ./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 4 --rank 1 --prefetch --gpu-mem 8
|
||||||
(prima-v3) ./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 2 --prefetch --gpu-mem 8
|
(prima-v3) ./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 4 --rank 2 --prefetch --gpu-mem 8
|
||||||
(prima-v4) ./llama-cli -m download/qwq-32b-q4_k_m.gguf -c 1024 --world 4 --rank 3 --prefetch --gpu-mem 8
|
(prima-v4) ./llama-cli -m download/qwq-32b-q4_k_m.gguf --world 4 --rank 3 --prefetch --gpu-mem 8
|
||||||
```
|
```
|
||||||
|
|
||||||
> If your host machine does not have a GPU, ignore the `--gpu-mem` option.
|
> You can ignore `--gpu-mem` if you don't want to limit VRAM usage.
|
||||||
|
|
||||||
> If you update to the latest code, non-rank 0 nodes can omit `-c 1024`.
|
> Always use `git fetch` to update the local repository.
|
||||||
|
|
||||||
### Run in Server Mode
|
### Run in Server Mode
|
||||||
You can run prima.cpp in server mode, by launching `llama-server` on the rank 0 device (with `--host` and `--port` specified) and `llama-cli` on the others. Here is an example with 2 devices:
|
You can run prima.cpp in server mode, by launching `llama-server` on the rank 0 device (with `--host` and `--port` specified) and `llama-cli` on the others. Here is an example with 2 devices:
|
||||||
|
|
Loading…
Add table
Reference in a new issue