add test for IQ1 and doc for device selection

2025-09-05 20:29:26 +00:00 · 2025-06-04 15:12:00 +04:00 · 2025-06-04 15:12:00 +04:00 · ef1e10101e
commit ef1e10101e
parent 27756ee182
1 changed files with 26 additions and 18 deletions
--- a/README.md
+++ b/README.md
@ -34,7 +34,7 @@ And, if your devices are more powerful, you could unlock even more possibilities

 > Device D4 runs inside a Termux-simulated Linux. Device D1 reads disk data in random mode and D2~D4 read in sequential mode.

-**Table 2:** Token latency for Llama models.
+**Table 2:** Token latency for Llama models (w/o device selection).
 | **Model**      | **llama.cpp** | **exo**   | **dllama** | **prima.cpp** |
 |-----------------|---------------|-----------|------------|---------------|
 | Llama 3-8B     | **15 ms**     | 263 ms    | 459 ms     | 54 ms         |
@ -45,7 +45,7 @@ And, if your devices are more powerful, you could unlock even more possibilities
 | Llama 1-65B    | 8807 ms       | -         | -          | **569 ms**    |
 | Llama 3-70B    | 10120 ms      | OOM       | OOM        | **674 ms**    |

-**Table 3:** Token latency for Qwen 2.5, QwQ, and DeepSeek R1 models.
+**Table 3:** Token latency for Qwen 2.5, QwQ, and DeepSeek R1 models (w/o device selection).

 | **Model**                        | **llama.cpp** | **exo**       | **dllama** | **prima.cpp** |
 |-----------------------------------|---------------|---------------|------------|---------------|
@ -61,7 +61,9 @@ And, if your devices are more powerful, you could unlock even more possibilities

 > As video recording consumes some RAM, prima.cpp proactively reduces memory usage, resulting in slightly higher latency in the video compared to the table.

-> In current implementation, each device is assigned at least one model layer. For example, this leads to a 1:1:29:1 split for Llama 3-8B, which makes prima.cpp less efficient. In future updates, we will have a 0:0:32:0 split and idle devices removed, then llama.cpp would become a special case of prima.cpp when serving small models.
+> In the old version (w/o device selection), each device is assigned at least one model layer. This would lead to a 1:1:29:1 split for Llama 3-8B, which makes prima.cpp slower than llama.cpp.
+> 
+> **New:** In the latest version (with device selection), we will have a 0:0:32:0 split and weak devices removed, then prima.cpp would become llama.cpp when serving small models.

 ## 🔑 Key Features

@ -70,6 +72,7 @@ And, if your devices are more powerful, you could unlock even more possibilities
 - - **GPU & CPU Offloading:** If a device has a GPU, you can use both GPU and CPU for inference. For example, when VRAM is full, we can offload some model layers to RAM.
 - - **Piped-ring parallelism with prefetching:** Prefetch upcoming layer weights to overlap disk loading latency and use advanced piped-ring parallelism to prevent the "prefetch-release" effect. This new parallelism improves pipeline parallelism by using a ring structure and allows devices to run multiple cycles to predict a new token.
 - - **Heterogeneity-aware workload distribution:** A scheduler is designed to optimize workload distribution based on each device's computing power, disk speed, memory, and OS (the OS will affect the disk speed and the memory management strategy). It decides how many model layers a device should handle and how many should run on GPU (if available). 
+- - **Automatic device selection:** If there are weak devices and removing them would speed up inference, prima.cpp will automatically discover and remove them.
 - - **Quantization:** We now support Q4K, Q6K, Q80 and IQ1 quantization (GGUF format) and are exploring a Q4K-IQ1 hybrid for a better balance between performance and speed.
 - **Support Models:** We now support hot models like the **Llama, Qwen (and QwQ), and DeepSeek series**. More will be added in future updates.
 - **Cross-Platform:** The cluster can consist of devices with different OSs, including macOS, Linux, Android, HarmonyOS, etc. Now, Android and HarmonyOS devices require Termux, and Windows support will be added in future update.
@ -78,26 +81,26 @@ And, if your devices are more powerful, you could unlock even more possibilities
 Here are the models we have tested so far. You can also try more on Hugging Face!

 ### Llama
- **Llama 3-8B (Q4K, Q6K, Q80):** [Meta-Llama-3-8B-Instruct](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF)
+- **Llama 3-8B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/LLama-3-8b-Uncensored-i1-GGUF)):** [Meta-Llama-3-8B-Instruct](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF)
 - **Llama 3-14B (Q4K, Q6K, Q80):** [Llama-3-14B-Instruct-v1](https://huggingface.co/RDson/Llama-3-14B-Instruct-v1-GGUF)
- **Llama 1-30B (Q4K, Q6K, Q80):** [upstage-llama-30b-instruct-2048](https://huggingface.co/TheBloke/upstage-llama-30b-instruct-2048-GGUF)
- **Llama 3-45B (Q4K, Q6K, Q80):** [Llama-3-pruned-45B-Drobeta-Turnu-Severin](https://huggingface.co/mradermacher/Llama-3-pruned-45B-Drobeta-Turnu-Severin-GGUF)
- **Llama 3-60B (Q4K, Q6K, Q80):** [nyun-llama3-60B](https://huggingface.co/mradermacher/nyun-llama3-60B-GGUF)
- **Llama 1-65B (Q4K, Q6K, Q80):** [llama-65b](https://huggingface.co/TheBloke/LLaMA-65B-GGUF)
- **Llama 3-70B (Q4K, Q6K, Q80):** [Meta-Llama-3-70B-Instruct](https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF)
+- **Llama 1-30B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/LLaMA-30B-HF-i1-GGUF)):** [upstage-llama-30b-instruct-2048](https://huggingface.co/TheBloke/upstage-llama-30b-instruct-2048-GGUF)
+- **Llama 3-45B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/Llama-3-pruned-45B-Drobeta-Turnu-Severin-i1-GGUF)):** [Llama-3-pruned-45B-Drobeta-Turnu-Severin](https://huggingface.co/mradermacher/Llama-3-pruned-45B-Drobeta-Turnu-Severin-GGUF)
+- **Llama 3-60B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/nyun-llama3-60B-i1-GGUF)):** [nyun-llama3-60B](https://huggingface.co/mradermacher/nyun-llama3-60B-GGUF)
+- **Llama 1-65B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/llama-65b-instruct-i1-GGUF)):** [llama-65b](https://huggingface.co/TheBloke/LLaMA-65B-GGUF)
+- **Llama 3-70B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/Meta-Llama-3-70B-Instruct-DPO-i1-GGUF)):** [Meta-Llama-3-70B-Instruct](https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF)

 ### Qwen 2.5 / QwQ
- **Qwen 2.5-7B (Q4K, Q6K, Q80):** [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF)
- **Qwen 2.5-14B (Q4K, Q6K, Q80):** [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF)
- **Qwen 2.5-32B (Q4K, Q6K, Q80):** [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GGUF)
- **Qwen 2.5-72B (Q4K, Q6K, Q80):** [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-GGUF)
- **QwQ-32B (Q4K, Q6K, Q80):** [qwq-32b](https://huggingface.co/Qwen/QwQ-32B-GGUF)
+- **Qwen 2.5-7B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/Qwen2.5-7B-i1-GGUF)):** [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF)
+- **Qwen 2.5-14B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/Qwen2.5-14B-i1-GGUF)):** [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF)
+- **Qwen 2.5-32B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/Qwen2.5-32B-i1-GGUF)):** [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GGUF)
+- **Qwen 2.5-72B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/Qwen2.5-72B-Instruct-i1-GGUF)):** [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-GGUF)
+- **QwQ-32B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/QwQ-32B-i1-GGUF)):** [qwq-32b](https://huggingface.co/Qwen/QwQ-32B-GGUF)

 ### DeepSeek
- **DeepSeek R1-7B (Q4K, Q6K, Q80):** [deepseek-ai.DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Qwen-7B-GGUF)
- **DeepSeek R1-8B (Q4K, Q6K, Q80):** [deepseek-ai.DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Llama-8B-GGUF)
- **DeepSeek R1-14B (Q4K, Q6K, Q80):** [deepseek-ai.DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Qwen-14B-GGUF)
- **DeepSeek R1-32B (Q4K, Q6K, Q80):** [deepseek-ai.DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Qwen-32B-GGUF)
+- **DeepSeek R1-7B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/DeepSeek-R1-Distill-Qwen-7B-Uncensored-i1-GGUF)):** [deepseek-ai.DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Qwen-7B-GGUF)
+- **DeepSeek R1-8B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/DeepSeek-R1-Distill-Llama-8B-i1-GGUF)):** [deepseek-ai.DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Llama-8B-GGUF)
+- **DeepSeek R1-14B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/Qwen2.5-14B-DeepSeek-R1-1M-Uncensored-GGUF)):** [deepseek-ai.DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Qwen-14B-GGUF)
+- **DeepSeek R1-32B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/mradermacher/deepseek-r1-qwen-2.5-32B-ablated-i1-GGUF)):** [deepseek-ai.DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-R1-Distill-Qwen-32B-GGUF)
 - **DeepSeek R1-70B (Q4K, Q6K, Q80, [IQ1](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF)):** [DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF)

 ## ⚙️ How to Use?
@ -154,6 +157,7 @@ make GGML_CUDA=1 -j$(nproc)
 make LLAMA_NO_METAL=1 -j$(nproc)  

 # To enable debug mode, add LLAMA_DEBUG=1:
+# WARNING: Running in DEBUG mode will slow down inference!
 make LLAMA_DEBUG=1 -j$(nproc) 

 # Otherwise, just use:
@ -356,6 +360,10 @@ Not yet—but it's on the roadmap. Currently, prima.cpp can run on Linux, macOS,

 Not yet. Now prima.cpp supports only CUDA-based GPUs. Vulkan is in our roadmap, and AMD GPUs will be supported once we have that device.

+**7. Why did I get "No layer is assigned to me, exit"?**
+
+No worries, this is expected. Prima.cpp found that this device was too slow, and dropping it could speed up inference, so it was removed.
+
 ## ❤️ Acknowledgment
 This project builds upon the incredible work from the open-source community, especially [ggml, gguf](https://github.com/ggml-org/ggml), and [llama.cpp](https://github.com/ggml-org/llama.cpp). We gratefully acknowledge their contributions.