mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-09 05:54:06 +00:00
Update balance-serve.md
This commit is contained in:
parent
8a1313ca4e
commit
6cbe044aae
1 changed files with 4 additions and 1 deletions
|
@ -128,14 +128,17 @@ It features the following arguments:
|
|||
|
||||
- `--max_new_tokens`: Maximum number of tokens generated per request.
|
||||
- `--cache_lens`: Total length of kvcache allocated by the scheduler. All requests share a kvcache space.
|
||||
- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)
|
||||
- `--chunk_size`: Maximum number of tokens processed in a single run by the engine.
|
||||
corresponding to 32768 tokens, and the space occupied will be released after the requests are completed.
|
||||
- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)
|
||||
- `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
|
||||
- `--model_path`: Path to safetensor config path (only config required, not model safetensors).
|
||||
Please note that, since `ver 0.2.4`, the last segment of `${model_path}` directory name **MUST** be one of the model names defined in `ktransformers/configs/model_configs.json`.
|
||||
- `--force_think`: Force responding the reasoning tag of `DeepSeek R1`.
|
||||
|
||||
The relationship between `max_batch_size`, `cache_lens`, and `max_new_tokens` should satisfy:
|
||||
`cache_lens > max_batch_size * max_new_tokens`, otherwise the concurrency will decrease.
|
||||
|
||||
### 2. access server
|
||||
|
||||
```
|
||||
|
|
Loading…
Add table
Reference in a new issue