mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-05 20:19:51 +00:00
41 lines
No EOL
2.3 KiB
Markdown
41 lines
No EOL
2.3 KiB
Markdown
## How to use ktransformers long context framework
|
|
|
|
Currently, long context is only supported by our **local_chat.py** interface, and the integration with the server interface is under development.
|
|
|
|
To facilitate user management, we have uploaded the model config, gguf, and tokenizer to a repo. URL: https://huggingface.co/nilv234/internlm2_5_to_llama_1m/tree/main
|
|
|
|
By setting the model_path and gguf_path in the local_chat function to **/path/to/repo** and setting the mode to **"long_context"**, you can use the InternLM2.5-7B-Chat-1M model with 1m functionality on a 24G VRAM.
|
|
|
|
After running local_chat.py for the first time, a config.yaml file will be automatically created under ** ~/.ktransformers**. The relevant configurations for long context are as follows:
|
|
|
|
```python
|
|
chunk_size: 4096 # prefill chunk size
|
|
max_seq_len: 100000 # KVCache length
|
|
block_size: 128 # KVCache block size
|
|
local_windows_len: 4096 # The KVCache of length local_windows_len is stored on the GPU.
|
|
second_select_num: 96 # After preselection, each time select the number of KVCache blocks. If >= preselect_block_count, use the preselected blocks.
|
|
threads_num: 64 # CPU thread num
|
|
anchor_type: DYNAMIC # KVCache block representative token selection method.
|
|
kv_type: FP16
|
|
dense_layer_num: 0 # The first few layers do not need to fill or select KVCache
|
|
anchor_num: 1 # The number of representative tokens within a KVCache block.
|
|
preselect_block: False # Whether to preselect.
|
|
head_select_mode: SHARED # All kv_heads jointly select.
|
|
preselect_block_count: 96 # Number of preselected blocks.
|
|
layer_step: 1 # Select every few layers.
|
|
token_step: 1 # Select every few tokens.
|
|
```
|
|
|
|
The memory required for different context lengths is shown in the table below:
|
|
|
|
| | 4K | 32K | 64K | 128K | 512K | 1M |
|
|
| -------------- | --- | ---- | ---- | ---- | ---- | ------ |
|
|
| DRAM Size (GB) | 0.5 | 4.29 | 8.58 | 17.1 | 68.7 | 145.49 |
|
|
|
|
Please choose an appropriate max_seq_len based on your DRAM size.
|
|
For example:
|
|
```python
|
|
python local_chat.py --model_path="/data/model/internlm2_5_to_llama_1m" --gguf_path="/data/model/internlm2_5_to_llama_1m" --max_new_tokens=500 --cpu_infer=10 --use_cuda_graph=True --mode="long_context" --prompt_file="/path/to/file"
|
|
```
|
|
|
|
If you've already specified the input text via the prompt_file, just press Enter when the terminal displays chat: to begin. |