Aubrey Li
a12e8ab46e
yaml: fix Marlin AssertionError
...
Marlin quantized linear only supports GPU device, when change generate_op
to "KLinearMarlin", generate_device need to be changed to "cuda" accordingly.
Fixes: e5b001d76f
("Update readme; Format code; Add example yaml.")
2025-03-21 23:58:20 +08:00
Atream
167506b779
Update DeepSeek-V3-Chat-multi-gpu-marlin.yaml
2025-03-17 17:05:01 +08:00
Atream
c9a0c44213
Update DeepSeek-V3-Chat-multi-gpu-fp8-linear-ggml-experts.yaml
2025-03-17 17:03:52 +08:00
liam
19f058ec9e
🔧 update multi-gpu-fp8-linear and multi-gpu marlin yaml
2025-03-17 15:08:12 +08:00
Azure-Tang
85c32fdd10
Fix rocm example yaml
2025-03-15 22:27:02 -04:00
Azure
3986e2d2cf
Merge pull request #178 from fxzjshm/hip
...
[Feat] Port to ROCm/HIP
2025-03-15 02:31:07 +08:00
Azure-Tang
e5b001d76f
Update readme; Format code; Add example yaml.
2025-03-14 14:25:52 -04:00
Atream
a889288fc1
use compile for gate, slight performance improvement
2025-03-14 12:43:28 +00:00
Azure-Tang
ed8437413b
merge main; Add torch q8 linear
2025-03-14 05:52:07 -04:00
Atream
90eb87b3fc
Update DeepSeek-V3-Chat-multi-gpu-marlin.yaml
2025-02-26 21:53:50 +08:00
Azure
91c1619296
Merge branch 'develop-0.2.2' into support-fp8
...
Update README.md
2025-02-25 13:43:26 +00:00
Azure
2c0cce90d0
add fp8 multi gpu yaml example
2025-02-25 13:32:09 +00:00
Atream
477ac28a9c
fix-update-flashinfer_wrapper_local_chat
2025-02-25 12:47:31 +00:00
Atream
b443c7dfa2
Merge pull request #657 from kvcache-ai/feat-absorb-for-long-prefill
...
Feat absorb for long prefill
2025-02-25 16:53:21 +08:00
Atream
f4c198bd42
support absorb for prefill long context
2025-02-25 08:52:02 +00:00
Azure
ca7366d2db
Merge remote-tracking branch 'upstream/develop-0.2.2' into support-fp8
2025-02-24 11:58:10 +00:00
Azure
581a524f65
Add data loader to read special weights for fp8; Add special weight process script
2025-02-24 11:34:17 +00:00
Atream
f327695079
fix KExpertsMarlin on GPU with out CUDA Graph
2025-02-24 09:30:54 +00:00
Atream
f5f6c6b95d
update yaml
2025-02-23 14:33:58 +00:00
DDong Jianwei
95d937c51d
tmp
2025-02-23 18:51:42 +08:00
Atream
5ec33d046d
optimize gguf dequant, save mem, support Q2_K
...
use marlin for lm_head, lm_head only calc last token for prefill
extend context window to 19K for DeepSeek-V3/R1 within 24GB VRAM
2025-02-22 06:13:01 +00:00
Atream
7e1fe256c8
optimize GPU
2025-02-21 05:06:57 +00:00
Atream
c189d55bd1
toy support for experts on GPU, no CUDA Graph
2025-02-15 15:16:00 +00:00
Azure
b7653b9c4f
add V3/R1 8 gpu yaml example
2025-02-14 02:56:13 +00:00
MorphisZhang
aea4243712
Add optimization config for Deepseek V3/R1 with 4 GPUs
2025-02-13 16:32:28 +08:00
Azure
0564ac8465
update marlin expert example
2025-02-12 04:11:00 +00:00
liam
83401dbb3b
⚡ ready to publish
2025-02-10 12:29:23 +08:00
Azure
c4d9bc6670
support KExpertsMarlin backend
2025-02-07 05:57:40 +00:00
Azure
ee24a27001
update v3 single gpu rule yaml;
2025-02-04 16:14:35 +00:00
Azure
907251c743
done support deepseekv3
2025-02-04 15:53:38 +00:00
Azure
f748cd29f0
fix rope; update moegate
2025-02-01 18:05:45 +00:00
Azure
f873558a89
update rope calculation; update modeling.py; update gate for moe
2025-02-01 07:32:21 +00:00
Azure
476b1d8dc6
support deepseekv3; runable but have precition problem
2025-01-31 08:27:24 +00:00
anyanqilin
2d67016d14
wjh-change
2024-11-04 14:02:19 +08:00
xhedit
234faf7987
typo fix: KMisrtal -> KMistral
2024-09-12 15:58:01 +00:00
chenxl
49cce0c437
[fix] bugs about Qwen57B, install requirement, Dockerfile
2024-08-30 09:51:32 +00:00
TangJingqi
8747c099f2
update yaml example; update version idx; update docker file
2024-08-29 22:39:20 +08:00
TangJingqi
abd4214b56
fix readme; adjust param
2024-08-29 10:40:08 +08:00
chenxl
4d1d561d28
[feature] release 0.1.3
2024-08-28 16:11:43 +00:00
TangJingqi
de3faaf55d
Update readme; add pipeline tutorial; add detailed inject tutorial
2024-08-15 20:42:54 +08:00
TangJingqi
c47205dce9
fix name
2024-08-15 11:25:12 +08:00
TangJingqi
67043b4b5c
[fix] format classes and files name
2024-08-15 10:44:59 +08:00
Atream
412055d450
[feature] experts can be injected using CPUInfer
...
[fix] fix ktransformers interface when use new CUDAGraphRunner
[fix] fix YAML and optimize logic, the top rule has the highest priority
2024-08-14 16:10:54 +08:00
chenxl
f5f79f5c0e
[ADD] support multi-gpu qlen>1 q5_k
2024-08-12 11:41:26 +00:00
chenxl
18c42e67df
Initial commit
2024-07-27 16:06:58 +08:00