optimize gguf dequant, save mem, support Q2_K

use marlin for lm_head, lm_head only calc last token for prefill extend context window to 19K for DeepSeek-V3/R1 within 24GB VRAM
2025-09-12 08:09:42 +00:00 · 2025-02-22 06:13:01 +00:00 · 2025-02-22 06:13:01 +00:00 · 5ec33d046d
commit 5ec33d046d
parent 7e1fe256c8
27 changed files with 435 additions and 259 deletions
--- a/ktransformers/operators/flashinfer_wrapper.py
+++ b/ktransformers/operators/flashinfer_wrapper.py
@ -9,7 +9,7 @@ flashinfer_enabled = False

 try:
    import flashinfer
-    flashinfer_enabled = True
+    flashinfer_enabled = False
    print("found flashinfer")
    
 except ImportError: