koboldcpp

vrr/koboldcpp

Fork 0

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-12 01:54:37 +00:00

Commit graph

Author	SHA1	Message	Date
Jeff Bolz	6efcd65945	vulkan: optimize flash attention split_k_reduce (#14554 ) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).	2025-07-08 20:11:42 +02:00
Jeff Bolz	8875523eb3	vulkan: support softmax/FA batch and broadcast (#14449 )	2025-07-02 15:48:33 +03:00
Jeff Bolz	f01bd02376	vulkan: Implement split_k for coopmat2 flash attention. (#12627 ) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-02 14:25:08 -05:00

Author

SHA1

Message

Date

Jeff Bolz

6efcd65945

vulkan: optimize flash attention split_k_reduce (#14554 )

* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).

2025-07-08 20:11:42 +02:00

Jeff Bolz

8875523eb3

vulkan: support softmax/FA batch and broadcast (#14449 )

2025-07-02 15:48:33 +03:00

Jeff Bolz

f01bd02376

vulkan: Implement split_k for coopmat2 flash attention. (#12627 )

When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.

2025-04-02 14:25:08 -05:00

3 commits