mla : make the V tensor a view of K (#18986)

* mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-05 23:41:45 +00:00 · 2026-01-22 22:09:01 +02:00 · 2026-01-22 22:09:01 +02:00 · a5eaa1d6a3
commit a5eaa1d6a3
parent e2baf02162
8 changed files with 39 additions and 13 deletions
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@ -1565,6 +1565,11 @@ ggml_tensor * llm_graph_context::build_attn_mha(
            v = ggml_transpose(ctx0, v);
        }

+        // TODO: update llama_kv_cache to not store V cache in the MLA case and automatically return a view of K
+        if (v_mla) {
+            v = ggml_view_4d(ctx0, k, v->ne[0], v->ne[1], v->ne[2], v->ne[3], k->nb[1], k->nb[2], k->nb[3], 0);
+        }
+
        // this can happen when KV cache is not used (e.g. an embedding model with non-causal attn)
        if (k->type == GGML_TYPE_F32) {
            k = ggml_cast(ctx0, k, GGML_TYPE_F16);