llama: avoid copying logits during prompt decode in MTP (#23198)

* llama: avoid copying logits during prompt decode in MTP

* review: update comment

* llama-graph: call set_output for t_h_pre_norm
This commit is contained in:
Aman Gupta 2026-05-17 23:30:25 +08:00 committed by GitHub
parent 39cf5d6191
commit 3e12fbdea5
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 91 additions and 27 deletions

View file

@ -243,6 +243,11 @@ struct server_slot {
return task->need_embd() || (spec && common_speculative_need_embd(spec));
}
bool need_embd_pre_norm() const {
GGML_ASSERT(task);
return spec && common_speculative_need_embd_pre_norm(spec);
}
// if the context does not have a memory module then all embeddings have to be computed within a single ubatch
// also we cannot split if the pooling would require any past tokens
// (MTP supports splitting — uses task->need_embd() not need_embd())