mirror of
https://github.com/LostRuins/koboldcpp.git
synced 2026-05-20 09:25:53 +00:00
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref:
|
||
|---|---|---|
| .. | ||
| ggml-alloc.h | ||
| ggml-backend.h | ||
| ggml-blas.h | ||
| ggml-cann.h | ||
| ggml-cpp.h | ||
| ggml-cpu.h | ||
| ggml-cuda.h | ||
| ggml-hexagon.h | ||
| ggml-metal.h | ||
| ggml-opencl.h | ||
| ggml-openvino.h | ||
| ggml-opt.h | ||
| ggml-rpc.h | ||
| ggml-sycl.h | ||
| ggml-virtgpu.h | ||
| ggml-vulkan.h | ||
| ggml-webgpu.h | ||
| ggml-zdnn.h | ||
| ggml-zendnn.h | ||
| ggml.h | ||
| gguf.h | ||