* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI
For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).
This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.
* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers
This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).
The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.
Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).
Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.
See 59778f0196.
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>