* sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle
Signed-off-by: Chun Tao <chun.tao@intel.com>
* Use async mem ops for correctness when SYCL graphs are explicitly on.
Signed-off-by: Tao, Chun <chun.tao@intel.com>
---------
Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Tao, Chun <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.
closes: #23242
* ggml-hexagon: add PAD op HVX kernel
Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.
* hex-ggml: remove duplicate op cases (merge conflict)
* hex-pad: fix editorconfig checks and macro alignment
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* docker: add OCI image labels to all published images
* docker: propagate OCI labels as manifest and index annotations
* docker: drop hardcoded org URL and revert accidental intel version bump
The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.
* docker: add skip_s390x workflow_dispatch input for fast test runs
Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
---------
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars
* refactor: skip MCP proxy probe when no server requires it
* refactor: suppress expected disconnect errors during MCP client shutdown
* refactor: Deduplicate requests
* refactor: deduplicate model fetching across ROUTER and MODEL modes
* refactor: Clean up models logic
* chore: Add `.env.example` file
* refactor: replace client-side CORS proxy probe with server status flag
* refactor: Post-review fixes
* test: add vitest client setup with API fetch mocks
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI
For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).
This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.
* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers
This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).
The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.
Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).
Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.
See 59778f0196.
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>