* requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf
The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on
PyPI for platform/CPython combinations where 2.6.x is not present.
The accompanying comment already says 'PyTorch 2.6.0 or later', so
the looser >=2.6.0 matches the documented intent and unblocks
pip install -r requirements/requirements-convert_hf_to_gguf.txt.
Fixes#23408
* requirements: bump torch floor to 2.11.0 per maintainer
* requirements: pin torch to ==2.11.0 per project policy
* requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy
* requirements: suppress check_requirements pin warning on mtmd
The check_requirements script flags '==' on lines in files matched by
*/**/requirements*.txt. Append the documented suppression comment to the
pinned torch and torchvision lines (and to the s390x platform marker lines)
so the check passes while keeping the pins required by project policy.
* ty: silence Tensor/Module union check on model[0].auto_model
With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns
Tensor | Module rather than Module, so model[0].auto_model fails ty
on the SentenceTransformer code path. The runtime behavior is
unchanged because SentenceTransformer always wraps a Module at
index 0. Adding a targeted unresolved-attribute ignore keeps the
type-check green without altering behavior. A follow-up issue
tracks typing the variable explicitly.
* pi : update
* ci : fix ios build
* ci : fix andoroid
* ci : fix apple builds
* cmake : add install() for impl libraries
Add install(TARGETS <target> LIBRARY) for all -impl libraries that were
changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in
commit bb28c1fe2. Without this, cmake --install fails to copy the shared
libraries, causing runtime errors like:
llama-server: error while loading shared libraries: libllama-server-impl.so
Ref: https://github.com/ggml-org/llama.cpp/issues/23494#issuecomment-4512912515
Assisted-by: llama.cpp:local pi
* ci : fix xcframework build
* cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control
Remove explicit STATIC from all -impl libraries (server, cli, completion, bench,
batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls
shared vs static linkage.
Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows.
Assisted-by: llama.cpp:local pi
* cmake : enable LLAMA_BUILD_APP by default
Assisted-by: llama.cpp:local pi
* ci : disable app in build-cmake-pkg.yml
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).
For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.
Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.
ref: https://github.com/ggml-org/llama.cpp/issues/23395
Assisted-by: llama.cpp:local pi
- HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference.
- Collapse OCR into the HUNYUANVL projector + HUNYUAN_VL text arch
* webui: Add max image size option
* remove magic numbers
* support all image formats
* use const
* Move regex to match b64 images to constants
* use SETTINGS_KEYS to get max image resolution setting
* Do not touch the image if already under the size threshold
* mtmd : deepseek-ocr fixes, improvements and refactoring
- image processing changes to achieve full parity with Pillow (reference impl)
- SAM mask casting only when flash-attn is on
- SAM refactor (build_sam() extracted so deepseek-ocr-2 can reuse it)
- llama-chat changes to fix server/WebUI issue (new media_markers_first())
- adapted test-chat-template and added test cases for deepseek-ocr
- changed regression test for deepseek-ocr to use CER+chrF scores for ground-truth comparison; removed embedding-model
- ty.toml ignore unresolved-import for tools/mtmd/tests/**
* image-text reordering fix removed
* refactor bool add_padding + pad_rounding enum into a single pad_style enum
* mtmd: fit_params now take into account mmproj
* rename alloc_compute_meta to reserve_compute_meta
* rm unused functions
* add ggml_backend_dev_t support
* add debug log
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().
Assisted-by: llama.cpp:local pi
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars
* refactor: skip MCP proxy probe when no server requires it
* refactor: suppress expected disconnect errors during MCP client shutdown
* refactor: Deduplicate requests
* refactor: deduplicate model fetching across ROUTER and MODEL modes
* refactor: Clean up models logic
* chore: Add `.env.example` file
* refactor: replace client-side CORS proxy probe with server status flag
* refactor: Post-review fixes
* test: add vitest client setup with API fetch mocks
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.
See 59778f0196.
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Support for Codex CLI by skipping unsupported Responses tools
* Warn on skipped Responses tools and preserve gpt-oss apply_patch rejection
* Revert gpt-oss apply_patch special handling