Concedo
1e828ccabf
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/common.cpp
# ggml/CMakeLists.txt
# scripts/sync-ggml.last
# scripts/sync_vendor.py
# src/llama-context.cpp
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-05-17 11:26:18 +08:00
Concedo
9203b6a051
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# .github/workflows/server-sanitize.yml
# .github/workflows/server-self-hosted.yml
# .github/workflows/server.yml
# .github/workflows/ui-build.yml
# .github/workflows/ui-ci.yml
# .github/workflows/ui-publish.yml
# .gitignore
# CMakeLists.txt
# CODEOWNERS
# scripts/ui-download.cmake
# scripts/xxd.cmake
# tests/test-backend-ops.cpp
# tests/test-reasoning-budget.cpp
# tools/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/server/README.md
2026-05-16 22:56:33 +08:00
Aman Gupta
255582687b
llama + spec: MTP Support ( #22673 )
...
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7 )
* MTP: clean-up (#9 )
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11 )
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13 )
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-16 20:06:23 +08:00
Aleksander Grygier
59778f0196
ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming ( #23064 )
...
* webui: Move static build output from `tools/server/public` to `build/ui` directory
* refactor: Move to `tools/ui`
* refactor: rename CMake variables and preprocessor defines
- Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated)
- Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated)
- Backward compat: old vars auto-forward to new ones with DEPRECATION warning
- Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc.
- Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET
- Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines
- Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED
* refactor: rename CLI flags (--webui -> --ui) with backward compat
- Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases)
- Add --ui-config (old --webui-config kept as deprecated alias)
- Add --ui-config-file (old --webui-config-file kept as deprecated alias)
- Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated)
- Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY
- C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields
- Backward compat: old fields synced to new ones in g_params_to_internals
* refactor: update C++ server internals with backward compat
- Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta)
- Rename params.webui usage -> params.ui (both synced, old still works)
- JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys
- Server routes use params.ui_mcp_proxy || params.webui_mcp_proxy
- Preprocessor guards use #if defined(LLAMA_BUILD_UI) || defined(LLAMA_BUILD_WEBUI)
* refactor: rename CI/CD workflows, artifacts, and build script
- Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build
- Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT
- Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks
- Update server.yml: job/artifact refs webui-build -> ui-build
- Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT
- Update server-self-hosted.yml: webui-build -> ui-build
- Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION
- Rename webui-download.cmake -> ui-download.cmake (internal refs updated)
- Update labeler.yml: server/webui -> server/ui path label
* docs: update CODEOWNERS and server README docs
- Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/
- Update server README.md: CLI tables show --ui flags with deprecated --webui aliases
- Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/
* fix: Small fixes for UI build
* fix: CMake.txt syntax
* chore: Formatting
* fix: `.editorconfig` for llama-ui
* chore: Formatting
* refactor: Use `APP_NAME` in Error route
* refactor: Cleanup
* refactor: Single migration service
* make llama-ui a linkable target
* fix: UI Build output
* fix: Missing change
* fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI
* refactor: UI workflows cleanup
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-05-16 02:02:40 +02:00
Concedo
da2cc90723
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/build-and-test-snapdragon.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# .github/workflows/server-self-hosted.yml
# .github/workflows/server-webui.yml
# .github/workflows/server.yml
# .gitignore
# CMakeLists.txt
# CONTRIBUTING.md
# README.md
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# grammars/README.md
# scripts/snapdragon/qdc/run_qdc_jobs.py
# scripts/snapdragon/qdc/tests/run_backend_ops_posix.py
# scripts/snapdragon/qdc/tests/run_bench_tests_posix.py
# scripts/snapdragon/qdc/tests/utils.py
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
# tools/server/webui/src/lib/components/app/server/ServerLoadingSplash.svelte
# tools/server/webui/src/routes/(chat)/chat/[id]/+page.svelte
# ty.toml
2026-05-15 17:09:48 +08:00
Aleksander Grygier
253ba110bc
webui: Move static build output from repo code to HF Bucket ( #22937 )
...
* ci: add workflow to publish webui to Hugging Face bucket
* ci: add webui release job to release workflow
* ci: test webui release job
* chore: Return to default minification strategy for build output files
* ci: extract webui build into separate workflow and job
* chore: Ignore webui static output + clean up references
* chore: Delete legacy webui static output
* chore: Ignore webui build static output
* fix: Workflow
* fix: Versioning naming
* chore: Update package name
* test: Test CI fix
* refactor: Naming
* server: implement webui build strategy with HF Bucket support
* chore: Remove test workflow
* chore: Use WebUI build workflow call in other workflows
* server: HF Buckets fallback for WebUI build
* refactor: App name variable
* refactor: Naming
* fix: Retrieve loading.html
* fix: workflow syntax
* fix: Rewrite malformed release.yml
* fix: Req param
* test: Re-add missing Playwright installation for CI tests
* refactor: Logic & security improvements
* refactor: Retrieve publishing jobs and DRY the workflows
* fix: Test workflow syntax
* fix: Upstream Release Tag for test workflow
* chore: Remove test workflow
* ci: Run WebUI jobs on `ubuntu-24.04-arm`
* refactor: Post-CR cleanup
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: CI cleanup
* refactor: Cleanup
* test: Test workflow
* refactor: use LLAMA_BUILD_NUMBER instead of LLAMA_BUILD_TAG for HF Bucket webui downloads
* server: add fallback mechanism for HF Bucket webui downloads from latest directory
* fix: Incorrect argument order in file(SHA256) calls for checksum verification
* refactor: Use cmake script for handling the HF Bucket download on build time
* feat: support local npm build for WebUI assets
* refactor: add `HF_ENABLED` flag to control WebUI build/download provisioning
* refactor: Cleanup
* chore: Remove test workflow
* fix: remove s390x from release workflow
* fix: add webui-build dependency to ubuntu-22-rocm and windows-hip
* Revert "fix: remove s390x from release workflow"
This reverts commit debcfffa9bc1e3112eae41f2d29741b682e4eb19.
* fix: Release workflow file
* fix: Proper release tag used for HF Bucket upload
* fix: Remove duplicate steps in release workflow
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-14 13:21:41 +02:00
Concedo
cc82c3164e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .github/workflows/build-cross.yml
# .github/workflows/build-sycl.yml
# .github/workflows/build.yml
# .github/workflows/editorconfig.yml
# .github/workflows/release.yml
# cmake/riscv64-spacemit-linux-gnu-gcc.cmake
# docs/backend/OPENVINO.md
# docs/backend/SYCL.md
# docs/build-riscv64-spacemit.md
# docs/ops.md
# docs/ops/WebGPU.csv
# embd_res/ggml-vocab-qwen35.gguf
# embd_res/ggml-vocab-qwen35.gguf.inp
# embd_res/ggml-vocab-qwen35.gguf.out
# examples/model-conversion/Makefile
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/common.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_reduce.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec_acc.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# ggml/src/ggml-zendnn/CMakeLists.txt
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# scripts/snapdragon/adb/run-completion.sh
# tests/CMakeLists.txt
# tools/cli/README.md
# tools/completion/README.md
# tools/mtmd/clip-impl.h
# tools/mtmd/clip.cpp
# tools/mtmd/clip.h
# tools/server/README.md
2026-05-14 19:04:04 +08:00
Georgi Gerganov
67b2b7f2f2
logs : reduce ( #23021 )
...
Python Type-Check / python type-check (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* logs : reduce
* args : fix envs
* server : fix build
* common : print verbosity level at start
* server : clean-up logs
* server : print prompt processing timings + sampling params
* minor : whitespaces
2026-05-14 13:05:52 +03:00
Georgi Gerganov
634275fbbb
spec : update CLI arguments for better consistency ( #22964 )
...
* spec : update CLI arguments for better consistency
* cont : fix CLI arg message
2026-05-13 09:15:39 +03:00
Concedo
f7923b261f
need to fix cuda compile. Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/python-type-check.yml
# examples/speculative-simple/README.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-cuda/im2col.cu
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# tests/test-backend-ops.cpp
# tools/cli/README.md
# tools/mtmd/CMakeLists.txt
# tools/server/README.md
2026-05-12 20:47:07 +08:00
Georgi Gerganov
68e7ea3eab
spec : parallel drafting support ( #22838 )
...
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
2026-05-11 19:09:43 +03:00
Concedo
45f8ff49bb
Merge commit ' 52e5f0a5c1' into concedo_experimental
...
# Conflicts:
# examples/gen-docs/gen-docs.cpp
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-vulkan/ggml-vulkan.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/binary.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/rms_norm_mul.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/ssm_scan.wgsl
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-reasoning-budget.cpp
# tools/llama-bench/llama-bench.cpp
# tools/rpc/rpc-server.cpp
# tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
# tools/server/webui/src/lib/components/app/chat/ChatSidebar/ChatSidebar.svelte
# tools/server/webui/src/routes/(chat)/+page.svelte
2026-04-29 22:27:36 +08:00
Georgi Gerganov
14e733e36f
spec : refactor params ( #22397 )
...
* spec : refactor params
* cont : fix
* cont : rename "sparam" to "sampling"
* cont : add spec params category
* cont : add info about removed arguments
* cont : skip param length check for spec params
* cont : adapt server tests
2026-04-28 09:07:33 +03:00
Concedo
340b22283e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# .gitignore
# docs/backend/SYCL.md
# docs/backend/snapdragon/README.md
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/mmvq.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_blk.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
# scripts/server-test-structured.py
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-mtmd.sh
# scripts/snapdragon/adb/run-tool.sh
# scripts/snapdragon/qdc/requirements.txt
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-completion.ps1
# scripts/snapdragon/windows/run-mtmd.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tools/cli/cli.cpp
# ty.toml
2026-04-25 12:13:14 +08:00
Matthias Straka
0dd7f915fd
cli : cleanup auto-completion code ( #21745 )
2026-04-23 15:03:28 +02:00
Concedo
0755f27372
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/openvino.Dockerfile
# .github/workflows/build-self-hosted.yml
# .github/workflows/build.yml
# common/chat.cpp
# docs/backend/OPENVINO.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-openvino/ggml-decoder.cpp
# ggml/src/ggml-openvino/ggml-openvino-extra.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-openvino/ggml-quants.cpp
# ggml/src/ggml-openvino/openvino/op/rope.cpp
# ggml/src/ggml-openvino/openvino/op_table.cpp
# ggml/src/ggml-openvino/openvino/op_table.h
# ggml/src/ggml-openvino/openvino/translate_session.cpp
# ggml/src/ggml-openvino/openvino/utils.cpp
# ggml/src/ggml-openvino/openvino/utils.h
# ggml/src/ggml-openvino/utils.cpp
# ggml/src/ggml-openvino/utils.h
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/mtmd/CMakeLists.txt
# tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
Ethan Turner
750579ff14
common: Refactoring sampler parameters ( #20429 ) ( #22233 )
...
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.
Issue: https://github.com/ggml-org/llama.cpp/issues/20429
Original PR: https://github.com/ggml-org/llama.cpp/pull/20297
2026-04-22 10:40:19 +02:00
Concedo
19a12bb080
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# common/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/sync-ggml.last
# tools/cli/cli.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device ( #22171 )
...
* fit-params : add option to output estimated memory per device
* cont : minor
* cont : refactor
* cont : move fit params implementation to libcommon
* cont : header
* cont : headers
* cont : codeowners
2026-04-21 09:54:36 +03:00
Concedo
cd6788007e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cross.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# examples/llama.android/lib/src/main/cpp/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/test-chat.cpp
# tests/test-mtmd-c-api.c
# tools/server/README.md
2026-04-20 20:19:11 +08:00
Georgi Gerganov
de71b5f81c
server : refactor "use checkpoint" logic ( #22114 )
2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own
9d49acb2a7
server: rename --clear-idle to --cache-idle-slots ( #21741 )
2026-04-20 08:30:24 +03:00
Sascha Rogmann
455d8e4be8
server : speculative checkpointing ( #19493 )
...
* server : speculative decoding using checkpoints
* server : fix draft check with checkpoints
* server : rename spec vars
* server : log levels
* server : refactored spec logic to speculative.cpp
* server : renamed spec checkpoints option
* server : fix spec checkpoints, logging
* speculative : checkpoints with draft model, logging
* server : n_tokens_cur and create_checkpoint in draft
* server : fix server_speculative_callback (slot.id)
* spec : fix ngram-map/begin idx_last_check
* spec : init ckpt (begin() wasn't called)
* chore: update webui build output
* server : restore sampler in spec checkpoint and clear mem
* cont : avoid --spec-use-checkpoints argument
* cont : remove server_prompt_checkpoint_with_size
* spec : rename (leave_draft_state)
* cont : clean-up
* cont : do not ignore partial drafts even if the are short
* cont : spec callback owned by session
* cont : simplify
* cont : avoid empty speculative session
* cont : simplify
* cont : simplify
* cont : enable mtmd speculative decoding
* cont : keep the spec sampler alive
* cont : simplify
* cont : fix nullptr deref + draft checkpoints
* cont : remove common_speculative_accept_response
* cont : remove callback
* cont : simplify
* cont : minor
* cont : simplify
* cont : fix accepted number
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Concedo
79882d669a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CODEOWNERS
# common/CMakeLists.txt
# common/common.h
# docs/ops.md
# docs/ops/Metal.csv
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/diffusion/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/gen-docs/CMakeLists.txt
# examples/idle/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/sycl/CMakeLists.txt
# examples/training/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-quantize-stats.cpp
# tools/batched-bench/CMakeLists.txt
# tools/cli/CMakeLists.txt
# tools/cli/cli.cpp
# tools/completion/CMakeLists.txt
# tools/cvector-generator/CMakeLists.txt
# tools/cvector-generator/cvector-generator.cpp
# tools/export-lora/CMakeLists.txt
# tools/gguf-split/CMakeLists.txt
# tools/gguf-split/gguf-split.cpp
# tools/imatrix/CMakeLists.txt
# tools/llama-bench/CMakeLists.txt
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/perplexity/CMakeLists.txt
# tools/quantize/CMakeLists.txt
# tools/quantize/quantize.cpp
# tools/results/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/tokenize/CMakeLists.txt
# tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Georgi Gerganov
6990e2f1f7
libs : rename libcommon -> libllama-common ( #21936 )
...
* cmake : allow libcommon to be shared
* cmake : rename libcommon to libllama-common
* cont : set -fPIC for httplib
* cont : export all symbols
* cont : fix build_info exports
* libs : add libllama-common-base
* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Concedo
2e4f94822e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-self-hosted.yml
# .github/workflows/docker.yml
# ci/run.sh
# docs/build.md
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# src/llama-vocab.cpp
# tests/test-chat.cpp
# tests/test-jinja.cpp
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-04-04 14:27:23 +08:00
Yes You Can Have Your Own
50e0ad08fb
server: save and clear idle slots on new task (--clear-idle) ( #20993 )
...
* server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE)
* server: move idle slot KV clearing to slot release
The save "cost" is now paid by the finishing request.
* server: add --kv-clear-idle flag, enable by default
* server: skip clearing last idle slot, clear on launch
* server: test --no-kv-clear-idle flag
* server: simplify on-release clearing loop
* server: remove on-release KV clearing, keep launch-only
* cont : clean-up
* tests: update log strings after --clear-idle rename
* tests: use debug tags instead of log message matching
* test: fix Windows CI by dropping temp log file unlink
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-03 19:02:27 +02:00
Concedo
34ad53e950
merged support for gemma4. the e2b, e4b and 26b work, the 31b does not
2026-04-03 11:07:46 +08:00
Ruben Ortlam
5803c8d115
tests: allow exporting graph ops from HF file without downloading weights ( #21182 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
* tests: allow exporting graph ops from HF file without downloading weights
* use unique_ptr for llama_context in HF metadata case
* fix missing non-required tensors falling back to type f32
* use unique pointers where possible
* use no_alloc instead of fixing f32 fallback
* fix missing space
2026-04-02 18:19:20 +02:00
Concedo
42ad89cd86
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/cann.Dockerfile
# .devops/cpu.Dockerfile
# .devops/llama-cli-cann.Dockerfile
# .devops/nix/package.nix
# .github/workflows/build-android.yml
# .github/workflows/build-cann.yml
# .github/workflows/build-msys.yml
# .github/workflows/docker.yml
# .github/workflows/editorconfig.yml
# .github/workflows/gguf-publish.yml
# .github/workflows/python-lint.yml
# .github/workflows/release.yml
# CMakeLists.txt
# docs/backend/CANN.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-rpc/ggml-rpc.cpp
# scripts/sync_vendor.py
# tests/test-chat-auto-parser.cpp
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
# tests/test-reasoning-budget.cpp
# tools/cli/cli.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2026-03-30 20:45:38 +08:00
Sigbjørn Skjæret
c46758d28f
cli : add /glob command ( #21084 )
...
* add /glob command
* output error when max files reached
* support globbing outside curdir
2026-03-28 02:33:04 +01:00
Adrien Gallouët
5c1a7b8355
server : add custom socket options to disable SO_REUSEPORT ( #21056 )
...
* server : add custom socket options to disable SO_REUSEPORT
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add --reuse-port
$ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
$ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Update tools/server/README.md (llama-gen-docs)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Fix windows
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 01:12:43 +01:00
Xuan-Son Nguyen
20197b6fe3
server: add built-in tools backend support ( #20898 )
...
* wip: server_tools
* refactor
* displayName -> display_name
* snake_case everywhere
* rm redundant field
* change arg to --tools all
* add readme mention
* llama-gen-docs
2026-03-27 10:07:11 +01:00
Concedo
6054bacadd
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/ai-issues.yml
# CONTRIBUTING.md
# docs/autoparser.md
# docs/ops.md
# docs/ops/Metal.csv
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hex-dma.h
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hip/CMakeLists.txt
# models/templates/Apriel-1.6-15b-Thinker-fixed.jinja
# models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja
# models/templates/deepseek-ai-DeepSeek-V3.1.jinja
# models/templates/llama-cpp-deepseek-r1.jinja
# models/templates/meetkai-functionary-medium-v3.1.jinja
# scripts/fetch_server_test_models.py
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-mtmd.sh
# scripts/snapdragon/adb/run-tool.sh
# tests/test-chat-auto-parser.cpp
# tests/test-chat-peg-parser.cpp
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/server/README.md
2026-03-21 12:06:01 +08:00
Piotr Wilkin (ilintar)
5e54d51b19
common/parser: add proper reasoning tag prefill reading ( #20424 )
...
* Implement proper prefill extraction
* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp
* Update tools/server/server-task.cpp
* refactor: move grammars to variant, remove grammar_external, handle exception internally
* Make code less C++y
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-19 16:58:21 +01:00
Concedo
48f914e374
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ci/run.sh
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/arch/riscv/repack.cpp
# ggml/src/ggml-cpu/arch/x86/repack.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-exp.h
# ggml/src/ggml-hexagon/htp/hvx-sigmoid.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync-ggml.last
# tests/test-backend-sampler.cpp
# tests/test-chat.cpp
# tests/test-jinja.cpp
# tools/cli/cli.cpp
2026-03-19 02:23:06 +08:00
Piotr Wilkin (ilintar)
d2ecd2d1cf
common/parser: add --skip-chat-parsing to force a pure content parser. ( #20289 )
...
* Add `--force-pure-content` to force a pure content parser.
* Update common/arg.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Change parameter name [no ci]
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Concedo
b1c500ae2b
Merge commit ' 2948e6049a' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CONTRIBUTING.md
# docs/backend/VirtGPU/development.md
# docs/ops.md
# docs/ops/WebGPU.csv
# embd_res/templates/GigaChat3-10B-A1.8B.jinja
# embd_res/templates/GigaChat3.1-10B-A1.8B.jinja
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-grammar-integration.cpp
# tests/test-quantize-fns.cpp
2026-03-15 11:21:24 +08:00
Concedo
67c9798d0b
Merge commit ' 3ca19b0e9f' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# common/CMakeLists.txt
# common/chat-peg-parser.cpp
# docs/backend/SYCL.md
# docs/ops.md
# docs/ops/SYCL.csv
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/rope.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/compare-llama-bench.py
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tools/cli/cli.cpp
2026-03-15 11:11:31 +08:00
Concedo
04915d99ee
Merge commit ' 451ef08432' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# docs/ops.md
# docs/ops/Vulkan.csv
# src/llama-model-loader.cpp
# src/llama-model.cpp
# src/llama.cpp
# tests/CMakeLists.txt
# tests/peg-parser/test-basic.cpp
# tests/peg-parser/test-json-parser.cpp
# tests/peg-parser/test-python-dict-parser.cpp
# tests/peg-parser/test-unicode.cpp
# tests/test-chat-auto-parser.cpp
# tests/test-chat-peg-parser.cpp
# tests/test-chat.cpp
# tools/CMakeLists.txt
2026-03-13 23:33:37 +08:00
Ruben Ortlam
128142fe7d
test-backend-ops: allow loading tests from file and parsing model operators into file ( #19896 )
...
* tests: allow loading test-backend-ops tests from json
* add error threshold based on op
* add error when file cannot be read
* add graph operator json extraction tool
* add nb parameter for non-contiguous input tensors
* fix view check
* only use view if non-contiguous/permuted, use C++ random instead of rand()
* replace internal API calls with public llama_graph_reserve call
* reduce test description length
* fix nb[0] not getting set for view
* add name to tests
* fix inplace error
* use text file instead of json
* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/
* fix missing declaration
* use pragma once
* fix indent
* fix Windows build
2026-03-12 13:26:00 +01:00
ddh0
4a748b8f15
common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up ( #20416 )
2026-03-12 00:13:28 +01:00
Piotr Wilkin (ilintar)
acb7c79069
common/parser: handle reasoning budget ( #20297 )
...
* v1
* Finished!
* Handlie cli
* Reasoning sampler
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Less explosive terminology :)
* Add utf-8 case and tests
* common : migrate reasoning budget sampler to common
* cont : clean up
* cont : expose state and allow passing as initial state
* cont : remove unused imports
* cont : update state machine doc string
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Concedo
6adcd0b5db
Merge commit ' 34df42f7be' into concedo_experimental
...
# Conflicts:
# README.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/binary-ops.c
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-arith.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-inverse.h
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# tests/test-backend-ops.cpp
# tools/cli/cli.cpp
# tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
2026-03-10 22:20:04 +08:00
Concedo
746664fde6
Merge commit ' 2cd20b72ed' into concedo_experimental
...
# Conflicts:
# CONTRIBUTING.md
# docs/backend/CANN.md
# docs/backend/SYCL.md
# docs/backend/snapdragon/README.md
# docs/backend/snapdragon/windows.md
# docs/build.md
# docs/multimodal/MobileVLM.md
# docs/ops.md
# docs/ops/WebGPU.csv
# examples/debug/README.md
# examples/llama.vim
# examples/model-conversion/README.md
# examples/sycl/README.md
# ggml/src/ggml-cpu/amx/mmq.cpp
# ggml/src/ggml-cpu/arch/x86/repack.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp-drv.cpp
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-copy.h
# ggml/src/ggml-hexagon/htp/hvx-inverse.h
# ggml/src/ggml-hexagon/htp/hvx-reduce.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/worker-pool.c
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cpy.cl
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/quants.hpp
# ggml/src/ggml-sycl/softmax.cpp
# ggml/src/ggml-vulkan/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/pr2wt.sh
# scripts/server-bench.py
# scripts/snapdragon/windows/run-cli.ps1
# tests/test-alloc.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/completion/README.md
# tools/cvector-generator/cvector-generator.cpp
# tools/imatrix/README.md
# tools/perplexity/README.md
# tools/server/public_simplechat/readme.md
# tools/server/tests/README.md
2026-03-10 22:11:08 +08:00
Johannes Gäßler
a976ff081b
llama: end-to-end tests ( #19802 )
...
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
2026-03-08 12:30:21 +01:00
Piotr Wilkin (ilintar)
f5ddcd1696
Checkpoint every n tokens: squash ( #20087 )
2026-03-06 11:39:26 +01:00
Aleksander Grygier
f6235a41ef
webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts ( #18655 )
2026-03-06 10:00:39 +01:00
Marcel Petrick
92f7da00b4
chore : correct typos [no ci] ( #20041 )
...
* fix(docs): correct typos found during code review
Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.
Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
* Update docs/backend/CANN.md
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"
This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.
* Update tests/test-backend-ops.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-backend-ops.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-05 08:50:21 +01:00
Concedo
4e358265a3
Merge commit ' 8387ffb28d' into concedo_experimental
...
# Conflicts:
# docs/backend/VirtGPU.md
# docs/backend/ZenDNN.md
# ggml/src/ggml-cpu/amx/amx.cpp
# ggml/src/ggml-cpu/amx/mmq.cpp
# ggml/src/ggml-sycl/add-id.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h
# ggml/src/ggml-virtgpu/backend/backend-dispatched.h
# ggml/src/ggml-virtgpu/backend/backend-virgl-apir.h
# ggml/src/ggml-virtgpu/backend/backend.cpp
# ggml/src/ggml-virtgpu/backend/shared/api_remoting.h
# ggml/src/ggml-virtgpu/backend/shared/apir_backend.gen.h
# ggml/src/ggml-virtgpu/backend/shared/apir_backend.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs_rpc.h
# ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
# ggml/src/ggml-virtgpu/ggml-backend-device.cpp
# ggml/src/ggml-virtgpu/ggml-backend-reg.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-virtgpu/ggml-remoting.h
# ggml/src/ggml-virtgpu/include/apir_hw.h
# ggml/src/ggml-virtgpu/regenerate_remoting.py
# ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
# ggml/src/ggml-virtgpu/virtgpu-forward.gen.h
# ggml/src/ggml-virtgpu/virtgpu.cpp
# ggml/src/ggml-virtgpu/virtgpu.h
# ggml/src/ggml-zendnn/CMakeLists.txt
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-tokenizer-0.sh
# tools/cli/README.md
# tools/completion/README.md
# tools/imatrix/imatrix.cpp
# tools/server/README.md
2026-02-28 12:45:16 +08:00