koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-26 07:33:39 +00:00

Author	SHA1	Message	Date
Aldehir Rojas	b22ff4b7b4	cmake/ui : refactor the build (#23352 )	2026-05-23 17:08:22 -04:00
Aditya Singh	c0c7e147e7	requirements : bump torch to 2.11.0 (#23503 ) * requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on PyPI for platform/CPython combinations where 2.6.x is not present. The accompanying comment already says 'PyTorch 2.6.0 or later', so the looser >=2.6.0 matches the documented intent and unblocks pip install -r requirements/requirements-convert_hf_to_gguf.txt. Fixes #23408 * requirements: bump torch floor to 2.11.0 per maintainer * requirements: pin torch to ==2.11.0 per project policy * requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy * requirements: suppress check_requirements pin warning on mtmd The check_requirements script flags '==' on lines in files matched by //requirements.txt. Append the documented suppression comment to the pinned torch and torchvision lines (and to the s390x platform marker lines) so the check passes while keeping the pins required by project policy. * ty: silence Tensor/Module union check on model[0].auto_model With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns Tensor \| Module rather than Module, so model[0].auto_model fails ty on the SentenceTransformer code path. The runtime behavior is unchanged because SentenceTransformer always wraps a Module at index 0. Adding a targeted unresolved-attribute ignore keeps the type-check green without altering behavior. A follow-up issue tracks typing the variable explicitly.	2026-05-23 18:24:39 +02:00
Aldehir Rojas	1acee6bf89	server: only parse empty msg if continuing an assistant msg (#23506 )	2026-05-22 11:58:15 -04:00
fairydreaming	ef570f6308	perplexity : fix integer overflow (#23496 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-22 15:50:44 +03:00
Georgi Gerganov	bbce619adb	cmake : add install() for impl libraries + fix apple builds (#23511 ) * pi : update * ci : fix ios build * ci : fix andoroid * ci : fix apple builds * cmake : add install() for impl libraries Add install(TARGETS <target> LIBRARY) for all -impl libraries that were changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in commit `bb28c1fe2`. Without this, cmake --install fails to copy the shared libraries, causing runtime errors like: llama-server: error while loading shared libraries: libllama-server-impl.so Ref: https://github.com/ggml-org/llama.cpp/issues/23494#issuecomment-4512912515 Assisted-by: llama.cpp:local pi * ci : fix xcframework build	2026-05-22 11:46:26 +03:00
Georgi Gerganov	bb28c1fe24	cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (#23462 ) * cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control Remove explicit STATIC from all -impl libraries (server, cli, completion, bench, batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls shared vs static linkage. Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows. Assisted-by: llama.cpp:local pi * cmake : enable LLAMA_BUILD_APP by default Assisted-by: llama.cpp:local pi * ci : disable app in build-cmake-pkg.yml	2026-05-21 21:13:59 +03:00
ScrewTSW	b65bb4baae	server: expose prompt token counts in /slots endpoint (#23454 ) Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor prompt evaluation progress during processing.	2026-05-21 13:29:13 +02:00
Aman Gupta	52fb93a2bd	server : free draft/MTP resources on sleep to fix VRAM leak (#23461 ) The destroy() function in server_context_impl only cleaned up the main model and context (via llama_init.reset()) but did not free the speculative decoder (spec), draft context (ctx_dft), or draft model (model_dft). For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated resources (KV cache, compute buffers) that are not freed when entering the sleeping state. On each sleep/resume cycle, new resources are allocated without the old ones being freed, leading to a VRAM leak that eventually crashes the server with out-of-memory errors. Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy() before resetting llama_init, ensuring proper cleanup order to avoid use-after-free. ref: https://github.com/ggml-org/llama.cpp/issues/23395 Assisted-by: llama.cpp:local pi	2026-05-21 16:11:11 +08:00
Pascal	c9021714e8	server: re-inject subcommand when router spawns children under unified binary (#23442 )	2026-05-21 10:09:19 +02:00
Adrien Gallouët	1d7ab2b947	app : add batched-bench, fit-params, quantize & perplexity (#23459 ) Some checks are pending Python Type-Check / python type-check (push) Waiting to run Details * app : add batched-bench, fit-params, quantize & perplexity Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add missing main.cpp Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add EOL Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-21 10:29:44 +03:00
Aleksander Grygier	5e932a1c8d	ui: Improve Git Hooks for UI development (#23403 ) * refactor: Improve Git Hooks for UI development * fix: Address review comments * fix: Use absolute git path for `/hooks` Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com>	2026-05-21 08:27:50 +02:00
wendadawen	6a257d4463	mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (#23329 ) - HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference. - Collapse OCR into the HUNYUANVL projector + HUNYUAN_VL text arch	2026-05-21 00:35:37 +02:00
stduhpf	3a479c9132	ui: Add max image size option (#22849 ) * webui: Add max image size option * remove magic numbers * support all image formats * use const * Move regex to match b64 images to constants * use SETTINGS_KEYS to get max image resolution setting * Do not touch the image if already under the size threshold	2026-05-21 00:00:09 +02:00
Saba Fallah	a8681a0ed2	mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (#23345 ) * mtmd : deepseek-ocr fixes, improvements and refactoring - image processing changes to achieve full parity with Pillow (reference impl) - SAM mask casting only when flash-attn is on - SAM refactor (build_sam() extracted so deepseek-ocr-2 can reuse it) - llama-chat changes to fix server/WebUI issue (new media_markers_first()) - adapted test-chat-template and added test cases for deepseek-ocr - changed regression test for deepseek-ocr to use CER+chrF scores for ground-truth comparison; removed embedding-model - ty.toml ignore unresolved-import for tools/mtmd/tests/** * image-text reordering fix removed * refactor bool add_padding + pad_rounding enum into a single pad_style enum	2026-05-20 17:37:10 +02:00
Aleksander Grygier	6ce96713de	feat: Add WAV MIME type variants and improve audio format detection (#23396 )	2026-05-20 16:55:24 +02:00
Adrien Gallouët	29f1482221	app : introduce the llama unified executable (#23296 ) * app : introduce the llama unified executable Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use serve for server Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Hide completion and bench, add help command Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove STATIC Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use -impl targets instead of -lib Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Revert "Remove STATIC" This reverts commit cc44caccb9902b34a3531633edac911e5b3d65cd. --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-20 13:22:22 +02:00
Aleksander Grygier	e6b4acfe86	refactor: Move text attachments up before the message content in chat completions payload (#23406 )	2026-05-20 13:04:01 +02:00
Xuan-Son Nguyen	e2b129e1bf	mtmd: fit_params now take into account mmproj (#21489 ) * mtmd: fit_params now take into account mmproj * rename alloc_compute_meta to reserve_compute_meta * rm unused functions * add ggml_backend_dev_t support * add debug log	2026-05-20 11:27:44 +02:00
Aleksander Grygier	5028447384	ui: Refactor `isMobile` as reactive value in `viewport` store (#23330 ) * refactor: `isMobile` as reactive value in `viewport` store * refactor: Use Svelte media query for the viewport store	2026-05-20 10:52:00 +02:00
Aleksander Grygier	585080d310	fix: Div wrapper no pointer events on hidden (#23390 ) Some checks failed Python Type-Check / python type-check (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details	2026-05-20 09:46:31 +02:00
Aleksander Grygier	67ace021da	refactor: Chat Screen UI rendering (#23333 )	2026-05-19 22:38:42 +02:00
Johannes Gäßler	7256fce047	common: fix --fit verbosity with --verbosity 4 (#23282 )	2026-05-19 21:33:23 +02:00
Georgi Gerganov	d14ce3dab4	llama : MTP clean-up (#23269 ) * llama : disable equal splits for recurrent memory with partial rollback * spec : re-enable p-min with MTP drafts * spec : re-enable ngram spec in combination with RS rollback * spec : fix ngram-map-* params * spec : fix acceptance logic in combined ngram + draft configs * graph : fix reuse for combined `token` + `embd` batches * spec : log parameters for each speculative implementation - add LOG_INF in each constructor with implementation type and parameters - extract device string logic into common_speculative_get_devices_str() - move 'adding speculative implementation' log from init into constructors Assisted-by: llama.cpp:local pi * spec : extend --spec-default with ngram-map-k4v Assisted-by: llama.cpp:local pi * minor : fix n_embd log * args : update draft.n_max == 3 + regen docs * spec : relax ngram-mod rejection thold to 0.25 @ 5 low * logs : improve * docs : update speculative decoding CLI argument documentation - Add missing draft model CPU scheduling and tensor override parameters - Update --spec-type to include all available types (excluding draft-eagle3 WIP) - Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0) - Remove deprecated options (spec-draft-ctx-size, spec-draft-replace) - Add environment variables for new parameters Assisted-by: llama.cpp:local pi * arg : step-back on adding k4v to the default spec config * cont : fix name	2026-05-19 15:32:58 +03:00
Aleksander Grygier	6db130445d	ui: Bump packages + address build warnings (#23300 ) * chore: Update vulnerable packages * chore: Formatting * refactor: Update Tailwind CSS imports * ci: Use `ubuntu-latest` for Unit/E2E UI tests * chore: Bump package * fix: Add missing tag * refactor: Enums files naming	2026-05-19 10:16:04 +02:00
Pascal	ccee426426	server-context: guarantee there is at least 1 token to decode (#23280 )	2026-05-19 09:49:01 +03:00
Georgi Gerganov	3c81c8deea	server : print graphs reused in slot timings (#23279 ) Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-05-19 09:46:58 +03:00
Aleksander Grygier	3a9c1b854d	ui: Update KaTeX package and clean up logs from `sass` warnings (#23275 ) * ui: migrate katex imports to @use to resolve SCSS deprecation warnings * ci: Use `ubuntu-slim` for CI (UI) workflow	2026-05-18 16:26:01 +02:00
Aleksander Grygier	b9a2170fce	feat: add scroll-to-bottom button to chat + prevent forced scroll down (#23270 )	2026-05-18 16:17:21 +02:00
Aleksander Grygier	1ff0fc1384	ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236 ) * refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars * refactor: skip MCP proxy probe when no server requires it * refactor: suppress expected disconnect errors during MCP client shutdown * refactor: Deduplicate requests * refactor: deduplicate model fetching across ROUTER and MODEL modes * refactor: Clean up models logic * chore: Add `.env.example` file * refactor: replace client-side CORS proxy probe with server status flag * refactor: Post-review fixes * test: add vitest client setup with API fetch mocks	2026-05-18 16:09:40 +02:00
Aleksander Grygier	a135ec0baa	ui: Centralize monospace font styles in app.css (#23272 ) Some checks failed Python Type-Check / python type-check (push) Has been cancelled Details	2026-05-18 15:10:14 +02:00
Martin Andersson	232f466583	webui: fix Tailwind v4 utility classes missing when built via cmake (#23253 )	2026-05-18 14:08:02 +02:00
Aldehir Rojas	87589042ca	cmake : fix LLAMA_BUILD_UI logic (#23190 )	2026-05-17 14:42:26 -04:00
Aman Gupta	3e12fbdea5	llama: avoid copying logits during prompt decode in MTP (#23198 ) * llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm	2026-05-17 23:30:25 +08:00
Aldehir Rojas	39cf5d6191	common : delegate assistant continuation to underlying template handlers (#23089 ) * common : delegate assistant continuation to template handler * server : implement echo parameter to exclude assistant prefill in the response * server : fix tests for prefill * server : use existing llama template * cont : clean up	2026-05-17 13:36:05 +02:00
Rares Vernica	1a68ec9378	server : honor --embd-normalize CLI arg (#23125 ) The --embd-normalize flag was registered only for the embedding and debug examples, so llama-server rejected it and the /embedding handler used a hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's example set and read params.embd_normalize as the handler's default. The per-request "embd_normalize" body field continues to override.	2026-05-17 09:39:04 +03:00
Judd	4f13cb7424	webui: support video files as input (#22830 )	2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen	b64739ea39	server: (router) alloc tmp buffer on heap (#23159 )	2026-05-16 23:42:16 +02:00
Pascal	64b38b561b	server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137 )	2026-05-16 21:21:06 +02:00
Aleksander Grygier	0253fb21f5	ui: Add request timeout for MCP tool calls (#23138 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details * feat: Add request timeout for MCP tool calls in llama-ui * feat: MCP Settings tab with max timeout setting	2026-05-16 15:20:27 +02:00
Holger Voormann	25b1bc9c2f	ui: Correct links in `tools/ui/README.md` [no ci] (#23139 ) In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`. See `59778f0196`.	2026-05-16 14:42:38 +02:00
Aman Gupta	255582687b	llama + spec: MTP Support (#22673 ) * spec: support MTP * fix batch size * rename files * cont : simplify (#7) * MTP: clean-up (#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: `8c05923630` Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (#11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (#13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-16 20:06:23 +08:00
kubawoo	b81c2cdd74	ui: Fix handling of MCP resource template parameters (#23117 ) * Fix handling of MCP resource template parameters * Fix formatting for uri-template.test.ts --------- Co-authored-by: kuba <kuba@laptop.local.net>	2026-05-16 13:25:41 +02:00
viggy	1428004808	webui : [ChatFormActionAdd][a11y] fix accessibility issues in add menu trigger and items (#22736 ) * fix tab order on attach button, and dont focus on disabled mennu item * add a11y tests	2026-05-16 12:00:46 +02:00
Pascal	366c5e2a3b	ui: untrack settings sync in props effect to prevent reactive loop (#23127 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / python type-check (push) Waiting to run Details	2026-05-16 11:25:34 +02:00
Aleksander Grygier	59778f0196	ui: Restructure repo to use `tools/ui` folder and `ui` / `UI` / `llama-ui` / `LLAMA_UI` naming (#23064 ) * webui: Move static build output from `tools/server/public` to `build/ui` directory * refactor: Move to `tools/ui` * refactor: rename CMake variables and preprocessor defines - Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated) - Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated) - Backward compat: old vars auto-forward to new ones with DEPRECATION warning - Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc. - Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET - Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines - Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED * refactor: rename CLI flags (--webui -> --ui) with backward compat - Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases) - Add --ui-config (old --webui-config kept as deprecated alias) - Add --ui-config-file (old --webui-config-file kept as deprecated alias) - Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated) - Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY - C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields - Backward compat: old fields synced to new ones in g_params_to_internals * refactor: update C++ server internals with backward compat - Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta) - Rename params.webui usage -> params.ui (both synced, old still works) - JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys - Server routes use params.ui_mcp_proxy \|\| params.webui_mcp_proxy - Preprocessor guards use #if defined(LLAMA_BUILD_UI) \|\| defined(LLAMA_BUILD_WEBUI) * refactor: rename CI/CD workflows, artifacts, and build script - Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build - Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT - Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks - Update server.yml: job/artifact refs webui-build -> ui-build - Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT - Update server-self-hosted.yml: webui-build -> ui-build - Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION - Rename webui-download.cmake -> ui-download.cmake (internal refs updated) - Update labeler.yml: server/webui -> server/ui path label * docs: update CODEOWNERS and server README docs - Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/ - Update server README.md: CLI tables show --ui flags with deprecated --webui aliases - Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/ * fix: Small fixes for UI build * fix: CMake.txt syntax * chore: Formatting * fix: `.editorconfig` for llama-ui * chore: Formatting * refactor: Use `APP_NAME` in Error route * refactor: Cleanup * refactor: Single migration service * make llama-ui a linkable target * fix: UI Build output * fix: Missing change * fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI * refactor: UI workflows cleanup --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-05-16 02:02:40 +02:00
Julien Chaumond	6831fe470c	docs: document `usage` object in server timings response (#23110 ) * docs: document `usage` object in server timings response Co-Authored-By: julien-agent <Agents+cyolo@huggingface.co> * Apply suggestion from @julien-c --------- Co-authored-by: julien-agent <Agents+cyolo@huggingface.co>	2026-05-15 19:33:12 +02:00
Xuan-Son Nguyen	72e60f500d	mtmd: add chunks and fix preproc for qwen3a (#23073 ) * mtmd: add chunks and fix preproc for qwen3a * add attn_mask * limit mtmd_chunk size (avoid blow up memory) * correct audio tokens * re-order the set_input case * remove attn_mask	2026-05-15 19:32:47 +02:00
Pascal	8be1786707	webui: fix theme from --webui-config-file not applied on first load (fresh localStorage) (#22902 )	2026-05-15 19:25:38 +02:00
Pascal	d528444580	webui: preserve partial response on streaming error (#23090 )	2026-05-15 11:18:11 +02:00
Sid Shaytay	91e84fed64	Support for Codex CLI by skipping unsupported Responses tools (#23041 ) Some checks are pending Python Type-Check / python type-check (push) Waiting to run Details * Support for Codex CLI by skipping unsupported Responses tools * Warn on skipped Responses tools and preserve gpt-oss apply_patch rejection * Revert gpt-oss apply_patch special handling	2026-05-15 09:03:24 +02:00

1 2 3 4 5 ...

848 commits