koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-22 11:16:08 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	3c81c8deea	server : print graphs reused in slot timings (#23279 ) Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-05-19 09:46:58 +03:00
Georgi Gerganov	cd963fee6a	save-load-state : refactor tests and improve readability (#23196 ) * save-load-state : refactor into separate phase functions - Split monolithic main() into 4 self-contained phase functions, each managing its own context/sampler/batch lifecycle - Each function tokenizes internally using its local ctx instance - main() is now a clean orchestrator: init -> run phases -> assert results - Proper resource cleanup on every exit path (return {} on error) Assisted-by: llama.cpp:local pi * save-load-state : use params.out_file instead of separate state_file - Remove state_file parameter from all phase functions - Each function accesses params.out_file directly - Initialize params.out_file in main alongside params.prompt Assisted-by: llama.cpp:local pi * save-load-state : use smart pointers for ctx and smpl - Replace raw llama_context* with llama_context_ptr - Replace raw llama_sampler* with llama_sampler_ptr - Remove all manual llama_free() and llama_sampler_free() calls - Keep llama_batch as raw (managed manually with llama_batch_free) Assisted-by: llama.cpp:local pi * save-load-state : add local llama_batch_ptr RAII wrapper - Add llama_batch_ptr struct holding llama_batch by value - Calls llama_batch_free() in destructor - Eliminates all manual llama_batch_free() calls Assisted-by: llama.cpp:local pi * save-load-state : replace printf/fprintf with logging macros - Add log.h include - Replace fprintf(stderr, ...) errors with LOG_ERR - Replace fprintf(stderr, ...) info with LOG_TRC - Replace printf output with LOG Assisted-by: llama.cpp:local pi * save-load-state : refactor tests to check results inline Each follow-up phase now accepts an expected result and performs the comparison internally instead of collecting results in main(). Assisted-by: llama.cpp:local pi * save-load-state : improve test output readability Add phase labels, remove redundant run prefixes, and show PASS after each test. Assisted-by: llama.cpp:local pi * pi : add rule about git signing * save-load-state : simplify llama_batch_ptr Change get() to return a reference and remove operator(). Use batch.get() throughout for consistency. Assisted-by: llama.cpp:local pi save-load-state : extract generate_tokens helper Factor out the repeated token generation loop into a shared helper function used by all phases. Assisted-by: llama.cpp:local pi * save-load-state : update comments to use test terminology Replace "Phase" with "Test" and list each test's steps as bullet points. Assisted-by: llama.cpp:local pi * save-load-state : rename test functions Rename to test_baseline, test_state_load, test_seq_cp_host, test_seq_cp_device. Update comments and logs accordingly. Assisted-by: llama.cpp:local pi * pi : add rule to never git push without confirmation Assisted-by: llama.cpp:local pi * common : add model_only option to common_init_from_params Add bool model_only parameter to skip context creation, sampler init, and context-dependent setup. Use in save-load-state to initialize only the model, with each test creating its own context. Assisted-by: llama.cpp:local pi --------- Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-05-19 09:46:34 +03:00
Georgi Gerganov	d2e179a477	llama-eval : add per-task summary stats (#23151 ) * llama-eval : add per-problem summary table to HTML reports - Add chunk_idx and problem_idx to TaskState and saved case dicts - Group completed cases by problem_idx in dump_html() - Render per-problem summary table before individual task table - Columns: Problem (zero-padded), Runs, Correct (n/r), Tokens (min/avg/max), T/s (min/avg/max), Gen s (min/avg/max) - Sorted by problem index, monospace font, right-aligned numbers - Colspan headers for grouped stats, auto width - Simulator: add /v1/models endpoint, timings in response, template-aware question matching, --dataset arg (aime/aime2025) Assisted-by: llama.cpp:local pi * llama-eval : add tabs for Detailed and Summary tables, apply monospace font globally - Wrap Detailed and Summary tables in switchable tabs (Detailed active by default) - Remove summary-section wrapper, use tab labels instead - Apply monospace font to all tables and the top bar Assisted-by: llama.cpp:local pi * llama-eval : redesign top bar as CSS grid label/value pairs - Replace flat span list with 4-column grid layout (2 pairs per row) - Labels in muted color (#888), values in dark (#222) - Bold dataset name and model name - Removed media query, always uses 4 columns Assisted-by: llama.cpp:local pi * llama-eval : use realistic token counts and throughput in simulator - comp_tokens: [30, 80] → [10000, 60000] - tps_gen: derived → uniform [90.0, 110.0] - t_gen_ms: now computed from tokens/tps Assisted-by: llama.cpp:local pi * llama-eval : color Answer column green/red based on correctness Use the same .correct/.incorrect CSS classes on the Answer column to make correct answers green and incorrect answers red. Assisted-by: llama.cpp:local pi * llama-eval : fix pyright errors from max(..., key=len) type inference Use key=lambda x: len(x) instead of key=len so the type checker infers the return type as str instead of Sized, fixing: - unresolved-attribute: Object of type Sized has no attribute lower - not-subscriptable: Cannot subscript object of type Sized Assisted-by: llama.cpp:local pi	2026-05-19 09:46:05 +03:00
Reese Levine	c85a242ed0	ggml-webgpu : extend GDN for K>1 (#23299 )	2026-05-19 09:45:41 +03:00
Neo Zhang	aabee047d8	[SCYL] add chapter for performance reference in SYCL.md (#23315 ) * add chapter for performance reference * rm unsupported GPU	2026-05-19 09:44:51 +03:00
Sigbjørn Skjæret	f1c1c5c057	convert : filter lora tensor names (#23077 )	2026-05-19 09:44:25 +03:00
Intel AI Get-to Market Customer Success and Solutions	439f1b193d	sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (#22153 ) * sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle Signed-off-by: Chun Tao <chun.tao@intel.com> * Use async mem ops for correctness when SYCL graphs are explicitly on. Signed-off-by: Tao, Chun <chun.tao@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Tao, Chun <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-19 09:44:02 +03:00
Radoslav Gerganov	c3e9ade6dd	rpc : keep last_graph_uid in the device context (#23273 ) With the introduction of MTP we can have multiple compute contexts for the same RPC device. In this case last_graph_uid is not updated properly when contexts are being switched. This patch fixes this by moving last_graph_uid to the device context, making sure it is always updated. closes: #23242	2026-05-19 09:42:36 +03:00
Pranav Dhinakar	9a532ae4ba	hexagon: add support for TRI op (#22822 ) * Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context * addressed PR review comments for TRI op * hexagon: clang format * hex-unary: remove merge conflict markers * hex-ggml: remove duplicate op cases (merge conflict) * hex-ggml: fix editor config errors --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-18 14:04:57 -07:00
Pranav Dhinakar	b7340443d4	ggml-hexagon: add PAD op HVX kernel (#23078 ) * ggml-hexagon: add PAD op HVX kernel Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized kernels. Supports zero-padding and circular padding across all 4 tensor dimensions. * hex-ggml: remove duplicate op cases (merge conflict) * hex-pad: fix editorconfig checks and macro alignment --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-18 13:39:36 -07:00
SamareshSingh	5cbaa5e69e	docker : add OCI image labels for version and build date (#21653 ) * docker: add OCI image labels to all published images * docker: propagate OCI labels as manifest and index annotations * docker: drop hardcoded org URL and revert accidental intel version bump The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit. * docker: add skip_s390x workflow_dispatch input for fast test runs Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too. Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com> --------- Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>	2026-05-18 22:14:45 +02:00
Adrien Gallouët	45b455e66f	common : remove hf cache migration (#23266 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-18 17:11:47 +02:00
Aleksander Grygier	3a9c1b854d	ui: Update KaTeX package and clean up logs from `sass` warnings (#23275 ) * ui: migrate katex imports to @use to resolve SCSS deprecation warnings * ci: Use `ubuntu-slim` for CI (UI) workflow	2026-05-18 16:26:01 +02:00
Aleksander Grygier	b9a2170fce	feat: add scroll-to-bottom button to chat + prevent forced scroll down (#23270 )	2026-05-18 16:17:21 +02:00
Aleksander Grygier	1ff0fc1384	ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236 ) * refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars * refactor: skip MCP proxy probe when no server requires it * refactor: suppress expected disconnect errors during MCP client shutdown * refactor: Deduplicate requests * refactor: deduplicate model fetching across ROUTER and MODEL modes * refactor: Clean up models logic * chore: Add `.env.example` file * refactor: replace client-side CORS proxy probe with server status flag * refactor: Post-review fixes * test: add vitest client setup with API fetch mocks	2026-05-18 16:09:40 +02:00
Aleksander Grygier	a135ec0baa	ui: Centralize monospace font styles in app.css (#23272 ) Some checks failed Python Type-Check / python type-check (push) Has been cancelled Details	2026-05-18 15:10:14 +02:00
Martin Andersson	232f466583	webui: fix Tailwind v4 utility classes missing when built via cmake (#23253 )	2026-05-18 14:08:02 +02:00
Andrei	49c21f97cd	llama: initialize pre-norm embedding mask flag (#23256 )	2026-05-18 14:20:49 +03:00
Sigbjørn Skjæret	77e38d68f2	add myself to conversion (#23261 )	2026-05-18 12:42:56 +02:00
Martin Klacer	053e01dff6	ci : added kleidiai-server to server-self-hosted workflow (#22435 ) * kleidiai: added kleidiai-server to server-self-hosted workflow * Added KleidiAI-enabled Arm64 Linux llama-server CI/integration test workflow into the server-self-hosted.yml configuration file Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I032e33c525b7e26bc5d53719f638bee610cec1ee * Added self-hosted executor for KleidiAI server workflow Signed-off-by: Martin Klacer <martin.klacer@arm.com> * Update .github/workflows/server-self-hosted.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-18 11:14:57 +02:00
Georgi Gerganov	c3f95c1f06	scripts : allow wc2wt with an existing branch (#23189 )	2026-05-18 08:57:28 +03:00
Intel AI Get-to Market Customer Success and Solutions	0caf2a1d48	sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156 ) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-18 08:12:21 +03:00
Intel AI Get-to Market Customer Success and Solutions	5511965b19	sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150 ) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-18 08:11:51 +03:00
Neo Zhang	e98bcfec28	sycl : fix error when use -mg 1 error (#23140 )	2026-05-18 08:11:19 +03:00
Incarnas	1867a0c692	update bid to match each layers MTP source (#23237 ) * update bid to match each layers MTP source * Update conversion/qwen.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-18 12:37:12 +08:00
Sigbjørn Skjæret	dd7cad7197	cmake : do not check for bin install dir (#23234 )	2026-05-18 02:33:14 +02:00
Gabe Goodhart	726704a160	feat: Support d_conv=15 for ssm-conv.cu (#23017 ) Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2026-05-17 23:05:11 +02:00
Aldehir Rojas	87589042ca	cmake : fix LLAMA_BUILD_UI logic (#23190 )	2026-05-17 14:42:26 -04:00
Sigbjørn Skjæret	e0de4c2419	cmake : do not install conversion script (#23204 )	2026-05-17 18:07:21 +02:00
Oliver Simons	84c678242a	CUDA: Continue directly including cuda/iterator (#23102 ) Cont of #22936, forgot to update one site	2026-05-17 18:00:10 +02:00
Aman Gupta	3e12fbdea5	llama: avoid copying logits during prompt decode in MTP (#23198 ) * llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm	2026-05-17 23:30:25 +08:00
Aldehir Rojas	39cf5d6191	common : delegate assistant continuation to underlying template handlers (#23089 ) * common : delegate assistant continuation to template handler * server : implement echo parameter to exclude assistant prefill in the response * server : fix tests for prefill * server : use existing llama template * cont : clean up	2026-05-17 13:36:05 +02:00
Jan Ekström	a6d6183dbc	ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009 ) * ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI For whatever reason, the files are under additional sub-path `vulkan/` under the cmake directory, which does not match either current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`), nor what gets installed when you run the cmake build+install for SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`). This allows for SPIRV-Headers to be found, as currently the CI runner's setup does not seem to include the relevant path in list of search locations. * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers This is installed by the project if it is built and installed. Receiving an error during the configuration step is generally preferred to receiving an error in the middle of a build.	2026-05-17 13:12:11 +02:00
Pascal	fcae601e44	vulkan: add cpy bf16 -> f32 pipelines (#22677 )	2026-05-17 11:31:20 +02:00
Jeff Bolz	7ba22c6a09	vulkan: Support unaligned tensors for ROPE (#22637 )	2026-05-17 11:30:16 +02:00
Aldehir Rojas	f4cc787b9f	common : enable streaming JSON argument values (#23173 ) * common : remove atomic from json arguments * common : remove parsing logic on JSON arguments	2026-05-17 03:44:34 -05:00
Jeff Bolz	3fbadb06dc	vulkan: fuse SSM_CONV + BIAS + SILU (#22653 )	2026-05-17 10:25:50 +02:00
Rares Vernica	1a68ec9378	server : honor --embd-normalize CLI arg (#23125 ) The --embd-normalize flag was registered only for the embedding and debug examples, so llama-server rejected it and the /embedding handler used a hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's example set and read params.embd_normalize as the handler's default. The per-request "embd_normalize" body field continues to override.	2026-05-17 09:39:04 +03:00
ddh0	a16cce81d3	ngram : reduce noisy logs (#23185 ) * ngram : reduce noisy logs * ngram : reduce noisy logs	2026-05-17 09:38:17 +03:00
Judd	4f13cb7424	webui: support video files as input (#22830 )	2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen	b64739ea39	server: (router) alloc tmp buffer on heap (#23159 )	2026-05-16 23:42:16 +02:00
Pascal	64b38b561b	server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137 )	2026-05-16 21:21:06 +02:00
Winston Ma	6049906133	vulkan: removed duplicate #include <memory> in headers (#23144 )	2026-05-16 19:57:35 +02:00
Aleksander Grygier	0253fb21f5	ui: Add request timeout for MCP tool calls (#23138 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details * feat: Add request timeout for MCP tool calls in llama-ui * feat: MCP Settings tab with max timeout setting	2026-05-16 15:20:27 +02:00
Georgi Gerganov	3a92bc99db	sync : ggml	2026-05-16 16:11:29 +03:00
Georgi Gerganov	e6c37a1adc	ggml : bump version to 0.12.0 (ggml/1494)	2026-05-16 16:11:29 +03:00
CrispStrobe	560445bf34	metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477) For a given output position j on the time axis, only input positions i such that is0 <= j < is0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.	2026-05-16 16:11:29 +03:00
Steve Lhomme	2eb3e6b242	ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480) That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code	2026-05-16 16:11:29 +03:00
Holger Voormann	25b1bc9c2f	ui: Correct links in `tools/ui/README.md` [no ci] (#23139 ) In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`. See `59778f0196`.	2026-05-16 14:42:38 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	18675b6bbc	vendor : update cpp-httplib to 0.45.0 (#23103 )	2026-05-16 15:25:21 +03:00

1 2 3 4 5 ...

9230 commits