Commit graph

564 commits

Author SHA1 Message Date
Concedo
80ce8a50b3 allow token bans and eos handling in 2026-05-16 15:20:46 +08:00
askmyteapot
3174fccb83
Fix linker error for kcpp_permit_any_repack (#2210)
Fixes `gpttype_adapter.lib(gpttype_adapter.obj) : error LNK2019: unresolved external symbol "int kcpp_permit_any_repack"`

It's a `bool`, not an `int`
2026-05-16 08:55:56 +08:00
Concedo
66a7b5e5de prevent repacking if mmap is used, reduces memory footprint 2026-05-15 16:40:53 +08:00
Concedo
286e62267e adjust batching eligibility 2026-05-11 21:54:32 +08:00
Concedo
bfaddd7a3b added support for added memory and gemma and glm prompt fixes for batching mode 2026-05-10 23:39:03 +08:00
Concedo
33ca75d56f ci for tools upload, minor function reordering 2026-05-10 23:10:43 +08:00
AlpinDale
c03302b670
feat: add a primitive form of continuous batching (#2167)
* feat: add a primitive form of continuous batching

* fix: deadlock in batching fallback

* fix: windows build

* chore: suppress the contbatch arg from --help

* feat: batch-aware rep_pen_slope

* fix: automatically disable shifting when batching is enabled

* fix: mixed-path state corruption

* fix: attempt to fully separate the two pipelines

* added a semaphore to prevent non-batchable requests from starting while batched requests are running

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2026-05-10 17:50:31 +08:00
Concedo
eb30b29d69 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/gguf-publish.yml
#	CODEOWNERS
#	examples/sycl/test.sh
#	pyproject.toml
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/README.md
2026-05-08 14:48:57 +08:00
Concedo
950676fdb7 split utils.cpp into 2 files to support sd.cpp 2026-05-04 15:04:12 +08:00
Concedo
8b62e7b667 allow splitmode to be set independently, enable tensor parallelism 2026-05-02 16:41:28 +08:00
Concedo
0755f27372 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/openvino.Dockerfile
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/build.yml
#	common/chat.cpp
#	docs/backend/OPENVINO.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/libggml-htp.inf
#	ggml/src/ggml-openvino/ggml-decoder.cpp
#	ggml/src/ggml-openvino/ggml-openvino-extra.cpp
#	ggml/src/ggml-openvino/ggml-openvino.cpp
#	ggml/src/ggml-openvino/ggml-quants.cpp
#	ggml/src/ggml-openvino/openvino/op/rope.cpp
#	ggml/src/ggml-openvino/openvino/op_table.cpp
#	ggml/src/ggml-openvino/openvino/op_table.h
#	ggml/src/ggml-openvino/openvino/translate_session.cpp
#	ggml/src/ggml-openvino/openvino/utils.cpp
#	ggml/src/ggml-openvino/openvino/utils.h
#	ggml/src/ggml-openvino/utils.cpp
#	ggml/src/ggml-openvino/utils.h
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/convert.hpp
#	ggml/src/ggml-sycl/gemm.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/set_rows.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/sync_vendor.py
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
Concedo
becf70d49b fixed logspam for fit 2026-04-23 00:43:09 +08:00
Concedo
96ec87127a updated colab, handle connection dropping during prompt processing 2026-04-21 21:46:13 +08:00
Concedo
4629b49afb updated to handle changes for clip_is_mrope 2026-04-21 19:34:32 +08:00
Concedo
19a12bb080 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	common/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	scripts/sync-ggml.last
#	tools/cli/cli.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Concedo
1feba4e4ea fixed koboldcpp.sh, fixed vision max/min when one param is missing, fixed processing count wrong, updated lite 2026-04-21 18:36:47 +08:00
Concedo
71b4107bb6 fixed terminal logs 2026-04-19 11:31:12 +08:00
Concedo
e5eab545f3 handle override jinja template 2026-04-19 00:30:28 +08:00
Concedo
17c754a5fc improved reasoning budget 2026-04-18 17:19:09 +08:00
Concedo
0b37cb9a57 added preliminary support for reasoning budget 2026-04-18 11:56:33 +08:00
Concedo
9a38091207 support q5_1 kv 2026-04-17 17:06:15 +08:00
Concedo
cccb45a00a summary outputs include processed amt 2026-04-17 14:22:51 +08:00
Concedo
64ce5fca15 better approach when SWA window exceeded, simply refill the window. this is not 100% correct but good enough for fastforward users. Disable FF or increase window if not good enough 2026-04-17 11:44:13 +08:00
Concedo
b5e317e015 SWA fix attempt 2 2026-04-17 00:33:45 +08:00
Concedo
ae292c496e handle SWA conflicting with rewind, increased default SWA padding. 2026-04-16 17:00:26 +08:00
Concedo
0251c6dbde added swa padding controls 2026-04-16 16:21:48 +08:00
Concedo
535df844dd touchup for min/max tokens ui 2026-04-16 14:56:22 +08:00
Llama
c592bd01da
Pass img_min_params and img_max_params to ctx_clip_params (#2133)
* Pass img_min_params and img_max_params to ctx_clip_params

These values determine the minimum and maximum size (in
tokens) of vision embeddings. The default value of -1
uses a model-dependent default size, for example for
Gemma 4 the default is a 280 token embedding. For higher
quality results (at the cost of using more memory and
slower speed) you can increase the size of the embedding
to 1120 tokens.

* Change dict to mydict to match change to method
2026-04-16 12:27:06 +08:00
Concedo
9c0b9b0bb1 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/development/HOWTO-add-model.md
#	docs/multimodal.md
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/gated_delta_net.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/upscale.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	tests/test-backend-ops.cpp
#	tests/test-llama-archs.cpp
#	tools/mtmd/CMakeLists.txt
2026-04-14 20:06:04 +08:00
Concedo
ae60ea0009 handle updated gemma templates 2026-04-14 00:29:58 +08:00
Concedo
5361b45fba Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	requirements/requirements-tool_bench.txt
2026-04-12 16:22:26 +08:00
Concedo
919a010ebc support outro but don't actually use it yet 2026-04-10 23:18:30 +08:00
Concedo
a1fc912452 try don't trigger the magic if input is following the jinja template.exactly for thinking models (+1 squashed commits)
Squashed commits:

[5542e81dc] try don't trigger the magic if input is following the jinja template.
2026-04-07 21:47:38 +08:00
Concedo
2d3fe0c113 revert my tweak, switch back to henk's original implementation for now, we can explore this again next time. 2026-04-07 19:00:58 +08:00
Concedo
e991bc044e updated lite, modify henk fix to allow triggering on missing close only 2026-04-06 23:41:45 +08:00
Concedo
82cc19e055 calculate some fields before autofit for more accurate estimate 2026-04-06 20:44:37 +08:00
Concedo
f6e712d919 universal gemma4 fix, add memory check 2026-04-06 19:20:44 +08:00
henk717
4e30294cb1
Henk's Gemma4 31B Magic (#2096) 2026-04-06 18:49:19 +08:00
Concedo
6c937c05d9 improve ncmoe / moecpu regex 2026-04-04 23:53:13 +08:00
Concedo
db8bc40731 add some warnings if shifting fails 2026-04-04 23:16:26 +08:00
Concedo
eb3422996a BOS fix for gemma4 2026-04-04 22:15:01 +08:00
Concedo
97f785efce ensure BOS on vision prefix 2026-04-03 16:20:36 +08:00
Concedo
e8cffa37c8 fixed gemma4v image crashing on encode, however images are not yet working correctly 2026-04-03 15:56:35 +08:00
Concedo
0c2b679ea3 support bf16 quantkv cache type 2026-03-28 00:01:17 +08:00
Concedo
c91f350ed5 increase max images, take images from the end instead of beginning if too many images 2026-03-26 23:03:52 +08:00
Concedo
993925ba96 gracefully handle bad grammar instead of crashing 2026-03-23 17:00:53 +08:00
Concedo
07327b6c10 double n_batch size when pipeline parallel is enabled, keep u_batch the same 2026-03-21 11:22:10 +08:00
Concedo
3113e3a643 move main device print 2026-03-21 10:47:21 +08:00
Concedo
f579939057 updated lite, change smartcache snapshot behavior to conserve slots 2026-03-15 15:15:39 +08:00
Concedo
fcdf2f40d5 no need snapshot after gen is complete. 2026-03-15 12:34:48 +08:00