koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-12 18:09:42 +00:00

Author	SHA1	Message	Date
Olivier Chafik	a83f528688	`tool-call`: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539 ) * An empty tool_call_id is better than none! * sync: minja (tool call name optional https://github.com/google/minja/pull/36) * Force-disable parallel_tool_calls if template doesn't support it * More debug logs * Llama 3.x tools: accept / trigger on more varied spaced outputs * Fix empty content for functionary v3.2 tool call * Add proper tool call docs to server README * readme: function calling is supported now * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-31 14:15:25 +00:00
Olivier Chafik	b1bcd309fc	fix stop regression (#11543 )	2025-01-31 13:48:31 +00:00
Olivier Chafik	5783575c9d	Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533 )	2025-01-31 08:24:29 +00:00
Olivier Chafik	4a2b196d03	server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531 )	2025-01-31 10:12:40 +02:00
Steve Grubb	1bd3047a93	common: Add missing va_end (#11529 ) The va_copy man page states that va_end must be called to revert whatever the copy did. For some implementaions, not calling va_end has no consequences. For others it could leak memory.	2025-01-31 07:58:55 +02:00
Daniel Bevenius	a2df2787b3	server : update help metrics processing/deferred (#11512 ) This commit updates the help text for the metrics `requests_processing` and `requests_deferred` to be more grammatically correct. Currently the returned metrics look like this: ```console \# HELP llamacpp:requests_processing Number of request processing. \# TYPE llamacpp:requests_processing gauge llamacpp:requests_processing 0 \# HELP llamacpp:requests_deferred Number of request deferred. \# TYPE llamacpp:requests_deferred gauge llamacpp:requests_deferred 0 ``` With this commit, the metrics will look like this: ```console \# HELP llamacpp:requests_processing Number of requests processing. \# TYPE llamacpp:requests_processing gauge llamacpp:requests_processing 0 \# HELP llamacpp:requests_deferred Number of requests deferred. \# TYPE llamacpp:requests_deferred gauge llamacpp:requests_deferred 0 ``` This is also consistent with the description of the metrics in the server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).	2025-01-31 06:04:53 +01:00
Olivier Chafik	553f1e46e9	`ci`: ccache for all github worfklows (#11516 )	2025-01-30 22:01:06 +00:00
Olivier Chafik	8b576b6c55	Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639 ) --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-30 19:13:58 +00:00
uvos	27d135c970	HIP: require at least HIP 5.5	2025-01-30 16:25:44 +01:00
uvos	6af1ca48cb	HIP: Prepare reduction operators for wave 64	2025-01-30 16:25:44 +01:00
uvos	c300e68ef4	CUDA/HIP: add warp_size to cuda_device_info	2025-01-30 16:25:44 +01:00
Concedo	7a5499e77b	added one more backend for clblast noavx2 and clblast failsafe	2025-01-30 22:47:22 +08:00
Concedo	898856e183	cleaned up unused flags from makefile, updated lite	2025-01-30 19:34:55 +08:00
Concedo	fd84b062f9	allow reuse of clip embds	2025-01-30 19:02:45 +08:00
Olivier Chafik	3d804dec76	sync: minja (#11499 )	2025-01-30 10:30:27 +00:00
mgroeber9110	ffd0821c57	vocab : correctly identify LF token for GPT-2 style BPE tokenizer (#11496 )	2025-01-30 12:10:59 +02:00
Daniel Bevenius	4314e56c4f	server : use lambda instead of std::bind (#11507 ) This commit replaces the two usages of `std::bind` in favor of lambdas for the callback functions for `callback_new_task` and `callback_update_slots`. The motivation for this changes is consistency with the rest of the code in server.cpp (lambdas are used for all other callbacks/handlers). Also lambdas are more readable (perhaps this is subjective) but also they are recommended over `std::bind` in modern C++. Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md	2025-01-30 11:05:00 +01:00
Concedo	ba5e94eed2	Revert "Update requirements.txt - include pyinstaller (#1341 )" This reverts commit `c27fcc4d4f`.	2025-01-30 17:57:48 +08:00
askmyteapot	c27fcc4d4f	Update requirements.txt - include pyinstaller (#1341 )	2025-01-30 17:34:44 +08:00
Isaac McFadyen	496e5bf46b	server : (docs) added response format for /apply-template [no ci] (#11503 )	2025-01-30 10:11:53 +01:00
Guspan Tanadi	7919256c57	readme : reference examples relative links (#11505 )	2025-01-30 06:58:02 +01:00
Daniel Bevenius	e0449763a4	server : update json snippets in README.md [no ci] (#11492 ) This commit updates some of JSON snippets in README.md file and removes the `json` language tag from the code blocks. The motivation for this changes is that if there is invalid json in a code snippet these are highlighted in red which can make it somewhat difficult to read and can be a little distracting.	2025-01-30 05:48:14 +01:00
Nigel Bosch	eb7cf15a80	server : add /apply-template endpoint for additional use cases of Minja functionality (#11489 ) * add /apply-template endpoint to server * remove unnecessary line * add /apply-template documentation * return only "prompt" field in /apply-template * use suggested idea instead of my overly verbose way	2025-01-29 19:45:44 +01:00
Rémy Oudompheng	66ee4f297c	vulkan: implement initial support for IQ2 and IQ3 quantizations (#11360 ) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-01-29 18:29:39 +01:00
Concedo	f4e2f4b069	disable context shift when using mrope	2025-01-30 00:36:05 +08:00
Concedo	646df4b126	default to autoguess for chat completions adapter	2025-01-30 00:25:13 +08:00
Concedo	70f1d8d746	vision can set max res (+1 squashed commits) Squashed commits: [938fc655] vision can set max res	2025-01-30 00:19:49 +08:00
Daniel Bevenius	e51c47b401	server : update auto gen files comments [no ci] (#11484 ) * server : update auto gen files comments This commit updates the 'auto generated files' comments in server.cpp and removes `deps.sh` from the comment. The motivation for this change is that `deps.sh` was removed in Commit `91c36c269b` ("server : (web ui) Various improvements, now use vite as bundler (#10599)"). * squash! server : update auto gen files comments [no ci] Move comments about file generation to README.md. * squash! server : update auto gen files comments [no ci] Remove the comments in server.cpp that mention that information can be found in the README.md file.	2025-01-29 16:34:18 +01:00
Jeff Bolz	2711d0215f	vulkan: Catch pipeline creation failure and print an error message (#11436 ) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging	2025-01-29 09:26:50 -06:00
Concedo	2f69432774	makefile indentation fix (+1 squashed commits) Squashed commits: [f640eb59] makefile indentation fix	2025-01-29 22:18:54 +08:00
Eric Curtin	f0d4b29edf	Parse https://ollama.com/library/ syntax (#11480 ) People search for ollama models using the web ui, this change allows one to copy the url from the browser and for it to be compatible with llama-run. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-29 11:23:10 +00:00
Georgi Gerganov	815857791d	sync : ggml	2025-01-29 11:25:29 +02:00
William Tambellini	1a0e87d291	ggml : add option to not print stack on abort (ggml/1081) * Add option to not print stack on abort Add option/envvar to disable stack printing on abort. Also link some unittests with Threads to fix link errors on ubuntu/g++11. * Update ggml/src/ggml.c --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-01-29 11:24:53 +02:00
issixx	d2e518e9b4	ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>	2025-01-29 11:24:51 +02:00
Daniel Bevenius	b636228c0a	embedding : enable --no-warmup option (#11475 ) This commit enables the `--no-warmup` option for the llama-embeddings. The motivation for this change is to allow the user to disable the warmup when running the the program.	2025-01-29 10:38:54 +02:00
Molly Sophia	325afb370a	llama: fix missing k_cache store for rwkv6qwen2 (#11445 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-01-29 12:07:21 +08:00
Emreerdog	794fe23f29	cmake: add hints for locating ggml on Windows using Llama find-package (#11466 )	2025-01-28 19:22:06 -04:00
peidaqi	cf8cc856d7	server : Fixed wrong function name in llamacpp server unit test (#11473 ) The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True	2025-01-29 00:03:42 +01:00
Xuan-Son Nguyen	d0c08040b6	ci : fix build CPU arm64 (#11472 ) * ci : fix build CPU arm64 * failed, trying ubuntu 22 * vulkan: ubuntu 24 * vulkan : jammy --> noble	2025-01-29 00:02:56 +01:00
uvos	be5ef7963f	HIP: Supress transformation warning in softmax.cu loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.	2025-01-28 23:06:32 +01:00
Nikita Sarychev	cae9fb4361	HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (#11080 ) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.	2025-01-28 16:42:20 +01:00
Eric Curtin	7fee2889e6	Add github protocol pulling and http:// (#11465 ) As pulling protocols to llama-run Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-28 14:45:41 +00:00
Nuno	d7d1eccacc	docker: allow installing pip packages system-wide (#11437 ) Signed-off-by: rare-magma <rare-magma@posteo.eu>	2025-01-28 14:17:25 +00:00
someone13574	4bf3119d61	cmake : don't fail on `GGML_CPU=OFF` (#11457 )	2025-01-28 15:15:34 +01:00
Concedo	558bc5c901	tts can now set a length limit	2025-01-28 22:06:59 +08:00
Nuno	f643120bad	docker: add perplexity and bench commands to full image (#11438 ) Signed-off-by: rare-magma <rare-magma@posteo.eu>	2025-01-28 10:42:32 +00:00
Concedo	c5d4e07664	Merge commit '`acd38efee3`' into concedo_experimental # Conflicts: # .devops/cpu.Dockerfile # .devops/vulkan.Dockerfile # .github/workflows/build.yml # .github/workflows/docker.yml # CMakeLists.txt # README.md # cmake/llama-config.cmake.in # examples/simple-cmake-pkg/.gitignore # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-hip/CMakeLists.txt	2025-01-28 18:16:44 +08:00
Akarshan Biswas	6e84b0ab8e	SYCL : SOFTMAX F16 mask support and other fixes (#11261 ) Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases	2025-01-28 09:56:58 +00:00
Concedo	6bf0b2d062	try casting the numeric fields read	2025-01-28 17:43:28 +08:00
Michael Engel	2b8525d5c8	Handle missing model in CLI parameters for llama-run (#11399 ) The HTTP client in llama-run only prints an error in case the download of a resource failed. If the model name in the CLI parameter list is missing, this causes the application to crash. In order to prevent this, a check for the required model parameter has been added and errors for resource downloads get propagated to the caller. Signed-off-by: Michael Engel <mengel@redhat.com>	2025-01-28 08:32:40 +00:00

... 10 11 12 13 14 ...

7391 commits