koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-15 18:29:20 +00:00

Author	SHA1	Message	Date
Concedo	a0ae187563	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/docker.yml # README.md # build-xcframework.sh # examples/llava/CMakeLists.txt # examples/llava/clip.cpp # examples/rpc/rpc-server.cpp # examples/run/run.cpp # ggml/src/ggml-cann/ggml-cann.cpp # scripts/sync-ggml-am.sh # scripts/sync-ggml.last # tests/test-backend-ops.cpp # tests/test-chat.cpp	2025-04-12 10:06:47 +08:00
Concedo	ea9bd61e47	Merge commit '`64eda5deb9`' into concedo_experimental # Conflicts: # .devops/cuda.Dockerfile # .devops/intel.Dockerfile # .devops/llama-cli-cann.Dockerfile # .devops/musa.Dockerfile # .devops/rocm.Dockerfile # .devops/vulkan.Dockerfile # .github/workflows/build.yml # .github/workflows/docker.yml # README.md # docs/backend/SYCL.md # examples/llava/clip.cpp # examples/server_embd.py # ggml/src/ggml-cann/acl_tensor.cpp # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-cann/ggml-cann.cpp # src/CMakeLists.txt # tests/test-chat-template.cpp	2025-04-12 08:31:22 +08:00
Yuxuan Zhang	06bb53ad9b	llama-model : add Glm4Model implementation for GLM-4-0414 (#12867 ) * GLM-4-0414 * use original one * Using with tensor map * fix bug * change order * change order * format with flask8	2025-04-11 12:10:10 +02:00
Xuan-Son Nguyen	8b91d5355a	llama : correct rms norm for llama 4 (#12882 )	2025-04-11 08:49:50 +02:00
Bo Zheng	d3bd7193ba	llama : Support Qwen3 and Qwen3MoE (#12828 ) * add qwen3 & qwen3moe support. * fix --------- Co-authored-by: bozheng-hit <dsoul0621@gmail.com>	2025-04-09 11:47:36 +02:00
Plamen Minev	381603a775	ci: detach common from the library (#12827 ) * fix: detach common from the library * fix: building chat test template	2025-04-09 10:11:11 +02:00
Georgi Gerganov	a19b5cef16	llama : fix FA when KV cache is not used (i.e. embeddings) (#12825 ) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci	2025-04-08 19:54:51 +03:00
Concedo	ebf924c5d1	Merge branch 'upstream' into concedo_experimental	2025-04-08 21:46:30 +08:00
Concedo	822cf2430e	Merge commit '`f1e3eb4249`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # README.md # docs/backend/SYCL.md # examples/llava/clip.cpp # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-vulkan/cmake/host-toolchain.cmake.in	2025-04-08 20:48:53 +08:00
Xuan-Son Nguyen	1466621e73	llama : Support llama 4 text-only (#12791 ) * llama4 conversion * initial support, no chat template * clean up a bit * fix tokenizer conversion * correct hparams * try this * fix shexp * ffn_inp_normed * chat template * clean up model conversion * add_bos * add scale_before_ffn * fix order * weight_before_ffn * llm_graph_input_attn_temp * add chunk attn mask * build_inp_attn_scale() * add comment about ggml_repeat * clarify comments * fix build	2025-04-07 23:06:44 +02:00
Georgi Gerganov	3e1d29348b	kv-cache : simplify + fix warning for recurrent models (#12756 ) ggml-ci	2025-04-04 21:48:10 +03:00
Concedo	4e740311fe	Merge branch 'upstream' into concedo_experimental # Conflicts: # ci/run.sh # docs/backend/SYCL.md # docs/build.md # ggml/src/ggml-vulkan/CMakeLists.txt # ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt # tests/test-chat-template.cpp	2025-04-04 15:07:47 +08:00
yumeyao	5dd5d1ab00	vocab : use string_view::find() to avoid unnecessary looking up beyond the fragment range (#12706 )	2025-04-03 18:32:54 +03:00
Concedo	103d60ed2c	Merge branch 'upstream' into concedo_experimental # Conflicts: # common/common.cpp # examples/batched-bench/batched-bench.cpp # examples/batched/batched.cpp # examples/export-lora/export-lora.cpp # examples/gritlm/gritlm.cpp # examples/parallel/parallel.cpp # examples/passkey/passkey.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # ggml/src/ggml-cann/CMakeLists.txt # ggml/src/ggml-cann/acl_tensor.cpp # ggml/src/ggml-cann/acl_tensor.h # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-vulkan/CMakeLists.txt # tests/test-arg-parser.cpp # tests/test-backend-ops.cpp	2025-04-03 18:57:49 +08:00
Georgi Gerganov	833e2b7409	model : print tensor size during load (#12711 ) * model : print tensor size during load * cont : fix units MB -> MiB Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-04-02 16:38:54 +03:00
Diego Devesa	e0e912f49b	llama : add option to override model tensor buffers (#11397 ) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes	2025-04-02 14:52:01 +02:00
Georgi Gerganov	a10b36c91a	llama : refactor kv cache guard (#12695 ) * llama : refactor kv cache guard ggml-ci * cont : fix comment [no ci] * llama : fix kv_cache restore logic ggml-ci * context : simplify kv cache updates ggml-ci * cont : better name [no ci] * llama : fix llama_decode return code when could not find KV slot ggml-ci * context : change log err -> warn [no ci] * kv-cache : add comment + warning	2025-04-02 14:32:59 +03:00
Sigbjørn Skjæret	83a88bd6af	vocab : BailingMoE : change possessive quantifiers to greedy (#12677 )	2025-04-02 11:21:48 +02:00
jklincn	e39e727e9a	llama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key (#12672 )	2025-04-01 14:54:28 +02:00
Concedo	9e182b3e78	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # README.md # docs/backend/SYCL.md # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-vulkan/CMakeLists.txt # ggml/src/ggml-vulkan/ggml-vulkan.cpp # scripts/sync-ggml.last # tests/test-chat-template.cpp	2025-04-01 20:16:07 +08:00
Daniel Bevenius	c80a7759da	vocab : add special infill tokens for CodeLlama (#11850 ) * vocab : add special infill tokens for CodeLlama The commit adds the following special tokens for CodeLlama infill: - `▁<PRE>` - `▁<SUF>` - `▁<MID>` The motivation for this is that currently the infill example uses CodeLlama as a suggested model. But when using this model the following error is generated: ```console /llama.cpp-debug/examples/infill/infill.cpp:165: GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0) failed Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. 305251 Aborted (core dumped) ./build/bin/llama-infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf \ -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 \ --in-prefix "def helloworld():\n print(\"hell" \ --in-suffix "\n print(\"goodbye world\")\n " ``` * squash! vocab : add special infill tokens for CodeLlama Add _<EOT> as well.	2025-03-31 18:40:56 +02:00
Sigbjørn Skjæret	2c3f8b850a	llama : support BailingMoE (Ling) (#12634 )	2025-03-30 22:21:03 +02:00
Juyoung Suk	b3de7cac73	llama : add Trillion 7B model support (#12556 ) * Support Trillion 7B * Update llama.h * Update llama.h * Update llama-vocab.cpp for Trillion * Update llama-vocab.cpp	2025-03-30 20:38:33 +02:00
Sergei Vorobyov	7242dd9675	llama-chat : Add Yandex instruct model template support (#12621 ) * add yandex template * update yandex chat template * fix tests * adjust chat template * fix style * fix tool macro in template * add clarify comment --------- Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-03-30 20:12:03 +02:00
Concedo	e6337ff957	Merge commit '`e408d4351a`' into concedo_experimental # Conflicts: # ggml/CMakeLists.txt	2025-03-30 18:26:02 +08:00
Concedo	ce05aa722d	Merge commit '`0bb2919335`' into concedo_experimental # Conflicts: # ggml/src/CMakeLists.txt # src/llama-model.cpp	2025-03-30 18:18:20 +08:00
Xuan-Son Nguyen	af6ae1efb2	llama : fix non-causal mask for gemma 3 (#12615 )	2025-03-30 00:07:37 +01:00
Djip007	0bb2919335	llama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU (#12632 ) this allow to use GPU host when possible over CPU repack. this have the same effect to resolve this issues (#12498) without completely disable CPU extra buffer. Co-authored-by: philou <philou@framework>	2025-03-29 14:07:37 +01:00
Concedo	396875e1c4	update api docs and lite	2025-03-29 15:39:25 +08:00
Sigbjørn Skjæret	3714c3ee1a	llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631 )	2025-03-28 22:13:02 +01:00
Georgi Gerganov	b4ae50810e	metal : improve FA + improve MoE (#12612 ) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 20:21:59 +02:00
Johannes Gäßler	dd373dd3bf	llama: fix error on bad grammar (#12628 )	2025-03-28 18:08:52 +01:00
Si1w	f125b8dccf	llama : add PLM GGUF Conversion & Inference Support (#12457 ) * add edgellm model arch[conversation feature doesn't work] * remove output.weight layer for edgellm arch * [Model] update the name of the model * update the name of model arch in convert gguf * [Model] Refarctor the model arch into llama-model * [Bug] Fix the bug in create attn kv * [Code] Fix editorconfig erros * [Code] Remove Trailing whitespace * [Code] Remove Trailing whitespace * [Code] Change the order of model arch in list * [Code] Fix flake8 Lint errors * Remove trailing white space * [Code] Remove call in model arch	2025-03-27 12:49:15 +02:00
HighDoping	953c2a62cf	model : restore support for T5Encoder (#12590 )	2025-03-27 11:43:33 +01:00
Georgi Gerganov	f28bc4c286	llama : make loras compatible with repacking (#12593 ) * llama : make loras compatible with repacking ggml-ci * cont : simplify ggml-ci * cont : add TODO [no ci]	2025-03-27 08:24:10 +02:00
Concedo	ea358369cc	Merge branch 'upstream' into concedo_experimental # Conflicts: # ci/README.md # ci/run.sh # docs/backend/CUDA-FEDORA.md # docs/build.md # docs/install.md # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cuda/common.cuh # tests/test-backend-ops.cpp	2025-03-26 00:18:01 +08:00
Georgi Gerganov	2d77d88e70	context : fix worst-case reserve outputs (#12545 ) ggml-ci	2025-03-25 09:19:23 +02:00
compilade	00d53800e0	llama-vocab : add SuperBPE pre-tokenizer (#12532 )	2025-03-24 11:47:24 +01:00
Prajwal B Mehendarkar	c54f6b7988	mmap : skip resource limit checks on AIX (#12541 )	2025-03-24 12:17:10 +02:00
Xuan-Son Nguyen	fbdfefe74e	llama : gemma3 : use output tensor if it exists in model weight (#12506 ) * llama : gemma3 : use output tensor if it exists in model weight * also add to the llm_tensor_names	2025-03-22 23:28:19 +01:00
Concedo	ae670dbe0e	no repacking for avx2 for kcpp because it breaks 4_0_4_4 quants	2025-03-22 00:33:27 +08:00
Concedo	7030ebf401	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/backend/SYCL.md # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp # ggml/src/ggml-sycl/CMakeLists.txt # tests/test-backend-ops.cpp	2025-03-22 00:32:42 +08:00
Georgi Gerganov	af04481e6b	model : do not repack if a GPU device is present (#12498 ) ggml-ci	2025-03-21 16:14:29 +02:00
Sigbjørn Skjæret	960e726077	chore : cleanup llama_model_loader::TENSOR_ usage (#12492 )	2025-03-21 10:21:36 +01:00
Sigbjørn Skjæret	dbb3a4739e	llama : make Qwen2MoE QKV bias optional (#12477 )	2025-03-20 12:49:59 +01:00
fairydreaming	568013d0cd	context : clear sets containing encoder output sequence ids before storing new values (#12470 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-03-19 21:01:57 +01:00
Concedo	0c90d2ebcf	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # cmake/common.cmake # docs/backend/SYCL.md # examples/main/README.md # examples/speculative/speculative.cpp # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-musa/CMakeLists.txt # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt # tests/test-backend-ops.cpp	2025-03-19 19:27:11 +08:00
Sigbjørn Skjæret	108e53c2f1	llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456 ) * Add support for GPT2, Bloom and CodeShell tied word embeddings * Deduplicate tied word embeddings weights * Workaround for incorrect weight map It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first. * check++ * fatfingers--	2025-03-19 09:08:49 +01:00
Georgi Gerganov	75422e8bc4	graph : normalize Q, K, V shapes + sync cross attention (#12449 ) * graph : normalize Q, K, V shapes and add comments ggml-ci * context : synchronize before getting cross attention data * model : fix command-r attention norm check	2025-03-18 21:35:19 +02:00
Xuan-Son Nguyen	99aa304fb9	llama : add support for EXAONE tied word embeddings (#12451 )	2025-03-18 17:24:33 +01:00

1 2 3 4 5 ...

449 commits