koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-10 17:14:36 +00:00

Author	SHA1	Message	Date
Concedo	9f976e9c65	swa full used unless ctx shift and fast forward disabled	2025-05-21 22:47:45 +08:00
Georgi Gerganov	e298d2fbd0	kv-cache : add SWA support (#13194 ) * kv-cache : prepare for SWA ggml-ci * kv-cache : initial iSWA implementation ggml-ci * kv-cache : rework error recovery logic ggml-ci * models : fix Phi-3 SWA parameters ggml-ci * model : adjust Granite to rope factor changes ggml-ci * server : check if context can do shifts ggml-ci * iswa : for now, always enable shifts (experiment) ggml-ci * kv-cache : simplify SWA logic ggml-ci * kv-cache : apply defrag when we fail to find slots for the batch ggml-ci * llama : update docs about llama_decode ggml-ci * kv-cache : update warning logs when no space for the batch is available ggml-ci * llama : add llama_kv_self_seq_pos_min() * kv-cache : keep track of partial SWA computes and print warnings * server : disallow use cases involving partial SWA context ggml-ci * llama : add param to control SWA cache size ggml-ci * minor : clean-up ggml-ci	2025-05-20 08:05:46 +03:00
Concedo	e5d26a2356	Merge branch 'upstream' into concedo_experimental # Conflicts: # common/CMakeLists.txt # docs/backend/SYCL.md # ggml/CMakeLists.txt # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-sycl/binbcast.cpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/dequantize.hpp # ggml/src/ggml-sycl/dmmv.cpp # ggml/src/ggml-sycl/gemm.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/vecdotq.hpp # ggml/src/ggml-vulkan/CMakeLists.txt # ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt # ggml/src/gguf.cpp # scripts/compare-llama-bench.py # tests/CMakeLists.txt # tests/test-chat.cpp # tools/llama-bench/llama-bench.cpp # tools/server/README.md	2025-05-16 15:30:31 +08:00
Sigbjørn Skjæret	f5170c1d7a	editorconfig : fix trailing whitespace from #13542 (#13546 )	2025-05-14 21:22:49 +03:00
Gilad S.	017f10b5fa	fix: crash when calling `llama_state_get_size` on a context without a KV cache (#13542 )	2025-05-14 19:18:18 +03:00
Concedo	21e31e255b	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/docker.yml # README.md # build-xcframework.sh # common/CMakeLists.txt # examples/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-metal/ggml-metal.m # ggml/src/ggml-metal/ggml-metal.metal # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-sycl/backend.hpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/vecdotq.hpp # scripts/compare-llama-bench.py # src/CMakeLists.txt # src/llama-model.cpp # src/llama.cpp # tests/test-backend-ops.cpp # tests/test-opt.cpp # tools/llama-bench/README.md # tools/llama-bench/llama-bench.cpp # tools/mtmd/CMakeLists.txt # tools/mtmd/README.md # tools/mtmd/clip.cpp # tools/rpc/rpc-server.cpp # tools/server/CMakeLists.txt # tools/server/README.md	2025-05-13 00:28:35 +08:00
Johannes Gäßler	10d2af0eaa	llama/ggml: add LLM training support (#10544 ) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-12 14:44:49 +02:00
Georgi Gerganov	064cc596ac	context : fix state io for memory-less contexts (#13470 ) ggml-ci	2025-05-12 15:12:27 +03:00
David Huang	7f323a589f	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386 )	2025-05-11 14:18:39 +02:00
Concedo	2439014a03	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # examples/embedding/embedding.cpp # tools/imatrix/imatrix.cpp # tools/perplexity/perplexity.cpp	2025-05-08 23:41:02 +08:00
Georgi Gerganov	6562e5a4d6	context : allow cache-less context for embeddings (#13108 ) * context : allow cache-less context for embeddings ggml-ci * context : enable reranking with encode() ggml-ci * context : encode() clears embd_seq ggml-ci * examples : use llama_encode() when appropriate ggml-ci * models : nomic bert moe does not require KV cache * llama : update comments for llama_decode/llama_encode ggml-ci * context : update warning log [no ci]	2025-05-08 14:28:33 +03:00
Georgi Gerganov	51fb96b1ff	context : remove logits_all flag (#13284 ) * context : remove logits_all flag ggml-ci * llama : remove logits_all flag + reorder llama_context_params ggml-ci	2025-05-08 14:26:50 +03:00
Concedo	0951ad9f58	temp merge, not working	2025-05-03 11:42:01 +08:00
Georgi Gerganov	a75cb30dc9	context : fix reorder logic (#13267 ) ggml-ci	2025-05-02 20:54:13 +03:00
Georgi Gerganov	c642bc014c	kv-cache : separate recurrent vs non-recurrent impl (#12799 ) * kv-cache : serparate recurrent vs non-recurrent impl (wip) ggml-ci * kv-cache : init -> contructor + add llama_memory_params ggml-ci * kv-cache : fix callback reference ggml-ci * context : llama_kv_cache -> llama_memory_i ggml-ci * context : move memory creation logic to model ggml-ci * llama : remove reference of memory during encode ggml-ci * kv-cache : hide padding details in the implementation ggml-ci * kv-cache : add ubatch_next() ggml-ci * context : simplify sbatch logic ggml-ci * kv-cache : hide defrag logic in the implementation ggml-ci * context : hide kv cache details in implementation ggml-ci * build : fix ggml-ci * cont : another fix ggml-ci * kv-cache : simplify interface (wip) ggml-ci * kv-cache : use separate KV cell structs for unified/recurrent ggml-ci * kv-cache : clean-up ggml-ci * model : better llama_model::create_model() signature ggml-ci * kv-cache : fix recurrent seq_rm() ggml-ci * kv-cache : replace `struct callbacks` with `llama_model &` ggml-ci * kv-cache : replace `struct graph_params` with `llama_context &` ggml-ci * kv-cache : fix offload check ggml-ci * context : avoid passing unique_ptr ggml-ci * kv-cache : avoid using the backends from the llama_context ref #13113 ggml-ci * kv-cache : more consistent debug logs [no ci] * kv-cache : do not pass the full llama_context for kv graphs ggml-ci * kv-cache : remove comment * kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext ggml-ci * kv-cache : fix recurrent multi-user case ggml-ci * memory : remove comments [no ci]	2025-05-02 17:48:36 +03:00
Concedo	ca53d1bedc	Merge commit '`13c9a3319b`' into concedo_experimental # Conflicts: # ggml/src/ggml-cpu/CMakeLists.txt # scripts/sync-ggml.last # tests/test-backend-ops.cpp	2025-05-02 16:42:16 +08:00
ddh0	16a457facd	fix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (#13221 )	2025-04-30 21:28:43 +01:00
Concedo	b2ecfa0f55	Merge branch 'upstream' into concedo_experimental # Conflicts: # README.md # examples/llama-bench/README.md # examples/llama-bench/llama-bench.cpp # examples/llava/CMakeLists.txt # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # tests/test-chat-template.cpp	2025-04-29 21:05:16 +08:00
pockers21	fb0471d175	context : do not clear output buffer on reserve (#13152 ) Co-authored-by: pockers21 <liyang2@uniontech.com>	2025-04-28 16:45:40 +03:00
Concedo	3f545eadbe	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # tests/test-backend-ops.cpp	2025-04-26 09:12:40 +08:00
Diego Devesa	295354ea68	llama : fix K-shift with quantized K and BLAS backend (#13113 )	2025-04-25 19:40:11 +02:00
Concedo	bce519cee7	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # tests/test-backend-ops.cpp	2025-04-18 12:44:20 +08:00
Georgi Gerganov	2f74c354c0	graph : make FA compatible with MLA + add initial Metal kernels (#12953 ) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci	2025-04-17 18:16:36 +03:00
Concedo	06159939d9	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # Makefile # docs/build.md # examples/rpc/rpc-server.cpp # examples/sycl/build.sh # ggml/CMakeLists.txt # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-hip/CMakeLists.txt # scripts/sync-ggml.last	2025-04-17 00:52:37 +08:00
Juk Armstrong	daa422881a	llama : DeepSeek V2/V3 MLA implementation (#12801 ) * Merged using squash to remove all noise commit messages * Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large * Removed 3 conts (2x RoPE and 1x RMS-norm) * Changed to use `<cmath>` instead of `<math.h>` * Reverted removal of the 3 conts * Used `reshape` in `llm_graph_context::build_attn_mha()` * Use `k_pe = ggml_reshape` * Removed the 3 conts again * Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF * Removed MQA optimisation from `build_attn_mha()` as no gains now * Simplified `is_mla` branch in `llm_build_deepseek2()` * Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls * Fixed call to `build_attn` in `llm_build_t5_enc`	2025-04-15 09:49:57 +03:00
Concedo	822cf2430e	Merge commit '`f1e3eb4249`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # README.md # docs/backend/SYCL.md # examples/llava/clip.cpp # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-vulkan/cmake/host-toolchain.cmake.in	2025-04-08 20:48:53 +08:00
Georgi Gerganov	3e1d29348b	kv-cache : simplify + fix warning for recurrent models (#12756 ) ggml-ci	2025-04-04 21:48:10 +03:00
Concedo	103d60ed2c	Merge branch 'upstream' into concedo_experimental # Conflicts: # common/common.cpp # examples/batched-bench/batched-bench.cpp # examples/batched/batched.cpp # examples/export-lora/export-lora.cpp # examples/gritlm/gritlm.cpp # examples/parallel/parallel.cpp # examples/passkey/passkey.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # ggml/src/ggml-cann/CMakeLists.txt # ggml/src/ggml-cann/acl_tensor.cpp # ggml/src/ggml-cann/acl_tensor.h # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-vulkan/CMakeLists.txt # tests/test-arg-parser.cpp # tests/test-backend-ops.cpp	2025-04-03 18:57:49 +08:00
Diego Devesa	e0e912f49b	llama : add option to override model tensor buffers (#11397 ) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes	2025-04-02 14:52:01 +02:00
Georgi Gerganov	a10b36c91a	llama : refactor kv cache guard (#12695 ) * llama : refactor kv cache guard ggml-ci * cont : fix comment [no ci] * llama : fix kv_cache restore logic ggml-ci * context : simplify kv cache updates ggml-ci * cont : better name [no ci] * llama : fix llama_decode return code when could not find KV slot ggml-ci * context : change log err -> warn [no ci] * kv-cache : add comment + warning	2025-04-02 14:32:59 +03:00
Concedo	e6337ff957	Merge commit '`e408d4351a`' into concedo_experimental # Conflicts: # ggml/CMakeLists.txt	2025-03-30 18:26:02 +08:00
Xuan-Son Nguyen	af6ae1efb2	llama : fix non-causal mask for gemma 3 (#12615 )	2025-03-30 00:07:37 +01:00
Concedo	396875e1c4	update api docs and lite	2025-03-29 15:39:25 +08:00
Georgi Gerganov	b4ae50810e	metal : improve FA + improve MoE (#12612 ) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 20:21:59 +02:00
Concedo	ea358369cc	Merge branch 'upstream' into concedo_experimental # Conflicts: # ci/README.md # ci/run.sh # docs/backend/CUDA-FEDORA.md # docs/build.md # docs/install.md # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cuda/common.cuh # tests/test-backend-ops.cpp	2025-03-26 00:18:01 +08:00
Georgi Gerganov	2d77d88e70	context : fix worst-case reserve outputs (#12545 ) ggml-ci	2025-03-25 09:19:23 +02:00
Concedo	7030ebf401	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/backend/SYCL.md # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp # ggml/src/ggml-sycl/CMakeLists.txt # tests/test-backend-ops.cpp	2025-03-22 00:32:42 +08:00
fairydreaming	568013d0cd	context : clear sets containing encoder output sequence ids before storing new values (#12470 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-03-19 21:01:57 +01:00
Concedo	0c90d2ebcf	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # cmake/common.cmake # docs/backend/SYCL.md # examples/main/README.md # examples/speculative/speculative.cpp # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-musa/CMakeLists.txt # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-vulkan/vulkan-shaders/CMakeLists.txt # tests/test-backend-ops.cpp	2025-03-19 19:27:11 +08:00
Georgi Gerganov	75422e8bc4	graph : normalize Q, K, V shapes + sync cross attention (#12449 ) * graph : normalize Q, K, V shapes and add comments ggml-ci * context : synchronize before getting cross attention data * model : fix command-r attention norm check	2025-03-18 21:35:19 +02:00
Georgi Gerganov	8551c44d84	context : always use non-causal attention for encoder graphs (#12447 ) * context : always use non-causal attention for encoder graphs ggml-ci * context : move the change to llama_context::encode() ggml-ci	2025-03-18 13:05:49 +02:00
Georgi Gerganov	dc079cfdff	context : fix init of n_outputs (#12397 ) ggml-ci	2025-03-16 19:29:36 +02:00
Concedo	6888f5495d	allow quantkv with contextshift	2025-03-16 21:48:42 +08:00
Concedo	67851e5415	Merge branch 'upstream' into concedo_experimental # Conflicts: # examples/run/run.cpp # ggml/src/ggml-cann/aclnn_ops.cpp	2025-03-15 19:54:19 +08:00
fairydreaming	8fcb563613	Load all MoE experts during warmup (#11571 ) * llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-03-14 13:47:05 +01:00
Concedo	be3bba67ff	Merge branch 'upstream' into concedo_experimental # Conflicts: # src/llama-model.cpp	2025-03-14 18:25:21 +08:00
Georgi Gerganov	081bee8c64	hparams : add SWA rope parameters (#12374 ) ggml-ci	2025-03-14 09:03:24 +02:00
Concedo	7dc72db9de	Merge branch 'upstream' into concedo_experimental	2025-03-14 11:58:53 +08:00
Concedo	0db4ae6237	traded my ink for a pen	2025-03-14 11:58:15 +08:00
Georgi Gerganov	84d5475541	llama : fix Gemma3 SWA KV cache shift (#12373 ) * llama : fix Gemma3 SWA KV cache shift ggml-ci * hparams : add comment [no ci]	2025-03-13 19:08:07 +02:00

1 2

55 commits