koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-10 17:14:36 +00:00

Author	SHA1	Message	Date
Pierrick Hymbert	3ab8b3a92e	llama : cleanup unused mmq flags (#5772 ) * cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-01 13:39:06 +02:00
Concedo	ad638285de	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # README.md # flake.lock # ggml-cuda.cu # llama.cpp # tests/test-backend-ops.cpp # tests/test-quantize-fns.cpp	2024-02-28 13:41:35 +08:00
Georgi Gerganov	9d533a77d0	llama : fix defrag bugs + add parameter (#5735 ) * llama : fix defrag bugs + enable by default ggml-ci * llama : add defrag_thold parameter ggml-ci * llama : cont * llama : disable log message ggml-ci * llama : fix graph size check during defrag	2024-02-27 14:35:51 +02:00
Concedo	3ccaf8e09a	Merge commit '`f7625019c5`' into concedo_experimental # Conflicts: # .github/ISSUE_TEMPLATE/bug.md # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md # tests/test-backend-ops.cpp # tests/test-opt.cpp # tests/test-quantize-fns.cpp	2024-02-26 10:44:23 +08:00
Georgi Gerganov	ab336a9d5e	code : normalize enum names (#5697 ) * coda : normalize enum names ggml-ci * code : cont * code : cont	2024-02-25 12:09:09 +02:00
Concedo	8d5e25008f	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # ci/run.sh # tests/test-tokenizer-0-falcon.cpp # tests/test-tokenizer-0-llama.cpp # tests/test-tokenizer-1-bpe.cpp # tests/test-tokenizer-1-llama.cpp	2024-02-17 15:22:05 +08:00
Alexey Parfenov	6dcc02d244	server : add "samplers" param to control the samplers order (#5494 )	2024-02-16 13:33:25 +02:00
bmwl	f486f6e1e5	ggml : add numa options (#5377 ) * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-16 11:31:07 +02:00
Concedo	3cec37c2e0	Merge branch 'master' into concedo_experimental # Conflicts: # .flake8 # .github/workflows/python-lint.yml # flake.lock # ggml-cuda.cu # ggml-quants.c # llama.cpp # pocs/vdot/q8dot.cpp # pocs/vdot/vdot.cpp # tests/test-quantize-fns.cpp # tests/test-quantize-perf.cpp	2024-02-13 00:14:22 +08:00
Alexey Parfenov	a803333a4e	common : use enums for sampler types (#5418 ) * common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:43:31 +02:00
Concedo	6dc01297f8	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/nix/package.nix # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md # flake.nix # llama.cpp # llama.h # tests/test-llama-grammar.cpp	2024-02-04 19:42:57 +08:00
Alexander Abushady	4cb956c7db	Quadratic Sampling UI (#652 ) * Quadratic Sampling UI Kalomaze's Quadratic Sampling, now has a UI within KCPP. * remove debug prints * cleanup, add smooth sampler to dynatemp --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2024-02-04 16:26:27 +08:00
Jared Van Bortel	1ec3332ade	YaRN : store rope scaling type as int32_t in memory (#5285 ) * YaRN : store rope scaling type as int32_t in memory * llama : store mapped names as const char *	2024-02-03 13:22:06 +02:00
Georgi Gerganov	5cb04dbc16	llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240 ) * llama : remove LLAMA_MAX_DEVICES from llama.h ggml-ci * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * server : remove LLAMA_MAX_DEVICES ggml-ci * llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD ggml-ci * train : remove LLAMA_SUPPORTS_GPU_OFFLOAD * readme : add deprecation notice * readme : change deprecation notice to "remove" and fix url * llama : remove gpu includes from llama.h ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-31 17:30:17 +02:00
Concedo	08236ccc97	better abort handling, added support for dynatemp exponent	2024-01-23 16:56:12 +08:00
Concedo	f96f29be7b	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/nix/nixpkgs-instances.nix # .devops/nix/package.nix # .devops/nix/scope.nix # .github/workflows/build.yml # .github/workflows/nix-ci.yml # CMakeLists.txt # flake.nix # ggml.c	2024-01-22 22:31:22 +08:00
Kawrakow	6f9939d119	KL-divergence (#5076 ) * kl-divergence: be able to save all logits to a file * Add ability to compute KL-divergence --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-22 16:10:14 +02:00
Kawrakow	7dcbe39d36	Add ability to evauate multiple choice tasks (#5047 ) * TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-21 14:42:44 +02:00
Concedo	1cb8a5e955	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # CMakeLists.txt # Makefile # README.md # ci/run.sh # flake.lock # flake.nix # ggml-cuda.cu # ggml-cuda.h # scripts/get-wikitext-2.sh # tests/CMakeLists.txt	2024-01-21 14:32:15 +08:00
Concedo	71e9a64171	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/nix-ci.yml # CMakeLists.txt # Makefile # ggml-cuda.cu # ggml-opencl.cpp # llama.cpp	2024-01-20 23:27:42 +08:00
Kawrakow	682986a08e	Add Winogrande evaluation (#5015 ) * winogrande: simple implementation It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics. * winogrande: somewhat better Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before. * winogrande: improving Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last. * winogrande: add dataset instructions --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-18 13:46:27 +02:00
stduhpf	e0324285a5	speculative : threading options (#4959 ) * speculative: expose draft threading * fix usage format * accept -td and -tbd args * speculative: revert default behavior when -td is unspecified * fix trailing whitespace	2024-01-16 13:04:32 +02:00
Concedo	dc7bc0cb50	Merge commit '`584d674be6`' into concedo_experimental # Conflicts: # .github/workflows/nix-flake-update.yml # Makefile # Package.swift # ggml-cuda.cu # tests/test-quantize-fns.cpp	2024-01-14 16:29:44 +08:00
Yann Follet	722d33f34e	main : add parameter --no-display-prompt (#4541 ) * add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-13 18:09:08 +02:00
slaren	e7e4df031b	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Georgi Gerganov	7edefbd79c	main : better name for variable n_print (#4874 )	2024-01-11 22:46:26 +02:00
Georgi Gerganov	3ca63b4538	main : disable token count by default (#4874 )	2024-01-11 22:43:05 +02:00
pudepiedj	43f76bf1c3	main : print total token count and tokens consumed so far (#4874 ) * Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn	2024-01-11 18:14:52 +02:00
Concedo	66533c8424	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # Package.swift # README.md # tests/test-quantize-fns.cpp	2024-01-09 17:48:18 +08:00
Georgi Gerganov	52531fdff8	main : add self-extend support (#4815 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div	2024-01-08 11:18:32 +02:00
kalomaze	123bff9a0f	Full DynaTemp implementation + UI (#600 ) * move Dynatemp changes to new branch * fix float header * Properly reintroduce variable expert count Controllable through experts.txt * first pass at DynaTemp UI Checkbox partial implemented, Min and Max Temp implemented * DynaTemp UI Checkbox Trigger DynaTemp on checkbox * DynaTemp UI checkbox edition Hell Yeah! DynaTemp! * Remove greedy dynatemp * Fix race condition caused by debug print * Fixed broken presets and miro Fixes broken presets and mirostat * Remove debug function + HHI temp Also removed unnecessary softmax double precision * Fix whitespace (?) for generate function * epic upstream renaming scheme fix * fix stupid indents * Other cleanup Reintroduce unused rep pen function, move temp functions first before entropy dynamic temp * Slight indent fix * revert batch pyinstaller maker to mainline and also delete experts.txt since adjustable routing is also being removed for the PR * compact dynatemp into a single value dynatemp_range. This is a float which represents the allowed deviation from the min and max temperature when using dynatemp. Thus, if we want a value of dynatemp_min=0.3, dynatemp_max=0.5, then we would simply set temperature=0.4 and dynatemp_range=0.1. Functionally dynatemp would operate the same, but it would simplify usage and make it a single easy to adjust value. --------- Co-authored-by: Alexander Abushady <aabushady214@gmail.com> Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2024-01-06 11:13:16 +08:00
Concedo	4a8308b1c8	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile	2023-12-23 10:40:29 +08:00
LeonEricsson	7082d24cec	lookup : add prompt lookup decoding example (#4484 ) * initial commit, going through initializations * main loop finished, starting to debug * BUG: generates gibberish/repeating tokens after a while * kv_cache management * Added colors to distinguish drafted tokens (--color). Updated README * lookup : fix token positions in the draft batch * lookup : use n_draft from CLI params * lookup : final touches --------- Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 18:05:56 +02:00
Concedo	ec21fa7712	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # CMakeLists.txt # Makefile # Package.swift # README.md # ggml-cuda.cu # llama.cpp # llama.h # scripts/sync-ggml.sh # tests/CMakeLists.txt	2023-12-08 17:42:26 +08:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Kerfuffle	5aa365d88f	llama : allow overriding GGUF metadata when loading model (#4092 ) * feat: Allow overriding GGUF metadata when loading model * Fix the one time GCC is stricter than clang about something * Step1 * Refactor... basically everything! * Nuke obsolete GetArrayLen struct * simplify std::string specialization * Various cleanups Add informational output when overrides are applied Warn user when an override with the wrong type is specified * Fix broken logic for parsing bool KV overrides Fix issue where overrides didn't apply when key missing in GGUF metadata Resolve merge changes * llama : rearrange model params * Update new GET_KEY call Add note that metadata KV overrides aren't reflected in initial metadata KV info dump --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-05 19:19:18 +02:00
MaggotHATE	52c8bc3cf3	sampling : custom samplers order (#4285 ) * Samplers sequence order w parameter * Cleaned commented code * Fixed formatting * Rewrote with unordered_map * Revert and rewrite, too many problems and safeguards would be needed * Fixed code style * Code style fixes according to review * More readable samplers input string, fixed help * Style fix in sampler_queue * Formatting fixes * Fixing whitespaces	2023-12-05 12:05:51 +02:00
Concedo	8acd7be734	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # README.md	2023-11-27 14:06:14 +08:00
Georgi Gerganov	6b0a7420d0	llama : KV cache view API + better KV cache management (#4170 ) * llama : keep track of used KV cells + better KV cache management * llama : zero KV cache used upon clear ggml-ci * llama : allow exporting a view of the KV cache (#4180) * Allow exporting a view of the KV cache * Allow dumping the sequences per cell in common * Track max contiguous cells value and position as well * Fix max contiguous empty cells index calculation Make dump functions deal with lengths or sequences counts > 10 better * Fix off by one error in dump_kv_cache_view * Add doc comments for KV cache view functions Eliminate cell sequence struct; use llama_seq_id directly Minor cleanups * common : add -dkvc arg for enabling kv cache dumps --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-11-23 19:07:56 +02:00
Concedo	56a5fa7a60	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # tests/test-tokenizer-0-falcon.py # tests/test-tokenizer-0-llama.py	2023-11-20 22:37:06 +08:00
Seb C	881800d1f0	main : Add ChatML functionality to main example (#4046 ) Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com>	2023-11-20 14:56:59 +01:00
Concedo	6bf8ee4aea	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # ggml-cuda.cu # tests/test-tokenizer-0-falcon.py # tests/test-tokenizer-0-llama.py	2023-11-18 11:10:45 +08:00
Kerfuffle	91f6499393	Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040 ) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time	2023-11-16 19:14:37 -07:00
Concedo	d7729ac3eb	Merge branch 'master' into concedo_experimental	2023-11-03 16:00:05 +08:00
Georgi Gerganov	8f961abdc4	speculative : change default p_accept to 0.5 + CLI args (#3919 ) ggml-ci	2023-11-03 09:41:56 +02:00
Georgi Gerganov	05816027d6	common : YAYF (yet another YARN fix) (#3925 ) ggml-ci	2023-11-03 09:24:00 +02:00
Concedo	bc4ff72317	not working merge	2023-11-02 17:52:40 +08:00
cebtenzzre	b12fa0d1c1	build : link against build info instead of compiling against it (#3879 ) * cmake : fix build when .git does not exist * cmake : simplify BUILD_INFO target * cmake : add missing dependencies on BUILD_INFO * build : link against build info instead of compiling against it * zig : make build info a .cpp source instead of a header Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com> * cmake : revert change to CMP0115 --------- Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>	2023-11-02 08:50:16 +02:00
Concedo	1ab18ecb53	Merge commit '`c43c2da8af`' into concedo_experimental # Conflicts: # llama.cpp	2023-11-02 11:17:59 +08:00
cebtenzzre	898aeca90a	llama : implement YaRN RoPE scaling (#2268 ) Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>	2023-11-01 18:04:33 -04:00

... 3 4 5 6 7

334 commits