koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-08 18:30:50 +00:00

Author	SHA1	Message	Date
Concedo	32102c2064	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-07-07 14:15:39 +08:00
Judd	36680f6e40	convert : update for baichuan (#2081 ) 1. guess n_layers; 2. relax warnings on context size; 3. add a note that its derivations are also supported. Co-authored-by: Judd <foldl@boxvest.com>	2023-07-06 19:23:49 +03:00
tslmy	a17a2683d8	alpaca.sh : update model file name (#2074 ) The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `ggmlv3.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.	2023-07-06 19:17:50 +03:00
Concedo	220aa707e6	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md # pocs/vdot/q8dot.cpp # pocs/vdot/vdot.cpp # scripts/sync-ggml.sh # tests/test-grad0.c # tests/test-quantize-fns.cpp # tests/test-quantize-perf.cpp	2023-07-06 15:40:40 +08:00
Tobias Lütke	31cfbb1013	Expose generation timings from server & update completions.js (#2116 ) * use javascript generators as much cleaner API Also add ways to access completion as promise and EventSource * export llama_timings as struct and expose them in server * update readme, update baked includes * llama : uniform variable names + struct init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson	983b555e9d	Update Server Instructions (#2113 ) * Update server instructions for web front end * Update server README * Remove duplicate OAI instructions * Fix duplicate text --------- Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 21:03:19 +03:00
Stephan Walter	1b107b8550	ggml : generalize `quantize_fns` for simpler FP16 handling (#1237 ) * Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson	8567c76b53	Update server instructions for web front end (#2103 ) Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 18:13:35 +03:00
Nigel Bosch	7f0e9a775e	embd-input: Fix input embedding example unsigned int seed (#2105 )	2023-07-05 07:33:33 +08:00
jwj7140	f257fd2550	Add an API example using server.cpp similar to OAI. (#2009 ) * add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding	2023-07-04 21:06:12 +03:00
Tobias Lütke	7ee76e45af	Simple webchat for server (#1998 ) * expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-04 16:05:27 +02:00
Concedo	69add28324	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml	2023-07-04 18:51:42 +08:00
Henri Vasserman	1cf14ccef1	fix server crashes (#2076 )	2023-07-04 00:05:23 +03:00
WangHaoranRobin	d7d2e6a0f0	server: add option to output probabilities for completion (#1962 ) * server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together	2023-07-03 00:38:44 +03:00
Concedo	b85ea580d3	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-07-02 14:45:25 +08:00
Georgi Gerganov	79f634a19d	embd-input : fix returning ptr to temporary	2023-07-01 18:46:00 +03:00
Georgi Gerganov	04606a1599	train : fix compile warning	2023-07-01 18:45:44 +03:00
Concedo	67cb0b2760	Merge branch 'master' into concedo_experimental	2023-06-30 23:25:40 +08:00
Howard Su	b8c8dda75f	Use unsigned for random seed (#2006 ) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-29 06:15:15 -07:00
Concedo	dff5575647	Merge branch 'master' into concedo_experimental # Conflicts: # .gitignore # Makefile # ggml-opencl.cpp # llama.cpp	2023-06-29 17:35:28 +08:00
Johannes Gäßler	7f9753fa12	CUDA GPU acceleration for LoRAs + f16 models (#1970 )	2023-06-28 18:35:54 +02:00
ningshanwutuobang	cfa0750bc9	llama : support input embeddings directly (#1910 ) * add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error	2023-06-28 18:53:37 +03:00
Concedo	282376c85a	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # tests/test-quantize-perf.cpp	2023-06-27 19:15:27 +08:00
Howard Su	0be54f75a6	baby-llama : fix build after ggml_rope change (#2016 )	2023-06-27 08:07:13 +03:00
Georgi Gerganov	181e8d9755	llama : fix rope usage after ChatGLM change	2023-06-27 00:37:33 +03:00
David Yang	eaa6ca5a61	ggml : increase max tensor name + clean up compiler warnings in train-text (#1988 ) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues	2023-06-26 22:45:32 +03:00
zrm	b853d45601	ggml : add NUMA support (#1556 ) * detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-26 20:57:59 +03:00
Concedo	e4c9aea840	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-06-26 10:35:47 +08:00
Concedo	d2034ced7b	Merge branch 'master' into concedo_experimental # Conflicts: # README.md # build.zig # flake.nix # tests/test-grad0.c # tests/test-sampling.cpp # tests/test-tokenizer-0.cpp	2023-06-25 17:01:15 +08:00
anon998	c2a08f87b8	fix server sampling: top k sampler first (#1977 ) Co-authored-by: anon <anon@example.org>	2023-06-25 10:48:36 +02:00
Didzis Gosko	527b6fba1d	llama : make model stateless and context stateful (llama_state) (#1797 ) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-24 11:47:58 +03:00
Concedo	b4c532e862	Merge branch 'master' into concedo_experimental	2023-06-20 17:26:27 +08:00
Henri Vasserman	20568fe60f	[Fix] Reenable server embedding endpoint (#1937 ) * Add back embedding feature * Update README	2023-06-20 01:12:39 +03:00
Concedo	d0d3c4f32b	Merge remote-tracking branch 'origin/master' into concedo_experimental # Conflicts: # README.md	2023-06-18 22:53:10 +08:00
Kawrakow	90cc59d6ab	examples : fix examples/metal (#1920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-18 10:52:10 +03:00
Concedo	278427d9a4	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md	2023-06-18 15:29:44 +08:00
Georgi Gerganov	4f9c43e3bd	minor : warning fixes	2023-06-17 20:24:11 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
Georgi Gerganov	051e1b0e6a	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
Concedo	9f8e2f8a18	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # pocs/vdot/vdot.cpp # scripts/verify-checksum-models.py # tests/test-quantize-fns.cpp # tests/test-quantize-perf.cpp # tests/test-sampling.cpp # tests/test-tokenizer-0.cpp	2023-06-17 20:02:32 +08:00
Randall Fitzgerald	794db3e7b9	Server Example Refactor and Improvements (#1570 ) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Felix Hellmann <privat@cirk2.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>	2023-06-17 14:53:04 +03:00
Jiří Podivín	5ddf7ea1fb	hooks : setting up flake8 and pre-commit hooks (#1681 ) Small, non-functional changes were made to non-compliant files. These include breaking up long lines, whitespace sanitation and unused import removal. Maximum line length in python files was set to a generous 125 chars, in order to minimize number of changes needed in scripts and general annoyance. The "txt" prompts directory is excluded from the checks as it may contain oddly formatted files and strings for a good reason. Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-06-17 13:32:48 +03:00
David Yang	92f20d9942	train : get raw text instead of page with html (#1905 ) We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.	2023-06-17 09:51:54 +03:00
SuperUserNameMan	b41b4cad6f	examples : add "simple" (#1840 ) * Create `simple.cpp` * minimalist example `CMakeLists.txt` * Update Makefile for minimalist example * remove 273: Trailing whitespace * removed trailing white spaces simple.cpp * typo and comments simple.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-16 21:58:09 +03:00
FrankHB	5b9ccaf104	Fixed possible macro redefinition (#1892 ) MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.	2023-06-16 21:25:01 +03:00
Borislav Stanimirov	9cbf50c041	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
Concedo	7ef8d740b9	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile	2023-06-16 16:37:14 +08:00
yangli2	c36e81da62	examples : add chat-vicuna.sh (#1854 ) Co-authored-by: Yang Li <yangliyl@google.com>	2023-06-15 21:05:53 +03:00
Srinivas Billa	9dda13e5e1	readme : server compile flag (#1874 ) Explicitly include the server make instructions for C++ noobsl like me ;)	2023-06-15 20:36:38 +03:00
Johannes Gäßler	6b8312e797	Better error when using both LoRA + GPU layers (#1861 )	2023-06-15 19:06:46 +02:00

1 2 3 4 5

216 commits