mirror of
https://github.com/LostRuins/koboldcpp.git
synced 2026-05-17 12:39:09 +00:00
* spec : refactor * spec : drop support for incompatible vocabs * spec : update common_speculative_init() * cont : pass seq_id * cont : dedup ctx_seq_rm_type * server : sketch the ctx_dft decode loop * server : draft prompt cache and checkpoints * server : improve ctx names * server, spec : transition to unified spec context * cont : sync main and drft contexts * cont : async drft eval when possible * cont : handle non-ckpt models * cont : pass correct n_past for drafting * cont : process images throught the draft context * spec : handle draft running out of context * server : fix mtmd draft processing * server : fix URL for draft model * server : add comment * server : clean-up + dry * speculative-simple : update * spec : fix n_past type * server : fix slot ctx_drft ptr * tools : update readme * naming : improve consistency * spec : refactor for multi-sequence speculative context * cont : prepare params * cont : prepare params * spec : support parallel drafts * server : support parallel drafting * llama : reuse device buffers when possible * server, spec : clean-up * cont : clean-up * cont : minor * spec : reset `drafting` flag at the end * spec : introduce `common_speculative_process()` * spec : allow for multiple spec types (chain of speculators) * replace old type field of type common_speculative_type in the common_params_speculative struct with a vector to allow multiple types to be specified * introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>) to figure out which implementations the user has enabled * introduce common_speculative_type_from_names(const std::vector<std::string> & names) to parse the already user provided spec types * all speculators run sequentially, best one wins (we verify its drafted tokens) * maximize expected accepted tokens for current round by calculating the product between the probability of accepting current token (n_acc_tokens / n_gen_drafts) and the draft's length --------- Co-authored-by: Petros Sideris <petros.sideris@nokia.com> |
||
|---|---|---|
| .. | ||
| batched | ||
| batched.swift | ||
| convert-llama2c-to-ggml | ||
| debug | ||
| deprecation-warning | ||
| diffusion | ||
| embedding | ||
| eval-callback | ||
| gen-docs | ||
| gguf | ||
| gguf-hash | ||
| idle | ||
| llama.android | ||
| llama.swiftui | ||
| lookahead | ||
| lookup | ||
| model-conversion | ||
| parallel | ||
| passkey | ||
| retrieval | ||
| save-load-state | ||
| simple | ||
| simple-chat | ||
| simple-cmake-pkg | ||
| speculative | ||
| speculative-simple | ||
| sycl | ||
| training | ||
| CMakeLists.txt | ||
| convert_legacy_llama.py | ||
| json_schema_pydantic_example.py | ||
| json_schema_to_grammar.py | ||
| llama.vim | ||
| pydantic_models_to_grammar.py | ||
| pydantic_models_to_grammar_examples.py | ||
| reason-act.sh | ||
| regex_to_grammar.py | ||
| server-llama2-13B.sh | ||
| server_embd.py | ||
| ts-type-to-grammar.sh | ||