* PoC: add chat template heuristics
The fallback chat template adapter of Vicuna is not ideal in some cases (e.g. a test against a sub-portion of the BBC news classification task on Kaggle gave an 82% accuracy with Vicuna and 88% with the official ChatML format for a q4_k_m Qwen 2.5 3B-Instruct gguf).
This PR adds a proof of concept simple heuristic which looks at the chat template and upgrades the adapter when it is able to.
* gemma 2 heuristic
* Phi 4, Llama 3.x heuristics
* better qwen vs generic heuristic
* cleanup
* mistral (generic) heuristic
* fix sys msg for mistral
* phi 3.5
* mistral v3
* cohere (aya expanse 32b based)
* only derive from chat template if AutoGuess
* add notes about alpaca fallbacks
* added AutoGuess.json dummy
* add mistral v7
* switch to using a json list with search strings
Squashed commit:
[0a6306ca0] draft wip dont use (will be squashed)
[a758a1c9c] wip dont use (will be squashed)
[e1994d3ce] wip dont use
[f59690d68] wip
[77228147d] wip on spec decoding. dont use yet
[2445bca54] wip adding speculative decoding (+1 squashed commits)
Squashed commits:
[50e341bb7] wip adding speculative decoding
* Support chunked encoding.
The koboldcpp API does not support HTTP chunked encoding. Some HTTP
libraries, notable Go's net/http can automatically choose to use chunked
encoding. This adds support for chunked encoding within the do_POST()
handler.
* refactor slightly to add additional safety checks and follow original format
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* API: add an /extra/chat_template route
A lot of manual tweaking is done when swapping between models. We can automate or make better assumptions about some of them by having more information, such as chat template. This PR adds an endpoint /extra/chat_template which returns the model chat template string as is in a 'chat_template' key. The front end can then use this to derive the proper templates or use it as is, or at least warn the user when they are trying to use e.g. a Mistral preset with a Llama 3.1 model.
* switch to pre-established /props endpoint for chat template
* bug-fix (upstream): one-off in string juggling