* feat: add a primitive form of continuous batching
* fix: deadlock in batching fallback
* fix: windows build
* chore: suppress the contbatch arg from --help
* feat: batch-aware rep_pen_slope
* fix: automatically disable shifting when batching is enabled
* fix: mixed-path state corruption
* fix: attempt to fully separate the two pipelines
* added a semaphore to prevent non-batchable requests from starting while batched requests are running
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* API: add an /extra/chat_template route
A lot of manual tweaking is done when swapping between models. We can automate or make better assumptions about some of them by having more information, such as chat template. This PR adds an endpoint /extra/chat_template which returns the model chat template string as is in a 'chat_template' key. The front end can then use this to derive the proper templates or use it as is, or at least warn the user when they are trying to use e.g. a Mistral preset with a Llama 3.1 model.
* switch to pre-established /props endpoint for chat template
* bug-fix (upstream): one-off in string juggling
* GradientAI Auto ROPE Base calculation
https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models
has a formula that better fits the ideal rope scaling.
Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.
* add in solar scaling logic
Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k.
* Update model_adapter.h
adding in tensor count to identify solar models based on tensor count of 435.
* Update model_adapter.cpp
add in n_tensor count for solar identification
* refactor and cleanup GradientAI rope scaling
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>