* improvements to tool calling logic (merged changes from old PR branch)
* added some tweaks for improved tool calls to reuse old ctx, but needs testing. refer to PR.
* fixes to some stuff that concedo's modifications broke
* fixed error in reasoning
* extremely hacky way to cache tool list please fix
* oops forgot to add this
* slightly less hacky way to preserve the tool list in context
* prevented unintended toolcalls from happening when LLM states something irrelevant to toolcall decision
* fixed something that broke koboldlite
* fixed bug added by concedo that broke jinja tools
* experimental further compression of tools array, needs testing
* reverted experimental further compression of tools array
* final cleanup
* add newline after memory insert
* changed tool reasoning to always be in json format to enforce including final decision
* used new json format to skip extra llm call when not necessary
* more catching of possible bad llm output
* further cleanup
* got it down to just one llm call!
* better json format
* even better json format
* further refinement to json format
* further refinement to json format
* fixed broken tool calling
* single-call enforced json method now seems to work well. removed fallbacks as they are no longer required.
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* Fix DoS / integer overflow
* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)
* White space
* Actually, since it's unsigned, use UINT64_MAX
* fix: TypeError when loading base model remotely in convert_lora_to_gguf
* refactor: simplify base model loading using cache_dir from HuggingFace
* Update convert_lora_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: add remote_hf_model_id to trigger lazy mode in LoRA converter
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition
* Argh.
* Making CISC happy ;)
* Integrate CONT tests
* Use loopy loop
* Skip new tests for (B)F16 for now.
Test 'Q4_K_M' quantization on https://huggingface.co/pfnet/plamo-2-translate
The 'suffix_to_score' size is 193510, it needs 19 memory allocation with final
capacity 262144 to hold the value, if not preserve the memory.
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
* Add files via upload
* fix unit test
* fix crashes for --reasoning-format=none
* Patch buggy official MiniMax-M2 chat template
* add upstream minja fix: https://github.com/ochafik/minja/pull/7
* Fix <think> token not generated
* add test copied from https://github.com/ggml-org/llama.cpp/pull/16946
* cleanup
* Hopes to fix the compilation error on CI
* Delete chat template patching since it’s fixed by upstream Minja
* Remove undeeded Minimax-M2 template patch
https://github.com/ochafik/minja/pull/7#issuecomment-3480356100
* Add proper handling of optional parameters with test
merged tests from: https://github.com/ggml-org/llama.cpp/pull/16946/commits/23d4bb75c485c12ac89f81c424dc03c87a640e8c
* Fix making all tool parameters optional
* Move xml tool parser to separate file
* cleanup & add tests for GLM4.5
* add streaming tests & enhancement & cleanups
Add streaming test for both GLM 4.5 and minimax-m2.
Cleanup for preserved_tokens.
Cleanup for grammar rule name.
Enhance the parser's stability.
* cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo
* apply suggestions from reviewers
* fix a misuse for data.grammar_lazy
* fix grammar when tool have no argument
* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2
* update chat.cpp
* fix grammar for GLM 4.5/4.6
* Try fix Jinja template for GLM
* Try fix GLM-4.6.jinja
* Update common/chat-parser-xml-toolcall.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking
* Improve Kimi-K2 chat template
* Fix unit test
* Fix "Invalid tool call arguments passed" in a rare case.
In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation
Fix compilation errors in the ASCEND_310P-specific ROPE operation code
by adding .get() calls when passing acl_tensor_ptr smart pointers to
functions expecting raw aclTensor* pointers.
This fixes the code that was missed in the previous refactoring commit
(8981848) which changed ggml_cann_create_tensor() return type from
aclTensor* to acl_tensor_ptr.
* cann: format code
Update openEuler version
Remove variable ASCEND_SOC_TYPE
Modify the chip type
Fix case in zip filename
Change "device" to "chip_type"
Modify the value of chip_type
* server: split HTTP into its own interface
* move server-http and httplib to its own file
* add the remaining endpoints
* fix exception/error handling
* renaming
* missing header
* fix missing windows header
* fix error responses from http layer
* fix slot save/restore handler
* fix case where only one stream chunk is returned
* add NOMINMAX
* do not call sink.write on empty data
* use safe_json_to_str for SSE
* clean up
* add some comments
* improve usage of next()
* bring back the "server is listening on" message
* more generic handler
* add req.headers
* move the chat template print to init()
* add req.path
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>