* fix: token usage fix for mistral-vibe
* fix: generate unique request IDs for OAI-compatible responses
* fix: prompt_tokens reporting KV cache size instead of actual count during streaming
* fixes for PR #2015
For (1), this is not a good idea. If it returned 0 (e.g. during an error), this value may not be updated and will return the value of a previous or different request. It's better to return 0 in those cases.
For (2), this is a good idea but we don't need that level of randomness. I'll probably swap it with a 6 digit random number instead.
For (3), the official openai spec gates it behind stream_options.include_usage = true so i'll do that too
* missed 1 item
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* tweak format sting types
This may not be all of them, but it's the ones which warn on OpenBSD
* complete the changes needed to fix the format string specifers
* avoid using inttypes, directly cast to size_t (u64 usually) instead
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Many models have vocabulary sizes, and thus tensor shapes, with more
than 5 digits (ex: Gemma 3's vocab size is 262,208).
I already fixed this for `llama_format_tensor_shape` but missed it for
`llama_format_tensor_shape` until now. Oops.
* Set C locale for consistent float formatting across all binaries.
* Add C locale setting to all tools binaries
Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.
* Apply suggestion from @JohannesGaessler
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ggml-webgpu: fix workgroup dispatch limit for large batch sizes
WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT
operations with batch sizes exceedeing this limi would fail.
* add compute_2d_workgroups() helper to split total workgroup ID across
X/Y dimensions
* update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D
dispatch
* update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID
from 2D dispatch
* update mul_mat.wgsl to compute global index from 2D workgroup
coordinates
* refactor all three mul_mat dispatch paths to use the shared helper
* ggml-webgpu: add bounds checking for over-dispatched workgroups
2D workgroup dispatch can over-dispatch when total workgroups don't
divide evenly into the 65535 per-dimension limit. Extra workgroups
would compute invalid batch indices, causing memory corruption.
* add batch_idx bound check to mul_mat_reg_tile.wgsl and
mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups
from accessing invalid memory
* fixes test failures with large batch sizes (eg., bs=[128, 1024])
* ggml-webgpu: add back TODO for spliting large sizes into batches
* Optimize 2d workgroup provisioning
* Set some parameters that increase speed
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Allow webgpu_buf_pool to resize if needed, remove inflight_threads, and replace inflight_threads with num_kernels for submission
* Run clang-format
* Keep track of num batched kernels that have not been submitted yet
* Run clang-format
* Increase buf pool max size
* Increase param buf pool init size
* Remove webgpu buf pool resizing
* Merge with master
* Add buffer pool growth
* Move buffer pool growth outside of lock
* Reduce max pool size to 32
* Run clang-format
* Only resize param buf pool
* ggml-webgpu: Add binary op support for overlapping and non-contiguous.
* Add newline to binary.wgsl
* Append the test of binary op for src overlapping to test_bin_bcast.
* Remove unnecessary newline.
* vulkan: fix and enable cpy_tensor_async function
* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore
* update offload_op logic
* fix missing transfer submission
* disable async transfer queue on AMD GCN
* revert op batch size change
* fix cpy_tensor_async checks
* Add model metadata loading from huggingface for use with other tests
* Add incremental chunking instead of full redownload, fix caching issue and add warning when it fails
* Add support for split models, load metadata from each individual split file, also avoid mmproj
* Code cleanup, revert incremental downloading
* Only compile when cpp-httplib has SSL support
* Fix formatting