Create a pool of N threads that grab a chunk of up to 100 tests at a time to
iterate through. The number of tests at a time decreases as fewer remain.
Each thread uses its own dev and cpu backend, and set_n_threads_fn is not
called on the cpu backend.
Fix some TSAN issues that arose:
- In init_tensor_uniform, don't use static vector of generators.
- Replace gmtime with versions that don't use a global variable.
- Mutex calls to print_test_result.
* initial talkie support, coherent
* reorder to follow convention
* absorb inverse rope
* stop folding scalars to improve quantization
* use broadcasting instead of duplication
* style cleanup
* add scaling support to LoraTorchTensor; use that path in conversion
* use layer_out_scale instead of embd_skip_scale
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K
* Fix to editorconfig checking pass
* Remove mul-mat-legacy pipeline
* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
* Only run webgpu CI on my fork
* Add webgpu only workflow
* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled
* restore build.yml
* ci : disable SYCL f16 builds
* ci : extract android and hip into separate workflows
* ci : move webgpu to separate workflow
* ci : move the rpc to a separate workflow
* ci : extract s309x and ppcl jobs
* ci : extract opencl job into a separate workflow
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.
Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.
* Refactored Compressed Tensors NVFP4 support for new base.py
* Support compressed-tensors NVFP4 conversion
* Moved Qwen MTP remap into filter_tensors
* simplify
* pathlib no longer used
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* TP: fix ggml context size calculation, memory leak
* move split state cache back into the context
* revert to constant ggml context size for cgraphs
* increase headroom for statically allocated tensors
* remove obsolete include
* ggml: implement `gguf_init_from_buffer`
* test: `gguf_init_from_buffer`
* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer
* fix: use `GGML_UNUSED`
Co-authored-by: Copilot <copilot@github.com>
* fix: remove `total_size` from `gguf_reader`
* fix: file offset calculation, rename `offset` to `data_offset`
Co-authored-by: Copilot <copilot@github.com>
* refactor: extract model loader bug fixes to another PR
* feat: add `gguf_init_from_callback`
* fix: always require a max expected size
* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`
* fix: harden against offset overflow in buffer read
* fix: remove seek behavior from the callback
* feat: `max_chunk_read == 0` means `SIZE_MAX`
* fix: seeking in a gguf file with no tensors
---------
Co-authored-by: Copilot <copilot@github.com>
* fix(action): update SpacemiT toolchain URL and version
Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c
* fix(action): add -L flag to curl command for URL redirection
Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
* common : add common_chat_split_by_role
* cont : fix spans to reach end of message
* server: fix checkpoints creation
- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints
* cont : clean-up
* Support autoparser detection for message barriers
* server: fix message span delimiter and update docs
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
* ci : remove tag from build-self-hosted.yml
* ci : slim -> self-hosted
* ci : prevent heavy CPU jobs from running on fast runners
* ci : prevent cmake pkg to run on dedicated fast runners
* ci : try to bump 3.11 -> 3.13
* ci : move lint back to 3.11
* ci : back to 3.11
* ci : add comment about UI jobs
* ci : move python requirements check to CPU runners
this job is a bit slow for a dedicated "fast" runner
* ci : add self-hosted ui workflow
* ci : fix UI naming
* tmp to check if arm64 fast is compatible with all jobs
* revert last commit
* requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf
The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on
PyPI for platform/CPython combinations where 2.6.x is not present.
The accompanying comment already says 'PyTorch 2.6.0 or later', so
the looser >=2.6.0 matches the documented intent and unblocks
pip install -r requirements/requirements-convert_hf_to_gguf.txt.
Fixes#23408
* requirements: bump torch floor to 2.11.0 per maintainer
* requirements: pin torch to ==2.11.0 per project policy
* requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy
* requirements: suppress check_requirements pin warning on mtmd
The check_requirements script flags '==' on lines in files matched by
*/**/requirements*.txt. Append the documented suppression comment to the
pinned torch and torchvision lines (and to the s390x platform marker lines)
so the check passes while keeping the pins required by project policy.
* ty: silence Tensor/Module union check on model[0].auto_model
With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns
Tensor | Module rather than Module, so model[0].auto_model fails ty
on the SentenceTransformer code path. The runtime behavior is
unchanged because SentenceTransformer always wraps a Module at
index 0. Adding a targeted unresolved-attribute ignore keeps the
type-check green without altering behavior. A follow-up issue
tracks typing the variable explicitly.
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity