Commit graph

3371 commits

Author SHA1 Message Date
Daniel Han
08bb85fcda Create CODEOWNERS (#4039) 2026-02-12 02:56:13 -08:00
Lei Zhenyuan
cdc9dc1fb1 fix for tma (#4023) 2026-02-10 17:50:33 -08:00
Datta Nimmaturi
6804c05130 Misc fixes (#4018)
* convert print to logger

* Print but cleaner

* Hide model on multiple devices

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix typo transfomers -> transformers, revert MoE message change

* Update MoE detection message to show num_experts and target_modules

* Fix llama-cli path in save info message

* target_parameters warning for moe

* fix should_convert_module for llm_int8_skip_modules

* fix should_convert_module for llm_int8_skip_modules

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Logging filters

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* negation

* remove should_convert_module patch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-10 06:31:34 -08:00
Daniel Han
10338dbaa4 Fix warmup_ratio deprecation for transformers >= 5.0 (#4019)
* Fix warmup_ratio deprecation warning for transformers >= 5.0

In transformers 5.0, warmup_ratio is deprecated in favor of
warmup_steps which now accepts float values (< 1 = ratio,
>= 1 = absolute steps).

The compiler now conditionally sets warmup_steps=0.1 on
transformers >= 5.0 (same semantics as warmup_ratio=0.1) and
keeps warmup_ratio=0.1 on older versions where warmup_steps
only accepts int.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-10 06:17:47 -08:00
Daniel Han
f106eec5e9 Fix Gemma3 4B training on transformers 5.x (token_type_ids) (#4017)
* Inject token_type_ids for Gemma3 multimodal training on transformers 5.x

In transformers 5.x, create_causal_mask_mapping() raises ValueError when
is_training=True and token_type_ids is None. When doing text-only SFT on
Gemma3 4B (a multimodal model), the dataset_utils detection for
_needs_token_type_ids can miss because:
- The model is wrapped in PeftModel, so type(model).__module__ points to
  peft.peft_model instead of transformers
- The processing_class is a tokenizer (not Gemma3Processor), so the
  fallback MRO check resolves to a module without create_causal_mask_mapping

This adds a fallback in _unsloth_pre_compute_loss that injects
token_type_ids=zeros when:
1. token_type_ids is not already in inputs
2. The inner model config has model_type "gemma3"
3. The model's module has create_causal_mask_mapping (transformers 5.x)
4. The model is in training mode

On transformers 4.x, create_causal_mask_mapping does not exist so this
check is inert.

Depends on: unslothai/unsloth-zoo#488

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-10 05:14:36 -08:00
andrewor14
cd24ea0e50 FP8: Load model on-the-fly in vLLM (#3717)
* FP8: Load model on-the-fly in vLLM

**Summary:** Existing support for `load_in_fp8=True` performs
an offline quantization when loading the initial model.
This is no longer necessary as of vllm==0.12.0 (after
https://github.com/vllm-project/vllm/pull/23014), where we
can quantize the model on-the-fly when we load it:

```
llm = LLM(
  ...
  hf_overrides={
    "quantization_config_dict_str": json.dumps(torchao_config),
  },
)
```

**Note:** Needs https://github.com/unslothai/unsloth-zoo/pull/380

**Test Plan:**
https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix on-the-fly FP8: always check mapper first, fallback to on-the-fly

The original implementation bypasses the FP8 mapper entirely for
vllm >= 0.12.0, meaning models like Llama-3.2-1B-Instruct and Qwen3-8B
that have pre-quantized FP8-Block/FP8 checkpoints would never use them.

This fixes the priority order:
1. Mapper has a pre-quantized model -> use it (always)
2. Mapper has no match + vllm >= 0.12.0 -> on-the-fly FP8 via torchao
3. Mapper has no match + vllm < 0.12.0 -> offline quantization

Changes:
- loader_utils.py: Move vllm >= 0.12.0 check after mapper lookups
- loader.py: Set load_in_fp8=False when mapper resolves to a
  pre-quantized model to prevent double quantization

Tested on B200 with Llama-3.2-1B-Instruct and Qwen3-8B. Corrected code
produces results matching baseline (pre-quantized path preserved).

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-10 05:10:13 -08:00
Datta Nimmaturi
3df65308f3 [Misc] Fixes (#4015)
* convert print to logger

* Print but cleaner

* Hide model on multiple devices

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix typo transfomers -> transformers, revert MoE message change

* Update MoE detection message to show num_experts and target_modules

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-10 02:08:55 -08:00
Roland Tannous
fe5a7d11b6 add llama.cpp prefix to gguf conversion help messages (#4016) 2026-02-10 01:59:05 -08:00
Fizza Mukhtar
a353fad514 Fix #3397: Prevent trainer tokenization hang with safe num_proc (#4013)
* Fix #3397: Prevent trainer tokenization hang with safe num_proc

* Fix #3397: Add missing import sys for Windows-safe tokenization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Consolidate with existing num_proc guard in dataset_utils.py

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-10 01:53:46 -08:00
Daniel Han
acfe670357 Fix EmbeddingGemma float16 NaN via FORCE_FLOAT32 for gemma3_text (#4014)
* Fix EmbeddingGemma float16 NaN by adding gemma3_text to FORCE_FLOAT32 and SDPA lists

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-10 01:40:13 -08:00
Daniel Han
a2f4f04ea5 Inject model reference for dynamic token_type_ids detection in SFTTrainer (#4012)
* Inject model reference for dynamic token_type_ids detection in SFTTrainer

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-10 00:37:07 -08:00
Daniel Han
a35e866625 Suppress vLLM v1 executor sleep/wake log messages (#4011)
* Suppress vLLM v1 executor sleep/wake log messages

Add HideLoggingMessage filters for vllm.v1.executor.abstract logger to
suppress repetitive sleep/wake INFO and WARNING messages that spam training
output when UNSLOTH_VLLM_STANDBY is enabled. The existing filter at line 275
handles the legacy vllm.executor.executor_base path; this adds coverage for
the v1 engine path used by vllm 0.11+.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 23:51:58 -08:00
pre-commit-ci[bot]
293b431e77 [pre-commit.ci] pre-commit autoupdate (#4009)
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.14.14 → v0.15.0](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.14...v0.15.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 17:32:18 -08:00
Daniel Han
4f5de9ba93 Silence peft target_parameters RuntimeWarning for MoE models (#4008)
* Silence peft target_parameters RuntimeWarning for MoE models

Wrap _get_peft_model calls with warnings.catch_warnings() to suppress
the "target_parameters were set but no parameter was matched" warning.
This fires on MoE models where expert layers use nn.Parameter naming
that peft warns about but handles correctly.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 08:25:40 -08:00
Daniel Han
4924a5f6aa Silence TRL's batch_size=1 padding-free warning in compiled trainer source (#4007)
Strip the "anihilate"/"annihilate" warning block from compiled trainer
source so it does not fire when Unsloth auto-enables padding-free mode
with batch size 1 (the common single-GPU case).

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-09 07:55:29 -08:00
Daniel Han
f3f3c9dfb9 Fix dtype mismatch in fp16 + 4-bit/8-bit LoRA training (#4005)
* Fix dtype mismatch in fp16 + 4-bit/8-bit LoRA training

Two fixes for training with dtype=torch.float16 and load_in_4bit=True:

1. fast_lora.py: fast_dequantize() returns tensors in quant_state.dtype
   (typically bfloat16 or float32), but activations may be float16. The
   subsequent matmul/addmm operations require matching dtypes. Add dtype
   casts after each fast_dequantize() call in LoRA_MLP.backward and
   LoRA_QKV.backward (5 locations total).

2. rl.py: TRL unconditionally casts trainable parameters to bfloat16 in
   the peft init block. When training with fp16=True, this causes
   GradScaler to crash since it requires float32 parameters. Make the
   cast conditional -- use float32 when fp16 is enabled, bfloat16
   otherwise. This is a no-op for GRPOTrainer (whose peft init block is
   already removed by the existing regex), but fixes SFTTrainer and
   other TRL trainers.

Tested with Llama-3.2-1B-Instruct 4-bit on both fp16 and bf16 training.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix fp16 + 4-bit LoRA: thread correct_dtype through post_patch

Root cause: fast_dequantize returns tensors in quant_state.dtype, which
for pre-quantized models is bfloat16 (from config.json). The post_patch
methods in llama/gemma/gemma2 call patch_model_and_tokenizer without
passing correct_dtype, so quant_state.dtype is never overridden to match
the user's requested dtype. This causes a dtype mismatch crash in the
backward pass when training with dtype=torch.float16.

Fix: pass the user's dtype from from_pretrained through post_patch to
patch_model_and_tokenizer as correct_dtype, matching the pattern already
used by vision.py.

Revert the 5 symptom-level dtype casts in fast_lora.py (upW, gateW, QW,
KW, VW) since they are no longer needed with quant_state.dtype properly
set at the source.

Tested: fp16+4bit and bf16+4bit Llama-3.2-1B-Instruct 15-step SFT runs
both complete successfully with similar losses (~1.558 vs ~1.563).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove TRL's unconditional bfloat16 cast instead of patching the dtype

TRL 0.26.0+ hardcodes `param.data.to(torch.bfloat16)` for all trainable
params in quantized models, citing the QLoRA paper recommendation. This
is wrong: it ignores the user's requested dtype and breaks GradScaler
when fp16=True. The block exists in sft_trainer, grpo_trainer,
rloo_trainer, and reward_trainer (not dpo_trainer).

Previous fix patched the cast to be dtype-conditional. This commit
replaces the entire guard `if getattr(model, "is_loaded_in_4bit", ...)
or getattr(model, "is_loaded_in_8bit", ...):` with `if False:` to
disable the block entirely. Unsloth already handles adapter dtype via
patch_model_and_tokenizer, making TRL's cast both unnecessary and
harmful.

For GRPOTrainer the enclosing peft init block is already removed by
the regex above, making this a no-op for GRPO.

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 07:39:26 -08:00
Daniel Han
0a04b1b22c Fix trl.experimental thin wrapper compilation and OOM from peft_config overwrite (#4006)
* Fix trainer compilation failures from trl.experimental thin wrappers

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix OOM from prepare_model_for_kbit_training overwriting peft_config patching

---------

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 07:04:55 -08:00
Daniel Han
14fe579629 Fix VLM model + text-only dataset ValueError in TRL 0.22.x (#4004)
TRL 0.22.x checks _is_vlm (model type) instead of _is_vision_dataset
(dataset content, added in 0.25.1+) in _set_signature_columns_if_needed.
When _is_vlm=True (e.g. Gemma3), signature columns are set to vision-only
["messages","prompt","completion","images"], which has zero overlap with
tokenized text columns [input_ids, labels, attention_mask, ...], causing
a ValueError.

Fix: expand the VLM branch signature columns to include both vision and
text column names. Extra columns not present in the dataset are harmlessly
ignored by _remove_unused_columns (it only raises when zero columns match).

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-09 06:24:58 -08:00
Daniel Han
ba7366be53 Fix notebook compatibility for transformers 4.57.6 and TRL 0.22-0.27 (#3998)
* Patch before compile?

* Fix notebook compatibility for transformers 4.57.6 and TRL 0.22-0.27

Fixes several notebook failures discovered during testing all 125
notebooks with transformers==4.57.6 + tRL 0.22.2 and TRL 0.27.1.

Warning suppression (import_fixes.py):
- Suppress torch 2.9+ pin_memory/is_pinned device deprecation warnings
- Suppress cuda.cudart/cuda.nvrtc module deprecation FutureWarning
- Filter vllm "Level is deprecated" stderr noise
- Filter PydanticSerializationUnexpectedValue warnings
- Filter Triton "df: No such file" stderr noise

VLM tokenizer loading (vision.py):
- Add _construct_vlm_processor_fallback() for models where
  AutoProcessor.from_pretrained fails (e.g., ERNIE 4.5 VL, LFM2.5-VL)
- Wrap processor loading in try/except with fallback to manual
  construction from separate image_processor + tokenizer components
- Add fallback to AutoTokenizer/PreTrainedTokenizerFast when tokenizer
  loading or patching fails

TRL 0.27.1 trainer compatibility (trainer.py):
- Add _resolve_trainer_params() to handle thin wrapper trainers that
  only have def __init__(self, *args, **kwargs) (e.g., ORPOTrainer
  in TRL 0.27.1) by walking MRO for real parameter signature

VLM _is_vlm detection (rl.py):
- Replace blanket _is_vlm=False override with model-architecture-based
  detection that checks vision_config or ForConditionalGeneration class
  name, fixing VLM training when bare tokenizer is passed as
  processing_class

ModernBERT SDPA compatibility (loader.py, sentence_transformer.py):
- Add "modernbert" to DISABLE_SDPA_MODEL_NAMES to avoid stride
  alignment issues with torch.compile backward pass
- Add DISABLE_SDPA check for sentence transformer models

Other fixes (_utils.py):
- Suppress false uninitialized weight warnings for VLM
  multi_modal_projector.layer_norm

Tested: 92/125 notebooks pass with TRL 0.22.2, 94/125 with TRL 0.27.1.
Remaining failures are infra (missing FFmpeg, network timeouts, GPU
arch) not code bugs.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix KTO shape mismatch on TRL 0.27.2+ and truncation alignment

- Patch KTO get_batch_logps to auto-align logits and labels when Unsloth
  model forward truncates input_ids beyond max_seq_length. TRL 0.27.2
  changed _process_tokens to only truncate completions (not prompts), so
  sequences with long prompts exceed max_seq_length and trigger model-side
  truncation. The original ValueError is replaced with min-length alignment.

- Also truncate attention_mask in LlamaModel forward when input_ids are
  truncated to max_seq_length, preventing shape mismatches in attention.

- Widen except clause in rl_replacements.py openenv import from
  `except ImportError` to `except (ImportError, NameError, Exception)` to
  handle vllm SamplingParams NameError in TRL 0.27.2.

* Fix TRL 0.26+ thin wrapper resolution, enable ModernBERT SDPA, clean up warning filters

TRL 0.26+ thin wrapper resolution (rl.py):
- Filter _-prefixed private imports when discovering Trainer/Config classes
- Look up Config in separate *_config.py module when not found in trainer module
- Detect thin wrappers (<1000 chars source) and resolve to experimental parent
  via MRO walk; use resolved module for imports and create_new_function
- Enables all 15 trainers to patch successfully (was 5/15 before)

ModernBERT SDPA (loader.py):
- Remove "modernbert" from DISABLE_SDPA_MODEL_NAMES
- SDPA works correctly for both classification and sentence transformers
- Verified: 88.9% accuracy on emotion classification, correct domain-specific
  embeddings after sentence transformer fine-tuning

Warning filter cleanup (import_fixes.py):
- Remove cuda.cudart/cuda.nvrtc FutureWarning filters (no such warnings
  exist in torch 2.9.1+; proactive suppression is unnecessary)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove multi_modal_projector.layer_norm from uninitialized weight guard

The LFM2.5-VL projector LayerNorm is properly initialized by
transformers and does not need to be excluded from the uninitialized
weight check. The original exclusion was added as a workaround but is
no longer needed after the upstream fix.

* Add transformers 5.0 compat: rope_theta helper, config-as-dim detection, BatchEncoding guard, try/except for TRL trainer source, push_to_hub_token compiler fix

- llama.py: Add _get_rope_theta() helper handling both config.rope_theta and rope_parameters dict
- llama.py: Handle BatchEncoding in unsloth_fast_generate (transformers 5.0+ returns BatchEncoding from apply_chat_template)
- gemma.py: Detect config passed as dim arg in GemmaFixedRotaryEmbedding
- tokenizer_utils.py: Add try/except for TRL trainer getsource in patch_sft_trainer_tokenizer
- rl_replacements.py: Add compiler fix replacing bare pop("push_to_hub_token") with pop(..., None)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use trl.experimental string check instead of char-count heuristic for thin wrapper detection

The <1000 / >1000 char threshold was fragile -- XPOConfig's parent is only
994 chars and would be skipped. All thin wrappers in TRL 0.26+ contain
"trl.experimental" in their deprecation warning, while no real trainer or
config class does, making it a reliable detection marker.

* Move DISABLE_SDPA_MODEL_NAMES import to module level in sentence_transformer

The function-level import was redundant since loader.py is already imported
at module level. Move it to the existing loader import line.

---------

Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 05:11:50 -08:00
siddhu donda
884ce4601f fix: add inputs_embeds support in _fast_prepare_inputs_for_generation (#3798) (#3814)
Add `inputs_embeds` parameter to `_fast_prepare_inputs_for_generation` so
`model.generate(inputs_embeds=...)` works with Unsloth-patched models.

Changes:
- Add `inputs_embeds=None` to function signature (fixes HF inspect check)
- Track `use_inputs_embeds` flag: True when inputs_embeds provided and no cache
- Conditionally return inputs_embeds on first step, input_ids on subsequent steps
- Handle input_ids being None/empty for batch size and device extraction
- Add attention_mask None-guard before slicing

Fixes: https://github.com/unslothai/unsloth/issues/3798

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: siddhudonda <siddhudonda@users.noreply.github.com>
2026-02-09 04:59:43 -08:00
Daniel Han
3b1e8d0ae6 Update README.md 2026-02-09 04:50:54 -08:00
Daniel Han
60dd7269a5 Fix broken documentation links, typos, and formatting in README (#4003)
- Fix 14 broken documentation links (all returning 404) caused by docs
  site restructuring (install-and-update -> install, pages moved to
  /docs/blog/ and /docs/models/tutorials/)
- Fix "Qwen2.3-VL" -> "Qwen3-VL" (model does not exist)
- Fix incorrect "GSPO" label on gpt-oss GRPO notebook
- Fix "4b-bit" typo -> "4-bit"
- Fix "sodoku" typo -> "sudoku"
- Fix double dash formatting on FP8 GRPO notebook list item
- Fix citation URL from http:// to https://
- Update "MultiGPU coming soon" to "is now supported"
- Fix Windows installation step numbering (1,3,5,6,7 -> 1,2,3,4,5)
- Fix Advanced/Troubleshooting step numbering (5,6,5 -> 4,5,6)

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-09 04:46:46 -08:00
Fizza Mukhtar
c98312f229 Fix multi-GPU loading for quantized models in distributed training (#3917)
When using torchrun with quantized models (4bit/8bit/fp8), each rank
must load the model directly onto its own GPU. The default device_map
("sequential") places everything on GPU 0, causing illegal memory
access errors when Accelerate tries to relocate quantized weights.

Use the existing prepare_device_map() utility from loader_utils to
detect distributed training via LOCAL_RANK/WORLD_SIZE env vars and
override device_map to target each rank's local GPU. This is applied
in both FastLanguageModel.from_pretrained and FastModel.from_pretrained,
covering text, vision, and audio model paths.

Fixes #3914

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-09 04:26:21 -08:00
Mohammad Miadh Angkad
336bec216a Refactor Ollama template wiring and harden packing helpers (#3890)
* Refactor Ollama template wiring and harden packing helpers

Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>

* Fix Qwen3 and Gemma3n template bindings and tidy packing test helper

* Fix gptoss Ollama comment and tinyllama stop parameter

- Fix wrong comment referencing gemma3n for gptoss_ollama in chat_templates.py
- Add missing stop keyword to tinyllama PARAMETER in ollama_template_mappers.py

* Fix _DummyTrainer compatibility across TRL versions

The try/except only handled the removal of return_position_ids
(TRL v0.24+) but not the absence of padding_free (TRL v0.18.2).
Gracefully degrade through all optional collator flags so the
test works from trl>=0.18.2 through v0.27+.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 04:04:48 -08:00
RektPunk
f868d8b073 [Feature] seperate gguf file path (#3934)
* seperate gguf

* fix Modelfile log

* ollama Modelfile create

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix GGUF file placement: move initial conversion to _gguf dir, fix cleanup

- Move initial GGUF files (from convert_to_gguf) into {model_directory}_gguf/
  immediately after conversion, so all GGUF outputs live in the dedicated
  directory regardless of quantization method (fixes bf16-only case where
  quant == first_conversion skipped the loop and _gguf dir was never created)
- Remove redundant gguf_directory/makedirs from inside the re-quant loop
  since the directory is now created before the loop
- Use Path.unlink(missing_ok=True) for base GGUF cleanup robustness
- Unify Modelfile location to {save_directory}_gguf/Modelfile for both
  VLM and non-VLM models
- Fix print message to show actual modelfile_location path
- Add gguf_directory key to return dict
- Clean up {save_directory}_gguf in push_to_hub_gguf error/finally blocks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-09 04:00:14 -08:00
Etherll
315178b5c3 Add push_to_hub_gguf support for FastSentenceTransformer (#4002)
* Implement GGUF upload method for SentenceTransformer

Added a method to convert and upload SentenceTransformer models to GGUF format, including handling of tokenizer, quantization methods, and repository management on Hugging Face Hub.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-09 00:51:26 -08:00
Daniel Han
b47b081f99 Fix triton 3.6.0 + torch 2.9.x torch.compile crash (missing cluster_dims) (#4001)
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-08 20:18:25 -08:00
Daniel Han
c43a5b8f02 Fix multiprocessing crash on Windows/macOS and unify num_proc logic (#3999)
On Windows and macOS (Python 3.8+), multiprocessing uses the spawn
start method. When datasets .map(num_proc=N) is called, it creates a
Pool(N) which re-imports __main__ in each worker, causing infinite
recursion and a RuntimeError during bootstrapping.

Guard the auto-computed dataset_num_proc in the generated Config
__init__ by checking multiprocessing.get_start_method() != 'fork'.
When the start method is not fork (spawn/forkserver), force
dataset_num_proc = None so datasets takes the single-process path.
Linux fork behavior is unchanged.

Also replace the fixed memory threshold logic with the simpler
adaptive approach: cap at 64, then min(num_proc, int(available_gb)),
with a safety floor of 1 when available memory is at or below 2GB.

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-08 02:50:06 -08:00
pluesclues
c6de138e62 Update rl_replacements.py (#3990) 2026-02-05 08:22:42 -08:00
Daniel Han
3a4b1e7fc5 Disable torchcodec in transformers when FFmpeg is missing (#3989)
* Disable torchcodec in transformers when FFmpeg is missing

When torchcodec is installed but FFmpeg libraries are unavailable,
transformers still thinks torchcodec is available (via find_spec check)
and tries to use it for audio loading, causing RuntimeError.

This adds disable_torchcodec_if_broken() which tests if torchcodec can
actually load its native libraries, and if not, patches transformers'
_torchcodec_available to False so it falls back to librosa instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-05 06:54:09 -08:00
Daniel Han
145f6aaeb1 Fix cutlass inductor options for PyTorch < 2.8.0 (#3988)
The cuda.cutlass_epilogue_fusion_enabled and cuda.cutlass_tma_only
inductor config options were added in PyTorch 2.8.0. Using these
options on older PyTorch versions causes a RuntimeError during
GRPOTrainer initialization.

This fix adds a version check to only include these options when
running PyTorch 2.8.0 or later, allowing GRPO training to work on
older PyTorch versions (e.g., Colab environments with PyTorch 2.5-2.7).

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-05 06:40:11 -08:00
Daniel Han
7b42acae94 Fix RuntimeError not caught when torchcodec fails to load (#3987)
When datasets library has torchcodec installed but FFmpeg libraries
are missing, torchcodec raises a RuntimeError during import. The
exception handler only caught ImportError and AttributeError, causing
the error to propagate and crash Unsloth imports in environments
like Colab where FFmpeg may not be installed.

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-02-05 06:35:10 -08:00
Daniel Han
ce256c43bc Merge branch 'main' of https://github.com/unslothai/unsloth 2026-02-05 06:10:06 -08:00
Daniel Han
f463f692d6 MoE release 2026-02-05 06:09:56 -08:00
Datta Nimmaturi
fad6957555 [MoE] Improve moe kernels for unsloth fine tuning (#3812)
* Improve MoE performance

* small changes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix imports

* disable autotune

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* LoRA for MoE

* Make autotune default

* make dy contiguous

* use non lora model as base for RL

* Revert "use non lora model as base for RL"

This reverts commit bc8f15629d060593b2eaf436f158ff5ac9df0d5d.

* fixup derp

* non TMA [T4]

* Revert "non TMA [T4]"

This reverts commit 35304566690e7c9ab9632899920c85bff322409a.

* Fixes for VL MoE and v5 transformers

* [transformers] [v5] remove unused hybridcache (#3910)

* remote unused hybridcache

* cleanup

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* No double compile for qwen3moe

* Fix top_k on trl GRPO

* Recognise GLM as MoE

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix missing RotaryEmbeddingConfigMixin

* Licensing for autotuning cache

* Cleanup

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Erland366 <erland.pg366@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-02-05 06:03:25 -08:00
Daniel Han
2883ce4091 Update _utils.py 2026-02-05 05:58:00 -08:00
Daniel Han
ff3f78b6b9 Add PyTorch 2.10 and xformers 0.0.34 support (#3985)
- Add cu126/cu128/cu130 xformers 0.0.34 wheel dependencies for torch 2.10
- Add cu126-torch2100, cu128-torch2100, cu130-torch2100 meta-dependencies
- Add cu126-ampere-torch2100, cu128-ampere-torch2100, cu130-ampere-torch2100 variants
- Update _auto_install.py version detection for torch 2.10.x
- Add CUDA check for torch 2.10 (requires CUDA 12.6, 12.8, or 13.0)
- Update README.md with torch 2.10 installation instructions

Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
2026-02-05 05:56:26 -08:00
Daniel Han
7ceebe4554 Silence non-actionable TRL trainer import failures (#3980)
_patch_trl_rl_trainers enumerates all trainer modules from dir(trl.trainer)
and attempts to import each one. Modules like alignprop_trainer fail because
they depend on optional packages (diffusers) that may not be installed. The
failure is harmless but the print() call produces noise on every import.

Change print() to logger.info() so these messages only appear when
UNSLOTH_ENABLE_LOGGING=1.

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-02-05 05:32:52 -08:00
Daniel Han
5798267401 Silence third-party deprecation warnings and fix socket leak (#3983)
* Silence third-party deprecation warnings and fix socket resource leak

- Add warning filters for TorchAO deprecated import paths
- Filter SWIG builtin type warnings from bitsandbytes/triton
- Filter Triton autotuner deprecation warnings
- Filter Python 3.12+ multiprocessing fork warnings
- Filter resource warnings for unclosed sockets/files
- Fix socket leak in has_internet() by properly closing socket

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-05 04:55:52 -08:00
Daniel Han
649865caca Fix GPT-OSS BlockMask error during inference (#3982)
GPT-OSS models use eager attention during inference because flex
attention returns incorrect results (likely due to left padding).
However, when _attn_implementation is set to "flex_attention",
transformers creates BlockMask objects which cause a TypeError
when passed to the eager attention path:

  TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask'

This fix excludes GPT-OSS from using flex_attention, keeping it on
the eager path to avoid the BlockMask/Tensor type mismatch.
2026-02-05 04:28:46 -08:00
Daniel Han
6f3e52bbcf Prefer flex attention when available (#3979)
* Enable flex attention by default

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Avoid dropping flex attention when SDPA unsupported

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-05 03:19:04 -08:00
pluesclues
9b34982509 Trl 0.27.0 update (#3965)
* Update rl_replacements.py

* Update rl_replacements.py

* Update rl.py

* Update rl_replacements.py

* Update rl_replacements.py

* Update rl.py

* Update rl.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update rl_replacements.py

* Update rl.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update rl_replacements.py, remove chat template from codexes commits

* Update rl.py, got rid of gradient checkpointing code that did not work

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-04 23:01:16 -08:00
Daniel Han
e1c682e6d2 Fix torchvision compatibility check for source builds and future torch versions (#3978)
* Fix torchvision compatibility check for source builds and future torch versions

The torchvision version check raised a hard ImportError for custom/source-built
PyTorch installations (e.g. AMD ROCm from source with +git* suffixes), even when
the actual build was functional. This also silently skipped any torch version
not already in the hardcoded table, giving no warning at all for future releases.

Changes:
- Detect custom/source builds by checking the raw version string's local
  identifier against known standard prefixes (cu, rocm, cpu, xpu). Our custom
  Version() strips local identifiers via regex, so detection must happen on the
  raw string before parsing.
- Downgrade to a warning (instead of ImportError) for custom/source builds,
  since their version numbers may not follow standard PyPI release pairings.
- Add formula-based inference for future torch versions not yet in the table.
  The torch->torchvision minor version formula (torch 2.x -> tv 0.(x+15)) has
  held for every release from torch 2.0 through 2.9. For formula-predicted
  versions, mismatches produce a warning rather than a hard error.
- Add UNSLOTH_SKIP_TORCHVISION_CHECK=1 env var to skip the check entirely.
- Wrap importlib_version and Version calls in try/except so broken metadata
  never crashes the import.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review: stricter regex, case insensitivity, pre-release detection

Fixes three edge cases found during review:

1. Regex precision: cu/xpu now require a trailing digit (cu\d, xpu\d) to
   avoid false negatives on suffixes like "+custom_build" that happen to
   start with "cu". cpu/xpu match as exact strings only.

2. Case insensitivity: added re.IGNORECASE so "+ROCM6.3" and "+CPU" are
   correctly recognized as standard builds rather than custom ones.

3. Pre-release detection: nightly/dev/alpha/beta/rc builds with standard
   CUDA/ROCm suffixes (e.g. "2.7.0.dev20250301+cu124") now produce a
   warning instead of a hard ImportError. These builds commonly have
   version mismatches that are expected during development.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address PR review comments: fullmatch, env var casing, torchvision pre-release

1. Switch re.match to re.fullmatch for the custom build regex so the
   entire local identifier must match. Fixes false negatives where
   suffixes like +cu124_custom were misclassified as standard because
   re.match only checked the start of the string.

2. Use .lower() for the UNSLOTH_SKIP_TORCHVISION_CHECK env var so
   any casing of "true" / "TRUE" / etc. is accepted.

3. Check torchvision_version_raw for pre-release tags in addition to
   torch_version_raw, so a stable torch paired with a nightly
   torchvision (e.g. 0.23.0.dev...) also gets a warning instead of
   a hard ImportError.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-04 04:50:26 -08:00
Daniel Han
4f75ec2fc8 Add vLLM + torch < 2.9.0 + SM100 compatibility check (#3973)
vLLM's distributed module (device_communicators) crashes with std::bad_alloc
when imported on SM100 GPUs (B200/B100/Blackwell) with torch < 2.9.0.

This adds an early check that runs before vLLM is imported, providing a
helpful error message instead of a cryptic C++ exception.

The check:
1. Detects if vLLM is installed
2. Checks if torch version is < 2.9.0
3. Checks if any GPU is SM100 (Blackwell)
4. If all conditions met, raises RuntimeError with clear upgrade instructions
2026-02-03 03:10:24 -08:00
Daniel Han
d5f5b7d6a6 Add TRL truncation regression and metadata loss fixes (Fixes 1 and 3) (#3971)
* Add TRL truncation regression and metadata loss fixes

Fix 1: TRL 0.24.0-0.25.1 right-truncation regression
- These versions pass max_length=self.max_prompt_length and truncation=True
  to the tokenizer, which right-truncates prompts and strips the assistant
  turn suffix
- Use regex to remove these kwargs from the generated code

Fix 3: Metadata loss for chat_template_kwargs
- TRL 0.24.0+ extracts prompts = [x["prompt"] for x in inputs], losing metadata
  like reasoning_effort
- Inject code to store per-sample chat_template_kwargs on self before extraction
- Preserve these kwargs in prompts_text generation for all TRL versions

Tested with TRL versions 0.22.2, 0.23.1, 0.24.0, 0.25.1, 0.26.2, and 0.27.1.

* Update Fix 1 comment with detailed TRL version behavior explanation

Expand the comment for the TRL 0.24.0-0.25.1 truncation regression fix
to clarify what each TRL version does:

- TRL 0.22.2-0.23.1: Uses truncate_with_protected_tokens() for smart
  truncation that preserves rightmost tokens and protects special tokens
- TRL 0.24.0-0.25.1: Removed smart truncation, passes kwargs directly
  to tokenizer (max_length, truncation=True, add_special_tokens=False)
- TRL 0.26.2+: Removed these kwargs entirely

The fix removes these problematic kwargs so 0.24.0-0.25.1 behaves like
0.26.2+ (no tokenizer-level truncation).

---------

Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
2026-02-03 03:00:12 -08:00
Daniel Han
8f44ae0eda Fix num_train_epochs=None causing TypeError in GRPOConfig (#3972)
When users pass `num_train_epochs=None` to GRPOConfig (relying on
max_steps to control training duration), Trainer.__init__ fails with:

  TypeError: '>' not supported between instances of 'NoneType' and 'int'

This happens because transformers.Trainer does `args.num_train_epochs > 0`
in its __init__ which fails when the value is None.

This fix converts None to 3.0 (the default) before Trainer initialization.
The actual training duration is still controlled by max_steps since it
takes precedence when both are set.

Example that now works:
```python
config = GRPOConfig(
    num_train_epochs=None,  # Previously caused TypeError
    max_steps=500,          # This controls actual duration
    ...
)
```
2026-02-03 02:48:40 -08:00
Daniel Han
41417693e4 Fix Vision GRPO string prompts and OpenEnv async compatibility (#3964)
* [fix] Vision GRPO string prompts and OpenEnv async compatibility

- Guard prepare_multimodal_messages in GRPO trainer to skip processing
  when prompts are pre-templated strings. Notebooks that pre-apply
  apply_chat_template() produce strings with image tokens already
  embedded; calling prepare_multimodal_messages on those crashes with
  TypeError.
- Apply nest_asyncio when OpenEnv EnvClient exposes async reset/step,
  so scripts using run_until_complete() wrappers work in all contexts.
- Add wrapper to call patch_torchcodec_audio_decoder() from unsloth_zoo
  for AudioDecoder dict-compatibility.

* Add apply_chat_template guard for pre-templated string prompts in Vision GRPO

When notebooks pre-apply apply_chat_template, prompts become strings.
The existing guard skips prepare_multimodal_messages for strings. This
adds a second guard to skip apply_chat_template in the forward_kwargs
block, using prompts directly as prompts_text instead. Covers both
TRL 0.25.x (no tools param) and TRL 0.26.2+ (with tools=self.tools).
Non-matching replacements silently pass for older TRL versions.

* Add TRL 0.25.1 single-line variant for apply_chat_template guard

TRL 0.25.1 uses single-line formatting for apply_chat_template:
  apply_chat_template({"prompt": prompt}, ...)["prompt"]

While TRL 0.26.2+ uses multi-line formatting:
  apply_chat_template(
      {"prompt": prompt}, ...
  )["prompt"]

Add both variants to ensure full backwards compatibility.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-03 02:03:46 -08:00
Daniel Han
949f1ce573 Fix TRL 0.27.0 GRPO compatibility and PEFT model handling (#3969)
* Fix TRL 0.27.0 GRPO compatibility and PEFT model handling

- Remove use_reentrant=False from gradient_checkpointing_kwargs for TRL 0.27.0+
  TRL 0.27.0 auto-sets use_reentrant=False in GRPOConfig.__post_init__, but
  Unsloth gradient checkpointing requires use_reentrant=True. This adds a
  post-init cleanup that removes the setting when present.

- Handle prepare_peft_model standalone function pattern for TRL 0.22.0+
  TRL changed from self._prepare_peft_model() method to prepare_peft_model()
  standalone function. Both patterns are now bypassed to let Unsloth handle
  PEFT model preparation.

Tested with TRL versions 0.22.2, 0.23.1, 0.24.0, 0.25.1, 0.26.2, and 0.27.1.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-03 01:56:31 -08:00
Kaitao Yang
7dd3ae8768 reduce code duplication (#3877)
* reduce code duplication

* address reviewer feedback: keep original function name

- Keep original function name `_offload_frozen_module_for_training`
- Make `offload_device` parameter Optional (can be None)
- Keep original error handling (return None for missing modules_to_save)
- Maintain code deduplication by reusing the helper function

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-02-03 00:27:49 -08:00
Daniel Han
8f0b57ae18 Use standard gradient checkpointing for small sequence lengths (#3867)
* Use standard gradient checkpointing for small sequence lengths

When max_seq_length < 512, the overhead of gradient offloading in
gc="unsloth" mode is not worth it. Benchmarks on B200 show:

| seq_len | gc=unsloth | gc=True  | Difference |
|---------|------------|----------|------------|
| 256     | 6,803 t/s  | 6,993 t/s| +2.8%      |
| 384     | 9,889 t/s  | 9,963 t/s| +0.7%      |
| 512     | 13,151 t/s | 13,092 t/s| -0.4%     |
| 1024    | 26,662 t/s | 25,094 t/s| -5.9%     |

The crossover point is around seq_len 384-512. For sequences shorter
than 512, we now automatically use standard gradient checkpointing
instead of the custom offloading implementation.

Additionally, when user explicitly sets use_gradient_checkpointing to
True or False in get_peft_model, it now correctly overrides any
previous "unsloth" patching from from_pretrained. This ensures
consistent behavior regardless of the order of function calls.

Updated in three locations:
- FastLlamaModel.get_peft_model (llama.py)
- FastLanguageModel.from_pretrained (loader.py)
- FastModel.from_pretrained (loader.py)

* Refactor: extract gradient checkpointing heuristic into utility function

Addresses code review feedback to reduce duplication. The gradient
checkpointing heuristic logic was duplicated in 3 places:
- FastLlamaModel.get_peft_model (llama.py)
- FastLanguageModel.from_pretrained (loader.py)
- FastModel.from_pretrained (loader.py)

Created apply_unsloth_gradient_checkpointing() utility function in
_utils.py that handles:
- Heuristic: seq < 512 falls back to standard gc
- Explicit True/False overrides unpatch previous patching
- Returns the effective use_gradient_checkpointing value

Net reduction of ~6 lines while improving maintainability.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-02-02 23:57:09 -08:00