* convert print to logger
* Print but cleaner
* Hide model on multiple devices
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix typo transfomers -> transformers, revert MoE message change
* Update MoE detection message to show num_experts and target_modules
* Fix llama-cli path in save info message
* target_parameters warning for moe
* fix should_convert_module for llm_int8_skip_modules
* fix should_convert_module for llm_int8_skip_modules
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Logging filters
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* negation
* remove should_convert_module patch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix warmup_ratio deprecation warning for transformers >= 5.0
In transformers 5.0, warmup_ratio is deprecated in favor of
warmup_steps which now accepts float values (< 1 = ratio,
>= 1 = absolute steps).
The compiler now conditionally sets warmup_steps=0.1 on
transformers >= 5.0 (same semantics as warmup_ratio=0.1) and
keeps warmup_ratio=0.1 on older versions where warmup_steps
only accepts int.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Inject token_type_ids for Gemma3 multimodal training on transformers 5.x
In transformers 5.x, create_causal_mask_mapping() raises ValueError when
is_training=True and token_type_ids is None. When doing text-only SFT on
Gemma3 4B (a multimodal model), the dataset_utils detection for
_needs_token_type_ids can miss because:
- The model is wrapped in PeftModel, so type(model).__module__ points to
peft.peft_model instead of transformers
- The processing_class is a tokenizer (not Gemma3Processor), so the
fallback MRO check resolves to a module without create_causal_mask_mapping
This adds a fallback in _unsloth_pre_compute_loss that injects
token_type_ids=zeros when:
1. token_type_ids is not already in inputs
2. The inner model config has model_type "gemma3"
3. The model's module has create_causal_mask_mapping (transformers 5.x)
4. The model is in training mode
On transformers 4.x, create_causal_mask_mapping does not exist so this
check is inert.
Depends on: unslothai/unsloth-zoo#488
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* FP8: Load model on-the-fly in vLLM
**Summary:** Existing support for `load_in_fp8=True` performs
an offline quantization when loading the initial model.
This is no longer necessary as of vllm==0.12.0 (after
https://github.com/vllm-project/vllm/pull/23014), where we
can quantize the model on-the-fly when we load it:
```
llm = LLM(
...
hf_overrides={
"quantization_config_dict_str": json.dumps(torchao_config),
},
)
```
**Note:** Needs https://github.com/unslothai/unsloth-zoo/pull/380
**Test Plan:**
https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix on-the-fly FP8: always check mapper first, fallback to on-the-fly
The original implementation bypasses the FP8 mapper entirely for
vllm >= 0.12.0, meaning models like Llama-3.2-1B-Instruct and Qwen3-8B
that have pre-quantized FP8-Block/FP8 checkpoints would never use them.
This fixes the priority order:
1. Mapper has a pre-quantized model -> use it (always)
2. Mapper has no match + vllm >= 0.12.0 -> on-the-fly FP8 via torchao
3. Mapper has no match + vllm < 0.12.0 -> offline quantization
Changes:
- loader_utils.py: Move vllm >= 0.12.0 check after mapper lookups
- loader.py: Set load_in_fp8=False when mapper resolves to a
pre-quantized model to prevent double quantization
Tested on B200 with Llama-3.2-1B-Instruct and Qwen3-8B. Corrected code
produces results matching baseline (pre-quantized path preserved).
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* convert print to logger
* Print but cleaner
* Hide model on multiple devices
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix typo transfomers -> transformers, revert MoE message change
* Update MoE detection message to show num_experts and target_modules
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix#3397: Prevent trainer tokenization hang with safe num_proc
* Fix#3397: Add missing import sys for Windows-safe tokenization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Consolidate with existing num_proc guard in dataset_utils.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix EmbeddingGemma float16 NaN by adding gemma3_text to FORCE_FLOAT32 and SDPA lists
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Inject model reference for dynamic token_type_ids detection in SFTTrainer
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Suppress vLLM v1 executor sleep/wake log messages
Add HideLoggingMessage filters for vllm.v1.executor.abstract logger to
suppress repetitive sleep/wake INFO and WARNING messages that spam training
output when UNSLOTH_VLLM_STANDBY is enabled. The existing filter at line 275
handles the legacy vllm.executor.executor_base path; this adds coverage for
the v1 engine path used by vllm 0.11+.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Silence peft target_parameters RuntimeWarning for MoE models
Wrap _get_peft_model calls with warnings.catch_warnings() to suppress
the "target_parameters were set but no parameter was matched" warning.
This fires on MoE models where expert layers use nn.Parameter naming
that peft warns about but handles correctly.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Strip the "anihilate"/"annihilate" warning block from compiled trainer
source so it does not fire when Unsloth auto-enables padding-free mode
with batch size 1 (the common single-GPU case).
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix dtype mismatch in fp16 + 4-bit/8-bit LoRA training
Two fixes for training with dtype=torch.float16 and load_in_4bit=True:
1. fast_lora.py: fast_dequantize() returns tensors in quant_state.dtype
(typically bfloat16 or float32), but activations may be float16. The
subsequent matmul/addmm operations require matching dtypes. Add dtype
casts after each fast_dequantize() call in LoRA_MLP.backward and
LoRA_QKV.backward (5 locations total).
2. rl.py: TRL unconditionally casts trainable parameters to bfloat16 in
the peft init block. When training with fp16=True, this causes
GradScaler to crash since it requires float32 parameters. Make the
cast conditional -- use float32 when fp16 is enabled, bfloat16
otherwise. This is a no-op for GRPOTrainer (whose peft init block is
already removed by the existing regex), but fixes SFTTrainer and
other TRL trainers.
Tested with Llama-3.2-1B-Instruct 4-bit on both fp16 and bf16 training.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix fp16 + 4-bit LoRA: thread correct_dtype through post_patch
Root cause: fast_dequantize returns tensors in quant_state.dtype, which
for pre-quantized models is bfloat16 (from config.json). The post_patch
methods in llama/gemma/gemma2 call patch_model_and_tokenizer without
passing correct_dtype, so quant_state.dtype is never overridden to match
the user's requested dtype. This causes a dtype mismatch crash in the
backward pass when training with dtype=torch.float16.
Fix: pass the user's dtype from from_pretrained through post_patch to
patch_model_and_tokenizer as correct_dtype, matching the pattern already
used by vision.py.
Revert the 5 symptom-level dtype casts in fast_lora.py (upW, gateW, QW,
KW, VW) since they are no longer needed with quant_state.dtype properly
set at the source.
Tested: fp16+4bit and bf16+4bit Llama-3.2-1B-Instruct 15-step SFT runs
both complete successfully with similar losses (~1.558 vs ~1.563).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove TRL's unconditional bfloat16 cast instead of patching the dtype
TRL 0.26.0+ hardcodes `param.data.to(torch.bfloat16)` for all trainable
params in quantized models, citing the QLoRA paper recommendation. This
is wrong: it ignores the user's requested dtype and breaks GradScaler
when fp16=True. The block exists in sft_trainer, grpo_trainer,
rloo_trainer, and reward_trainer (not dpo_trainer).
Previous fix patched the cast to be dtype-conditional. This commit
replaces the entire guard `if getattr(model, "is_loaded_in_4bit", ...)
or getattr(model, "is_loaded_in_8bit", ...):` with `if False:` to
disable the block entirely. Unsloth already handles adapter dtype via
patch_model_and_tokenizer, making TRL's cast both unnecessary and
harmful.
For GRPOTrainer the enclosing peft init block is already removed by
the regex above, making this a no-op for GRPO.
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix trainer compilation failures from trl.experimental thin wrappers
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix OOM from prepare_model_for_kbit_training overwriting peft_config patching
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
TRL 0.22.x checks _is_vlm (model type) instead of _is_vision_dataset
(dataset content, added in 0.25.1+) in _set_signature_columns_if_needed.
When _is_vlm=True (e.g. Gemma3), signature columns are set to vision-only
["messages","prompt","completion","images"], which has zero overlap with
tokenized text columns [input_ids, labels, attention_mask, ...], causing
a ValueError.
Fix: expand the VLM branch signature columns to include both vision and
text column names. Extra columns not present in the dataset are harmlessly
ignored by _remove_unused_columns (it only raises when zero columns match).
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Patch before compile?
* Fix notebook compatibility for transformers 4.57.6 and TRL 0.22-0.27
Fixes several notebook failures discovered during testing all 125
notebooks with transformers==4.57.6 + tRL 0.22.2 and TRL 0.27.1.
Warning suppression (import_fixes.py):
- Suppress torch 2.9+ pin_memory/is_pinned device deprecation warnings
- Suppress cuda.cudart/cuda.nvrtc module deprecation FutureWarning
- Filter vllm "Level is deprecated" stderr noise
- Filter PydanticSerializationUnexpectedValue warnings
- Filter Triton "df: No such file" stderr noise
VLM tokenizer loading (vision.py):
- Add _construct_vlm_processor_fallback() for models where
AutoProcessor.from_pretrained fails (e.g., ERNIE 4.5 VL, LFM2.5-VL)
- Wrap processor loading in try/except with fallback to manual
construction from separate image_processor + tokenizer components
- Add fallback to AutoTokenizer/PreTrainedTokenizerFast when tokenizer
loading or patching fails
TRL 0.27.1 trainer compatibility (trainer.py):
- Add _resolve_trainer_params() to handle thin wrapper trainers that
only have def __init__(self, *args, **kwargs) (e.g., ORPOTrainer
in TRL 0.27.1) by walking MRO for real parameter signature
VLM _is_vlm detection (rl.py):
- Replace blanket _is_vlm=False override with model-architecture-based
detection that checks vision_config or ForConditionalGeneration class
name, fixing VLM training when bare tokenizer is passed as
processing_class
ModernBERT SDPA compatibility (loader.py, sentence_transformer.py):
- Add "modernbert" to DISABLE_SDPA_MODEL_NAMES to avoid stride
alignment issues with torch.compile backward pass
- Add DISABLE_SDPA check for sentence transformer models
Other fixes (_utils.py):
- Suppress false uninitialized weight warnings for VLM
multi_modal_projector.layer_norm
Tested: 92/125 notebooks pass with TRL 0.22.2, 94/125 with TRL 0.27.1.
Remaining failures are infra (missing FFmpeg, network timeouts, GPU
arch) not code bugs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix KTO shape mismatch on TRL 0.27.2+ and truncation alignment
- Patch KTO get_batch_logps to auto-align logits and labels when Unsloth
model forward truncates input_ids beyond max_seq_length. TRL 0.27.2
changed _process_tokens to only truncate completions (not prompts), so
sequences with long prompts exceed max_seq_length and trigger model-side
truncation. The original ValueError is replaced with min-length alignment.
- Also truncate attention_mask in LlamaModel forward when input_ids are
truncated to max_seq_length, preventing shape mismatches in attention.
- Widen except clause in rl_replacements.py openenv import from
`except ImportError` to `except (ImportError, NameError, Exception)` to
handle vllm SamplingParams NameError in TRL 0.27.2.
* Fix TRL 0.26+ thin wrapper resolution, enable ModernBERT SDPA, clean up warning filters
TRL 0.26+ thin wrapper resolution (rl.py):
- Filter _-prefixed private imports when discovering Trainer/Config classes
- Look up Config in separate *_config.py module when not found in trainer module
- Detect thin wrappers (<1000 chars source) and resolve to experimental parent
via MRO walk; use resolved module for imports and create_new_function
- Enables all 15 trainers to patch successfully (was 5/15 before)
ModernBERT SDPA (loader.py):
- Remove "modernbert" from DISABLE_SDPA_MODEL_NAMES
- SDPA works correctly for both classification and sentence transformers
- Verified: 88.9% accuracy on emotion classification, correct domain-specific
embeddings after sentence transformer fine-tuning
Warning filter cleanup (import_fixes.py):
- Remove cuda.cudart/cuda.nvrtc FutureWarning filters (no such warnings
exist in torch 2.9.1+; proactive suppression is unnecessary)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove multi_modal_projector.layer_norm from uninitialized weight guard
The LFM2.5-VL projector LayerNorm is properly initialized by
transformers and does not need to be excluded from the uninitialized
weight check. The original exclusion was added as a workaround but is
no longer needed after the upstream fix.
* Add transformers 5.0 compat: rope_theta helper, config-as-dim detection, BatchEncoding guard, try/except for TRL trainer source, push_to_hub_token compiler fix
- llama.py: Add _get_rope_theta() helper handling both config.rope_theta and rope_parameters dict
- llama.py: Handle BatchEncoding in unsloth_fast_generate (transformers 5.0+ returns BatchEncoding from apply_chat_template)
- gemma.py: Detect config passed as dim arg in GemmaFixedRotaryEmbedding
- tokenizer_utils.py: Add try/except for TRL trainer getsource in patch_sft_trainer_tokenizer
- rl_replacements.py: Add compiler fix replacing bare pop("push_to_hub_token") with pop(..., None)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use trl.experimental string check instead of char-count heuristic for thin wrapper detection
The <1000 / >1000 char threshold was fragile -- XPOConfig's parent is only
994 chars and would be skipped. All thin wrappers in TRL 0.26+ contain
"trl.experimental" in their deprecation warning, while no real trainer or
config class does, making it a reliable detection marker.
* Move DISABLE_SDPA_MODEL_NAMES import to module level in sentence_transformer
The function-level import was redundant since loader.py is already imported
at module level. Move it to the existing loader import line.
---------
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Add `inputs_embeds` parameter to `_fast_prepare_inputs_for_generation` so
`model.generate(inputs_embeds=...)` works with Unsloth-patched models.
Changes:
- Add `inputs_embeds=None` to function signature (fixes HF inspect check)
- Track `use_inputs_embeds` flag: True when inputs_embeds provided and no cache
- Conditionally return inputs_embeds on first step, input_ids on subsequent steps
- Handle input_ids being None/empty for batch size and device extraction
- Add attention_mask None-guard before slicing
Fixes: https://github.com/unslothai/unsloth/issues/3798
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: siddhudonda <siddhudonda@users.noreply.github.com>
When using torchrun with quantized models (4bit/8bit/fp8), each rank
must load the model directly onto its own GPU. The default device_map
("sequential") places everything on GPU 0, causing illegal memory
access errors when Accelerate tries to relocate quantized weights.
Use the existing prepare_device_map() utility from loader_utils to
detect distributed training via LOCAL_RANK/WORLD_SIZE env vars and
override device_map to target each rank's local GPU. This is applied
in both FastLanguageModel.from_pretrained and FastModel.from_pretrained,
covering text, vision, and audio model paths.
Fixes#3914
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Refactor Ollama template wiring and harden packing helpers
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
* Fix Qwen3 and Gemma3n template bindings and tidy packing test helper
* Fix gptoss Ollama comment and tinyllama stop parameter
- Fix wrong comment referencing gemma3n for gptoss_ollama in chat_templates.py
- Add missing stop keyword to tinyllama PARAMETER in ollama_template_mappers.py
* Fix _DummyTrainer compatibility across TRL versions
The try/except only handled the removal of return_position_ids
(TRL v0.24+) but not the absence of padding_free (TRL v0.18.2).
Gracefully degrade through all optional collator flags so the
test works from trl>=0.18.2 through v0.27+.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* seperate gguf
* fix Modelfile log
* ollama Modelfile create
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix GGUF file placement: move initial conversion to _gguf dir, fix cleanup
- Move initial GGUF files (from convert_to_gguf) into {model_directory}_gguf/
immediately after conversion, so all GGUF outputs live in the dedicated
directory regardless of quantization method (fixes bf16-only case where
quant == first_conversion skipped the loop and _gguf dir was never created)
- Remove redundant gguf_directory/makedirs from inside the re-quant loop
since the directory is now created before the loop
- Use Path.unlink(missing_ok=True) for base GGUF cleanup robustness
- Unify Modelfile location to {save_directory}_gguf/Modelfile for both
VLM and non-VLM models
- Fix print message to show actual modelfile_location path
- Add gguf_directory key to return dict
- Clean up {save_directory}_gguf in push_to_hub_gguf error/finally blocks
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Implement GGUF upload method for SentenceTransformer
Added a method to convert and upload SentenceTransformer models to GGUF format, including handling of tokenizer, quantization methods, and repository management on Hugging Face Hub.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
On Windows and macOS (Python 3.8+), multiprocessing uses the spawn
start method. When datasets .map(num_proc=N) is called, it creates a
Pool(N) which re-imports __main__ in each worker, causing infinite
recursion and a RuntimeError during bootstrapping.
Guard the auto-computed dataset_num_proc in the generated Config
__init__ by checking multiprocessing.get_start_method() != 'fork'.
When the start method is not fork (spawn/forkserver), force
dataset_num_proc = None so datasets takes the single-process path.
Linux fork behavior is unchanged.
Also replace the fixed memory threshold logic with the simpler
adaptive approach: cap at 64, then min(num_proc, int(available_gb)),
with a safety floor of 1 when available memory is at or below 2GB.
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Disable torchcodec in transformers when FFmpeg is missing
When torchcodec is installed but FFmpeg libraries are unavailable,
transformers still thinks torchcodec is available (via find_spec check)
and tries to use it for audio loading, causing RuntimeError.
This adds disable_torchcodec_if_broken() which tests if torchcodec can
actually load its native libraries, and if not, patches transformers'
_torchcodec_available to False so it falls back to librosa instead.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The cuda.cutlass_epilogue_fusion_enabled and cuda.cutlass_tma_only
inductor config options were added in PyTorch 2.8.0. Using these
options on older PyTorch versions causes a RuntimeError during
GRPOTrainer initialization.
This fix adds a version check to only include these options when
running PyTorch 2.8.0 or later, allowing GRPO training to work on
older PyTorch versions (e.g., Colab environments with PyTorch 2.5-2.7).
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
When datasets library has torchcodec installed but FFmpeg libraries
are missing, torchcodec raises a RuntimeError during import. The
exception handler only caught ImportError and AttributeError, causing
the error to propagate and crash Unsloth imports in environments
like Colab where FFmpeg may not be installed.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Improve MoE performance
* small changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix imports
* disable autotune
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* LoRA for MoE
* Make autotune default
* make dy contiguous
* use non lora model as base for RL
* Revert "use non lora model as base for RL"
This reverts commit bc8f15629d060593b2eaf436f158ff5ac9df0d5d.
* fixup derp
* non TMA [T4]
* Revert "non TMA [T4]"
This reverts commit 35304566690e7c9ab9632899920c85bff322409a.
* Fixes for VL MoE and v5 transformers
* [transformers] [v5] remove unused hybridcache (#3910)
* remote unused hybridcache
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* No double compile for qwen3moe
* Fix top_k on trl GRPO
* Recognise GLM as MoE
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix missing RotaryEmbeddingConfigMixin
* Licensing for autotuning cache
* Cleanup
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Erland366 <erland.pg366@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
_patch_trl_rl_trainers enumerates all trainer modules from dir(trl.trainer)
and attempts to import each one. Modules like alignprop_trainer fail because
they depend on optional packages (diffusers) that may not be installed. The
failure is harmless but the print() call produces noise on every import.
Change print() to logger.info() so these messages only appear when
UNSLOTH_ENABLE_LOGGING=1.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
GPT-OSS models use eager attention during inference because flex
attention returns incorrect results (likely due to left padding).
However, when _attn_implementation is set to "flex_attention",
transformers creates BlockMask objects which cause a TypeError
when passed to the eager attention path:
TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask'
This fix excludes GPT-OSS from using flex_attention, keeping it on
the eager path to avoid the BlockMask/Tensor type mismatch.
* Enable flex attention by default
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Avoid dropping flex attention when SDPA unsupported
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update rl_replacements.py
* Update rl_replacements.py
* Update rl.py
* Update rl_replacements.py
* Update rl_replacements.py
* Update rl.py
* Update rl.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update rl_replacements.py
* Update rl.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update rl_replacements.py, remove chat template from codexes commits
* Update rl.py, got rid of gradient checkpointing code that did not work
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix torchvision compatibility check for source builds and future torch versions
The torchvision version check raised a hard ImportError for custom/source-built
PyTorch installations (e.g. AMD ROCm from source with +git* suffixes), even when
the actual build was functional. This also silently skipped any torch version
not already in the hardcoded table, giving no warning at all for future releases.
Changes:
- Detect custom/source builds by checking the raw version string's local
identifier against known standard prefixes (cu, rocm, cpu, xpu). Our custom
Version() strips local identifiers via regex, so detection must happen on the
raw string before parsing.
- Downgrade to a warning (instead of ImportError) for custom/source builds,
since their version numbers may not follow standard PyPI release pairings.
- Add formula-based inference for future torch versions not yet in the table.
The torch->torchvision minor version formula (torch 2.x -> tv 0.(x+15)) has
held for every release from torch 2.0 through 2.9. For formula-predicted
versions, mismatches produce a warning rather than a hard error.
- Add UNSLOTH_SKIP_TORCHVISION_CHECK=1 env var to skip the check entirely.
- Wrap importlib_version and Version calls in try/except so broken metadata
never crashes the import.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: stricter regex, case insensitivity, pre-release detection
Fixes three edge cases found during review:
1. Regex precision: cu/xpu now require a trailing digit (cu\d, xpu\d) to
avoid false negatives on suffixes like "+custom_build" that happen to
start with "cu". cpu/xpu match as exact strings only.
2. Case insensitivity: added re.IGNORECASE so "+ROCM6.3" and "+CPU" are
correctly recognized as standard builds rather than custom ones.
3. Pre-release detection: nightly/dev/alpha/beta/rc builds with standard
CUDA/ROCm suffixes (e.g. "2.7.0.dev20250301+cu124") now produce a
warning instead of a hard ImportError. These builds commonly have
version mismatches that are expected during development.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address PR review comments: fullmatch, env var casing, torchvision pre-release
1. Switch re.match to re.fullmatch for the custom build regex so the
entire local identifier must match. Fixes false negatives where
suffixes like +cu124_custom were misclassified as standard because
re.match only checked the start of the string.
2. Use .lower() for the UNSLOTH_SKIP_TORCHVISION_CHECK env var so
any casing of "true" / "TRUE" / etc. is accepted.
3. Check torchvision_version_raw for pre-release tags in addition to
torch_version_raw, so a stable torch paired with a nightly
torchvision (e.g. 0.23.0.dev...) also gets a warning instead of
a hard ImportError.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
vLLM's distributed module (device_communicators) crashes with std::bad_alloc
when imported on SM100 GPUs (B200/B100/Blackwell) with torch < 2.9.0.
This adds an early check that runs before vLLM is imported, providing a
helpful error message instead of a cryptic C++ exception.
The check:
1. Detects if vLLM is installed
2. Checks if torch version is < 2.9.0
3. Checks if any GPU is SM100 (Blackwell)
4. If all conditions met, raises RuntimeError with clear upgrade instructions
* Add TRL truncation regression and metadata loss fixes
Fix 1: TRL 0.24.0-0.25.1 right-truncation regression
- These versions pass max_length=self.max_prompt_length and truncation=True
to the tokenizer, which right-truncates prompts and strips the assistant
turn suffix
- Use regex to remove these kwargs from the generated code
Fix 3: Metadata loss for chat_template_kwargs
- TRL 0.24.0+ extracts prompts = [x["prompt"] for x in inputs], losing metadata
like reasoning_effort
- Inject code to store per-sample chat_template_kwargs on self before extraction
- Preserve these kwargs in prompts_text generation for all TRL versions
Tested with TRL versions 0.22.2, 0.23.1, 0.24.0, 0.25.1, 0.26.2, and 0.27.1.
* Update Fix 1 comment with detailed TRL version behavior explanation
Expand the comment for the TRL 0.24.0-0.25.1 truncation regression fix
to clarify what each TRL version does:
- TRL 0.22.2-0.23.1: Uses truncate_with_protected_tokens() for smart
truncation that preserves rightmost tokens and protects special tokens
- TRL 0.24.0-0.25.1: Removed smart truncation, passes kwargs directly
to tokenizer (max_length, truncation=True, add_special_tokens=False)
- TRL 0.26.2+: Removed these kwargs entirely
The fix removes these problematic kwargs so 0.24.0-0.25.1 behaves like
0.26.2+ (no tokenizer-level truncation).
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
When users pass `num_train_epochs=None` to GRPOConfig (relying on
max_steps to control training duration), Trainer.__init__ fails with:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
This happens because transformers.Trainer does `args.num_train_epochs > 0`
in its __init__ which fails when the value is None.
This fix converts None to 3.0 (the default) before Trainer initialization.
The actual training duration is still controlled by max_steps since it
takes precedence when both are set.
Example that now works:
```python
config = GRPOConfig(
num_train_epochs=None, # Previously caused TypeError
max_steps=500, # This controls actual duration
...
)
```
* [fix] Vision GRPO string prompts and OpenEnv async compatibility
- Guard prepare_multimodal_messages in GRPO trainer to skip processing
when prompts are pre-templated strings. Notebooks that pre-apply
apply_chat_template() produce strings with image tokens already
embedded; calling prepare_multimodal_messages on those crashes with
TypeError.
- Apply nest_asyncio when OpenEnv EnvClient exposes async reset/step,
so scripts using run_until_complete() wrappers work in all contexts.
- Add wrapper to call patch_torchcodec_audio_decoder() from unsloth_zoo
for AudioDecoder dict-compatibility.
* Add apply_chat_template guard for pre-templated string prompts in Vision GRPO
When notebooks pre-apply apply_chat_template, prompts become strings.
The existing guard skips prepare_multimodal_messages for strings. This
adds a second guard to skip apply_chat_template in the forward_kwargs
block, using prompts directly as prompts_text instead. Covers both
TRL 0.25.x (no tools param) and TRL 0.26.2+ (with tools=self.tools).
Non-matching replacements silently pass for older TRL versions.
* Add TRL 0.25.1 single-line variant for apply_chat_template guard
TRL 0.25.1 uses single-line formatting for apply_chat_template:
apply_chat_template({"prompt": prompt}, ...)["prompt"]
While TRL 0.26.2+ uses multi-line formatting:
apply_chat_template(
{"prompt": prompt}, ...
)["prompt"]
Add both variants to ensure full backwards compatibility.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix TRL 0.27.0 GRPO compatibility and PEFT model handling
- Remove use_reentrant=False from gradient_checkpointing_kwargs for TRL 0.27.0+
TRL 0.27.0 auto-sets use_reentrant=False in GRPOConfig.__post_init__, but
Unsloth gradient checkpointing requires use_reentrant=True. This adds a
post-init cleanup that removes the setting when present.
- Handle prepare_peft_model standalone function pattern for TRL 0.22.0+
TRL changed from self._prepare_peft_model() method to prepare_peft_model()
standalone function. Both patterns are now bypassed to let Unsloth handle
PEFT model preparation.
Tested with TRL versions 0.22.2, 0.23.1, 0.24.0, 0.25.1, 0.26.2, and 0.27.1.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* reduce code duplication
* address reviewer feedback: keep original function name
- Keep original function name `_offload_frozen_module_for_training`
- Make `offload_device` parameter Optional (can be None)
- Keep original error handling (return None for missing modules_to_save)
- Maintain code deduplication by reusing the helper function
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Use standard gradient checkpointing for small sequence lengths
When max_seq_length < 512, the overhead of gradient offloading in
gc="unsloth" mode is not worth it. Benchmarks on B200 show:
| seq_len | gc=unsloth | gc=True | Difference |
|---------|------------|----------|------------|
| 256 | 6,803 t/s | 6,993 t/s| +2.8% |
| 384 | 9,889 t/s | 9,963 t/s| +0.7% |
| 512 | 13,151 t/s | 13,092 t/s| -0.4% |
| 1024 | 26,662 t/s | 25,094 t/s| -5.9% |
The crossover point is around seq_len 384-512. For sequences shorter
than 512, we now automatically use standard gradient checkpointing
instead of the custom offloading implementation.
Additionally, when user explicitly sets use_gradient_checkpointing to
True or False in get_peft_model, it now correctly overrides any
previous "unsloth" patching from from_pretrained. This ensures
consistent behavior regardless of the order of function calls.
Updated in three locations:
- FastLlamaModel.get_peft_model (llama.py)
- FastLanguageModel.from_pretrained (loader.py)
- FastModel.from_pretrained (loader.py)
* Refactor: extract gradient checkpointing heuristic into utility function
Addresses code review feedback to reduce duplication. The gradient
checkpointing heuristic logic was duplicated in 3 places:
- FastLlamaModel.get_peft_model (llama.py)
- FastLanguageModel.from_pretrained (loader.py)
- FastModel.from_pretrained (loader.py)
Created apply_unsloth_gradient_checkpointing() utility function in
_utils.py that handles:
- Heuristic: seq < 512 falls back to standard gc
- Explicit True/False overrides unpatch previous patching
- Returns the effective use_gradient_checkpointing value
Net reduction of ~6 lines while improving maintainability.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>