Related to #1615
Add documentation and function for exporting models from Colab to local machines.
* **README.md**: Add a new section titled "Exporting Models from Colab to Local Machine" under "✨ Finetune for Free" with detailed steps for exporting models from Colab to local machines.
* **CONTRIBUTING.md**: Add a note about the new documentation section for exporting models from Colab.
* **unsloth/save.py**: Add a new function `export_model_to_local` to handle exporting models from Colab to local machines.
(cherry picked from commit 0361bd658f)
* Enable FP8 + RL training for bf16 models (#3440)
* Enable FP8 + RL training for bf16 models
**Summary:** Enable FP8 + RL training using TorchAO for 1.33x faster training and 42% less model memory usage:
- We quantize the frozen LoRA weights into fp8 and keep the LoRA adapters in bf16
- We leverage TorchAO's `Float8Tensor`, which calls into fbgemm's fp8 x fp8 rowwise matmul kernel
- For now, we need to do an offline quantization first, because vllm doesn't support on-the-fly quantization for torchao yet (this is in progress: https://github.com/vllm-project/vllm/pull/26327)
**Example usage:**
```
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B-Base",
max_seq_length = 2048,
load_in_4bit = False,
fast_inference = True,
max_lora_rank = 32,
load_in_fp8 = True, # set this to True
)
\# the rest is the same as before
model = FastLanguageModel.get_peft_model(...)
```
**Initial results:**
```
\# fp8
{'train_runtime': 1725.4337, 'train_samples_per_second': 0.232, 'train_steps_per_second': 0.058, 'train_loss': 0.00015715716748673002, 'epoch': 0.01}
\# bf16
{'train_runtime': 2297.8145, 'train_samples_per_second': 0.174, 'train_steps_per_second': 0.044, 'train_loss': 0.00016081033063528594, 'epoch': 0.01}
```
<img width="1199" height="448" alt="Screenshot 2025-11-11 at 4 10 50 PM" src="https://github.com/user-attachments/assets/b6304afd-89e9-42b1-8064-775807e17b23" />
Test script: https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423
**Requires:**
- https://github.com/pytorch/ao/pull/3158 (torchao nightly or 0.15.0+)
- https://github.com/unslothai/unsloth-zoo/pull/351
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* _get_inference_mode_context_manager
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update utils.py
* Update utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update __init__.py
* Fix/save torchao model loading logic (#3621)
* make loading gpt-oss-BF16 faster. Linked to unsloth-zoo PR #314
* fix model loading and clean merged model directory
* revert default quant
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* revert mapper.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update loader_utils.py
* Update loader_utils.py
* Add 128x128 PerBlock FP8 + RL (#3629)
* Add 128x128 PerBlock FP8 + RL
**Summary:** Following https://github.com/unslothai/unsloth/pull/3440,
this PR extends torchao FP8 + RL support to also handle 128x128
PerBlock granularity (in addition to PerRow).
**Example usage:**
```
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B-Base",
max_seq_length = 2048,
load_in_4bit = False,
fast_inference = True,
max_lora_rank = 32,
load_in_fp8 = "block", # or "row" or True
)
```
**Initial results:** TBD
**Note:**
- Requires https://github.com/pytorch/ao/pull/3370
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Version
* Update vision.py
* Update rl.py
* Add torch 2.9.1
* Fix auto installer
* Update fp8.py
* Float8
* Update fp8.py
* Update mapper.py
* Update mapper.py
* Update loader_utils.py
* Update loader.py
* Update fp8.py
* Versioning
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: andrewor14 <andrewor14@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>