unsloth/tests/saving/test_patch_saving_none_tokenizer.py
Datta Nimmaturi 77756faa46
Fix tokenizer save gemma (#5115)
* [WIP] Fast inference for qwen3.5

* fix tokenizer not saving properly

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* extend to VLM and clenaup

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* gate tokenizer.model saving

* fix for gated/private models

* Fix tokenizer save review findings

- save.py:261 restore dict-based _TOKENIZER_MODEL_CACHE so negative
  results are cached; the set() in 0129fb5e regressed non-SentencePiece
  tokenizer saves to a fresh HfApi.model_info call on every checkpoint.
  Don't cache on exception so gated/private repos can retry later with a
  valid token.
- save.py:282 guard `repo_info.siblings` with `or []`; huggingface_hub
  types this Optional and returns None for empty or new repos, which
  made any() raise TypeError out of save_pretrained.
- save.py:3487 split push_to_hub into local save + _preserve + push so
  uploaded tokenizer_config.json/tokenizer.model include the fix rather
  than the unfixed copies written before the upload.
- save.py:3352 call patch_saving_functions on tokenizers passed to
  unsloth_save_pretrained_torchao to match the other three save
  entrypoints; previously torchao saves skipped the preservation patch.

* Fix push_to_hub repo_id conflict and torchao token forwarding

- save.py:3493-3496 pop `repo_id` from kwargs (defaulting to
  `save_directory`) before calling `self.push_to_hub(repo_id, **kwargs)`.
  The previous `self.push_to_hub(save_directory, **kwargs)` passed
  `save_directory` as the first positional `repo_id` while also
  forwarding a user-supplied `repo_id` through kwargs, raising
  `TypeError: got multiple values for argument 'repo_id'` on the
  standard `save_pretrained(local_path, push_to_hub=True, repo_id=...)`
  call shape. This regression was introduced by the earlier iteration
  that split push_to_hub into an explicit second step.
- save.py:3314 forward `token=token` on the torchao non-PEFT
  `tokenizer.save_pretrained(torchao_save_directory)` call so the
  patched wrapper can reach gated repos when HF_TOKEN is not in the
  environment. Left the sibling `unsloth_generic_save` call at 3063
  untouched (blame points at an earlier full-finetuned
  save_pretrained_merged fix and the token gap there is lower risk).

* Fix torchao tokenizer reload and push_to_hub repo_id default

- save.py:3283 after `auto_processor.from_pretrained(save_directory)`
  re-runs `patch_saving_functions(tokenizer)` on the freshly loaded
  tokenizer. The rebind at 3283 was overwriting the patched tokenizer
  passed into `unsloth_save_pretrained_torchao`, so the subsequent
  `tokenizer.push_to_hub` (3309) and `tokenizer.save_pretrained`
  (3314) bypassed `_preserve_sentencepiece_tokenizer_assets` and left
  `{save_directory}-torchao` without `tokenizer.model` / restored
  `added_tokens_decoder`.
- save.py:3497 fall back to `os.path.basename(save_directory)` for
  `repo_id` instead of the raw `save_directory`. The round-2 fallback
  diverged from `transformers.PreTrainedTokenizerBase.save_pretrained`,
  which defaults `repo_id = save_directory.split(os.path.sep)[-1]`;
  nested local paths like `./out/my-repo` now resolve to `my-repo`
  (the Hub id) instead of the full filesystem path.

* Revert tokenizer save_pretrained repo_id basename fallback

- save.py:3497 default `repo_id` back to `save_directory` as-is rather
  than `os.path.basename(save_directory)`. The basename fallback (added
  last iteration to match upstream transformers) stripped the user
  namespace from the Unsloth convention `tokenizer.save_pretrained(
  "user/repo", push_to_hub=True)`, redirecting the upload to
  `{current_user}/repo`. save.py itself treats `save_directory` as the
  repo id at 572, 593, 1723, 1779, 1836, 1844, 1858, and 3025, so the
  wrapper should follow the same convention. Users who pass a nested
  filesystem path with `push_to_hub=True` can supply explicit
  `repo_id=...`.

* Guard processor.tokenizer recursion against None

save.py:3511 change `elif hasattr(model, "tokenizer")` to
`elif getattr(model, "tokenizer", None) is not None`. The previous
guard only checked attribute existence; a ProcessorMixin that sets
`tokenizer = None` (audio-only or manually constructed) would enter
the branch and crash inside the recursive patch_saving_functions on
`model.push_to_hub.__name__`.

* Add review tests for tokenizer save

* Consolidate review tests

Drop redundant assertion in test_patch_saving_functions_still_patches_non_none_tokenizer.
The hasattr check already proves the patch applied; the or-chained
repeat assertion added no signal.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-04-22 09:03:20 -07:00

47 lines
1.2 KiB
Python

from unittest.mock import MagicMock
from transformers import PreTrainedTokenizerBase
from unsloth.save import patch_saving_functions
class _ProcessorWithNoneTokenizer:
tokenizer = None
def push_to_hub(self, *args, **kwargs):
return None
push_to_hub.__doc__ = "stub"
def save_pretrained(self, *args, **kwargs):
return None
def test_patch_saving_functions_no_crash_on_none_tokenizer():
proc = _ProcessorWithNoneTokenizer()
patch_saving_functions(proc)
def test_patch_saving_functions_still_patches_non_none_tokenizer():
inner = MagicMock(spec = PreTrainedTokenizerBase)
inner.save_pretrained = MagicMock()
inner.save_pretrained.__name__ = "save_pretrained"
inner.push_to_hub = MagicMock()
inner.push_to_hub.__name__ = "push_to_hub"
inner.push_to_hub.__doc__ = "tokenizer doc"
class _Proc:
def __init__(self, tok):
self.tokenizer = tok
def push_to_hub(self, *args, **kwargs):
return None
push_to_hub.__doc__ = "proc doc"
def save_pretrained(self, *args, **kwargs):
return None
proc = _Proc(inner)
patch_saving_functions(proc)
assert hasattr(inner, "original_save_pretrained")