mirror of
https://github.com/unslothai/unsloth.git
synced 2026-05-19 07:42:36 +00:00
* studio: engage draft-mtp on vision MTP GGUFs The draft-mtp auto-promotion in LlamaCppBackend.load_model was gated on not effective_is_vision, and the spec-emit branch repeated the same guard. Every Unsloth -MTP GGUF repo ships an mmproj projector, so effective_is_vision was always True for those repos and the MTP speedup silently never engaged out of the box. llama.cpp #22673 explicitly states MTP is compatible with vision input. The bundled b9204 server happily loads both: a manual run with --mmproj ... --spec-type draft-mtp --spec-draft-n-max 6 logs "loaded multimodal model" followed by "adding speculative implementation 'draft-mtp'". Drop the vision gate from both sites and rewrite the matching short circuit in _already_in_target_state so reload checks reach the auto promotion path on vision MTP loads. Add three regression tests covering vision MTP match (auto and default), and non MTP vision repo unaffected. Verified on a B200 with unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL: base decode 179.7 t/s vs MTP decode 253.8 t/s, draft acceptance 0.57, 1.41x speedup on a 255 token completion. mmproj still loads and image input remains available. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: prefer Qwen3.5 -MTP GGUF variants in default model lists With the vision gate dropped in the previous commit, draft-mtp now auto-engages on -MTP GGUF repos out of the box. Swap the four Qwen3.5 recommended entries in DEFAULT_MODELS_GGUF and DEFAULT_MODELS_STANDARD to their -MTP-GGUF counterparts so new users get the speedup by default: unsloth/Qwen3.5-4B-GGUF -> unsloth/Qwen3.5-4B-MTP-GGUF unsloth/Qwen3.5-9B-GGUF -> unsloth/Qwen3.5-9B-MTP-GGUF unsloth/Qwen3.5-35B-A3B-GGUF -> unsloth/Qwen3.5-35B-A3B-MTP-GGUF unsloth/Qwen3.5-0.8B-GGUF -> unsloth/Qwen3.5-0.8B-MTP-GGUF All four HF repos exist (HEAD 200) and ship the same UD-Q4_K_XL quant layout as the non-MTP variants. Non-Qwen3.5 entries are untouched. * bump version to 2026.5.4 Picks up the studio MTP vision-gate fix and the Qwen3.5 -MTP default swap in this PR. * studio: prefer Qwen3.6-35B-A3B-MTP-GGUF in default model lists Same rationale as the previous Qwen3.5 swap. The Qwen3.6 MTP variant exists at unsloth/Qwen3.6-35B-A3B-MTP-GGUF (HF HEAD 200) and now auto-engages draft-mtp out of the box with the gate fix. * studio: drop --spec-draft-n-max from 6 to 3 for draft-mtp n=6 is too greedy: on Qwen3.6 the draft has to guess 6 tokens ahead and acceptance crashes to ~0.45, leaving only ~14% throughput gain. PR ggml-org/llama.cpp#22673's author benched n=3 at ~0.72 acceptance and 2 to 3x speedup on the same Qwen3.6 family, and the README sample command uses n=2 or n=3. Match that. CPU/Mac branch already uses n=3, so this aligns both paths. * studio: set --spec-draft-n-max back to 6 for draft-mtp on GPU Reverts the n=3 tuning. n=6 is the original default; user-side comparisons hold the larger draft window steady so the toggle (next commit) is the primary on/off lever. * studio: add Speculative Decoding toggle under Max Tokens Adds a top-level kill switch (panel-switch under Max Tokens, mirroring Auto-Healing Tool Calls) that forces the /load request's speculative_type to "off" when disabled. The backend "off" branch in LlamaCppBackend.load_model skips both the draft-mtp auto-promotion and the spec-emit branch, so neither --spec-type draft-mtp nor --spec-default reaches llama-server. Wiring: - chat-runtime-store: new speculativeDecodingEnabled bool, default true, persisted to localStorage under unsloth_speculative_decoding, plus a setSpeculativeDecodingEnabled setter. - chat-settings-sheet: SpeculativeDecodingToggle rendered immediately beneath the Max Tokens slider for non-external models. - use-chat-model-runtime: when speculativeDecodingEnabled is false, override speculative_type to "off" in the loadModel call so the switch wins over any pre-existing speculativeType state (including the existing per-model toggle in Model Settings). Verified end to end on unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL: toggle ON emits --spec-type draft-mtp --spec-draft-n-max 6; toggle OFF emits zero --spec-* flags on the same MTP GGUF. * studio: relocate Speculative Decoding toggle into Model Settings Move the toggle out from under Max Tokens and back into the Model Settings section, directly beneath KV Cache Dtype, where the existing Apply/Reset workflow already drives a reload on dirty. This way flipping the switch in the UI actually picks up: the section becomes dirty, Apply re-runs /load with the new speculative_type. Drop the !currentModelIsMultimodal gate so vision MTP GGUFs can also disable speculative decoding from the UI. Switch the toggle's off-value from null to "off" so the backend's "off" short-circuit fires for MTP models too (null normalises to None which re-triggers the draft-mtp auto-promotion). Tooltip now reads "Faster generation with 0% accuracy hit". Remove the now-redundant speculativeDecodingEnabled bool + setter from the runtime store and the load-time override in use-chat-model-runtime; the toggle binds directly to speculativeType. * studio: restore OOM/TIGHT badge on recommended GGUF rows The recommended-list row passed vramStatus=null for any GGUF repo because the existing useRecommendedModelVram hook reads safetensors totals from HF model info, which GGUF-only repos do not expose. As a result, an OOM Q-quant repo would render with only a "GGUF" badge and no visual signal that nothing in it fits. Add useGgufRecommendedFit: per repo, fetch the variant list via the existing /api/models/gguf-variants endpoint, take the smallest variant's size_bytes, and classify with the same 0.7*GPU + 0.7*RAM thresholds as GgufVariantExpander. Session-scoped cache + in-flight dedup so a repo is requested at most once. Wire the result into the three GGUF row sites in pickers.tsx so OOM and TIGHT badges show on the collapsed cards. * Revert "studio: restore OOM/TIGHT badge on recommended GGUF rows" This reverts commit 07793b1240df72b13e51d6dc15f63c4ee8c6cba9. The new useGgufRecommendedFit hook was treating the symptom. PR #5561 identified the real root cause: useGpuInfo was calling /api/system with plain fetch instead of authFetch, so the session-auth check failed silently and gpu.available stayed false everywhere. With no GPU info, every fit check (variant expander, recommended carousel) fell back to "no signal" and dropped the OOM/TIGHT badges. Reverting the over-engineered hook and applying the authFetch fix in the next commit, which restores the existing badges with one line. * chore: replace qwen suggested with MTP variant * fix: restore GPU info auth for GGUF fit badges --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: imagineer99 <samleejackson0@gmail.com> |
||
|---|---|---|
| .. | ||
| public | ||
| src | ||
| .gitignore | ||
| .gitkeep | ||
| .npmrc | ||
| biome.json | ||
| components.json | ||
| data-designer.openapi (1).yaml | ||
| eslint.config.js | ||
| index.html | ||
| package-lock.json | ||
| package.json | ||
| tsconfig.app.json | ||
| tsconfig.json | ||
| tsconfig.node.json | ||
| vite.config.ts | ||