unsloth/studio/frontend
Daniel Han 4699c7e291
studio: engage draft-mtp on vision MTP GGUFs (drop incorrect vision gate) (#5560)
* studio: engage draft-mtp on vision MTP GGUFs

The draft-mtp auto-promotion in LlamaCppBackend.load_model was gated on
not effective_is_vision, and the spec-emit branch repeated the same
guard. Every Unsloth -MTP GGUF repo ships an mmproj projector, so
effective_is_vision was always True for those repos and the MTP speedup
silently never engaged out of the box.

llama.cpp #22673 explicitly states MTP is compatible with vision input.
The bundled b9204 server happily loads both: a manual run with
--mmproj ... --spec-type draft-mtp --spec-draft-n-max 6 logs
"loaded multimodal model" followed by
"adding speculative implementation 'draft-mtp'".

Drop the vision gate from both sites and rewrite the matching short
circuit in _already_in_target_state so reload checks reach the auto
promotion path on vision MTP loads. Add three regression tests covering
vision MTP match (auto and default), and non MTP vision repo unaffected.

Verified on a B200 with unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL:
base decode 179.7 t/s vs MTP decode 253.8 t/s, draft acceptance 0.57,
1.41x speedup on a 255 token completion. mmproj still loads and image
input remains available.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: prefer Qwen3.5 -MTP GGUF variants in default model lists

With the vision gate dropped in the previous commit, draft-mtp now
auto-engages on -MTP GGUF repos out of the box. Swap the four Qwen3.5
recommended entries in DEFAULT_MODELS_GGUF and DEFAULT_MODELS_STANDARD
to their -MTP-GGUF counterparts so new users get the speedup by default:

  unsloth/Qwen3.5-4B-GGUF        -> unsloth/Qwen3.5-4B-MTP-GGUF
  unsloth/Qwen3.5-9B-GGUF        -> unsloth/Qwen3.5-9B-MTP-GGUF
  unsloth/Qwen3.5-35B-A3B-GGUF   -> unsloth/Qwen3.5-35B-A3B-MTP-GGUF
  unsloth/Qwen3.5-0.8B-GGUF      -> unsloth/Qwen3.5-0.8B-MTP-GGUF

All four HF repos exist (HEAD 200) and ship the same UD-Q4_K_XL quant
layout as the non-MTP variants. Non-Qwen3.5 entries are untouched.

* bump version to 2026.5.4

Picks up the studio MTP vision-gate fix and the Qwen3.5 -MTP default
swap in this PR.

* studio: prefer Qwen3.6-35B-A3B-MTP-GGUF in default model lists

Same rationale as the previous Qwen3.5 swap. The Qwen3.6 MTP variant
exists at unsloth/Qwen3.6-35B-A3B-MTP-GGUF (HF HEAD 200) and now
auto-engages draft-mtp out of the box with the gate fix.

* studio: drop --spec-draft-n-max from 6 to 3 for draft-mtp

n=6 is too greedy: on Qwen3.6 the draft has to guess 6 tokens ahead
and acceptance crashes to ~0.45, leaving only ~14% throughput gain.

PR ggml-org/llama.cpp#22673's author benched n=3 at ~0.72 acceptance
and 2 to 3x speedup on the same Qwen3.6 family, and the README sample
command uses n=2 or n=3. Match that.

CPU/Mac branch already uses n=3, so this aligns both paths.

* studio: set --spec-draft-n-max back to 6 for draft-mtp on GPU

Reverts the n=3 tuning. n=6 is the original default; user-side comparisons
hold the larger draft window steady so the toggle (next commit) is the
primary on/off lever.

* studio: add Speculative Decoding toggle under Max Tokens

Adds a top-level kill switch (panel-switch under Max Tokens, mirroring
Auto-Healing Tool Calls) that forces the /load request's
speculative_type to "off" when disabled. The backend "off" branch in
LlamaCppBackend.load_model skips both the draft-mtp auto-promotion and
the spec-emit branch, so neither --spec-type draft-mtp nor
--spec-default reaches llama-server.

Wiring:

- chat-runtime-store: new speculativeDecodingEnabled bool, default
  true, persisted to localStorage under unsloth_speculative_decoding,
  plus a setSpeculativeDecodingEnabled setter.
- chat-settings-sheet: SpeculativeDecodingToggle rendered immediately
  beneath the Max Tokens slider for non-external models.
- use-chat-model-runtime: when speculativeDecodingEnabled is false,
  override speculative_type to "off" in the loadModel call so the
  switch wins over any pre-existing speculativeType state (including
  the existing per-model toggle in Model Settings).

Verified end to end on unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL:
toggle ON emits --spec-type draft-mtp --spec-draft-n-max 6; toggle
OFF emits zero --spec-* flags on the same MTP GGUF.

* studio: relocate Speculative Decoding toggle into Model Settings

Move the toggle out from under Max Tokens and back into the Model
Settings section, directly beneath KV Cache Dtype, where the existing
Apply/Reset workflow already drives a reload on dirty. This way flipping
the switch in the UI actually picks up: the section becomes dirty,
Apply re-runs /load with the new speculative_type.

Drop the !currentModelIsMultimodal gate so vision MTP GGUFs can also
disable speculative decoding from the UI.

Switch the toggle's off-value from null to "off" so the backend's "off"
short-circuit fires for MTP models too (null normalises to None which
re-triggers the draft-mtp auto-promotion).

Tooltip now reads "Faster generation with 0% accuracy hit".

Remove the now-redundant speculativeDecodingEnabled bool + setter from
the runtime store and the load-time override in use-chat-model-runtime;
the toggle binds directly to speculativeType.

* studio: restore OOM/TIGHT badge on recommended GGUF rows

The recommended-list row passed vramStatus=null for any GGUF repo
because the existing useRecommendedModelVram hook reads safetensors
totals from HF model info, which GGUF-only repos do not expose. As a
result, an OOM Q-quant repo would render with only a "GGUF" badge and
no visual signal that nothing in it fits.

Add useGgufRecommendedFit: per repo, fetch the variant list via the
existing /api/models/gguf-variants endpoint, take the smallest
variant's size_bytes, and classify with the same 0.7*GPU + 0.7*RAM
thresholds as GgufVariantExpander. Session-scoped cache + in-flight
dedup so a repo is requested at most once.

Wire the result into the three GGUF row sites in pickers.tsx so OOM
and TIGHT badges show on the collapsed cards.

* Revert "studio: restore OOM/TIGHT badge on recommended GGUF rows"

This reverts commit 07793b1240df72b13e51d6dc15f63c4ee8c6cba9.

The new useGgufRecommendedFit hook was treating the symptom. PR #5561
identified the real root cause: useGpuInfo was calling /api/system
with plain fetch instead of authFetch, so the session-auth check
failed silently and gpu.available stayed false everywhere. With no
GPU info, every fit check (variant expander, recommended carousel)
fell back to "no signal" and dropped the OOM/TIGHT badges.

Reverting the over-engineered hook and applying the authFetch fix
in the next commit, which restores the existing badges with one line.

* chore: replace qwen suggested with MTP variant

* fix: restore GPU info auth for GGUF fit badges

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
2026-05-18 08:42:55 -07:00
..
public Polish/cloud to providers (#5450) 2026-05-15 19:29:21 +04:00
src studio: engage draft-mtp on vision MTP GGUFs (drop incorrect vision gate) (#5560) 2026-05-18 08:42:55 -07:00
.gitignore perf(studio): upgrade to Vite 8 + auto-install bun for faster frontend builds (#4522) 2026-03-25 04:27:41 -07:00
.gitkeep add studio root folder 2026-02-02 09:14:35 +00:00
.npmrc security: NOT affected by Mini Shai-Hulud (May-12 wave) -- forward-looking hardening only (#5397) 2026-05-13 04:58:12 -07:00
biome.json feat: add seed dataset support with configuration, preview, and builder utilities 2026-02-14 18:44:38 +01:00
components.json add studio root folder 2026-02-02 09:14:35 +00:00
data-designer.openapi (1).yaml save and import, and fixes 2026-02-04 14:32:49 +01:00
eslint.config.js Final cleanup 2026-03-12 18:28:04 +00:00
index.html Final cleanup 2026-03-12 18:28:04 +00:00
package-lock.json Add OpenDocument chat attachments (#5510) 2026-05-18 03:41:24 -07:00
package.json Add OpenDocument chat attachments (#5510) 2026-05-18 03:41:24 -07:00
tsconfig.app.json Relax frontend unused local check (#4388) 2026-03-17 16:04:11 -07:00
tsconfig.json cleanup 2026-02-04 13:28:39 +01:00
tsconfig.node.json cleanup 2026-02-04 13:28:39 +01:00
vite.config.ts Fix Install commands for Windows + 1 line installs (#4447) 2026-03-19 02:09:09 -07:00