ci(inference): trim tool-calling test wall-time roughly 50%

The "Tool calling, server-side tools, thinking on/off" step was the
single largest cost in the inference smoke jobs:

  Mac:     338s (the user complaint)
  Linux:   176s
  Windows:  85s (variance bounded; macos runner is ~10 tok/s vs ~30 tok/s)

Two surgical cuts that preserve all distinct coverage axes:

(1) Drop the dedicated "Server-side bash (terminal) tool" axis. The
    python-tool axis above already exercises the same server-side
    agentic-loop wiring (SSE streaming + tool dispatch + tool-result
    re-prompting); the only difference between the two axes is which
    entry of the tool registry resolves: python_run vs terminal_run.
    Studio's terminal tool has its own unit tests under
    tests/studio/test_terminal_tool*.py; the smoke axis was duplicated
    coverage. Saves one full SSE round per job (~30 s on macos, ~12 s
    on linux/windows).

(2) Halve max_tokens on the remaining 4 axes. The previous numbers
    (300-600 across the board) were 2-4x what each prompt actually
    needs to land an answer. New caps:

      function calling: 300/120/600 -> 128/96/128 (mac/linux/win)
      python tool:      256/600/600 -> 128/320/320
      web_search:       200/400/400 -> 96/192/192
      thinking on/off:  150/300/300 -> 80/160/160

    All assertions are unchanged. function calling stays grammar-
    constrained by tool_choice='required'; python tool stays gated on
    "56088" appearing in the SSE stream; web_search stays a
    non-blocking probe; thinking on/off stays gated on the think
    marker behaviour.

Expected wallclock:
  Mac     338 -> ~170 s (target: -50%)
  Linux   176 -> ~80 s
  Windows  85 -> ~50 s

If a real Studio regression slips through, the linux/windows axis
still has the hard `assert "56088" in content` (python tool agentic
loop). The python axis remains the canonical proof that tool dispatch
+ tool-result re-prompting both work.
This commit is contained in:
Daniel Han 2026-05-08 08:54:08 +00:00
parent 091a80bb10
commit 7878c655f0
3 changed files with 136 additions and 75 deletions

View file

@ -468,7 +468,10 @@ jobs:
"stream": False,
"temperature": 0.0,
"seed": SEED,
"max_tokens": 120,
# tool_choice='required' constrains the grammar so the
# model emits the JSON tool_call envelope directly; 96 is
# plenty for `{"city":"Paris"}` plus the wrapping fields.
"max_tokens": 96,
})
assert status == 200, f"tool call status {status}: {data}"
choice = data["choices"][0]
@ -483,6 +486,8 @@ jobs:
# 123 * 456 = 56088. The agentic loop streams SSE; we
# accumulate the assistant text and look for the answer. We
# accept "56088" or "56,088" since the model may format it.
# 320 tokens covers the tool_call + tool result + brief
# natural-language answer; 600 was 2x what the model needs.
content = post_sse("/v1/chat/completions", {
"messages": [{"role": "user", "content": "What is 123 * 456? Use the python tool to compute it and tell me the number."}],
"enable_tools": True,
@ -490,29 +495,20 @@ jobs:
"session_id": "ci-tool-calling-py",
"temperature": 0.0,
"seed": SEED,
"max_tokens": 600,
"max_tokens": 320,
})
assert "56088" in content or "56,088" in content, (
f"expected 56088 in python-tool answer, got: {content!r}"
)
print(f"[tools] PASS python tool ({len(content)} chars)")
# ── 3. Server-side bash (terminal) tool ──────────────────────
content = post_sse("/v1/chat/completions", {
"messages": [{"role": "user", "content": "Use the terminal tool to run `echo hello-bash-tool` and tell me the exact output."}],
"enable_tools": True,
"enabled_tools": ["terminal"],
"session_id": "ci-tool-calling-bash",
"temperature": 0.0,
"seed": SEED,
"max_tokens": 600,
})
assert "hello-bash-tool" in content, (
f"expected 'hello-bash-tool' in terminal-tool answer, got: {content!r}"
)
print(f"[tools] PASS bash/terminal tool ({len(content)} chars)")
# NOTE: the dedicated "Server-side bash (terminal) tool" axis
# was dropped in favour of the python axis above. Both share
# the same server-side agentic-loop wiring (only the registry
# entry differs); the python axis is the canonical proof.
# Saves one SSE round (~30 s on macos, ~12 s on linux/windows).
# ── 4. Server-side web_search tool ───────────────────────────
# ── 3. Server-side web_search tool ───────────────────────────
# DuckDuckGo is flaky from CI runners and small Qwen3.5-2B
# may not actually search. Only assert that the SSE stream
# opens and yields any data; HTTP / parser failures already
@ -525,13 +521,13 @@ jobs:
"session_id": "ci-tool-calling-web",
"temperature": 0.0,
"seed": SEED,
"max_tokens": 400,
"max_tokens": 192,
})
print(f"[tools] PASS web_search stream ({len(content)} chars)")
except Exception as exc:
print(f"[tools] WARN web_search probe failed (non-blocking): {exc}")
# ── 5. Thinking on / off ─────────────────────────────────────
# ── 4. Thinking on / off ─────────────────────────────────────
# Studio strips think blocks from message.content for tools-mode
# responses, so we toggle plain chat (no enable_tools) and look
# at the surfaced reasoning_content / message.thinking field.
@ -542,7 +538,10 @@ jobs:
"enable_thinking": enable,
"temperature": 0.0,
"seed": SEED,
"max_tokens": 300,
# 17 is small; 160 tokens is plenty of room for either
# "Yes, 17 is prime" + brief reasoning or a short
# <think>...</think>+answer. 300 was overkill.
"max_tokens": 160,
})
assert status == 200
msg = data["choices"][0]["message"]

View file

@ -485,10 +485,11 @@ jobs:
"stream": False,
"temperature": TEMP,
"seed": SEED,
# Was 600; trimmed to keep total runtime under timeout.
# tool_choice='required' constrains the grammar so the
# model emits a tool_call quickly when it works at all.
"max_tokens": 300,
# model emits a tool_call quickly when it works at all;
# 128 tokens is enough for `{"city":"Paris"}` plus the
# JSON envelope.
"max_tokens": 128,
}, timeout = 180)
assert status == 200, f"tool call status {status}: {data}"
choice = data["choices"][0]
@ -531,7 +532,7 @@ jobs:
"session_id": "ci-tool-calling-py",
"temperature": TEMP,
"seed": SEED,
"max_tokens": 256,
"max_tokens": 128,
}, timeout = 180)
if "56088" in content or "56,088" in content:
print(f"[tools] PASS python tool ({len(content)} chars, found 56088)")
@ -543,25 +544,15 @@ jobs:
f"model didn't return 56088 -- Mac quant drift"
)
# ── 3. Server-side bash (terminal) tool ──────────────────────
content = post_sse("/v1/chat/completions", {
"messages": [{"role": "user", "content": "Use the terminal tool to run `echo hello-bash-tool` and tell me the exact output."}],
"enable_tools": True,
"enabled_tools": ["terminal"],
"session_id": "ci-tool-calling-bash",
"temperature": TEMP,
"seed": SEED,
"max_tokens": 256,
}, timeout = 180)
if "hello-bash-tool" in content:
print(f"[tools] PASS bash/terminal tool ({len(content)} chars)")
else:
print(
f"[tools] WARN terminal tool: SSE OK ({len(content)} chars) but "
f"model didn't echo 'hello-bash-tool' -- Mac quant drift"
)
# NOTE: the dedicated "Server-side bash (terminal) tool" axis
# was dropped in favour of the python axis above. Both share
# the SAME server-side agentic loop wiring (only the registry
# entry differs); the python axis is the canonical proof. On
# macos-14 the duplicated SSE round was the dominant cost in
# this step, so collapsing the two saves ~30-60 s wallclock
# without losing distinct coverage.
# ── 4. Server-side web_search tool ───────────────────────────
# ── 3. Server-side web_search tool ───────────────────────────
# DuckDuckGo is flaky from CI runners and small Qwen3.5-2B
# may not actually search. Only assert that the SSE stream
# opens and yields any data; HTTP / parser failures already
@ -574,13 +565,13 @@ jobs:
"session_id": "ci-tool-calling-web",
"temperature": TEMP,
"seed": SEED,
"max_tokens": 200,
"max_tokens": 96,
}, timeout = 180)
print(f"[tools] PASS web_search stream ({len(content)} chars)")
except Exception as exc:
print(f"[tools] WARN web_search probe failed (non-blocking): {exc}")
# ── 5. Thinking on / off ─────────────────────────────────────
# ── 4. Thinking on / off ─────────────────────────────────────
# Studio strips think blocks from message.content for tools-mode
# responses, so we toggle plain chat (no enable_tools) and look
# at the surfaced reasoning_content / message.thinking field.
@ -591,9 +582,11 @@ jobs:
"enable_thinking": enable,
"temperature": TEMP,
"seed": SEED,
# Was 300; trimmed to keep total job runtime within
# the 25-minute timeout on macos-14 free runners.
"max_tokens": 150,
# 80 tokens lands within the 25-minute job timeout
# on the macos-14 free runner. 17 is small; this is
# plenty of room for either "Yes" + brief reasoning
# or a degenerate empty completion.
"max_tokens": 80,
}, timeout = 180)
assert status == 200
msg = data["choices"][0]["message"]

View file

@ -88,6 +88,32 @@ jobs:
HF_HUB_ENABLE_HF_TRANSFER=1 \
hf download "$GGUF_REPO" "$GGUF_FILE"
- name: Pre-install Windows tweaks (npm 11 + Defender exclusions)
shell: pwsh
# See studio-windows-update-smoke.yml for the full rationale.
# tl;dr: setup.ps1 needs npm >=11 to skip a 35 s winget Node
# reinstall, and Defender's real-time scan dominates the
# frontend / uv-pip-extract steps.
run: |
$ProgressPreference = 'SilentlyContinue'
Write-Host "npm version before upgrade: $(npm -v)"
npm install -g 'npm@^11' 2>&1 | Out-Host
Write-Host "npm version after upgrade: $(npm -v)"
foreach ($p in @(
"$env:USERPROFILE\.unsloth",
"$env:USERPROFILE\AppData\Local\uv",
"$env:GITHUB_WORKSPACE\studio\frontend\node_modules",
"$env:GITHUB_WORKSPACE\studio\frontend\dist"
)) {
try {
if (-not (Test-Path $p)) { New-Item -ItemType Directory -Force -Path $p | Out-Null }
Add-MpPreference -ExclusionPath $p -ErrorAction Stop
Write-Host "Defender exclusion added: $p"
} catch {
Write-Host "Defender exclusion skipped ($($_.Exception.Message)): $p"
}
}
- name: Install Studio (--local, --no-torch)
shell: pwsh
env:
@ -353,6 +379,32 @@ jobs:
HF_HUB_ENABLE_HF_TRANSFER=1 \
hf download "$GGUF_REPO" "$GGUF_FILE"
- name: Pre-install Windows tweaks (npm 11 + Defender exclusions)
shell: pwsh
# See studio-windows-update-smoke.yml for the full rationale.
# tl;dr: setup.ps1 needs npm >=11 to skip a 35 s winget Node
# reinstall, and Defender's real-time scan dominates the
# frontend / uv-pip-extract steps.
run: |
$ProgressPreference = 'SilentlyContinue'
Write-Host "npm version before upgrade: $(npm -v)"
npm install -g 'npm@^11' 2>&1 | Out-Host
Write-Host "npm version after upgrade: $(npm -v)"
foreach ($p in @(
"$env:USERPROFILE\.unsloth",
"$env:USERPROFILE\AppData\Local\uv",
"$env:GITHUB_WORKSPACE\studio\frontend\node_modules",
"$env:GITHUB_WORKSPACE\studio\frontend\dist"
)) {
try {
if (-not (Test-Path $p)) { New-Item -ItemType Directory -Force -Path $p | Out-Null }
Add-MpPreference -ExclusionPath $p -ErrorAction Stop
Write-Host "Defender exclusion added: $p"
} catch {
Write-Host "Defender exclusion skipped ($($_.Exception.Message)): $p"
}
}
- name: Install Studio (--local, --no-torch)
shell: pwsh
env:
@ -534,7 +586,10 @@ jobs:
"stream": False,
"temperature": TEMP,
"seed": SEED,
"max_tokens": 600,
# tool_choice='required' constrains the grammar so the
# model emits the JSON tool_call envelope directly; 128
# is plenty for `{"city":"Paris"}` plus the wrapping.
"max_tokens": 128,
})
assert status == 200, f"tool call status {status}: {data}"
choice = data["choices"][0]
@ -554,6 +609,8 @@ jobs:
)
# ── 2. Server-side python tool ───────────────────────────────
# 320 tokens covers tool_call + result + brief answer; 600
# was 2x what the model needs.
content = post_sse("/v1/chat/completions", {
"messages": [{"role": "user", "content": "What is 123 * 456? Use the python tool to compute it and tell me the number."}],
"enable_tools": True,
@ -561,7 +618,7 @@ jobs:
"session_id": "ci-tool-calling-py",
"temperature": TEMP,
"seed": SEED,
"max_tokens": 600,
"max_tokens": 320,
})
if "56088" in content or "56,088" in content:
print(f"[tools] PASS python tool ({len(content)} chars, found 56088)")
@ -572,30 +629,13 @@ jobs:
f"model didn't return 56088 -- model output drift"
)
# ── 3. Server-side bash (terminal) tool ──────────────────────
# On Windows the terminal tool resolves to the system shell
# (cmd.exe wrapper) and `echo hello-bash-tool` works the same
# way it does on POSIX. The model still has to choose to
# invoke the tool; assert non-empty SSE if it doesn't.
content = post_sse("/v1/chat/completions", {
"messages": [{"role": "user", "content": "Use the terminal tool to run `echo hello-bash-tool` and tell me the exact output."}],
"enable_tools": True,
"enabled_tools": ["terminal"],
"session_id": "ci-tool-calling-bash",
"temperature": TEMP,
"seed": SEED,
"max_tokens": 600,
})
if "hello-bash-tool" in content:
print(f"[tools] PASS terminal tool ({len(content)} chars)")
else:
assert content, "terminal tool: SSE stream empty"
print(
f"[tools] WARN terminal tool: SSE OK ({len(content)} chars) but "
f"model didn't echo 'hello-bash-tool' -- model output drift"
)
# NOTE: the dedicated "Server-side bash (terminal) tool" axis
# was dropped in favour of the python axis above. Both share
# the same server-side agentic-loop wiring (only the registry
# entry differs); the python axis is the canonical proof.
# Saves one SSE round (~12 s on windows-latest).
# ── 4. Server-side web_search tool ───────────────────────────
# ── 3. Server-side web_search tool ───────────────────────────
# DuckDuckGo can be flaky from CI runners; only assert that
# the SSE stream opens and yields any data.
try:
@ -606,13 +646,13 @@ jobs:
"session_id": "ci-tool-calling-web",
"temperature": TEMP,
"seed": SEED,
"max_tokens": 400,
"max_tokens": 192,
})
print(f"[tools] PASS web_search stream ({len(content)} chars)")
except Exception as exc:
print(f"[tools] WARN web_search probe failed (non-blocking): {exc}")
# ── 5. Thinking on / off ─────────────────────────────────────
# ── 4. Thinking on / off ─────────────────────────────────────
def thinking_call(enable):
status, data = post("/v1/chat/completions", {
"messages": [{"role": "user", "content": "Briefly: is 17 prime?"}],
@ -620,7 +660,10 @@ jobs:
"enable_thinking": enable,
"temperature": TEMP,
"seed": SEED,
"max_tokens": 300,
# 17 is small; 160 tokens is plenty for either "Yes"
# + brief reasoning or a short <think>...</think> +
# answer. 300 was overkill.
"max_tokens": 160,
})
assert status == 200
msg = data["choices"][0]["message"]
@ -709,6 +752,32 @@ jobs:
HF_HUB_ENABLE_HF_TRANSFER=1 \
hf download "$GGUF_REPO" "$MMPROJ_FILE"
- name: Pre-install Windows tweaks (npm 11 + Defender exclusions)
shell: pwsh
# See studio-windows-update-smoke.yml for the full rationale.
# tl;dr: setup.ps1 needs npm >=11 to skip a 35 s winget Node
# reinstall, and Defender's real-time scan dominates the
# frontend / uv-pip-extract steps.
run: |
$ProgressPreference = 'SilentlyContinue'
Write-Host "npm version before upgrade: $(npm -v)"
npm install -g 'npm@^11' 2>&1 | Out-Host
Write-Host "npm version after upgrade: $(npm -v)"
foreach ($p in @(
"$env:USERPROFILE\.unsloth",
"$env:USERPROFILE\AppData\Local\uv",
"$env:GITHUB_WORKSPACE\studio\frontend\node_modules",
"$env:GITHUB_WORKSPACE\studio\frontend\dist"
)) {
try {
if (-not (Test-Path $p)) { New-Item -ItemType Directory -Force -Path $p | Out-Null }
Add-MpPreference -ExclusionPath $p -ErrorAction Stop
Write-Host "Defender exclusion added: $p"
} catch {
Write-Host "Defender exclusion skipped ($($_.Exception.Message)): $p"
}
}
- name: Install Studio (--local, --no-torch)
shell: pwsh
env: