Commit graph

8 commits

Author SHA1 Message Date
Daniel Han
eb8b0dee2e
Studio: make stop button actually stop generation (#5069)
* Studio: make stop button actually stop generation

The UI stop button routes through assistant-ui's cancelRun, which aborts
the frontend fetch. Four issues combined to let llama-server keep decoding
long after the user clicked stop:

1. request.is_disconnected() does not fire reliably behind proxies
   (e.g. Colab) that don't propagate fetch aborts.
2. llama-server defaults n_predict to n_ctx when max_tokens is not sent,
   so a cancelled request keeps producing tokens up to 262144.
3. The httpx.Client pool keeps TCP keep-alive, so even a cleanly closed
   stream reuses the same connection and llama-server's liveness poll
   never sees a disconnect.
4. No explicit backend route to cancel - every cancel path relied on
   is_disconnected.

Changes:
- Add POST /api/inference/cancel keyed by session_id/completion_id, with
  a registry populated for the lifetime of each streaming response.
- Have the frontend (chat-adapter.ts) POST /inference/cancel on
  AbortController abort, alongside the existing fetch teardown.
- Send max_tokens=4096 + t_max_predict_ms=120000 as defaults on every
  outbound chat completion to llama-server; honoured by user overrides.
- Disable httpx keep-alive on the streaming client so connection close
  reaches llama-server and its 1s liveness check fires.

No behaviour changes for non-streaming paths or for existing callers
that already pass max_tokens/session_id.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: harden stop-button cancel path and scope cancel route

- Require at least one identifier for /api/inference/cancel so a missing
  thread id cannot silently cancel every in-flight generation.
- Scope /cancel to a dedicated studio_router so it is not exposed under
  the /v1 OpenAI-compat prefix as a surprise endpoint.
- Store a set of cancel events per key in _CANCEL_REGISTRY so concurrent
  requests on the same session_id do not overwrite each other, and
  deduplicate in _cancel_by_keys so the cancelled count reflects unique
  requests.
- Always send session_id with chat completions (not only when tools are
  enabled) so non-tool GGUF streams register under it and are reachable
  from /cancel.
- Register the non-GGUF stream_chunks path in the cancel registry too,
  so transformers-based stop-button works behind proxies that swallow
  fetch aborts.
- Only apply the 2-minute t_max_predict_ms wall-clock cap when the
  caller did not pass max_tokens, so legitimate long generations on
  slow CPU/macOS/Windows supported installs are not silently truncated.
- Remove the abort listener on normal stream completion so reused
  AbortSignals cannot fire a spurious cancel POST after the fact.

* studio: close cancel-race and stale-cancel gaps in stop path

- Register the cancel tracker before returning StreamingResponse so a
  stop POST that arrives during prefill / warmup / proxy buffering
  finds an entry in _CANCEL_REGISTRY. Cleanup now runs via a Starlette
  BackgroundTask instead of a finally inside the async generator body.
- Add a per-run cancel_id on the frontend (crypto.randomUUID) and in
  ChatCompletionRequest so /api/inference/cancel matches one specific
  generation. Removes the stale-cancel bug where pressing stop then
  starting a new run in the same thread would cancel the retry.
- Apply t_max_predict_ms unconditionally in all three llama-server
  payload builders (previously gated on max_tokens=None, which made it
  dead code for UI callers that always send params.maxTokens). Raise
  the default to 10 minutes so slow CPU / macOS / Windows installs are
  not cut off mid-generation.
- Make _cancel_by_keys refuse empty input (return 0) so a future
  internal caller can not accidentally mass-cancel every in-flight
  request.
- Accept cancel_id (primary), session_id, and completion_id on the
  /api/inference/cancel route. Unify the three streaming sites on the
  same _cancel_keys / _tracker variable names.
- Annotate _CANCEL_REGISTRY as dict[str, set[threading.Event]].

* Add review tests for PR #5069

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: harden stop-button cancel semantics and wall-clock cap

- Make /inference/cancel match cancel_id EXCLUSIVELY when supplied.
  Previously the handler iterated ('cancel_id','session_id','completion_id')
  and unioned matches, so a stale cancel POST carrying {cancel_id:old,
  session_id:thr} would still cancel a later run on the same thread via
  the shared session_id. cancel_id is now a per-run exclusive key;
  session_id / completion_id are only used as fallbacks when cancel_id
  is absent.

- Close the early-cancel race. If /inference/cancel lands before the
  streaming handler reaches _TrackedCancel.__enter__() (stop clicked
  during prefill / warmup / proxy buffering), the cancel was silently
  dropped. Stash unmatched cancel_ids in _PENDING_CANCELS with a 30 s
  TTL; _TrackedCancel.__enter__() now replays any matching pending
  cancel by set()-ing the event immediately after registration.

- Make t_max_predict_ms = _DEFAULT_T_MAX_PREDICT_MS conditional on
  max_tokens is None at all three llama-server payload sites. The cap
  is a safety net for callers who leave max_tokens unset (otherwise
  llama-server defaults n_predict to n_ctx, up to 262144). Callers who
  set an explicit max_tokens are already self-limiting and must not be
  silently truncated at 10 minutes on slow CPU / macOS / Windows
  legitimate long generations.

- Guard each StreamingResponse return with try/except BaseException so
  _tracker.__exit__ runs even if StreamingResponse construction or any
  preceding statement raises between _tracker.__enter__() and the
  BackgroundTask attachment. Prevents a registry leak on that narrow
  window.

* studio: close TOCTOU race and restore wall-clock backstop on UI path

- Close TOCTOU race in the pending-cancel mechanism. The previous fix
  split cancel_inference's (cancel_by_keys + remember_pending_cancel)
  and _TrackedCancel.__enter__'s (register + consume_pending) into
  four separate lock acquisitions. Under contention a cancel POST
  could acquire-then-release the lock, find the registry empty, and
  stash ONLY AFTER __enter__ had already registered and consumed an
  empty pending map -- silently dropping the cancel. Both call sites
  now do their work inside a single _CANCEL_LOCK critical section, via
  the new atomic helper _cancel_by_cancel_id_or_stash() and an
  inlined consume-pending step in __enter__. Reproduced the race under
  forced interleaving pre-fix; 0/2000 drops post-fix under parallel
  stress.

- Apply t_max_predict_ms UNCONDITIONALLY at all three llama-server
  payload sites. The previous iteration gated the cap on
  `max_tokens is None`, which turned out to be dead code on the
  primary Studio UI path: chat-adapter.ts sets
  maxTokens=loadResp.context_length after every model load, so every
  chat request carries an explicit max_tokens and the wall-clock
  safety net never fired. The cap's original purpose is to bound
  stuck decodes regardless of the token budget; it must always apply.

- Raise _DEFAULT_T_MAX_PREDICT_MS from 10 minutes to 1 hour. 10
  minutes was too aggressive for legitimate slow-CPU chat responses
  (a 4096-token reply at 2 tok/s takes ~34 min); 1 hour accommodates
  that and still catches genuine zombie decodes.

- Prune _PENDING_CANCELS inside _cancel_by_keys as well, so stashed
  entries expire proportionally to overall cancel traffic rather than
  only to cancel_id-specific POSTs.

* studio: trim verbose comments and docstrings in cancel path

* studio/llama_cpp: drop upstream PR hashes from benchmark comment

* Add review tests for Studio stop button

* Consolidate review tests for Studio stop button

* Align cancel-route test with exclusive cancel_id semantics

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: move cancel cleanup to generator finally; drop dead helper

- Move _tracker.__exit__ from Starlette BackgroundTask into each
  streaming generator's finally block. Starlette skips the background
  callback when stream_response raises (OSError / ClientDisconnect),
  which leaked _CANCEL_REGISTRY entries on abrupt disconnect.
- Check cancel_event.is_set() at the top of each GGUF while loop so a
  pending-replay cancel falls through to final_chunk + [DONE] instead
  of propagating GeneratorExit out of _stream_with_retry.
- Remove unused _remember_pending_cancel; _cancel_by_cancel_id_or_stash
  superseded it.

* Add review tests for Studio stop-button

* studio: wire audio-input stream into cancel registry

- Register cancel_event with _TrackedCancel on the audio-input streaming
  path so POST /api/inference/cancel can stop whisper / audio-input GGUF
  runs. Previously the registry stayed empty on this branch, so the stop
  button returned {"cancelled":0} and the decode ran to completion.
- Apply the same finally-based cleanup and pre-iteration cancel-event
  check used on the other three streaming paths.
- Update the _CANCEL_REGISTRY block comment to list cancel_id as the
  primary key (was stale "session_id preferred").

* Consolidate review tests for Studio stop-button cancel flow

- Merge the 6 behavioral tests from test_stream_cleanup_on_disconnect.py
  (finally cleanup on normal/exception/aclose, pre-set cancel_event
  pattern, and its regressions) into test_stream_cancel_registration_timing.py,
  which is the PR's existing file covering the same area.
- Extend structural invariants to include audio_input_stream alongside the
  three GGUF / Unsloth streaming generators: no _tracker.__enter__ inside
  the async gen body, cleanup via try/finally, no background= on
  StreamingResponse.
- Delete test_stream_cleanup_on_disconnect.py (now empty).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: make cancel-via-POST interrupt Unsloth and audio-input streams

Close two remaining gaps in the stop-button cancellation wiring:

- stream_chunks (Unsloth path): add a top-of-loop cancel_event check and
  call backend.reset_generation_state() so cancel POSTs flush GPU state
  and close the SSE cleanly instead of relying on request.is_disconnected
  (which does not fire through proxies like Colab's).
- audio_input_stream: run the synchronous audio_input_generate() via
  asyncio.to_thread so blocking whisper chunks do not freeze the event
  loop, matching the pattern already used by the GGUF streaming paths.

* Add review tests for Studio stop-button cancel flow

* Consolidate review tests for Studio stop-button cancel flow

- Delete standalone test_cancel_registry.py at repo root: tests duplicated
  test_cancel_atomicity.py / test_cancel_id_wiring.py and re-implemented
  registry primitives inline (scaffolding).
- Extend tests/studio/test_stream_cancel_registration_timing.py with
  regression guards for the iter-1 cancel-loop fixes:
    structural: each streaming generator checks cancel_event in its loop;
                audio_input_stream offloads next() via asyncio.to_thread;
                stream_chunks cancel branch calls reset_generation_state().
    runtime:    Unsloth loop breaks on external cancel and resets state;
                audio loop stays responsive under blocking next();
                both loops emit zero tokens on pre-set cancel (replay path).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: extend stop-path to passthrough streams; tighten wall-clock cap

- Lower _DEFAULT_T_MAX_PREDICT_MS from 1 hour to 10 minutes so the
  wall-clock backstop actually bounds runaway decodes when cancel
  signaling fails.
- Wire _TrackedCancel and cancel_event.is_set() into
  _openai_passthrough_stream and _anthropic_passthrough_stream and
  disable httpx keepalive so stop requests from /v1 and /v1/messages
  tool-calling clients reach llama-server.
- Apply t_max_predict_ms to the tool-passthrough request body so the
  backstop covers passthrough paths as well.
- Symmetric pre-registration stash for session_id/completion_id
  cancels (_cancel_by_keys_or_stash) so early cancels by those keys
  replay on later registration like cancel_id.
- Drop dead except BaseException guards around StreamingResponse()
  at four streaming sites; cleanup lives in the generator's finally.

* studio: harden cancel registry against ghost-cancel and leak paths

- Revert the session_id/completion_id stash in the fallback cancel
  helper. session_id is thread-scoped and reused across runs, so
  stashing it on an unmatched POST would fire cancel_event for the
  user's next unrelated request via _TrackedCancel.__enter__.
  cancel_id remains the only per-run unique key that gets stashed.
- Default max_tokens to _DEFAULT_MAX_TOKENS in the tool-passthrough
  body. Mirror the direct GGUF path so OpenAI/Anthropic passthrough
  callers who omit max_tokens get the same zombie-decode cap instead
  of relying on the wall-clock backstop alone.
- Wrap _openai_passthrough_stream setup with an outer try/except
  BaseException. The inner except httpx.RequestError does not catch
  asyncio.CancelledError at await client.send, which would otherwise
  leave _tracker registered in _CANCEL_REGISTRY indefinitely.
- Frontend stop POST uses plain fetch + manual Authorization header
  instead of authFetch. A 401 on the cancel POST no longer refreshes
  tokens or redirects the user to the login page mid-stop.

* Add review tests for Studio stop-button cancel flow

* studio: trim comments on stop-button review changes

Collapse multi-paragraph rationale blocks on the cancel registry,
_openai_passthrough_stream, and the frontend onAbortCancel handler
into one-line explanations of why the non-obvious behaviour exists.
Drop authFetch import that became unused when the cancel POST
switched to plain fetch.

* Consolidate review tests for Studio stop-button cancel flow

Move review-added tests out of test_cancel_dispatch_edges.py into the
existing PR test files that already cover the same areas:
- backend registry fan-out / exclusivity / idempotency / falsy-keys
  edge cases moved into tests/studio/test_cancel_atomicity.py
- frontend plain-fetch (not authFetch) + manual Authorization header
  moved into tests/studio/test_cancel_id_wiring.py
Delete the now-empty test_cancel_dispatch_edges.py.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: stop default-capping responses at 4096 tokens (follow-up to #5069) (#5174)

* Studio: stop default-capping responses at 4096 tokens

Follow-up to #5069. The 4096 default introduced for runaway-decode
defense silently truncates any caller that omits max_tokens. The
Studio chat UI sets params.maxTokens = loadResp.context_length after
a GGUF load, so it's fine, but every other consumer is not:

- OpenAI-API direct callers (/v1/chat/completions, /v1/responses,
  /v1/messages, /v1/completions) where the OpenAI default is
  effectively unlimited per response. langchain, llama-index, raw
  curl, and the openai SDK all rely on that.
- Reasoning models. Qwen3 / gpt-oss reasoning traces routinely exceed
  4096 tokens before the model emits a single visible content token.
  The user sees the trace cut off mid-thought.
- Long-form generation ("write a chapter", "produce a full SVG").

Reproduced on this branch: gemma-4-E2B-it-GGUF Q8_0, prompt asking
for a 10000-word story, no max_tokens in the request:

    finish_reason: stop  (misleading -- should be 'length')
    content_chars: 19772
    content_tail: ...'a comforting, yet immense, pressure.\n\n*"'

Body ended mid-sentence on a stray opening quote, right at the 4096
token mark.

After this patch the same request returns 38357 chars ending with
'...held in a perfect, dynamic equilibrium.' -- a natural stop, not
a truncation.

Implementation: rename the constant to _DEFAULT_MAX_TOKENS_FLOOR and
set it to 32768. Each call site now uses the model's effective
context length when known, falling back to the floor:

    default_cap = self._effective_context_length or _DEFAULT_MAX_TOKENS_FLOOR

The 10-minute t_max_predict_ms wall-clock backstop from #5069 is
preserved as the second line of defense.

Plumbed _build_passthrough_payload + _build_openai_passthrough_body
through the routes layer so the Anthropic and OpenAI passthrough
paths also respect the model's context length.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Studio: cancel passthrough streams during llama-server prefill + route through apiUrl for Tauri

Three reviewer-flagged correctness gaps in the stop-button mechanism.

1) `_openai_passthrough_stream` could not honor cancel during prefill.
   The cancel check ran inside the `async for raw_line in lines_iter`
   body, so a cancel POST that arrived before llama-server emitted the
   first SSE line was unobservable until prefill completed. With a long
   prompt under proxy/Colab conditions -- the exact target scenario for
   this PR -- that left the model decoding for a long time after the
   user clicked Stop. Add an asyncio watcher task that closes `resp` as
   soon as `cancel_event` is set, raising in `aiter_lines` so the
   generator can exit. The watcher polls a threading.Event because the
   cancel registry is keyed by threading.Event for the synchronous
   /cancel handler.

2) `_anthropic_passthrough_stream` had the same blocking-prefill pattern.
   Same fix.

3) The frontend's stop-button cancel POST used a bare relative
   `fetch("/api/inference/cancel", ...)`, which targets the webview
   origin in Tauri production builds (where the backend is at
   `http://127.0.0.1:8888`). Route through the existing `apiUrl()`
   helper from `lib/api-base.ts` to match every other Studio call.
   Browser/dev builds get the empty base, so behavior is unchanged
   there.

Verified via temp/pr_simulation/sim_5069_prefill_cancel.py: cancel
during prefill terminates within ~250ms on both passthrough paths
(was 145s+ on the Anthropic path before this change), and the standard
non-passthrough chat path still cancels with no regression.

* Studio: log cancel-body parse errors instead of silently swallowing

Reviewer-flagged defensive logging gap. The bare `except Exception: pass`
in `cancel_inference` would mask malformed payloads that hint at a buggy
client or a transport issue. Log at debug so future investigation isn't
left guessing whether `body={}` came from a missing body or a parse
failure. Behavior is unchanged: an unparseable body still falls through
to the empty-dict path and the cancel call returns `{"cancelled": 0}`.

* Studio: Anthropic passthrough cancel parity with OpenAI passthrough

Two reviewer-flagged consistency gaps in the cancel surface for
/v1/messages.

1) Anthropic passthrough did not register cancel_id, so a per-run cancel
   POST (the cleanest Studio-style cancel path) silently missed when
   the route hit `_anthropic_passthrough_stream`. The OpenAI passthrough
   has registered (cancel_id, session_id, completion_id) since this PR
   was first opened; mirror that here. Also add `cancel_id` to
   `AnthropicMessagesRequest` so the route handler can plumb it through.

2) The cancel handler's fallback key list checked only completion_id
   and session_id, never message_id. Anthropic clients that send their
   native `id` (returned in the SSE message_start event) for cancel had
   no way to hit the registry. Add message_id to the fallback list.

Verified via temp/pr_simulation/sim_5069_prefill_cancel.py: P2 now
cancels by cancel_id in 137ms (was hanging pre-fix), and the new P2b
case cancels by message_id in 77ms. P1 (OpenAI) and P3 (standard chat)
still pass with no regression.

---------

Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
2026-04-24 10:09:25 -07:00
Daniel Han
93a24f6698
Add ROCm test suite for PR #4720 (#4824)
95 Python tests and 23 shell tests covering ROCm detection,
torch index URL selection, hardware flags, prebuilt asset selection,
and install pathway logic. All tests use mocks -- no AMD hardware required.

Companion to #4720 (AMD ROCm/HIP support).
2026-04-11 04:44:13 -07:00
Daniel Han
8981e6c804
Update test_pr4562_bugfixes.py for simplified install policy (#4817)
- Add TestFetchJsonRetries for JSON retry logic and max_pages
- Update TestSourceCodePatterns for simplified --simple-policy flow
- Add tests for installed prebuilt release reporting
- Add test for CUDA toolkit version-sorted nvcc discovery
- Remove assertions for removed --resolve-install-tag / --resolve-source-build paths
2026-04-03 04:06:14 -07:00
DoubleMathew
7ae9b7f45f
fix windows llama.cpp compile from source issue (#4793)
* fix windows llama.cpp compile from source issue

* undo local repo usage

* fix llama.cpp install

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix windows

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: route resolve-source-build call through Invoke-LlamaHelper

The --resolve-source-build call at the source-build resolution path
was still calling install_llama_prebuilt.py directly instead of going
through Invoke-LlamaHelper. On PS7+ with ErrorActionPreference=Stop,
stderr from the 422 response (when tag is "master") would trigger a
terminating NativeCommandError and crash setup.

* fix: suppress stderr error records from Invoke-LlamaHelper

ErrorActionPreference=Continue prevents termination but PowerShell
still displays stderr lines as visible ErrorRecord objects. Capture
all output via 2>&1 and split stdout from stderr manually so that
stderr lines never appear on the console. When StderrPath is given
the stderr content is written to that file for diagnostics.

* fix: always rebuild llama.cpp on Windows when tag is master

When the requested llama.cpp tag is "master" (a moving target), skip
the "already built" early exit so the build path runs and syncs to
the latest commit. Without this, existing llama-server binaries from
an older build (e.g. b8635 which lacks Gemma 4 support) are reused
and model loading fails.

Pinned tags (e.g. b8635) still skip the rebuild when the binary
already exists, since the tag is immutable.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-02 11:43:46 -07:00
Daniel Han
b20efc370a
Add regression tests for custom llama prebuilt installer (#4772)
Expand test coverage for install_llama_prebuilt.py:
- Add tests for source build plan resolution with custom repos
- Add tests for branch/commit/PR ref matching and normalization
- Add tests for manifest checksum validation
- Add tests for Windows CUDA upstream asset name patterns
- Update capsys checks to capture stderr after log() redirect
2026-04-02 04:45:09 -07:00
DoubleMathew
71b934ef9d
Fix custom llama.cpp source builds and macos metal source builds (#4762)
* Fix script unbound variable error

* remove stale test script, add llama.cpp metal source builds, update tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Metal precedence, test sync, and add behavioral tests

- Move macOS arm64 Metal check before CUDA/ROCm in GPU backend
  decision chain so Metal is not bypassed when nvcc is in PATH
- Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed
  for Metal library linking)
- Update test_llama_pr_force_and_source.py to match _CLONE_ARGS
  rename from _CLONE_BRANCH_ARGS in setup.sh
- Add confirm_install_tree guard test for
  existing_install_matches_choice
- Add TestMacOSMetalBuildLogic bash subprocess tests verifying
  Metal flag selection, nvcc precedence, and CPU fallback behavior

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Metal CPU fallback to also cover cmake build failures and update tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8)
2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8)
3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8)
4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8)
5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8)
6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests clean up

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 14:06:39 -05:00
Daniel Han
f84c2d03d3
Add installer test coverage for prebuilt llama.cpp changes (#4756)
Split out from #4741 to keep the main PR focused on installer logic.

- New test_install_llama_prebuilt_logic.py: tests for resolve logic,
  fallback behavior, env_int, busy/lock handling
- New test_validate_llama_prebuilt.py: validator tests for staged
  release_tag/upstream_tag handling
- New test_llama_pr_force_and_source.py: tests for PR_FORCE and
  LLAMA_SOURCE maintainer defaults
- Updated test_selection_logic.py: expanded selection/fallback coverage
- Updated test_pr4562_bugfixes.py: updated bugfix tests for new logic
- Updated smoke_test_llama_prebuilt.py: minor update
2026-04-01 06:06:29 -07:00
DoubleMathew
f4d8a246bf
Use prebuilt llama.cpp for unsloth studio setup (#4562)
* Use prebuilt llama.cpp for unsloth studio setup

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix 3 issues that cause unnecessary fallback to source build

1. Make filelock import optional -- environments without filelock
   (e.g. minimal installs) crashed at import time instead of
   gracefully skipping the lock.

2. Use already-verified converter script from the hydrated source
   tree instead of re-downloading from raw.githubusercontent.com
   with no checksum. Adds symlink with copy fallback for the
   legacy filename.

3. Initialize $SkipPrebuiltInstall in setup.ps1 before first use
   to prevent potential uninitialized variable errors.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Keep network fallback in ensure_converter_scripts

Prefer the local verified copy from the hydrated source tree, but
retain the original network download as a fallback if the file is
missing. Create the legacy hyphenated filename as a symlink with a
copy fallback instead of writing a second full copy.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix 4 bugs in source-build fallback and binary_env paths

- setup.ps1: Replace git pull + checkout FETCH_HEAD with fetch + checkout -B
  to avoid detached HEAD state that breaks re-runs. Use pinned tag in both
  fetch and clone paths.
- setup.sh: Move rm -rf after cmake/git prerequisite checks so a missing
  tool no longer deletes the existing install. Add --branch tag to clone.
- install_llama_prebuilt.py: Add binary_path.parent to Linux LD_LIBRARY_PATH
  in binary_env() so bundled .so files in build/bin are found even without
  RPATH, matching the existing Windows PATH logic.
- Add test for binary_env LD_LIBRARY_PATH on Linux.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Handle unresolved "latest" tag in source-build fallback clone

When tag resolution fails and the requested tag is "latest", both
setup scripts now omit --branch from git clone so the default branch
is cloned instead of failing on a nonexistent "latest" branch/tag.
Similarly, the PS1 fetch path fetches the default ref when the tag
is "latest".

* Resolve actual latest ggml-org tag instead of using literal "latest"

When both Python tag resolution attempts fail and the requested tag
is "latest", query the GitHub API for the actual latest release tag
from ggml-org/llama.cpp (e.g. b8508) instead of passing the literal
string "latest" to git clone --branch, which would fail since no
such branch/tag exists.

setup.sh uses curl + python json parsing; setup.ps1 uses
Invoke-RestMethod. Both fall back to the raw requested tag if the
API call also fails.

* Try Unsloth release repo before ggml-org when resolving latest tag

When falling back to the GitHub API to resolve "latest", query the
Unsloth release repo (unslothai/llama.cpp) first since it has the
prebuilt binaries pinned to tested tags. Only fall back to
ggml-org/llama.cpp if the Unsloth repo query fails.

* Add comprehensive sandbox tests for PR #4562 bug fixes

35 tests covering all fixes across platforms:
- binary_env cross-platform (Linux LD_LIBRARY_PATH, Windows PATH,
  macOS DYLD_LIBRARY_PATH) with edge cases (dedup, ordering, existing paths)
- resolve_requested_llama_tag (concrete, latest, None, empty)
- setup.sh logic via subprocess: prereq check ordering (cmake/git missing
  preserves install), pinned tag in clone, fetch+checkout -B pattern,
  fetch failure warns instead of aborting
- "latest" tag resolution fallback chain (Unsloth API -> ggml-org ->
  raw) with mock curl: success, failure, malformed JSON, empty body,
  empty tag_name, env overrides
- Source code pattern verification for both .sh and .ps1 files

All 138 tests pass in isolated uv venv.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add binary_path.parent to macOS DYLD_LIBRARY_PATH in binary_env

macOS prebuilt .dylib files are overlaid into build/bin (same as
Linux), but binary_env only added install_dir to DYLD_LIBRARY_PATH.
Add binary_path.parent so the loader can find sibling dylibs even
without embedded loader paths.

Mirrors the existing fix for Linux LD_LIBRARY_PATH and the Windows
PATH pattern.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Guard --branch when resolved tag is "latest"; fix broken test assertion

When all API fallbacks fail and the tag stays as literal "latest",
omit --branch from git clone (clones default branch instead of
failing). Both setup.sh and setup.ps1 now check for "latest" before
passing --branch to git clone/fetch.

Also fix test_setup_ps1_clone_uses_branch_tag which used Python
tuple syntax (assert "x", "y" in z) that always passes. Changed to
assert "x" in z and "y" in z.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix macOS DYLD trailing colon, install_lock no-op, and debug log

- binary_env macOS: use dedupe_existing_dirs instead of raw string
  concatenation. Eliminates trailing colon in DYLD_LIBRARY_PATH
  (which causes dyld to search CWD for libraries) and deduplicates
  when binary_path.parent == install_dir. Now consistent with the
  Linux and Windows branches.
- install_lock: when filelock is not installed, use os.O_CREAT|O_EXCL
  as a fallback exclusive file lock with timeout, instead of yielding
  with no locking. Prevents concurrent installs from corrupting each
  other's staging directories.
- setup.ps1: remove [DEBUG] log line that printed to every user on
  every Windows setup run.

* Add stale-lock detection and atomic clone-then-swap

install_lock fallback (no filelock): write PID to lock file and
check if the holder process is still alive on contention. Dead PIDs
(ProcessLookupError) and unreadable lock files trigger immediate
cleanup. Live processes owned by other users (PermissionError) are
correctly recognized as alive -- the lock is not removed.

setup.sh/setup.ps1 source-build: clone into a temporary directory
first, then swap into place only on success. If git clone fails,
the existing install is preserved instead of being deleted by the
premature rm -rf.

* Remove redundant upstream_tag != release_tag check

load_approved_release_checksums compared checksums.upstream_tag
against the Unsloth release_tag, which are different namespaces
(upstream ggml-org tag vs Unsloth published tag). This only worked
because both happened to be "b8508" by convention. Would break if
Unsloth ever uses a different release naming scheme.

The existing check at parse_approved_release_checksums (line 950)
already validates the release_tag field correctly.

* Fix lock TOCTOU race and build-in-temp-dir swap

install_lock fallback: add os.fsync(fd) after writing PID to ensure
the PID is visible to racing processes before they check. Treat
empty lock files (PID not yet written) as "wait and retry" instead
of stale, closing the window where two processes could both see an
empty file, both unlink it, and both acquire the lock.

setup.sh/setup.ps1 source-build: clone AND build in a temp directory
(LLAMA_CPP_DIR.build.$$). Only swap into the final LLAMA_CPP_DIR
after the build succeeds. If clone or cmake or build fails, the temp
dir is cleaned up and the existing working install is preserved.
Previously, rm -rf ran after clone but before build, destroying the
existing install even if the build later failed.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 05:42:43 -07:00