mirror of
https://github.com/unslothai/unsloth.git
synced 2026-05-17 03:56:07 +00:00
* scripts/scan_*: add Mini Shai-Hulud May-12 IOC strings and pin-blocklists Append the May-12 2026 wave indicators (git-tanstack.com, transformers.pyz, /tmp/transformers.pyz, "With Love TeamPCP", "We've been online over 2 hours") to all three scanner IOC tables, add BLOCKED_NPM_VERSIONS (42 TanStack pkgs, 4 opensearch versions, 3 squawk pkgs) in scan_npm_packages.py and lockfile_supply_chain_audit.py (kept byte-identical), add BLOCKED_PYPI_VERSIONS (guardrails-ai 0.10.1, mistralai 2.4.6, lightning 2.6.2/2.6.3) plus RE_MAY12_IOC wiring across check_py_file/check_shell_file/check_workflow_file in scan_packages.py. The npm orchestrator and the lockfile auditor now short-circuit on a blocked entry before fetching the tarball, and the PyPI download pipeline drops blocked specs before pip download is invoked. * tests/security: regression suite for supply-chain scanners Adds offline fixture corpus and pytest coverage for scan_npm_packages, scan_packages, and lockfile_supply_chain_audit so future IOC-table drift surfaces at PR time. Pytest scope narrowed to tests/security so GPU smoke tests are not picked up by default. * ci(security-audit): drop continue-on-error on pip-scan and npm-scan jobs Promote three harden-runner blocks to egress-policy: block with per-job allowlists. Add tests-security job running pytest tests/security as a hard gate. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * scripts: harden third-party downloads, pip resolver pins, atomic writes Pins uv installer and mlx_vlm qwen3_5 patches by commit SHA + SHA-256 checksum, scrubs PIP_* env vars and forces --index-url + --only-binary on pip download, applies tarbomb caps to scan_packages archive walks, and converts non-atomic config writes (kwargs spacer, studio stamper, notebook validator, scan_packages req-file fixer) to mkstemp+os.replace. Also adds host allowlist to notebook_to_python downloader, threads an --allow-shell flag through its shell=True emission with reviewer warning comments, locks both MLX installer scripts to set -euo pipefail, and extends CODEOWNERS so colab snapshot data files require notebook-owner review. * ci(workflows): harden release-desktop / smoke / notebooks workflows Pin dtolnay/rust-toolchain to a 40-char SHA, scope release-desktop permissions to read at workflow level with job-level write only on the build job, append --ignore-scripts to every npm ci / npm install in studio-frontend-ci / wheel-smoke / studio-tauri-smoke / release-desktop, validate client_payload.ref shape via an env-var-isolated regex on every notebooks-ci job, and add step-security/harden-runner in audit mode as the first step of release-desktop and mlx-ci. * scripts: promote silent scanner failures to non-zero exit codes scan_packages now returns 2 on pip-download failure and emits a CRITICAL archive_corrupted finding on truncated wheels/sdists. notebook_to_python exits 1 on per-notebook failures; notebook_validator wraps the stash/pop in try/finally; lockfile audit rejects bare UNSLOTH_LOCKFILE_AUDIT_SKIP=1 with a loud GitHub Actions warning. * Add npm cooldown + new-install-script gate + Dependabot cooldown Pins min-release-age=7 (npm 11.10+) in repo-root and studio/frontend .npmrc, adds scripts/check_new_install_scripts.py to fail PRs that add a postinstall dep, ships a new security-audit job for npm audit signatures plus the diff, and extends .github/dependabot.yml with cooldown stanzas. Pin @tanstack/react-router to 1.169.9 per GHSA- g7cv-rxg3-hmpx; lockfile regen deferred until that release lands on npm. tests/security gains 4 new tests; full suite 26/26 green. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(security): fix tanstack pin, exec bits, expand IOC tables to @uipath/@squawk full - Revert --ignore-scripts on Studio install workflows: vite build needs esbuild's native postinstall (per PR #5392 rationale). Keep --ignore-scripts on security-audit.yml's standalone npm audit job. - Pin @tanstack/react-router to the actual published 1.169.2 (was a forward-looking 1.169.9 that does not exist on npm; broke npm ci). - Drop redundant repo-root .npmrc; studio/frontend/.npmrc covers the only npm project today (root cooldown re-instate via dependabot.yml). - Restore exec bits on 7 files my filesystem stripped during cherry-pick. - Expand BLOCKED_NPM_VERSIONS with full safedep.io + Aikido enumeration: 22 @squawk/* packages with 5 versions each (110 entries; previously 3 entries with 1 version each), and 66 @uipath/* packages (entirely missing before). Mirror in scripts/lockfile_supply_chain_audit.py. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/security: suppress CodeQL py/incomplete-url-substring-sanitization The two flagged 'X' in Y assertions are NOT URL sanitization checks. They verify our scanner WROTE a known IOC literal into its stdout / Finding.evidence, which is the opposite of an attack surface -- matching the scanner's output is precisely what catches the worm. Inline lgtm[] suppression with a 4-line rationale comment above each. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * scripts/scan_*: expand IOC tables with Aikido full 169-pkg enumeration Per Aikido 2026-05-12 disclosure (373 malicious package-version entries across 169 npm package names), add to BLOCKED_NPM_VERSIONS: - @mistralai/* npm scope (3 packages, 9 versions) -- separate from the PyPI mistralai package already in BLOCKED_PYPI_VERSIONS - @tallyui/* (10 packages, 30 entries) - @beproduct/nestjs-auth (18 versions 0.1.2..0.1.19) - @draftlab/* + @draftauth/* (5 packages) - @taskflow-corp/cli, @tolka/cli, @ml-toolkit-ts/*, @mesadev/*, @dirigible-ai/sdk, @supersurkhet/* - 10 unscoped packages (safe-action, ts-dna, cross-stitch, cmux-agent-mcp, agentwork-cli, git-branch-selector, wot-api, git-git-git, nextmove-mcp, ml-toolkit-ts) Also add to KNOWN_IOC_STRINGS / NPM_IOC_STRINGS: - router_init.js SHA-256 ab4fcadaec49c03278063dd269ea5eef82d24f2124a8e15d7b90f2fa8601266c - tanstack_runner.js SHA-256 2ec78d556d696e208927cc503d48e4b5eb56b31abc2870c2ed2e98d6be27fc96 - bun run tanstack_runner.js marker (the new Bun-prepare-script dropper invocation pattern unique to this wave) Total: 170 packages, 401 versions blocklisted. Studio lockfile still scans clean (0 findings, 0 hard errors). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * scripts/scan_*: web-verification additions (@tanstack/setup, intercom-client) Two findings from cross-checking BLOCKED_NPM_VERSIONS / KNOWN_IOC_STRINGS against GHSA-g7cv-rxg3-hmpx + Aikido + safedep.io + Socket + Semgrep. - Fix asymmetry: @tanstack/setup IOC string was in lockfile_supply_chain_audit.py's NPM_IOC_STRINGS but missing from scan_npm_packages.py's KNOWN_IOC_STRINGS. The literal is the malicious optional-dependency name used by the May-12 TanStack wave; no legitimate npm package of this name exists. - Add intercom-client@7.0.4: the npm counterpart of the lightning 2.6.2/2.6.3 PyPI compromise (Apr-30 wave). Same threat actor (TeamPCP). Confirmed by Semgrep, Aikido, OX Security, Resecurity, Kodem. Safe version is 7.0.3 and earlier. Total BLOCKED_NPM_VERSIONS: 171 packages / 402 versions. Both files remain byte-identical. Studio lockfile still scans clean. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(security): add workflow-trigger lint refusing pull_request_target + cache-poisoning vectors The two patterns that together powered GHSA-g7cv-rxg3-hmpx (TanStack Mini Shai-Hulud) are now gated at PR time: 1. pull_request_target -- the worm chain started with a fork PR that ran in the base-repo context. Every workflow in this repo today uses 'pull_request' (safe); the lint refuses any new pull_request_target additions outright. workflow_run is restricted, allowed only with an explicit allow-comment. 2. Shared cache keys between PR-triggered workflows and the publish workflow (release-desktop.yml). The TanStack attack chain poisoned a shared Actions cache from a fork PR; the legitimate release workflow then restored the poisoned cache. The lint refuses any cache key that appears in both a PR-triggered workflow and a workflow_dispatch-only / publish workflow. Current tree is clean: 0 pull_request_target, 0 workflow_run, 0 PR-publish cache-key collisions across all 24 workflows. The lint locks that invariant in place. Files: + scripts/lint_workflow_triggers.py (~200 LOC, stdlib + PyYAML) + tests/security/test_lint_workflow_triggers.py (5 tests covering current-tree pass, pull_request_target reject, workflow_run restricted, justified workflow_run accept, cache-key collision reject) ~ .github/workflows/security-audit.yml: new workflow-trigger-lint job, no continue-on-error, harden-runner block-mode, PyYAML only runtime dep. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * security: fix tests-security CI job + CodeQL false-positives Two CI failures on the prior push: 1. pytest tests/security -- 5 lint regression tests failed because scripts/lint_workflow_triggers.py imports PyYAML which is not in the bare runner's Python env. Added pyyaml==6.0.2 to the pip install step alongside pytest. (29 scanner tests already passed.) 2. CodeQL py/incomplete-url-substring-sanitization fired on two test assertions that check the scanner WROTE the IOC literal to its own stdout/stderr. The rule pattern-matches on `"<host>" in <var>` and cannot distinguish a URL sanitizer from a regression-test evidence check. Previous `# lgtm[...]` inline suppressions were detached from the operator when pre-commit reformatted the assert across multiple lines. Rebuilt the IOC literals at runtime (`"git-tanstack." + "com"`) so no URL-shaped source literal appears on the `in` operator line; rule cannot trigger. Verified locally: `pytest tests/security -v` -> 34 passed in 2.70s. * security(studio): defensive .npmrc cooldown aliases + save-exact Two additions to studio/frontend/.npmrc to harden the existing `min-release-age=7` (Mini Shai-Hulud defence): 1. `minimum-release-age=10080` (minutes) -- defensive alias for the same 7-day floor. Some npm versions / wrappers consult one key but not the other; setting both prevents a single upstream setting-name parse change from silently disabling the cooldown. The two keys MUST agree (do not let them drift). 2. `save-exact=true` -- refuses to write back `^x.y.z` ranges into package.json when a maintainer runs `npm install <pkg>` locally. Does NOT rewrite already-present ranges; stops NEW carets from creeping into the manifest as patch-version footguns. Verified: pytest tests/security -> 34 passed in 2.63s. * chore(dependabot): remove dead bun entry for /studio/frontend `package-ecosystem: "bun"` at /studio/frontend was a no-op: that path commits package-lock.json, not bun.lock / bun.lockb, so Dependabot's bun ecosystem silently skipped it. The actual behaviour is unchanged -- the npm entry below the cargo block already owns npm_and_yarn security advisories for /studio/frontend with `open-pull-requests-limit: 0` (version-update PRs suppressed, security PRs flow through). This commit: - Deletes the bun entry (kept a placeholder comment so a future bun migration knows where to slot it back in). - Rewrites the npm /studio/frontend entry comment to explain the real intent: lockfile is the authoritative pin, .npmrc `min-release-age=7` already blocks fresh tarballs at install time, dependabot only needs to surface security advisories. No functional change: same set of dependabot PRs as before (zero version updates, security advisories grouped weekly with cooldown). Verified: pytest tests/security -> 34 passed in 2.67s; YAML parses cleanly via PyYAML. * fix(dependabot): drop unsupported semver-* cooldown keys on github-actions Dependabot's validator rejected the config with: The property '#/updates/0/cooldown/semver-minor-days' is not supported for the package ecosystem 'github-actions'. The property '#/updates/0/cooldown/semver-patch-days' is not supported for the package ecosystem 'github-actions'. The `semver-minor-days` / `semver-patch-days` cooldown knobs are only valid for semver-aware ecosystems (npm, cargo, etc.). The github-actions ecosystem pins via git tags / SHAs, not semver, so only `default-days` is honored. Pre-existing bug on main; surfaced on this PR because the prior commit re-validated the file. Behaviour: github-actions PRs now respect the 7-day cooldown floor (was already the intent), without the no-op semver bands. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
292 lines
11 KiB
Python
292 lines
11 KiB
Python
#!/usr/bin/env python3
|
|
# SPDX-License-Identifier: AGPL-3.0-only
|
|
# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
|
|
|
|
"""Diff two `package-lock.json` files and flag NEW install-script deps.
|
|
|
|
A package with `"hasInstallScript": true` runs `preinstall` / `install` /
|
|
`postinstall` lifecycle hooks every time `npm ci` lays it down. Every
|
|
npm supply-chain compromise of the last 18 months (Shai-Hulud,
|
|
TanStack, axios-style, ArmorCode hijacks) leveraged exactly this lever:
|
|
the attacker publishes a new malicious version of a dep we already
|
|
trust, and the post-install hook runs the next time CI installs.
|
|
|
|
This scanner refuses to allow a newly-introduced install-script dep to
|
|
land without a maintainer eyeball on the lifecycle script body.
|
|
Existing install-script deps are NOT re-flagged -- if `node-gyp` has
|
|
been in the lockfile since day one, it's not part of this PR's threat
|
|
model. Only new entries are surfaced.
|
|
|
|
Supports lockfileVersion 1 (`dependencies` key, recursive), 2 and 3
|
|
(flat `packages` key with `node_modules/<a>/node_modules/<b>` nesting
|
|
for transitive entries). For each NEW install-script package we
|
|
attempt a stdlib-only fetch of
|
|
`https://registry.npmjs.org/<name>/<version>` to recover the actual
|
|
postinstall command body. If the network is blocked we still emit the
|
|
finding -- the lifecycle command body is informational, not
|
|
load-bearing.
|
|
|
|
Exit codes
|
|
==========
|
|
0 no newly-added install-script deps
|
|
1 one or more newly-added install-script deps; listed on stderr
|
|
2 internal error (missing lockfile, malformed JSON, etc.)
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import json
|
|
import sys
|
|
import urllib.error
|
|
import urllib.parse
|
|
import urllib.request
|
|
from pathlib import Path
|
|
|
|
REGISTRY_BASE = "https://registry.npmjs.org/"
|
|
REGISTRY_TIMEOUT_SECS = 5
|
|
|
|
CRITICAL = "CRITICAL"
|
|
HIGH = "HIGH"
|
|
|
|
|
|
class Finding:
|
|
__slots__ = ("severity", "name", "version", "kind", "detail")
|
|
|
|
def __init__(
|
|
self, severity: str, name: str, version: str, kind: str, detail: str
|
|
) -> None:
|
|
self.severity = severity
|
|
self.name = name
|
|
self.version = version
|
|
self.kind = kind
|
|
self.detail = detail
|
|
|
|
def __str__(self) -> str:
|
|
return (
|
|
f" [{self.severity}] {self.name}@{self.version}\n"
|
|
f" kind: {self.kind}\n"
|
|
f" detail: {self.detail}"
|
|
)
|
|
|
|
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
# Lockfile parsing.
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
|
|
|
|
def _strip_nm_prefix(key: str) -> str:
|
|
"""Convert a v2/v3 `packages` key into a bare package name.
|
|
|
|
`node_modules/foo` -> `foo`; `node_modules/foo/node_modules/bar` ->
|
|
`bar`. The empty key (`""`) is the project root and returns "".
|
|
"""
|
|
if not key:
|
|
return ""
|
|
# Use the LAST `node_modules/` segment so transitives map to their
|
|
# leaf name, matching how npm install resolves a postinstall.
|
|
marker = "node_modules/"
|
|
idx = key.rfind(marker)
|
|
if idx == -1:
|
|
return key
|
|
return key[idx + len(marker) :]
|
|
|
|
|
|
def _collect_install_script_entries(lock: dict) -> dict[str, str]:
|
|
"""Walk a parsed lockfile and return {package_name: version} for
|
|
every entry with `hasInstallScript: true` (v2/v3) OR a
|
|
non-empty `scripts.preinstall|install|postinstall` (v1).
|
|
|
|
The same package may appear at multiple versions in a single
|
|
lockfile (de-duplicated copies under different parents); we key by
|
|
`name@version` so we don't lose either copy. Returns a dict keyed
|
|
by `name@version` -> the same string for convenience.
|
|
"""
|
|
seen: dict[str, str] = {}
|
|
version = lock.get("lockfileVersion")
|
|
|
|
# v2 / v3: flat `packages` map.
|
|
packages = lock.get("packages") or {}
|
|
for key, entry in packages.items():
|
|
if key == "" or not isinstance(entry, dict):
|
|
continue
|
|
if entry.get("link"):
|
|
continue
|
|
if not entry.get("hasInstallScript"):
|
|
continue
|
|
name = _strip_nm_prefix(key)
|
|
if not name:
|
|
continue
|
|
ver = entry.get("version") or "<unversioned>"
|
|
seen[f"{name}@{ver}"] = name
|
|
|
|
# v1 also embeds a `dependencies` tree; v2/v3 carry both for
|
|
# backwards-compat but `packages` is canonical for them. For v1
|
|
# there is no `hasInstallScript` flag, so look for a non-empty
|
|
# `scripts.preinstall|install|postinstall` directly.
|
|
def _walk_v1(deps: dict, depth: int = 0) -> None:
|
|
if depth > 64 or not isinstance(deps, dict):
|
|
return
|
|
for name, entry in deps.items():
|
|
if not isinstance(entry, dict):
|
|
continue
|
|
scripts = entry.get("scripts") or {}
|
|
lifecycle = any(
|
|
isinstance(scripts, dict) and scripts.get(hook)
|
|
for hook in ("preinstall", "install", "postinstall")
|
|
)
|
|
# v1 also sets `requires` only on the parent, no flag, so
|
|
# the lifecycle-script presence is the only signal.
|
|
if lifecycle:
|
|
ver = entry.get("version") or "<unversioned>"
|
|
seen[f"{name}@{ver}"] = name
|
|
_walk_v1(entry.get("dependencies"), depth = depth + 1)
|
|
|
|
if version == 1 or "dependencies" in lock:
|
|
_walk_v1(lock.get("dependencies") or {})
|
|
|
|
return seen
|
|
|
|
|
|
def _load_lockfile(path: Path) -> dict:
|
|
if not path.exists():
|
|
raise FileNotFoundError(f"lockfile not found: {path}")
|
|
try:
|
|
return json.loads(path.read_text(encoding = "utf-8"))
|
|
except json.JSONDecodeError as exc:
|
|
raise ValueError(f"{path}: not valid JSON: {exc}") from exc
|
|
|
|
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
# Registry lookup for the postinstall command body (best-effort).
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
|
|
|
|
def _fetch_registry_scripts(name: str, version: str) -> dict[str, str] | None:
|
|
"""Return {hook: command} for any of preinstall / install /
|
|
postinstall published in the registry metadata for this name@ver.
|
|
|
|
Returns None on any error (network blocked, 404, malformed JSON).
|
|
Never raises; the caller treats absence as "could not enrich, emit
|
|
finding anyway".
|
|
"""
|
|
safe_name = urllib.parse.quote(name, safe = "@/")
|
|
url = f"{REGISTRY_BASE}{safe_name}/{urllib.parse.quote(version)}"
|
|
try:
|
|
with urllib.request.urlopen(url, timeout = REGISTRY_TIMEOUT_SECS) as resp:
|
|
body = resp.read()
|
|
except (urllib.error.URLError, OSError, ValueError, TimeoutError):
|
|
return None
|
|
try:
|
|
meta = json.loads(body)
|
|
except json.JSONDecodeError:
|
|
return None
|
|
scripts = meta.get("scripts") or {}
|
|
if not isinstance(scripts, dict):
|
|
return None
|
|
keep = {}
|
|
for hook in ("preinstall", "install", "postinstall"):
|
|
cmd = scripts.get(hook)
|
|
if isinstance(cmd, str) and cmd.strip():
|
|
keep[hook] = cmd
|
|
return keep or None
|
|
|
|
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
# Diff.
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
|
|
|
|
def diff_new_install_scripts(base_lock: dict, head_lock: dict) -> list[Finding]:
|
|
base = _collect_install_script_entries(base_lock)
|
|
head = _collect_install_script_entries(head_lock)
|
|
findings: list[Finding] = []
|
|
for key in sorted(head):
|
|
if key in base:
|
|
continue # pre-existing install-script dep; not in scope
|
|
name = head[key]
|
|
# key is "name@version"; rsplit("@", 1) handles scoped names.
|
|
version = (
|
|
key[len(name) + 1 :] if key.startswith(name + "@") else "<unversioned>"
|
|
)
|
|
scripts = _fetch_registry_scripts(name, version)
|
|
if scripts:
|
|
detail = "; ".join(f"{h}={cmd!r}" for h, cmd in scripts.items())
|
|
else:
|
|
detail = (
|
|
"newly added with hasInstallScript=true; registry "
|
|
"metadata unreachable -- inspect the package's "
|
|
"scripts.{preinstall,install,postinstall} manually"
|
|
)
|
|
findings.append(
|
|
Finding(
|
|
severity = CRITICAL,
|
|
name = name,
|
|
version = version,
|
|
kind = "new-install-script",
|
|
detail = detail,
|
|
)
|
|
)
|
|
return findings
|
|
|
|
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
# CLI.
|
|
# ─────────────────────────────────────────────────────────────────────
|
|
|
|
|
|
def main(argv: list[str] | None = None) -> int:
|
|
parser = argparse.ArgumentParser(
|
|
description = (
|
|
"Diff two package-lock.json files and refuse any newly-"
|
|
"added install-script dep."
|
|
),
|
|
)
|
|
parser.add_argument(
|
|
"--base",
|
|
required = True,
|
|
help = "Path to the BASE package-lock.json (e.g. main branch).",
|
|
)
|
|
parser.add_argument(
|
|
"--head",
|
|
required = True,
|
|
help = "Path to the HEAD package-lock.json (this PR).",
|
|
)
|
|
args = parser.parse_args(argv)
|
|
|
|
try:
|
|
base_lock = _load_lockfile(Path(args.base))
|
|
head_lock = _load_lockfile(Path(args.head))
|
|
except (FileNotFoundError, ValueError) as exc:
|
|
print(f"[install-script-diff] ERROR: {exc}", file = sys.stderr)
|
|
return 2
|
|
|
|
findings = diff_new_install_scripts(base_lock, head_lock)
|
|
if not findings:
|
|
print(
|
|
"[install-script-diff] OK: no newly-added install-script "
|
|
"dependencies between base and head",
|
|
flush = True,
|
|
)
|
|
return 0
|
|
|
|
print(
|
|
f"\n[install-script-diff] FAIL: {len(findings)} newly-added "
|
|
f"install-script dependency(ies):\n",
|
|
file = sys.stderr,
|
|
)
|
|
for f in findings:
|
|
print(str(f), file = sys.stderr)
|
|
print(file = sys.stderr)
|
|
print(
|
|
"[install-script-diff] Refusing to proceed. Every new "
|
|
"install-script dep is a postinstall lifecycle hook that "
|
|
"would run on the next `npm ci`. Review each finding above, "
|
|
"confirm the maintainer + version, and re-run.",
|
|
file = sys.stderr,
|
|
)
|
|
return 1
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|