studio/ci: pre-install lockfile supply-chain audit (npm + cargo) (#5392)

* studio/ci: pre-install lockfile supply-chain audit (npm + cargo) The Mini Shai-Hulud wave that hit @tanstack/* on 2026-05-11 19:20-19:26 UTC (GHSA-g7cv-rxg3-hmpx) pushed 84 malicious versions across 42 packages. Each compromised tarball carried an `optionalDependencies` entry pointing at a GitHub-hosted prepare script that exfiltrated GitHub / npm / AWS / Vault / SSH credentials on `npm install` / `npm ci`. Our current lockfile pins ALL @tanstack/* at pre-malicious versions so we were not exposed, but the only defense layer between "dependabot opens a security-update PR during a malicious window" and "a compromised package's postinstall runs on the CI runner" is the advisory-DB latency. `npm audit` and OSV-Scanner are reactive: there is a window between malicious publication and GHSA landing. Add a pre-install lockfile audit that fires on the injection pattern itself, BEFORE `npm ci` gets a chance to execute lifecycle scripts: scripts/lockfile_supply_chain_audit.py npm side (studio/frontend/package-lock.json, lockfileVersion 2/3): 1. every `resolved` URL must point to registry.npmjs.org; direct GitHub / git+ / file: refs are the Shai-Hulud vector 2. every non-bundled entry must carry an `integrity` SHA 3. raw-text scan for known IOC strings (router_init.js, tanstack_runner.js, router_runtime.js, @tanstack/setup, the specific TanStack worm commit hash, getsession.org exfiltration host, "A Mini Shai-Hulud has Appeared" marker) 4. nested `node_modules/.../node_modules/` fold-ins are transparent -- they ride on the parent tarball's integrity cargo side (studio/src-tauri/Cargo.lock): 5. every `source` must be the crates.io registry 6. registry crates must have a `checksum` 7. one allowlist entry: fix-path-env from tauri-apps/fix-path-env-rs at pinned SHA c4c45d5. Any other non-registry source -- or a bump of that pinned SHA -- re-fires the audit until reviewed + appended Wire into four workflows: .github/workflows/security-audit.yml -- new step inside the advisory-audit job, immediately before `npm audit` so the structural pass and the advisory-DB pass appear together in the GitHub step summary. .github/workflows/studio-frontend-ci.yml, .github/workflows/wheel-smoke.yml, .github/workflows/studio-tauri-smoke.yml -- new step immediately BEFORE `npm ci`. If a future malicious bump lands in our lockfile, the audit refuses and `npm ci` never runs, so no `prepare` / `postinstall` from a compromised tarball can execute on the runner. Note on --ignore-scripts: every npm ci in our CI is followed directly by `npm run build` or `tauri build`, both of which depend on package install scripts (esbuild's native-binary postinstall, etc.). Blanket --ignore-scripts breaks the build, so the pre-install structural audit is the practical mitigation. The audit reads lockfiles only; it never executes anything from them. Verified: - Clean state: 0 findings on the current tree (npm + cargo). - Fault injection: synthetic `@tanstack/setup` IOC + non-registry `resolved` URL both fire with exit code 1. - YAML parses cleanly for all four modified workflows. Refs: - https://tanstack.com/blog/npm-supply-chain-compromise-postmortem - https://github.com/TanStack/router/issues/7383 - https://github.com/TanStack/router/security/advisories/GHSA-g7cv-rxg3-hmpx - https://www.aikido.dev/blog/mini-shai-hulud-is-back-tanstack-compromised - https://www.stepsecurity.io/blog/mini-shai-hulud-is-back-a-self-spreading-supply-chain-attack-hits-the-npm-ecosystem * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-05-17 03:56:07 +00:00 · 2026-05-11 20:36:52 -07:00 · 2026-05-11 20:36:52 -07:00 · ac765d2efb
commit ac765d2efb
parent 1794a544b5
5 changed files with 521 additions and 0 deletions
--- a/.github/workflows/security-audit.yml
+++ b/.github/workflows/security-audit.yml
@ -244,6 +244,27 @@ jobs:
            echo '```'
          } >> "$GITHUB_STEP_SUMMARY"

+      # ─────────────────────────────────────────────────────────────
+      # Pre-install lockfile supply-chain audit (npm + cargo).
+      # Catches structural anomalies (non-registry resolved URLs,
+      # missing integrity hashes, known IOC strings) BEFORE `npm
+      # audit` or OSV-Scanner consult the advisory DB. The advisory
+      # path is reactive -- there is a window between a malicious
+      # publication and the GHSA landing. This step fires on the
+      # injection pattern itself so it catches the same class of
+      # attack the moment the lockfile shape becomes wrong.
+      # ─────────────────────────────────────────────────────────────
+      - name: Lockfile supply-chain audit (pre-install scan)
+        run: |
+          python3 scripts/lockfile_supply_chain_audit.py
+          {
+            echo "## Lockfile supply-chain audit"
+            echo
+            echo "Scanned: studio/frontend/package-lock.json + studio/src-tauri/Cargo.lock"
+            echo
+            echo "No structural anomalies or known IOC strings."
+          } >> "$GITHUB_STEP_SUMMARY"
+
      # ─────────────────────────────────────────────────────────────
      # npm: Studio frontend
      # ─────────────────────────────────────────────────────────────
--- a/.github/workflows/studio-frontend-ci.yml
+++ b/.github/workflows/studio-frontend-ci.yml
@ -58,6 +58,14 @@ jobs:
          cache: 'npm'
          cache-dependency-path: studio/frontend/package-lock.json

+      # Run the structural lockfile scan BEFORE npm ci. A compromised
+      # tarball runs its `prepare` / `postinstall` during `npm ci`,
+      # so any catch has to fire upstream of that. The scanner is
+      # pure-Python read-only; safe to call ahead of every install.
+      - name: Lockfile supply-chain audit (pre-install scan)
+        working-directory: ${{ github.workspace }}
+        run: python3 scripts/lockfile_supply_chain_audit.py
+
      - name: Lockfile must agree with package.json (npm ci is strict)
        run: npm ci --no-fund --no-audit

--- a/.github/workflows/studio-tauri-smoke.yml
+++ b/.github/workflows/studio-tauri-smoke.yml
@ -69,6 +69,9 @@ jobs:
          echo "$out"
          [ "$out" = "tauri-cli 2.10.1" ] || { echo "::error::expected tauri-cli 2.10.1, got $out"; exit 1; }

+      - name: Lockfile supply-chain audit (pre-install scan)
+        run: python3 scripts/lockfile_supply_chain_audit.py
+
      - name: Frontend build (npm ci, vite)
        working-directory: studio/frontend
        run: |
--- a/.github/workflows/wheel-smoke.yml
+++ b/.github/workflows/wheel-smoke.yml
@ -53,6 +53,9 @@ jobs:
        with:
          python-version: '3.12'

+      - name: Lockfile supply-chain audit (pre-install scan)
+        run: python3 scripts/lockfile_supply_chain_audit.py
+
      - name: Build frontend
        run: |
          cd studio/frontend
--- a/scripts/lockfile_supply_chain_audit.py
+++ b/scripts/lockfile_supply_chain_audit.py
@ -0,0 +1,486 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
+
+"""Lockfile supply-chain audit for the Studio frontend and Tauri shell.
+
+Runs BEFORE `npm ci` / `cargo fetch` in CI. Refuses to proceed when a
+lockfile contains patterns that indicate the kind of supply-chain
+injection seen in the npm Shai-Hulud waves and the cargo
+crates.io brand-squat attempts.
+
+What it checks
+==============
+
+studio/frontend/package-lock.json (lockfileVersion 2 or 3):
+
+  1. `resolved` URL origin. Every entry must resolve through
+     `https://registry.npmjs.org/`. Direct GitHub-hosted dependencies
+     (`git+ssh://`, `git+https://`, `github:owner/repo#sha`,
+     `file:`, `http://`) are refused -- npm's TanStack incident used
+     exactly this vector to land an unaudited GitHub commit hash as
+     an optional dependency.
+
+  2. `integrity` field presence. Every non-workspace entry must carry
+     an `integrity` SHA. A missing integrity means the registry can
+     swap the tarball after lockfile generation and CI will not
+     notice.
+
+  3. Known IOC strings. A hardcoded set of indicator-of-compromise
+     substrings is grepped across the entire lockfile body (file
+     names, dependency keys, URLs). The list is updated as new
+     campaigns surface. Catching one means the local install was
+     about to pull a publicly-known malicious release.
+
+studio/src-tauri/Cargo.lock:
+
+  4. `source` field origin. Every entry with a `source` must point at
+     `registry+https://github.com/rust-lang/crates.io-index`. Direct
+     git sources (`git+https://...`) and `path+...` for cross-crate
+     paths warrant manual review and are flagged.
+
+  5. Known cargo IOC strings. Same idea as (3), separate list.
+
+Exit codes
+==========
+
+  0  no findings, or an opt-out env var (UNSLOTH_LOCKFILE_AUDIT_SKIP=1)
+     is set
+  1  one or more findings; stderr lists them with file path and line
+     number where derivable
+  2  internal error (missing dependency, malformed JSON, etc.)
+
+Operational stance
+==================
+
+This scanner only PARSES the lockfiles -- it never executes anything
+in them, never resolves anything against the network. Safe to run
+ahead of every `npm ci`. The IOC list is short by design; this
+complements (not replaces) `npm audit`, OSV-Scanner, and the
+advisory-DB pipeline in `.github/workflows/security-audit.yml`. The
+shape of the catch is "we refuse to proceed because the lockfile
+itself is shaped wrong", which fires before any third-party install
+script gets a chance to run on the runner.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import re
+import sys
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Known IOC strings (case-sensitive substring match).
+# ─────────────────────────────────────────────────────────────────────
+#
+# Keep these short and FACTUAL. Each entry is tied to a public advisory
+# and is the literal string an attacker would have to embed for the
+# attack to work. Adding speculative or generic patterns here would
+# generate false positives on dependency upgrades.
+NPM_IOC_STRINGS: tuple[str, ...] = (
+    # Shai-Hulud TanStack wave -- May 11, 2026 (GHSA-g7cv-rxg3-hmpx).
+    "router_init.js",
+    "tanstack_runner.js",
+    "router_runtime.js",
+    "@tanstack/setup",
+    "github:tanstack/router#79ac49eedf774dd4b0cfa308722bc463cfe5885c",
+    # Exfiltration endpoints observed across both Shai-Hulud waves.
+    "filev2.getsession.org",
+    "getsession.org/file/",
+    # Campaign markers; the worm tarballs print this to stdout on run.
+    "A Mini Shai-Hulud has Appeared",
+)
+
+CARGO_IOC_STRINGS: tuple[str, ...] = (
+    # Reserved for future cargo-side incidents. Empty by default --
+    # `source` origin check below catches the structural pattern.
+)
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Allowed lockfile origins.
+# ─────────────────────────────────────────────────────────────────────
+NPM_REGISTRY_PREFIX = "https://registry.npmjs.org/"
+
+# Tarballs are also fetched from this mirror on some GH Actions cached
+# runs (npm rewrites the resolved URL on cache hit). Allow either.
+NPM_REGISTRY_PREFIXES_ALLOWED: tuple[str, ...] = (NPM_REGISTRY_PREFIX,)
+
+CARGO_REGISTRY_SOURCE = "registry+https://github.com/rust-lang/crates.io-index"
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Cargo non-registry source allowlist.
+# ─────────────────────────────────────────────────────────────────────
+#
+# Each entry is `(crate_name, exact_source_string)`. The crate must
+# match by name AND the source must match the full pinned-SHA string
+# verbatim. Bumping the commit SHA forces a re-review here: the
+# scanner fires until the new SHA is appended.
+#
+# Studio's Tauri shell pulls `fix-path-env` directly from
+# tauri-apps/fix-path-env-rs because the crate is not published to
+# crates.io. The pinned commit (c4c45d5) was reviewed at the time it
+# landed; future bumps need explicit approval.
+CARGO_SOURCE_ALLOWLIST: tuple[tuple[str, str], ...] = (
+    (
+        "fix-path-env",
+        "git+https://github.com/tauri-apps/fix-path-env-rs#"
+        "c4c45d503ea115a839aae718d02f79e7c7f0f673",
+    ),
+)
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Finding container.
+# ─────────────────────────────────────────────────────────────────────
+
+
+class Finding:
+    __slots__ = ("path", "package", "kind", "detail")
+
+    def __init__(self, path: str, package: str, kind: str, detail: str) -> None:
+        self.path = path
+        self.package = package
+        self.kind = kind
+        self.detail = detail
+
+    def __str__(self) -> str:
+        return (
+            f"  [{self.kind}] {self.path}\n"
+            f"    package: {self.package}\n"
+            f"    detail:  {self.detail}"
+        )
+
+
+# ─────────────────────────────────────────────────────────────────────
+# package-lock.json audit.
+# ─────────────────────────────────────────────────────────────────────
+
+
+def audit_npm_lockfile(path: Path) -> list[Finding]:
+    findings: list[Finding] = []
+    if not path.exists():
+        return findings
+
+    raw = path.read_text(encoding = "utf-8")
+    try:
+        lock = json.loads(raw)
+    except json.JSONDecodeError as exc:
+        findings.append(
+            Finding(
+                path = str(path),
+                package = "<root>",
+                kind = "malformed-lockfile",
+                detail = f"could not parse as JSON: {exc}",
+            )
+        )
+        return findings
+
+    lockfile_version = lock.get("lockfileVersion")
+    if lockfile_version not in (2, 3):
+        findings.append(
+            Finding(
+                path = str(path),
+                package = "<root>",
+                kind = "unsupported-lockfile-version",
+                detail = (f"only lockfileVersion 2 or 3 audited; got {lockfile_version}"),
+            )
+        )
+
+    packages = lock.get("packages") or {}
+    for key, entry in packages.items():
+        # The empty key "" is the project root; workspace entries use
+        # keys like "node_modules/foo" or "studio/frontend/sub-pkg".
+        # Skip the project root (it has no `resolved`).
+        if key == "":
+            continue
+        if entry.get("link"):
+            # Workspace symlink; no tarball to resolve.
+            continue
+
+        resolved = entry.get("resolved")
+        # Entries living inside another package's `node_modules/`
+        # tree are bundled fold-ins -- the parent's tarball ships
+        # their source verbatim and the parent's `integrity` covers
+        # the whole subtree. npm represents them in lockfileVersion 3
+        # as nested entries with no `resolved` and no `integrity` of
+        # their own. Treat them as transparent to this audit.
+        nested = key.count("/node_modules/") >= 1
+
+        # 1. resolved-URL origin.
+        if resolved is None:
+            if nested or entry.get("bundled"):
+                # Bundled / fold-in entry; covered by parent integrity.
+                pass
+            elif entry.get("version"):
+                # Top-level entry without a resolved URL is suspicious.
+                findings.append(
+                    Finding(
+                        path = str(path),
+                        package = key,
+                        kind = "missing-resolved-url",
+                        detail = (
+                            f"version={entry['version']!r} but no `resolved` "
+                            "field; lockfile is incomplete"
+                        ),
+                    )
+                )
+        else:
+            if not any(resolved.startswith(p) for p in NPM_REGISTRY_PREFIXES_ALLOWED):
+                findings.append(
+                    Finding(
+                        path = str(path),
+                        package = key,
+                        kind = "non-registry-resolved-url",
+                        detail = (
+                            f"resolved={resolved!r}; only "
+                            f"{NPM_REGISTRY_PREFIX} is permitted. Direct "
+                            "GitHub / git / file references are the "
+                            "Shai-Hulud injection vector."
+                        ),
+                    )
+                )
+
+        # 2. integrity-hash presence.
+        if resolved is not None and not entry.get("integrity"):
+            findings.append(
+                Finding(
+                    path = str(path),
+                    package = key,
+                    kind = "missing-integrity-hash",
+                    detail = (
+                        "no `integrity` field; npm cannot verify the "
+                        "tarball SHA against the registry-published hash"
+                    ),
+                )
+            )
+
+    # 3. Known IOC strings: scan the raw file body so we hit fields the
+    #    structural pass above doesn't enumerate (scripts, optional
+    #    dependencies, etc.). Cheap and complete.
+    for ioc in NPM_IOC_STRINGS:
+        if ioc in raw:
+            # Best-effort line number lookup.
+            line_no = _first_line_containing(raw, ioc)
+            findings.append(
+                Finding(
+                    path = f"{path}:{line_no}" if line_no else str(path),
+                    package = "<ioc-match>",
+                    kind = "known-ioc-string",
+                    detail = (
+                        f"matched known IOC substring {ioc!r}; this is "
+                        "a public indicator of a recent supply-chain "
+                        "compromise. Refuse to install."
+                    ),
+                )
+            )
+
+    return findings
+
+
+def _first_line_containing(text: str, needle: str) -> int | None:
+    for i, line in enumerate(text.splitlines(), start = 1):
+        if needle in line:
+            return i
+    return None
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Cargo.lock audit.
+# ─────────────────────────────────────────────────────────────────────
+
+
+# Cargo.lock is TOML; parse with stdlib tomllib (Python 3.11+). The
+# studio's Tauri shell already requires a modern toolchain so this is
+# always available where CI runs.
+_PACKAGE_HEADER = re.compile(r"^\[\[package\]\]\s*$")
+
+
+def audit_cargo_lockfile(path: Path) -> list[Finding]:
+    findings: list[Finding] = []
+    if not path.exists():
+        return findings
+
+    raw = path.read_text(encoding = "utf-8")
+    try:
+        import tomllib  # type: ignore[import-not-found]
+    except ImportError:
+        # Python <3.11; fall back to a tomli shim if importable.
+        try:
+            import tomli as tomllib  # type: ignore[no-redef]
+        except ImportError:
+            findings.append(
+                Finding(
+                    path = str(path),
+                    package = "<root>",
+                    kind = "missing-toml-parser",
+                    detail = (
+                        "Python 3.11+ tomllib or tomli is required to "
+                        "parse Cargo.lock; install tomli or upgrade "
+                        "Python before re-running this audit"
+                    ),
+                )
+            )
+            return findings
+
+    try:
+        lock = tomllib.loads(raw)
+    except Exception as exc:
+        findings.append(
+            Finding(
+                path = str(path),
+                package = "<root>",
+                kind = "malformed-lockfile",
+                detail = f"could not parse as TOML: {exc}",
+            )
+        )
+        return findings
+
+    for entry in lock.get("package", []):
+        name = entry.get("name") or "<unnamed>"
+        version = entry.get("version") or "<unversioned>"
+        source = entry.get("source")
+        # Workspace-local crates have no `source` field; skip them.
+        if source is None:
+            continue
+        if source != CARGO_REGISTRY_SOURCE:
+            if (name, source) in CARGO_SOURCE_ALLOWLIST:
+                # Pre-approved non-registry source pinned by SHA.
+                pass
+            else:
+                findings.append(
+                    Finding(
+                        path = str(path),
+                        package = f"{name}@{version}",
+                        kind = "non-registry-cargo-source",
+                        detail = (
+                            f"source={source!r}; only "
+                            f"{CARGO_REGISTRY_SOURCE!r} is permitted "
+                            "by default, and no allowlist entry covers "
+                            "this crate. If the source is legitimate, "
+                            "add `(name, source)` to "
+                            "CARGO_SOURCE_ALLOWLIST after reviewing the "
+                            "pinned commit."
+                        ),
+                    )
+                )
+        if not entry.get("checksum") and source == CARGO_REGISTRY_SOURCE:
+            findings.append(
+                Finding(
+                    path = str(path),
+                    package = f"{name}@{version}",
+                    kind = "missing-cargo-checksum",
+                    detail = (
+                        "registry crate without checksum; cargo cannot "
+                        "verify the downloaded source against the "
+                        "registry-published SHA"
+                    ),
+                )
+            )
+
+    for ioc in CARGO_IOC_STRINGS:
+        if ioc in raw:
+            line_no = _first_line_containing(raw, ioc)
+            findings.append(
+                Finding(
+                    path = f"{path}:{line_no}" if line_no else str(path),
+                    package = "<ioc-match>",
+                    kind = "known-ioc-string",
+                    detail = f"matched known IOC substring {ioc!r}",
+                )
+            )
+
+    return findings
+
+
+# ─────────────────────────────────────────────────────────────────────
+# CLI.
+# ─────────────────────────────────────────────────────────────────────
+
+
+DEFAULT_NPM_LOCKFILES = ("studio/frontend/package-lock.json",)
+DEFAULT_CARGO_LOCKFILES = ("studio/src-tauri/Cargo.lock",)
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        description = "Pre-install lockfile supply-chain audit.",
+    )
+    parser.add_argument(
+        "--root",
+        default = str(REPO_ROOT),
+        help = "Repo root (default: parent of this script).",
+    )
+    parser.add_argument(
+        "--npm-lockfile",
+        action = "append",
+        default = None,
+        help = (
+            "Path to a package-lock.json (repeatable). "
+            "Default: studio/frontend/package-lock.json."
+        ),
+    )
+    parser.add_argument(
+        "--cargo-lockfile",
+        action = "append",
+        default = None,
+        help = (
+            "Path to a Cargo.lock (repeatable). "
+            "Default: studio/src-tauri/Cargo.lock."
+        ),
+    )
+    args = parser.parse_args(argv)
+
+    if os.environ.get("UNSLOTH_LOCKFILE_AUDIT_SKIP") == "1":
+        print(
+            "[lockfile-audit] UNSLOTH_LOCKFILE_AUDIT_SKIP=1; "
+            "audit skipped (expected only for local triage)",
+            flush = True,
+        )
+        return 0
+
+    root = Path(args.root).resolve()
+    npm_paths = [root / p for p in (args.npm_lockfile or DEFAULT_NPM_LOCKFILES)]
+    cargo_paths = [root / p for p in (args.cargo_lockfile or DEFAULT_CARGO_LOCKFILES)]
+
+    all_findings: list[Finding] = []
+    for p in npm_paths:
+        print(f"[lockfile-audit] npm: {p}", flush = True)
+        all_findings.extend(audit_npm_lockfile(p))
+    for p in cargo_paths:
+        print(f"[lockfile-audit] cargo: {p}", flush = True)
+        all_findings.extend(audit_cargo_lockfile(p))
+
+    if not all_findings:
+        print(
+            f"[lockfile-audit] OK: 0 findings across "
+            f"{len(npm_paths)} npm + {len(cargo_paths)} cargo lockfile(s)",
+            flush = True,
+        )
+        return 0
+
+    print(
+        f"\n[lockfile-audit] FAIL: {len(all_findings)} finding(s):\n",
+        file = sys.stderr,
+    )
+    for f in all_findings:
+        print(str(f), file = sys.stderr)
+        print(file = sys.stderr)
+    print(
+        "[lockfile-audit] Refusing to proceed. Each finding above is "
+        "either a structural lockfile anomaly or a public indicator-of-"
+        "compromise. Investigate before running `npm ci` or `cargo fetch`.",
+        file = sys.stderr,
+    )
+    return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())