mirror of
https://github.com/unslothai/unsloth.git
synced 2026-05-16 19:43:06 +00:00
* scripts/scan_*: add Mini Shai-Hulud May-12 IOC strings and pin-blocklists Append the May-12 2026 wave indicators (git-tanstack.com, transformers.pyz, /tmp/transformers.pyz, "With Love TeamPCP", "We've been online over 2 hours") to all three scanner IOC tables, add BLOCKED_NPM_VERSIONS (42 TanStack pkgs, 4 opensearch versions, 3 squawk pkgs) in scan_npm_packages.py and lockfile_supply_chain_audit.py (kept byte-identical), add BLOCKED_PYPI_VERSIONS (guardrails-ai 0.10.1, mistralai 2.4.6, lightning 2.6.2/2.6.3) plus RE_MAY12_IOC wiring across check_py_file/check_shell_file/check_workflow_file in scan_packages.py. The npm orchestrator and the lockfile auditor now short-circuit on a blocked entry before fetching the tarball, and the PyPI download pipeline drops blocked specs before pip download is invoked. * tests/security: regression suite for supply-chain scanners Adds offline fixture corpus and pytest coverage for scan_npm_packages, scan_packages, and lockfile_supply_chain_audit so future IOC-table drift surfaces at PR time. Pytest scope narrowed to tests/security so GPU smoke tests are not picked up by default. * ci(security-audit): drop continue-on-error on pip-scan and npm-scan jobs Promote three harden-runner blocks to egress-policy: block with per-job allowlists. Add tests-security job running pytest tests/security as a hard gate. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * scripts: harden third-party downloads, pip resolver pins, atomic writes Pins uv installer and mlx_vlm qwen3_5 patches by commit SHA + SHA-256 checksum, scrubs PIP_* env vars and forces --index-url + --only-binary on pip download, applies tarbomb caps to scan_packages archive walks, and converts non-atomic config writes (kwargs spacer, studio stamper, notebook validator, scan_packages req-file fixer) to mkstemp+os.replace. Also adds host allowlist to notebook_to_python downloader, threads an --allow-shell flag through its shell=True emission with reviewer warning comments, locks both MLX installer scripts to set -euo pipefail, and extends CODEOWNERS so colab snapshot data files require notebook-owner review. * ci(workflows): harden release-desktop / smoke / notebooks workflows Pin dtolnay/rust-toolchain to a 40-char SHA, scope release-desktop permissions to read at workflow level with job-level write only on the build job, append --ignore-scripts to every npm ci / npm install in studio-frontend-ci / wheel-smoke / studio-tauri-smoke / release-desktop, validate client_payload.ref shape via an env-var-isolated regex on every notebooks-ci job, and add step-security/harden-runner in audit mode as the first step of release-desktop and mlx-ci. * scripts: promote silent scanner failures to non-zero exit codes scan_packages now returns 2 on pip-download failure and emits a CRITICAL archive_corrupted finding on truncated wheels/sdists. notebook_to_python exits 1 on per-notebook failures; notebook_validator wraps the stash/pop in try/finally; lockfile audit rejects bare UNSLOTH_LOCKFILE_AUDIT_SKIP=1 with a loud GitHub Actions warning. * Add npm cooldown + new-install-script gate + Dependabot cooldown Pins min-release-age=7 (npm 11.10+) in repo-root and studio/frontend .npmrc, adds scripts/check_new_install_scripts.py to fail PRs that add a postinstall dep, ships a new security-audit job for npm audit signatures plus the diff, and extends .github/dependabot.yml with cooldown stanzas. Pin @tanstack/react-router to 1.169.9 per GHSA- g7cv-rxg3-hmpx; lockfile regen deferred until that release lands on npm. tests/security gains 4 new tests; full suite 26/26 green. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(security): fix tanstack pin, exec bits, expand IOC tables to @uipath/@squawk full - Revert --ignore-scripts on Studio install workflows: vite build needs esbuild's native postinstall (per PR #5392 rationale). Keep --ignore-scripts on security-audit.yml's standalone npm audit job. - Pin @tanstack/react-router to the actual published 1.169.2 (was a forward-looking 1.169.9 that does not exist on npm; broke npm ci). - Drop redundant repo-root .npmrc; studio/frontend/.npmrc covers the only npm project today (root cooldown re-instate via dependabot.yml). - Restore exec bits on 7 files my filesystem stripped during cherry-pick. - Expand BLOCKED_NPM_VERSIONS with full safedep.io + Aikido enumeration: 22 @squawk/* packages with 5 versions each (110 entries; previously 3 entries with 1 version each), and 66 @uipath/* packages (entirely missing before). Mirror in scripts/lockfile_supply_chain_audit.py. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/security: suppress CodeQL py/incomplete-url-substring-sanitization The two flagged 'X' in Y assertions are NOT URL sanitization checks. They verify our scanner WROTE a known IOC literal into its stdout / Finding.evidence, which is the opposite of an attack surface -- matching the scanner's output is precisely what catches the worm. Inline lgtm[] suppression with a 4-line rationale comment above each. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * scripts/scan_*: expand IOC tables with Aikido full 169-pkg enumeration Per Aikido 2026-05-12 disclosure (373 malicious package-version entries across 169 npm package names), add to BLOCKED_NPM_VERSIONS: - @mistralai/* npm scope (3 packages, 9 versions) -- separate from the PyPI mistralai package already in BLOCKED_PYPI_VERSIONS - @tallyui/* (10 packages, 30 entries) - @beproduct/nestjs-auth (18 versions 0.1.2..0.1.19) - @draftlab/* + @draftauth/* (5 packages) - @taskflow-corp/cli, @tolka/cli, @ml-toolkit-ts/*, @mesadev/*, @dirigible-ai/sdk, @supersurkhet/* - 10 unscoped packages (safe-action, ts-dna, cross-stitch, cmux-agent-mcp, agentwork-cli, git-branch-selector, wot-api, git-git-git, nextmove-mcp, ml-toolkit-ts) Also add to KNOWN_IOC_STRINGS / NPM_IOC_STRINGS: - router_init.js SHA-256 ab4fcadaec49c03278063dd269ea5eef82d24f2124a8e15d7b90f2fa8601266c - tanstack_runner.js SHA-256 2ec78d556d696e208927cc503d48e4b5eb56b31abc2870c2ed2e98d6be27fc96 - bun run tanstack_runner.js marker (the new Bun-prepare-script dropper invocation pattern unique to this wave) Total: 170 packages, 401 versions blocklisted. Studio lockfile still scans clean (0 findings, 0 hard errors). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * scripts/scan_*: web-verification additions (@tanstack/setup, intercom-client) Two findings from cross-checking BLOCKED_NPM_VERSIONS / KNOWN_IOC_STRINGS against GHSA-g7cv-rxg3-hmpx + Aikido + safedep.io + Socket + Semgrep. - Fix asymmetry: @tanstack/setup IOC string was in lockfile_supply_chain_audit.py's NPM_IOC_STRINGS but missing from scan_npm_packages.py's KNOWN_IOC_STRINGS. The literal is the malicious optional-dependency name used by the May-12 TanStack wave; no legitimate npm package of this name exists. - Add intercom-client@7.0.4: the npm counterpart of the lightning 2.6.2/2.6.3 PyPI compromise (Apr-30 wave). Same threat actor (TeamPCP). Confirmed by Semgrep, Aikido, OX Security, Resecurity, Kodem. Safe version is 7.0.3 and earlier. Total BLOCKED_NPM_VERSIONS: 171 packages / 402 versions. Both files remain byte-identical. Studio lockfile still scans clean. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci(security): add workflow-trigger lint refusing pull_request_target + cache-poisoning vectors The two patterns that together powered GHSA-g7cv-rxg3-hmpx (TanStack Mini Shai-Hulud) are now gated at PR time: 1. pull_request_target -- the worm chain started with a fork PR that ran in the base-repo context. Every workflow in this repo today uses 'pull_request' (safe); the lint refuses any new pull_request_target additions outright. workflow_run is restricted, allowed only with an explicit allow-comment. 2. Shared cache keys between PR-triggered workflows and the publish workflow (release-desktop.yml). The TanStack attack chain poisoned a shared Actions cache from a fork PR; the legitimate release workflow then restored the poisoned cache. The lint refuses any cache key that appears in both a PR-triggered workflow and a workflow_dispatch-only / publish workflow. Current tree is clean: 0 pull_request_target, 0 workflow_run, 0 PR-publish cache-key collisions across all 24 workflows. The lint locks that invariant in place. Files: + scripts/lint_workflow_triggers.py (~200 LOC, stdlib + PyYAML) + tests/security/test_lint_workflow_triggers.py (5 tests covering current-tree pass, pull_request_target reject, workflow_run restricted, justified workflow_run accept, cache-key collision reject) ~ .github/workflows/security-audit.yml: new workflow-trigger-lint job, no continue-on-error, harden-runner block-mode, PyYAML only runtime dep. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * security: fix tests-security CI job + CodeQL false-positives Two CI failures on the prior push: 1. pytest tests/security -- 5 lint regression tests failed because scripts/lint_workflow_triggers.py imports PyYAML which is not in the bare runner's Python env. Added pyyaml==6.0.2 to the pip install step alongside pytest. (29 scanner tests already passed.) 2. CodeQL py/incomplete-url-substring-sanitization fired on two test assertions that check the scanner WROTE the IOC literal to its own stdout/stderr. The rule pattern-matches on `"<host>" in <var>` and cannot distinguish a URL sanitizer from a regression-test evidence check. Previous `# lgtm[...]` inline suppressions were detached from the operator when pre-commit reformatted the assert across multiple lines. Rebuilt the IOC literals at runtime (`"git-tanstack." + "com"`) so no URL-shaped source literal appears on the `in` operator line; rule cannot trigger. Verified locally: `pytest tests/security -v` -> 34 passed in 2.70s. * security(studio): defensive .npmrc cooldown aliases + save-exact Two additions to studio/frontend/.npmrc to harden the existing `min-release-age=7` (Mini Shai-Hulud defence): 1. `minimum-release-age=10080` (minutes) -- defensive alias for the same 7-day floor. Some npm versions / wrappers consult one key but not the other; setting both prevents a single upstream setting-name parse change from silently disabling the cooldown. The two keys MUST agree (do not let them drift). 2. `save-exact=true` -- refuses to write back `^x.y.z` ranges into package.json when a maintainer runs `npm install <pkg>` locally. Does NOT rewrite already-present ranges; stops NEW carets from creeping into the manifest as patch-version footguns. Verified: pytest tests/security -> 34 passed in 2.63s. * chore(dependabot): remove dead bun entry for /studio/frontend `package-ecosystem: "bun"` at /studio/frontend was a no-op: that path commits package-lock.json, not bun.lock / bun.lockb, so Dependabot's bun ecosystem silently skipped it. The actual behaviour is unchanged -- the npm entry below the cargo block already owns npm_and_yarn security advisories for /studio/frontend with `open-pull-requests-limit: 0` (version-update PRs suppressed, security PRs flow through). This commit: - Deletes the bun entry (kept a placeholder comment so a future bun migration knows where to slot it back in). - Rewrites the npm /studio/frontend entry comment to explain the real intent: lockfile is the authoritative pin, .npmrc `min-release-age=7` already blocks fresh tarballs at install time, dependabot only needs to surface security advisories. No functional change: same set of dependabot PRs as before (zero version updates, security advisories grouped weekly with cooldown). Verified: pytest tests/security -> 34 passed in 2.67s; YAML parses cleanly via PyYAML. * fix(dependabot): drop unsupported semver-* cooldown keys on github-actions Dependabot's validator rejected the config with: The property '#/updates/0/cooldown/semver-minor-days' is not supported for the package ecosystem 'github-actions'. The property '#/updates/0/cooldown/semver-patch-days' is not supported for the package ecosystem 'github-actions'. The `semver-minor-days` / `semver-patch-days` cooldown knobs are only valid for semver-aware ecosystems (npm, cargo, etc.). The github-actions ecosystem pins via git tags / SHAs, not semver, so only `default-days` is honored. Pre-existing bug on main; surfaced on this PR because the prior commit re-validated the file. Behaviour: github-actions PRs now respect the 7-day cooldown floor (was already the intent), without the no-op semver bands. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
415 lines
14 KiB
Python
415 lines
14 KiB
Python
#!/usr/bin/env python
|
|
# coding: utf-8
|
|
"""
|
|
Convert Jupyter notebooks (.ipynb) to executable Python scripts (.py).
|
|
|
|
Converts IPython magics to plain Python:
|
|
!command -> subprocess.run('command', shell=True)
|
|
%cd path -> os.chdir('path')
|
|
%env VAR=value -> os.environ['VAR'] = 'value'
|
|
%%file filename -> with open('filename', 'w') as f: f.write(...)
|
|
%%capture -> (skipped)
|
|
/content/... -> _WORKING_DIR + /...
|
|
"""
|
|
|
|
import nbformat
|
|
import re
|
|
import shlex
|
|
import sys
|
|
import os
|
|
import urllib.request
|
|
import urllib.parse
|
|
from pathlib import Path
|
|
|
|
|
|
# Hosts we are willing to fetch raw notebook JSON from. Anything else
|
|
# is rejected before `urlopen` so a typoed / hostile URL cannot pull
|
|
# code from arbitrary infrastructure.
|
|
_ALLOWED_NOTEBOOK_HOSTS = {
|
|
"raw.githubusercontent.com",
|
|
"gist.githubusercontent.com",
|
|
}
|
|
|
|
|
|
# Shell metacharacters that imply the cell's `!cmd` line cannot be
|
|
# parsed as a flat argv. If any of these appears, `shlex.split` would
|
|
# either fail or, worse, silently strip the operator -- so we keep
|
|
# `shell=True` for that command and emit a review marker.
|
|
_SHELL_METACHARS_RE = re.compile(r"\$\(|`|\|\||\||&&|>>?|<<?|\*|\?|;")
|
|
|
|
|
|
def needs_fstring(cmd: str) -> bool:
|
|
"""Check if command has Python variable interpolation like {var_name}."""
|
|
pattern = r"(?<!\$)\{([a-zA-Z_][a-zA-Z0-9_]*)\}"
|
|
return bool(re.search(pattern, cmd))
|
|
|
|
|
|
def github_blob_to_raw(url: str) -> str:
|
|
"""Convert GitHub blob URL to raw URL."""
|
|
# https://github.com/user/repo/blob/branch/path
|
|
# -> https://raw.githubusercontent.com/user/repo/branch/path
|
|
# Compare the parsed host exactly (not as a substring) so a URL
|
|
# like https://attacker.example.com/github.com/blob/... does NOT
|
|
# get rewritten to a github raw URL. Closes CodeQL alert
|
|
# py/incomplete-url-substring-sanitization.
|
|
parsed = urllib.parse.urlparse(url)
|
|
if parsed.netloc != "github.com" or "/blob/" not in parsed.path:
|
|
return url
|
|
new_path = parsed.path.replace("/blob/", "/", 1)
|
|
return urllib.parse.urlunparse(
|
|
parsed._replace(netloc = "raw.githubusercontent.com", path = new_path)
|
|
)
|
|
|
|
|
|
def download_notebook(url: str) -> tuple[str, str]:
|
|
"""Download notebook from URL. Returns (content, filename)."""
|
|
# Convert blob URL to raw if needed
|
|
raw_url = github_blob_to_raw(url)
|
|
|
|
# Extract filename from URL
|
|
parsed = urllib.parse.urlparse(raw_url)
|
|
filename = os.path.basename(urllib.parse.unquote(parsed.path))
|
|
|
|
# Host allowlist. Refuse to fetch from anywhere the campaign IOC
|
|
# tables flag (or just anywhere we don't recognise). The blob->raw
|
|
# conversion above only emits `raw.githubusercontent.com`, so a
|
|
# rejection here means the caller hand-typed a URL pointing
|
|
# somewhere we don't trust.
|
|
host = parsed.hostname
|
|
if host not in _ALLOWED_NOTEBOOK_HOSTS:
|
|
raise ValueError(
|
|
f"Refused notebook fetch from {host!r}: not in allowlist "
|
|
f"{sorted(_ALLOWED_NOTEBOOK_HOSTS)}"
|
|
)
|
|
|
|
# Download
|
|
print(f"Downloading {url}...")
|
|
with urllib.request.urlopen(raw_url, timeout = 60) as response:
|
|
content = response.read().decode("utf-8")
|
|
|
|
return content, filename
|
|
|
|
|
|
def is_url(path: str) -> bool:
|
|
"""Check if path is a URL."""
|
|
return path.startswith("http://") or path.startswith("https://")
|
|
|
|
|
|
def replace_colab_paths(source: str) -> str:
|
|
"""Replace Colab-specific /content/ paths with current working directory."""
|
|
# Replace /content/ with f-string using _WORKING_DIR
|
|
source = source.replace('"/content/', 'f"{_WORKING_DIR}/')
|
|
source = source.replace("'/content/", "f'{_WORKING_DIR}/")
|
|
return source
|
|
|
|
|
|
def _emit_shell_command(indent: str, full_cmd: str, *, allow_shell: bool) -> list[str]:
|
|
"""Render a `!cmd` notebook line as one or more Python statements.
|
|
|
|
When the command body is f-string-interpolated, contains shell
|
|
metacharacters, or spans multiple lines, falling back to
|
|
`shell=True` is the only correct option -- `shlex.split` would
|
|
either drop operators or fail outright. We surface that with a
|
|
`# WARNING: shell=True; reviewed for hostile input` comment so a
|
|
reviewer cannot miss it.
|
|
|
|
Otherwise we emit `subprocess.run(shlex.split(cmd), shell=False)`
|
|
so the converted script is not a re-injection vector if the
|
|
notebook ever interpolates user-controlled data.
|
|
|
|
`allow_shell` defaults to True at the CLI for backwards
|
|
compatibility. Setting it to False makes `shell=True` emission a
|
|
hard error (no surprise behaviour).
|
|
"""
|
|
needs_f = needs_fstring(full_cmd)
|
|
has_meta = bool(_SHELL_METACHARS_RE.search(full_cmd))
|
|
multiline = "\n" in full_cmd
|
|
|
|
must_use_shell = needs_f or has_meta or multiline
|
|
|
|
if must_use_shell:
|
|
if not allow_shell:
|
|
raise ValueError(
|
|
"Cell uses shell metacharacters / interpolation but "
|
|
"--no-allow-shell was set; refusing to emit shell=True"
|
|
)
|
|
warn = f"{indent}# WARNING: shell=True; reviewed for hostile input"
|
|
f_prefix = "f" if needs_f else ""
|
|
if multiline:
|
|
escaped_cmd = full_cmd.replace('"""', r"\"\"\"")
|
|
if escaped_cmd.rstrip().endswith('"'):
|
|
escaped_cmd = escaped_cmd.rstrip() + " "
|
|
stmt = f'{indent}subprocess.run({f_prefix}"""{escaped_cmd}""", shell=True)'
|
|
else:
|
|
stmt = f"{indent}subprocess.run({f_prefix}{full_cmd!r}, shell=True)"
|
|
return [warn, stmt]
|
|
|
|
# Shell-safe argv form.
|
|
return [f"{indent}subprocess.run(shlex.split({full_cmd!r}), shell=False)"]
|
|
|
|
|
|
def convert_cell_to_python(source: str, *, allow_shell: bool = True) -> str:
|
|
"""Convert a cell's IPython magics to plain Python."""
|
|
lines = source.split("\n")
|
|
result = []
|
|
i = 0
|
|
|
|
while i < len(lines):
|
|
line = lines[i]
|
|
stripped = line.strip()
|
|
indent = line[: len(line) - len(line.lstrip())]
|
|
|
|
# Skip %%capture
|
|
if stripped.startswith("%%capture"):
|
|
i += 1
|
|
continue
|
|
|
|
# Handle %%file magic
|
|
if stripped.startswith("%%file "):
|
|
filename = stripped[7:].strip()
|
|
file_lines = []
|
|
i += 1
|
|
while i < len(lines):
|
|
file_lines.append(lines[i])
|
|
i += 1
|
|
file_content = "\n".join(file_lines)
|
|
file_content = file_content.replace('"""', r"\"\"\"")
|
|
result.append(f'{indent}with open({filename!r}, "w") as _f:')
|
|
result.append(f'{indent} _f.write("""{file_content}""")')
|
|
continue
|
|
|
|
# Handle ! shell commands
|
|
if stripped.startswith("!"):
|
|
cmd_lines = [stripped[1:]]
|
|
while cmd_lines[-1].rstrip().endswith("\\") and i + 1 < len(lines):
|
|
i += 1
|
|
cmd_lines.append(lines[i].strip())
|
|
full_cmd = "\n".join(cmd_lines)
|
|
|
|
result.extend(
|
|
_emit_shell_command(indent, full_cmd, allow_shell = allow_shell)
|
|
)
|
|
|
|
# %cd path -> os.chdir(path)
|
|
elif stripped.startswith("%cd "):
|
|
path = stripped[4:].strip()
|
|
result.append(f"{indent}os.chdir({path!r})")
|
|
|
|
# %env VAR=value
|
|
elif stripped.startswith("%env ") and "=" in stripped:
|
|
match = re.match(r"%env\s+(\w+)=(.+)", stripped)
|
|
if match:
|
|
var, val = match.groups()
|
|
result.append(f"{indent}os.environ[{var!r}] = {val!r}")
|
|
|
|
# %env VAR
|
|
elif stripped.startswith("%env "):
|
|
var = stripped[5:].strip()
|
|
result.append(f"{indent}os.environ.get({var!r})")
|
|
|
|
# %pwd
|
|
elif stripped == "%pwd":
|
|
result.append(f"{indent}os.getcwd()")
|
|
|
|
else:
|
|
result.append(line)
|
|
|
|
i += 1
|
|
|
|
return "\n".join(result)
|
|
|
|
|
|
def convert_notebook(
|
|
notebook_content: str,
|
|
source_name: str = "notebook",
|
|
*,
|
|
allow_shell: bool = True,
|
|
) -> str:
|
|
"""Convert notebook JSON content to Python script."""
|
|
# Parse notebook
|
|
if isinstance(notebook_content, str):
|
|
notebook = nbformat.reads(notebook_content, as_version = 4)
|
|
else:
|
|
notebook = notebook_content
|
|
|
|
lines = [
|
|
"#!/usr/bin/env python",
|
|
"# coding: utf-8",
|
|
f"# Converted from: {source_name}",
|
|
"",
|
|
"import shlex",
|
|
"import subprocess",
|
|
"import os",
|
|
"import sys",
|
|
"import re",
|
|
"",
|
|
"# Capture original packages before any installs",
|
|
"_original_packages = subprocess.run(",
|
|
" [sys.executable, '-m', 'pip', 'freeze'],",
|
|
" capture_output=True, text=True",
|
|
").stdout",
|
|
"",
|
|
"# Working directory (replaces Colab's /content/)",
|
|
"_WORKING_DIR = os.getcwd()",
|
|
"",
|
|
]
|
|
|
|
for cell in notebook.cells:
|
|
source = cell.source.strip()
|
|
if not source:
|
|
continue
|
|
|
|
if cell.cell_type == "code":
|
|
converted = convert_cell_to_python(source, allow_shell = allow_shell)
|
|
converted = replace_colab_paths(converted)
|
|
lines.append(converted)
|
|
lines.append("")
|
|
|
|
elif cell.cell_type == "markdown":
|
|
for line in source.split("\n"):
|
|
lines.append(f"# {line}")
|
|
lines.append("")
|
|
|
|
# Add package restoration at the end
|
|
lines.extend(
|
|
[
|
|
"",
|
|
"# Restore original packages (install one by one, skip failures)",
|
|
"for _pkg in _original_packages.strip().split('\\n'):",
|
|
" if _pkg:",
|
|
" subprocess.run([sys.executable, '-m', 'pip', 'install', _pkg, '-q'],",
|
|
" stderr=subprocess.DEVNULL)",
|
|
"",
|
|
]
|
|
)
|
|
|
|
return "\n".join(lines)
|
|
|
|
|
|
def convert_notebook_to_script(
|
|
source: str,
|
|
output_dir: str | None = None,
|
|
*,
|
|
allow_shell: bool = True,
|
|
):
|
|
"""
|
|
Convert a notebook to Python script.
|
|
|
|
Args:
|
|
source: Local file path or URL to notebook
|
|
output_dir: Output directory (optional, defaults to current directory)
|
|
allow_shell: When False, refuse to emit `shell=True` for any
|
|
`!cmd` cell that uses metacharacters / interpolation.
|
|
"""
|
|
if is_url(source):
|
|
content, filename = download_notebook(source)
|
|
source_name = source
|
|
else:
|
|
filename = os.path.basename(source)
|
|
with open(source, "r", encoding = "utf-8") as f:
|
|
content = f.read()
|
|
source_name = source
|
|
|
|
# Generate output filename
|
|
output_filename = filename.replace(".ipynb", ".py")
|
|
# Clean up filename
|
|
output_filename = (
|
|
output_filename.replace("(", "").replace(")", "").replace("-", "_")
|
|
)
|
|
|
|
# Add output directory if specified
|
|
if output_dir:
|
|
output_path = os.path.join(output_dir, output_filename)
|
|
else:
|
|
output_path = output_filename
|
|
|
|
# Convert
|
|
script = convert_notebook(content, source_name, allow_shell = allow_shell)
|
|
|
|
# Write output
|
|
with open(output_path, "w", encoding = "utf-8") as f:
|
|
f.write(script)
|
|
|
|
print(f"Converted {source} -> {output_path}")
|
|
return output_path
|
|
|
|
|
|
def main():
|
|
import argparse
|
|
|
|
class Formatter(
|
|
argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter
|
|
):
|
|
pass
|
|
|
|
parser = argparse.ArgumentParser(
|
|
description = __doc__,
|
|
formatter_class = Formatter,
|
|
epilog = """
|
|
Examples:
|
|
python notebook_to_python.py notebook.ipynb
|
|
python notebook_to_python.py -o scripts/ notebook1.ipynb notebook2.ipynb
|
|
python notebook_to_python.py --output ./converted https://github.com/user/repo/blob/main/notebook.ipynb
|
|
python notebook_to_python.py https://github.com/unslothai/notebooks/blob/main/nb/Oute_TTS_(1B).ipynb
|
|
""",
|
|
)
|
|
parser.add_argument(
|
|
"notebooks", nargs = "+", help = "Notebook files or URLs to convert."
|
|
)
|
|
parser.add_argument(
|
|
"-o", "--output", dest = "output_dir", default = ".", help = "Output directory."
|
|
)
|
|
# Default True for backwards compatibility: existing Colab notebooks
|
|
# routinely use pipes / redirection / interpolation in `!cmd` lines
|
|
# and the converted script needs to keep working. Operators who
|
|
# convert untrusted notebooks should pass --no-allow-shell to force
|
|
# a hard error on every metacharacter-bearing cell.
|
|
parser.add_argument(
|
|
"--allow-shell",
|
|
dest = "allow_shell",
|
|
action = "store_true",
|
|
default = True,
|
|
help = "Allow emitting subprocess.run(..., shell=True) for cells "
|
|
"that use shell metacharacters or interpolation (default).",
|
|
)
|
|
parser.add_argument(
|
|
"--no-allow-shell",
|
|
dest = "allow_shell",
|
|
action = "store_false",
|
|
help = "Refuse to emit shell=True; cells with metacharacters error out.",
|
|
)
|
|
|
|
args = parser.parse_args()
|
|
|
|
# Create output directory if needed
|
|
os.makedirs(args.output_dir, exist_ok = True)
|
|
|
|
# SF2: track per-notebook failures so a CI invocation that converts
|
|
# 10 notebooks but silently fails on 3 is no longer reported as
|
|
# success. Each failure is collected and the loop continues so the
|
|
# caller sees the full set; final exit status is 1 if anything
|
|
# failed.
|
|
failures: list[tuple[str, str]] = []
|
|
ok = 0
|
|
total = len(args.notebooks)
|
|
for source in args.notebooks:
|
|
try:
|
|
convert_notebook_to_script(
|
|
source,
|
|
output_dir = args.output_dir if args.output_dir != "." else None,
|
|
allow_shell = args.allow_shell,
|
|
)
|
|
ok += 1
|
|
except Exception as e:
|
|
print(f"ERROR converting {source}: {e}")
|
|
failures.append((source, f"{type(e).__name__}: {e}"))
|
|
|
|
print(
|
|
f"converted {ok}/{total}, {len(failures)} failed",
|
|
file = sys.stderr if failures else sys.stdout,
|
|
)
|
|
sys.exit(1 if failures else 0)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|