Commit graph

2 commits

Author SHA1 Message Date
Daniel Han
fa3840cf6d scripts: harden github_blob_to_raw against substring URL spoofing
CodeQL flagged scripts/notebook_to_python.py:33's
`if "github.com" in url and "/blob/" in url` as
py/incomplete-url-substring-sanitization: "github.com" can sit
anywhere in the URL, so an attacker-controlled URL like
https://attacker.example.com/github.com/blob/x would be rewritten
to a raw.githubusercontent.com URL and fetched as if it were a
real GitHub blob.

Switch to urllib.parse.urlparse and require parsed.netloc ==
"github.com" exactly, then rewrite via a proper urlunparse on the
parsed components (path is replaced with first /blob/ -> / only).
Query strings and fragments now round-trip correctly too, which
was an incidental bug in the old string-replace path.

Closes the high-severity CodeQL alert on PR head 08235625.
2026-05-08 02:43:47 +00:00
Daniel Han
bfb5c2872c CI(notebooks): cross-repo validator for unslothai/notebooks
New PR-time + scheduled workflow that walks every nb/, kaggle/, and
original_template/ notebook in unslothai/notebooks and statically
validates the install cells and user-facing code against:

  - googlecolab/backend-info pip-freeze.gpu.txt (Colab oracle, refreshed
    on every run; fallback snapshot committed under scripts/data/).
  - PyPI metadata for transitive constraint resolution.
  - Hardcoded torch/torchcodec ABI table.
  - Hardcoded peft/torchao floor table.
  - The live unsloth + trl API surface, introspected under
    tests/_zoo_aggressive_cuda_spoof.py so the api job runs on a
    GPU-less ubuntu-latest runner.

Catches the bug classes from notebooks#258 / #260 / #261 / #264 / #221
and commit 51b1462 mechanically:

  R-INST-001  forbid git+ HEAD installs (notebooks#221)
  R-INST-002  --no-deps + transitive constraint violation
  R-INST-003  peft 0.19+ requires torchao 0.16.0+ (notebooks#258)
  R-INST-004  torch <-> torchcodec ABI mismatch (notebooks#261a)
  R-INST-005  --no-deps transformers + Colab tokenizers drift
              (notebooks#261b / #264)
  R-INST-006  forbid !!pip
  R-API-003   adamw_torch_fused -> adamw_8bit hint (warning)
  R-API-004   notebook references symbols outside live unsloth surface
  R-EXC-001   DONT_UPDATE_EXCEPTIONS notebooks must satisfy the same
              policy clauses as generated notebooks (notebooks#260)
  R-DRIFT-001 update_all_notebooks.py emits no diff (commit 51b1462)
  R-CONV-001  notebook_to_python.py converts every .ipynb cleanly

Files:
  .github/workflows/notebooks-ci.yml          PR-time + cron + dispatch
  scripts/notebook_validator.py               1148 LOC, single-file
  scripts/notebook_to_python.py               battle-tested converter
  scripts/data/colab_pip_freeze.gpu.txt       fallback snapshot
  scripts/data/colab_to_cpu_pin.json          cu128 -> CPU wheel map
  tests/notebooks/test_validator_fixtures.py  21 golden tests, all green

CPU-only by design. The api-introspect job follows the existing
consolidated-tests-ci spoof pattern (lines 309/417/536/626/826/1081/
1586/1998 of consolidated-tests-ci.yml). The smoke-install job is
opt-in via workflow_dispatch and stubs torchcodec since no CPU wheel
exists.

Validated on the live unslothai/notebooks@7af0ac0f tree: every fixture
test passes, exceptions check is silent, lint surfaces 27 errors + 6
warnings on real notebooks (mix of #258-class regressions in 6 nb/
notebooks the previous template fixes did not reach, plus 14
git+-HEAD installs in hand-tuned exception notebooks).
2026-05-07 11:42:57 +00:00