feat(core): PDF text extraction fallback and Jupyter notebook parsing (#3160)

* feat(core): add PDF text extraction fallback and Jupyter notebook parsing

For text-only models (qwen3-coder, deepseek) that lack PDF modality support,
read_file now falls back to pdftotext (poppler-utils) for text extraction
instead of returning an unsupported error. A new `pages` parameter enables
paginated PDF reading (e.g. "1-5", "3-").

Also adds structured .ipynb parsing — notebooks are displayed as labeled
cells with code blocks and execution outputs rather than raw JSON.

Key changes:
- New utils/pdf.ts: pdftotext integration with availability caching,
  page range parsing, 5MB maxBuffer, and 100K char output truncation
- New utils/notebook.ts: .ipynb JSON parser with per-cell output
  truncation (10K chars) and overall notebook truncation (100K chars)
- Modified fileUtils.ts: new 'notebook' FileType, PDF fallback logic,
  pages parameter threading
- Modified read-file.ts: pages parameter in schema/validation/execution

* fix(core): avoid circular dependency via shell-utils in pdf.ts

pdf.ts was importing execCommand from shell-utils.ts, which transitively
pulled in tool-utils.ts → ../index.js (barrel), creating a circular
dependency that caused AuthType to be undefined during vitest module
initialization in 46 test files.

Replace with a local execFile wrapper that has no transitive dependencies
beyond node:child_process.

* fix(core): use optional call on getContentGeneratorConfig

Moving the modalities computation outside the if-block caused
readManyFiles.test.ts to fail because its mock config doesn't implement
getContentGeneratorConfig — previously the method was only called for
media files (image/pdf/audio/video), never for text files.

Use ?.() to gracefully fall back to an empty modalities object when the
method is not defined.

* fix(core): reject open-ended PDF page ranges to enforce 20-page limit

Previously, parsePDFPageRange returned lastPage: Infinity for open-ended
ranges like "3-", which bypassed the 20-page validation check and caused
pdftotext to extract from the start page to EOF. This violated the
documented "Max 20 pages per request" contract.

Now validation explicitly rejects open-ended ranges with a helpful
message telling users to specify an explicit end page within the limit.
The pages parameter schema description and interface comment are also
updated to reflect this constraint.

* fix(core): tighten parsePDFPageRange to reject malformed tokens

parseInt() silently truncates invalid input, so values like "1-2-3",
"5abc", "1-2x", "1x-2", and "1.5" were accepted and then interpreted
as the wrong range (e.g. "1-2-3" parsed as 1-2). Switch to regex-based
whole-string validation so any non-matching input returns null at
ReadFileTool.build() time instead of reaching pdftotext.

* fix(core): surface processSingleFileContent errors in readManyFiles

readManyFiles previously dropped any file whose processSingleFileContent
result carried an error, so users only saw "No files matching the
criteria were found or all were skipped." This hid actionable guidance
such as the pdftotext-not-installed install hint, password-protected
PDF notices, and the >10MB size-limit message.

Now the per-file error message (already a human-readable string in
llmContent) is included as a content part, so batch reads surface the
same guidance as single-file reads.

* fix(core): tolerate whitespace around hyphen in parsePDFPageRange

The strict regex introduced in the previous commit stopped accepting
inputs like "1 - 5" or "3 -", which the old parseInt-based parser
handled (parseInt skips leading whitespace). Allow optional \s* on each
side of the hyphen while still rejecting malformed trailing tokens such
as "5abc" and "1-2-3".

* fix(cli,core): render failed @file reads as Error in atCommandProcessor

The previous commit surfaced per-file errors through readManyFiles, but
FileReadInfo still lacked a status field and atCommandProcessor
hardcoded ToolCallStatus.Success for every entry in result.files. So a
failed read (missing pdftotext, password-protected PDF, >10MB file)
rendered in the UI as if it had succeeded, just with the error text
embedded in the LLM content.

Add an optional `error` field on FileReadInfo, populate it in
readFileContent, and use it in atCommandProcessor to pick
ToolCallStatus.Error plus a resultDisplay string the user can see.

* fix(core): treat pdftotext maxBuffer overrun as truncation

When a text-dense PDF produced more than 5MB of stdout, Node killed the
child and `execFile` delivered the error as `ERR_CHILD_PROCESS_STDIO_MAXBUFFER`,
which fell into the generic `pdftotext failed:` branch — so a perfectly
valid PDF failed instead of returning the usual truncated output.

Detect the maxBuffer error code in the execFile wrapper, and in
extractPDFText use the partial stdout with the existing truncation note.
Also lower the maxBuffer to 2×MAX_PDF_TEXT_OUTPUT_CHARS (from 5MB) since
anything past that is discarded anyway — this also caps RSS for
pathological inputs.

* fix(core): skip 10MB size gate for PDF text-extraction path

The generic 9.9MB file-size check ran before the pdf branch knew whether
we were taking the base64 inline path or the pdftotext text-extraction
path. That meant `read_file("huge.pdf", pages="1-5")` was rejected up
front even though pdftotext streams through the file and only emits a
capped (100K char) text slice — never loading 15MB into Node memory.

Move the size gate past the fileType/modalities decision point and skip
it when the PDF will go through text extraction (pages parameter set,
or model lacks pdf modality). The base64 inline path still carries its
own encoded-size cap, so oversized PDFs continue to be rejected there.

* fix(core): harden pdftotext wrapper against six audit findings

An adversarial pass over the PDF utilities turned up several issues
that warrant hardening before the PR lands:

- Argument injection (C1): filenames starting with `-` (e.g.
  `-opw=foo.pdf`) are parsed as options by poppler's argv parser when
  passed positionally. Insert `--` before `filePath` in both
  `extractPDFText` and `getPDFPageCount` so the shell's option parser
  stops processing flags. Reproduced locally: `pdftotext -h -` prints
  help while `pdftotext -- -h -` treats `-h` as the input file.

- Brittle availability signal (H1): `isPdftotextAvailable` used
  `stderr.length > 0` as the positive signal, so a sandbox that
  suppresses stderr would cache `false` for the whole process. Switch
  to the exit code.

- Concurrent availability probes (H2): N parallel callers (e.g. an
  `@`-glob of PDFs) each spawned their own `pdftotext -v` before the
  first probe resolved. Cache the in-flight promise.

- Precision-loss bypass of the 20-page cap (H3): `Number()` collapses
  any integer past 2^53 onto the same value, so the string
  `"999999999999999998-999999999999999999"` parsed as a 1-page range
  and slid past the validator. Cap accepted page numbers at 1,000,000.

- Timeout error clarity (M2): 30s timeouts surfaced as the generic
  `pdftotext failed:` branch with empty stderr. Detect SIGTERM/killed
  and emit a dedicated "timed out after 30s" message.

- Over-eager maxBuffer success (M1): the previous commit treated any
  maxBuffer overrun with non-empty stdout as a truncated success. If
  the overrun was driven by stderr spam (password warnings, corrupt-
  PDF diagnostics), that delivered garbage as success. Require at
  least MAX_PDF_TEXT_OUTPUT_CHARS of stdout before treating as
  truncated; otherwise re-run the password/corrupt detectors on the
  captured stderr.

Added regression tests for each.

* fix(core): gate non-regular files and oversized PDFs before extraction

Two defense-in-depth guards suggested by the adversarial audit:

- Non-regular files (FIFOs, sockets, /dev/zero, character devices)
  have meaningless `stats.size` (typically 0), so the 10MB size gate
  would happily wave them through. Handing `/dev/zero` to pdftotext
  then produced a 30s-timeout failure after the wrapper streamed
  megabytes into Node. Require `stats.isFile()` before routing into
  any extraction path.

- The previous commit skipped the 10MB gate for the PDF text-
  extraction path so `read_file("huge.pdf", pages="1-5")` could
  work. Unbounded, though, a multi-GB PDF would make pdftotext run
  until the 30s timeout fires. Add a separate 100MB ceiling for the
  extraction path with a guidance error pointing the user at `pages`
  or document splitting. The base64 inline path keeps its own encoded-
  size cap.

Added regression tests for both.

* fix(core): strip ANSI escapes and surface non-text outputs in notebooks

Two notebook-rendering issues surfaced by the audit:

- ipykernel emits ANSI CSI/SGR escape sequences (`\x1B[0;31m...`) in
  error tracebacks by default. Those codes add noise and burn tokens
  without conveying anything useful once we're rendering to plain
  text. Strip them from stream, execute_result, display_data, and
  error outputs.

- Cells whose only output was a non-text MIME type (image/png,
  text/html, application/vnd.jupyter.widget-view+json, ...) were
  silently dropped — the model saw the source code with no indication
  that a plot or HTML block existed. Emit a `[non-text output:
  <mime-types>]` placeholder so the model knows something was there
  without us inlining the payload.

* fix(core): round-2 audit fixes (in-flight cleanup, Windows timeout, ANSI/MIME)

Reverse audit on the previous three commits surfaced four medium-
severity issues plus a polish item:

- isPdftotextAvailable in-flight promise leak: the `.then(...)` cleared
  the cached promise on success but a synchronous throw inside the
  IIFE would have left a rejected promise stuck in the slot forever.
  Switch to `.finally` so the slot is always cleared.

- Timeout detection on Windows: Node's `execFile` `timeout` terminates
  via TerminateProcess on Windows, where `signal` is typically `null`
  rather than `'SIGTERM'`. The previous SIGTERM-only check would let
  Windows timeouts fall through to the generic "pdftotext failed"
  branch. Accept null/undefined signal alongside SIGTERM.

- ANSI regex was CSI-only: missed OSC hyperlinks (`ESC ]8;;url`),
  DCS, APC/PM/SOS, and lone two-byte escapes that ipykernel and
  related tools sometimes emit. Extend the pattern to cover all four
  families.

- Non-text MIME placeholder was attacker-controlled: a malicious
  notebook could set `data: {"\nIGNORE PREVIOUS INSTRUCTIONS\n": ...}`
  and that key would flow unescaped into `[non-text output: ...]`,
  smuggling prompt-injection payload bytes into the LLM context.
  Filter keys against the IANA MIME-type grammar before joining.

- Hoisted PDF_EXTRACTION_MAX_MB to module scope alongside the other
  size constants so it's discoverable in one place.

* chore(core): correct ANSI comment example and rename cache-reset test

Comment/test polish from the convergence audit:

- The `[@-Z\-_]` C1-Fe branch of the ANSI regex does not actually match
  `ESC c` (RIS), `ESC 7`, or `ESC 8`, which sit at 0x63/0x37/0x38. It
  does match IND/NEL/HTS/RI (ESC D/E/H/M). Correct the jsdoc example.

- The `should clear the in-flight promise after a probe to allow retries`
  test wasn't distinguishing the `.finally` behaviour from the
  `resetPdftotextCache()` call that immediately precedes the second
  probe. Rename it to reflect what it actually verifies; the `.finally`
  remains as defence-in-depth (a synchronous throw inside the IIFE's
  own handlers can't leave the in-flight slot stuck on a rejected
  promise).
This commit is contained in:
Shaojin Wen 2026-04-20 11:09:50 +08:00 committed by GitHub
parent 0b8b3da836
commit 9d8201d206
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
12 changed files with 1985 additions and 39 deletions

View file

@ -1132,5 +1132,30 @@ describe('handleAtCommand', () => {
expect(result.toolDisplays!.length).toBeGreaterThanOrEqual(1);
expect(result.toolDisplays![0].description).toContain('file.txt');
});
it('should mark per-file failures as Error status, not Success', async () => {
// Trigger the >10MB size error in processSingleFileContent so the
// readManyFiles result carries a per-file `error` field.
const filePath = path.join(testRootDir, 'oversized.bin');
await fsPromises.mkdir(path.dirname(filePath), { recursive: true });
await fsPromises.writeFile(filePath, Buffer.alloc(10 * 1024 * 1024 + 1));
const query = `@${filePath}`;
const result = await handleAtCommand({
query,
config: mockConfig,
onDebugMessage: mockOnDebugMessage,
messageId: 504,
signal: abortController.signal,
});
expect(result.toolDisplays).toBeDefined();
expect(result.toolDisplays).toHaveLength(1);
expect(result.toolDisplays![0].status).toBe(ToolCallStatus.Error);
expect(result.toolDisplays![0].resultDisplay).toContain(
'Failed to read oversized.bin',
);
expect(result.toolDisplays![0].resultDisplay).toContain('10MB');
});
});
});

View file

@ -364,8 +364,10 @@ export async function handleAtCommand({
description: file.isDirectory
? `Read directory ${path.basename(file.filePath)}`
: `Read file ${path.basename(file.filePath)}`,
status: ToolCallStatus.Success,
resultDisplay: undefined,
status: file.error ? ToolCallStatus.Error : ToolCallStatus.Success,
resultDisplay: file.error
? `Failed to read ${path.basename(file.filePath)}: ${file.error}`
: undefined,
confirmationDetails: undefined,
}),
);