Restore remote document fetch compatibility for public sites after the
CVE-2026-4308 SSRF hardening.
The initial security fix correctly blocked non-public destinations, but
it also changed the outbound request fingerprint for `document_query`
remote fetches. Some public sites, including https://nvd.nist.gov/vuln/detail/CVE-2026-4308, used for testing, responded with HTTP
403 to the default `requests` user agent even though they remained safe
and publicly routable.
This change keeps the centralized SSRF protections in place while
restoring the previous request compatibility behavior by sending the
configured `USER_AGENT` header, falling back to the prior
`@mixedbread-ai/unstructured` value.
What is fixed:
- public URLs such as
`https://nvd.nist.gov/vuln/detail/CVE-2026-4308`
no longer fail with site-specific HTTP 403 due to request fingerprint
changes introduced by the SSRF mitigation
Address CVE-2026-4308 in the document_query tool remote-fetch path.
The issue was originally reported by @YLChen-007.
This change replaces ad hoc remote document fetching with a centralized
safe fetch flow that validates remote URLs before any network request is
used for parsing. It blocks localhost and non-public IPv4/IPv6 targets,
validates every redirect hop, disables implicit trust of proxy env
settings for this path, and enforces a strict remote document size cap.
It also removes direct third-party loader access to attacker-controlled
URLs by prefetching remote content first and then parsing only trusted
local bytes or temp files for HTML, text, PDF, image, and unstructured
document handling.
Refs:
- CVE-2026-4308
- Report by @YLChen-007