mirror of https://github.com/agent0ai/agent-zero.git synced 2026-06-02 15:31:28 +00:00

History

Alessandro a7b4fcd798 Remove document query plugin requirements The LiteParse dependency is already managed from the root requirements.txt, so the document query plugin should not carry a separate requirements file. This keeps dependency installation centralized for the bundled core plugin.		2026-06-01 02:29:22 +02:00
..
extensions/python/startup_migration	feat(document_query): add liteparse runtime and progressive skill	2026-05-29 12:45:14 +02:00
helpers	Tune LiteParse OCR defaults	2026-05-30 19:02:10 +02:00
prompts	fix(document_query): clean prompt spelling and legacy references	2026-05-29 12:46:32 +02:00
skills/document-query	feat(document_query): add liteparse runtime and progressive skill	2026-05-29 12:45:14 +02:00
tools	feat: extract document_query into _document_query plugin with parser strategy pattern	2026-05-29 12:45:05 +02:00
webui	Tune LiteParse OCR defaults	2026-05-30 19:02:10 +02:00
default_config.yaml	Tune LiteParse OCR defaults	2026-05-30 19:02:10 +02:00
hooks.py	feat(document_query): add liteparse runtime and progressive skill	2026-05-29 12:45:14 +02:00
plugin.yaml	feat: extract document_query into _document_query plugin with parser strategy pattern	2026-05-29 12:45:05 +02:00
README.md	Tune LiteParse OCR defaults	2026-05-30 19:02:10 +02:00

README.md

Document Query Plugin

Load, parse, index, and Q&A over local and remote documents with configurable timeouts and thread-safe parsers.

Features

Strategy-pattern parsers - MIME-type routing to dedicated parser classes
Centralized fetching - local and HTTP(S) resources are fetched once, size-checked, then passed to parsers
LiteParse first path - fast local parsing for PDFs and supported document/image formats, with legacy fallbacks
Adaptive OCR - large text-rich PDFs skip OCR automatically to avoid pathological parse times
Bounded parser execution - sync parsers are offloaded to asyncio.to_thread and globally capped across chats
Configurable timeouts - per-document and gather-level timeouts
Expanded format support - PDF, HTML, text, YAML, XML, TOML, JS, TS, images, and catch-all Unstructured

Configuration

See default_config.yaml for all options. Key settings:

Setting	Default	Description
fetch_timeout	30	HTTP fetch timeout (seconds)
fetch_retries	3	HTTP retry attempts
max_remote_bytes	52428800	Max remote document size
per_document_timeout	60	Max time for a single document parse
gather_timeout	120	Max time for all documents combined
parser_concurrency	1	Max parser jobs running across all chats in one process
context_intro_chunks	2	Leading chunks included per document for title/abstract grounding
chunk_size	1000	Text splitter chunk size
chunk_overlap	100	Text splitter overlap
search_threshold	0.5	Similarity search threshold
liteparse_enabled	true	Prefer LiteParse before legacy parser fallbacks
liteparse_num_workers	2	Max LiteParse OCR workers per parser job
liteparse_ocr_auto_disable_pages	30	Disable OCR for text-rich PDFs at or above this effective page count
thread_offload	true	Offload sync parsers to thread pool

LiteParse is installed into the Agent Zero framework runtime from hooks.py during plugin install/startup. If installation fails, the plugin logs the error and continues with the legacy parser fallbacks.

LiteParse always runs in a child process so native parser and OCR failures stay isolated from the Web UI process.

Parsers

Parser	MIME Types	Backend
LiteParseParser	PDF, Office/OpenDocument, images	LiteParse
PdfParser	application/pdf	PyMuPDF + Tesseract OCR fallback
HtmlParser	text/html	Markdownify transformer
TextParser	text/*, application/json, YAML, XML, TOML, JS, TS, shell	Direct read
ImageParser	image/*	UnstructuredLoader
UnstructuredParser	* (catch-all)	UnstructuredLoader hi-res

Adding a new parser

Create helpers/parsers/.py extending BaseParser
Set mimetypes class attribute
Implement _parse_sync(document, config)
Register in helpers/parsers/init.py