From d3f2265ac98e91da80c6afbbf1bd2ce304b45f2c Mon Sep 17 00:00:00 2001 From: Kevin Colten Date: Sat, 9 May 2026 07:59:33 -0700 Subject: [PATCH] Add Vision Models guide and references Introduce a new 'Vision Models' configuration guide and wire it into user and provider docs. Adds docs/5-CONFIGURATION/vision-models.md explaining how to configure a default Vision Model, required binaries (pdftoppm, ffmpeg), routing/adaptive-sampling behavior, cost considerations, and troubleshooting. Update Adding Sources and AI Providers pages to reference the new guide and clarify that image/PDF/video visual extraction requires a configured Vision Model (and that pure image files are unsupported without one). --- docs/3-USER-GUIDE/adding-sources.md | 9 +- docs/4-AI-PROVIDERS/index.md | 1 + docs/5-CONFIGURATION/index.md | 6 ++ docs/5-CONFIGURATION/vision-models.md | 134 ++++++++++++++++++++++++++ 4 files changed, 149 insertions(+), 1 deletion(-) create mode 100644 docs/5-CONFIGURATION/vision-models.md diff --git a/docs/3-USER-GUIDE/adding-sources.md b/docs/3-USER-GUIDE/adding-sources.md index 020d9e34..6d94f5f3 100644 --- a/docs/3-USER-GUIDE/adding-sources.md +++ b/docs/3-USER-GUIDE/adding-sources.md @@ -71,10 +71,17 @@ Sources are the raw materials of your research. This guide covers how to add dif **Just paste the URL** in "Web Link" section. +### Images (with Vision Model configured) +- **Images** (.jpg, .png, .webp, etc.) — Described directly by a multimodal LLM +- **Scanned / image-heavy PDFs** — Pages rendered and analyzed page-by-page (replaces standard OCR) +- **Videos** — Visual frames analyzed alongside the audio transcript + +Requires setting a default **Vision Model** in Settings → API Keys → Default Models. See the [Vision Models guide](../5-CONFIGURATION/vision-models.md) for setup, supported models, and cost guidance. Without a Vision Model configured, image-only files remain unsupported and PDFs/videos fall back to the standard text/audio pipeline. + ### What Doesn't Work - Paywalled content (WSJ, FT, etc.) — Can't extract - Password-protected PDFs — Can't open -- Pure image files (.jpg, .png) — Except scanned PDFs which have OCR +- Pure image files (.jpg, .png) — Unsupported unless a [Vision Model](../5-CONFIGURATION/vision-models.md) is configured - Very large files (>100MB) — Timeout --- diff --git a/docs/4-AI-PROVIDERS/index.md b/docs/4-AI-PROVIDERS/index.md index 2a5786bf..fdc031b3 100644 --- a/docs/4-AI-PROVIDERS/index.md +++ b/docs/4-AI-PROVIDERS/index.md @@ -149,6 +149,7 @@ Open Notebook supports 17+ AI providers. This guide helps you **choose the right ### I need multimodal (images, audio, video) → **Google Gemini** — Best multimodal support +→ See [Vision Models](../5-CONFIGURATION/vision-models.md) to configure a default vision model for image / PDF page / video frame extraction (also works with OpenAI, Anthropic, OpenRouter, Ollama, …). ### I want access to many models with one API key → **OpenRouter** — 100+ models, unified billing diff --git a/docs/5-CONFIGURATION/index.md b/docs/5-CONFIGURATION/index.md index f1198368..2199b52c 100644 --- a/docs/5-CONFIGURATION/index.md +++ b/docs/5-CONFIGURATION/index.md @@ -210,6 +210,12 @@ OPEN_NOTEBOOK_ENCRYPTION_KEY=my-secret-key - GPU acceleration - Docker networking +### [Vision Models](vision-models.md) +- Configure a default vision model (GPT-4o, Claude 3+, Gemini, …) +- Image, PDF page, and video frame extraction +- Adaptive sampling and cost considerations +- System binaries (`pdftoppm`, `ffmpeg`) + ### [Ollama](ollama.md) - Setting up and pointing to an Ollama server - Downloading models diff --git a/docs/5-CONFIGURATION/vision-models.md b/docs/5-CONFIGURATION/vision-models.md new file mode 100644 index 00000000..5b4f4f4e --- /dev/null +++ b/docs/5-CONFIGURATION/vision-models.md @@ -0,0 +1,134 @@ +# Vision Models + +Configure a default **Vision Model** to let Open Notebook extract content from images, PDF pages, and video frames using a multimodal LLM (GPT-4o, Claude 3+, Gemini, etc.). + +> **What this unlocks**: image-only files (`.jpg`, `.png`, `.webp`, …) become a supported source type, PDFs get vision-based page analysis (useful for scans, diagrams, complex layouts), and videos get visual context combined with the existing audio transcript. + +--- + +## How It Works + +Vision support is delivered by two upstream libraries: + +- **[Esperanto](https://github.com/lfnovo/esperanto)** ([PR #191](https://github.com/lfnovo/esperanto/pull/191)) — adds multimodal (image) input to `chat_complete` across all LLM providers. Open Notebook calls vision-capable LLMs through this unified surface. +- **[content-core](https://github.com/lfnovo/content-core)** ([PR #37](https://github.com/lfnovo/content-core/pull/37)) — adds vision-model–based extractors that render PDF pages with `pdftoppm`, sample video frames with `ffmpeg`, and feed them to the configured vision model. + +Open Notebook wires the user's configured Vision Model (and its credential) through to content-core during source ingestion. When no Vision Model is set, the system falls back to the configured **Chat Model** — so any existing install with a multimodal chat model already gets vision support automatically. + +--- + +## What Gets Vision-Processed + +When a Vision Model (or fallback chat model) is configured: + +| Source type | Default behavior | With vision model | +|---|---|---| +| **Images** (`image/*`) | Unsupported | Described directly by the vision model | +| **PDFs** | `pdfplumber` text extraction | Pages rendered to images and analyzed | +| **Videos** | Audio transcript only | Audio transcript + visual frame analysis | +| **Documents** (.docx, .pptx, …) | Unchanged | Unchanged (text extraction only) | +| **Web links** | Unchanged | Unchanged | + +**Routing precedence** (handled by content-core): +- `document_engine="docling"` always wins. +- `document_engine="auto"` + Vision Model set → vision route for PDF / image / video MIME types. +- No Vision Model configured → standard text/audio pipeline (PDFs → pdfplumber, videos → audio-only, images → unsupported). + +**Adaptive sampling** (avoids blowing up token cost on large inputs): + +- **PDFs**: every page up to 20 → every 2nd up to 100 → every 5th up to 500 → every 10th beyond. +- **Videos**: 1.0 fps for ≤60 s → 0.5 fps for ≤5 min → 0.2 fps for ≤15 min → 0.1 fps beyond. +- Pages and frames are analyzed in parallel with a concurrency cap of 5. + +--- + +## Configuration + +### 1. Add a vision-capable provider credential + +Any multimodal LLM works. Common choices: + +| Provider | Recommended models | +|---|---| +| **OpenAI** | `gpt-4o`, `gpt-4o-mini` | +| **Anthropic** | `claude-sonnet-4-5`, `claude-3-5-sonnet`, `claude-3-5-haiku` | +| **Google** | `gemini-2.0-flash`, `gemini-1.5-pro` | +| **OpenRouter** | Any of the above by routed name | +| **Ollama** | `llava`, `llama3.2-vision`, `qwen2.5vl` | + +Add the credential the usual way: **Settings → API Keys → Add Credential → Test Connection → Discover Models → Register Models**. See the [AI Providers Configuration Guide](ai-providers.md) for per-provider walkthroughs. + +### 2. Set the default Vision Model + +1. Go to **Settings → API Keys**. +2. Scroll to **Default Models**. +3. Pick a registered language model in the **Vision Model** dropdown. +4. Save. + +The dropdown lists every registered language-type model — Open Notebook does not capability-detect, so make sure the model you pick actually accepts image input. Non-multimodal models will surface the provider's API error verbatim during ingestion. + +> **Tip:** if you already use a multimodal model as your **Chat Model** (e.g. `gpt-4o`, `claude-sonnet-4-5`, `gemini-2.0-flash`), you can leave **Vision Model** empty — ingestion will fall back to the chat model. + +### 3. (PDFs / videos only) Install system binaries + +Vision PDF and video extraction shell out to standard tooling that must be on `PATH` inside the API container or host: + +- `pdftoppm` (from **poppler**) — required for PDF page rendering. +- `ffmpeg` and `ffprobe` — required for video frame extraction. + +The official Open Notebook Docker image ships these. If you run from source on macOS: + +```bash +brew install poppler ffmpeg +``` + +Debian / Ubuntu: + +```bash +apt-get install -y poppler-utils ffmpeg +``` + +Image-only ingestion does **not** need either binary. + +--- + +## Behavior & Failure Modes + +- **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved. +- **Vision model fails on a frame/page** — the processor returns an `ExtractionOutput` with a placeholder message rather than aborting the whole source. +- **Video audio extraction fails** — the visual analysis still completes; the transcript portion is simply omitted. +- **Credential pass-through** — Open Notebook forwards the model's stored credential to content-core via `vision_config` (mirroring how speech-to-text credentials are passed). Models without an attached credential rely on environment-variable defaults at the provider level. + +--- + +## Cost Considerations + +Vision input is significantly more expensive than text. A 50-page scanned PDF rendered at full resolution can easily produce **5–15× the tokens** of the equivalent OCR-extracted text. Recommendations: + +- Use **`gpt-4o-mini`**, **`claude-3-5-haiku`**, or **`gemini-2.0-flash`** as a default Vision Model for routine ingestion — quality is good enough for most diagrams/scans at a fraction of the cost of flagship models. +- Reserve flagship vision models (`gpt-4o`, `claude-sonnet-4-5`, `gemini-1.5-pro`) for sources where layout, handwriting, or detailed diagrams matter. +- For digital (text-native) PDFs, **leave Vision Model unset** or rely on `document_engine="docling"` — pdfplumber is faster and free. + +--- + +## Troubleshooting + +**"pdftoppm: command not found" / "ffmpeg: command not found"** +Install the binaries (see step 3) and restart the API. Image-only ingestion does not need either. + +**Images upload but produce empty descriptions** +The selected Vision Model probably isn't multimodal. Re-select an image-capable model (see the table above) and retry. + +**PDF processing times out on huge files** +Adaptive sampling already caps page rendering, but very large PDFs (500+ pages) at high concurrency can still hit provider rate limits. Use a cheaper / higher-throughput vision model, or split the PDF. + +**Video has no visual analysis, only transcript** +Either no Vision Model is configured or `ffmpeg`/`ffprobe` is missing. Check the API logs for `Failed to retrieve model configuration` or ffmpeg errors. + +--- + +## Related Docs + +- [AI Providers Configuration](ai-providers.md) — per-provider credential setup +- [Adding Sources](../3-USER-GUIDE/adding-sources.md) — how source ingestion works end to end +- [Local STT](local-stt.md) — companion feature for audio/video transcription