| .. | ||
| data | ||
| reports | ||
| scripts | ||
| src/surfsense_evals | ||
| tests | ||
| .env.example | ||
| .gitignore | ||
| pyproject.toml | ||
| README.md | ||
| uv.lock | ||
SurfSense Evals
Domain-agnostic eval harness for SurfSense. Each benchmark is a Python subpackage under suites/<domain>/<benchmark>/ that self-registers with the CLI; core/ is the shared infrastructure (HTTP clients, arms, parsers, metrics, report writer, registry). The harness talks to SurfSense over HTTP only — it does not import any backend Python module — so it ships in its own venv and never bloats the FastAPI runtime image.
Benchmarks
| Benchmark | Shape | Vision required? | Default ingest |
|---|---|---|---|
medical/medxpertqa (headline) |
Native PDF vs SurfSense head-to-head, MCQ | yes | vision=on, mode=basic |
medical/mirage |
SurfSense single-arm, MCQ | no | vision=off, mode=basic |
medical/cure |
SurfSense single-arm retrieval (Recall/MRR/nDCG) | no | vision=off, mode=basic |
multimodal_doc/mmlongbench |
Native PDF vs SurfSense head-to-head, open-ended | yes | vision=on, mode=basic |
Future domains (legal/, finance/, code/, scientific/) drop into suites/ without touching core/ or the CLI.
Install + auth
uv pip install -e ./surfsense_evals
cp surfsense_evals/.env.example surfsense_evals/.env
# Edit .env: SURFSENSE_API_BASE, OPENROUTER_API_KEY, and ONE of:
# LOCAL → SURFSENSE_USER_EMAIL + SURFSENSE_USER_PASSWORD
# GOOGLE → SURFSENSE_JWT (+ optional SURFSENSE_REFRESH_TOKEN)
# (lift both from browser localStorage after a normal Google login)
Step-by-step: run all four benchmarks
The medical and multimodal_doc suites each get their own SearchSpace and pinned model, so they're independent — run them in any order. Both head-to-head benchmarks (medxpertqa, mmlongbench) require a vision-capable OpenRouter slug; pinning a text-only one (e.g. openai/gpt-5.4-mini) silently drops images and the runner emits a warning.
Recommended vision slugs (use models list --grep <name> to confirm one): anthropic/claude-sonnet-4.5 (balanced cost), anthropic/claude-opus-4.7 (strongest reasoning), openai/gpt-5 (top-tier vision), google/gemini-2.5-pro (best for long PDFs, 1M-token context).
# 0. (optional) discover what's registered
python -m surfsense_evals suites list
python -m surfsense_evals benchmarks list
# 1. MEDICAL SUITE — one SearchSpace, three benchmarks
python -m surfsense_evals setup --suite medical --provider-model anthropic/claude-sonnet-4.5
# 1a. headline head-to-head: Native PDF (vision) vs SurfSense (vision RAG)
# Downloads dev+test JSONL + images.zip, renders one PDF per question
# (case + table + images + 5 options), uploads with use_vision_llm=True.
python -m surfsense_evals ingest medical medxpertqa --split test
python -m surfsense_evals run medical medxpertqa --concurrency 4
# 1b. MIRAGE — single-arm SurfSense MCQ accuracy
# (MMLU-Med / MedQA-US / MedMCQA / PubMedQA / BioASQ)
python -m surfsense_evals ingest medical mirage
python -m surfsense_evals run medical mirage
# 1c. CUREv1 — single-arm SurfSense retrieval (Recall@k / MRR / nDCG@10)
python -m surfsense_evals ingest medical cure --lang en
python -m surfsense_evals run medical cure --lang en
# 1d. write reports/medical/<UTC-ts>/summary.{md,json}
python -m surfsense_evals report --suite medical
# 2. MULTIMODAL_DOC SUITE — long PDFs with embedded images, charts, tables
python -m surfsense_evals setup --suite multimodal_doc --provider-model google/gemini-2.5-pro
python -m surfsense_evals ingest multimodal_doc mmlongbench # ~660MB, resumable
python -m surfsense_evals run multimodal_doc mmlongbench --concurrency 4
python -m surfsense_evals report --suite multimodal_doc
# 3. CLEANUP — soft-deletes the SearchSpaces; rendered PDFs stay cached
python -m surfsense_evals teardown --suite medical
python -m surfsense_evals teardown --suite multimodal_doc
Asymmetric scenarios — the "vision-extract once, answer cheap" play
The walkthrough above is --scenario head-to-head (default): both arms answer with the same vision-capable slug. SurfSense's actual architectural value-prop is that the ingestion-time vision LLM and the runtime LLM are completely independent — you can pay a vision LLM once, at ingest, to convert every embedded image into text (per-image OCR and semantic description, inlined where the image actually appears in the document — see What --use-vision-llm produces below). Then every query is served by a cheap text-only model that sees that extracted text natively. Two extra scenarios make this explicit:
--scenario |
Native arm answers with | SurfSense arm answers with | Question being measured |
|---|---|---|---|
head-to-head |
--provider-model (vision) |
--provider-model (vision) |
Pure RAG quality at parity. (Default.) |
symmetric-cheap |
--provider-model (cheap, text-only) |
--provider-model (same) |
Does pre-extracted image context let a non-vision LLM reason over image-heavy docs? |
cost-arbitrage |
--native-arm-model (vision) |
--provider-model (cheap) |
How close does SurfSense get to a vision-native baseline at a fraction of per-query cost? |
In all three modes the ingest-time vision LLM is set on the SearchSpace's vision_llm_config_id (auto-picked from the strongest registered global OpenRouter vision config — claude-sonnet-4.5 > claude-opus-4.7 > gpt-5 > gemini-2.5-pro, override with --vision-llm <slug>). What changes is which slug the answering models hit per arm.
Ingest with vision, evaluate with a non-vision LLM (symmetric-cheap)
This is the answer to "does SurfSense give a non-vision LLM enough context to reason over image-heavy docs?". Both arms hit the same cheap text-only slug. The native arm is structurally blind to images (text-only LLM + raw PDFs). The SurfSense arm reads chunks that already contain the per-image OCR and visual descriptions, written there by the vision LLM at ingest time.
python -m surfsense_evals setup --suite medical \
--scenario symmetric-cheap \
--provider-model openai/gpt-5.4-mini
# vision LLM at ingest = auto-picked (claude-sonnet-4.5 by default)
# answer LLM for BOTH arms = openai/gpt-5.4-mini (text-only)
python -m surfsense_evals ingest medical medxpertqa --split test # vision=on by default
python -m surfsense_evals run medical medxpertqa --concurrency 4
python -m surfsense_evals report --suite medical
# Δ accuracy on image-required MCQs is the headline number; native arm
# baseline is "what a text-only LLM gets without seeing the images".
Cheap SurfSense vs vision-native baseline (cost-arbitrage)
python -m surfsense_evals setup --suite medical \
--scenario cost-arbitrage \
--provider-model openai/gpt-5.4-mini \
--native-arm-model anthropic/claude-sonnet-4.5
# vision LLM at ingest = auto-picked claude-sonnet-4.5
# native arm = sonnet (vision); SurfSense arm = gpt-5.4-mini (text-only)
python -m surfsense_evals ingest medical medxpertqa --split test
python -m surfsense_evals run medical medxpertqa --concurrency 4
python -m surfsense_evals report --suite medical
# Report header reads:
# Scenario: cost-arbitrage — native arm answers with `anthropic/claude-sonnet-4.5`
# (vision); SurfSense answers with `openai/gpt-5.4-mini` over chunks vision-extracted
# at ingest by `anthropic/claude-sonnet-4.5`.
Notes:
cost-arbitragerequires both--provider-model(the cheap SurfSense slug) AND--native-arm-model <vision slug>.--vision-llm <slug>is optional; if omitted the harness queriesGET /api/v1/global-vision-llm-configsand auto-picks the strongest registered one. Pass--no-vision-llm-setupif you want to keep whatever vision config is already attached to the SearchSpace.- The runner's "looks text-only" warning is suppressed (or relabelled as informational) for
symmetric-cheapso intentional asymmetry doesn't read as a misconfiguration. - All three scenario fields (
scenario,provider_model,native_arm_model,vision_provider_model) are persisted tostate.jsonand recorded inrun_artifact.extra+ the report header — no need to retrace what was set.
Per-benchmark useful flags
medical/medxpertqa (run):
--split {test,dev,all}— pick a subset (defaulttest)--task "Diagnosis"/--body-system "Cardiovascular"— slice the report--require-images— drop rare rows where every image filename failed to resolve--n 100— quick smoke run--no-mentions— let SurfSense retrieve unscoped ("did the @-mention matter?")
multimodal_doc/mmlongbench:
--max-docs N(ingest) — cap downloads at the first N unique PDFs--format {str,int,float,list,none}(run) — slice by answer format;none= the ~22% intentionally unanswerable hallucination probes--skip-unanswerable(run) — drop unanswerable questions--docs <a.pdf>,<b.pdf>(run) — scope to specific docs
Ingestion knobs (vision LLM, processing mode, summarize)
The harness exposes POST /api/v1/documents/fileupload's three knobs on every ingest subcommand:
| Flag pair | Effect |
|---|---|
--use-vision-llm / --no-vision-llm |
Walk every embedded image in the PDF and inline image-derived text at the image's position (see below). |
--processing-mode {basic,premium} |
premium carries a 10× page multiplier and routes to a stronger ETL (e.g. LlamaCloud). |
--should-summarize / --no-summarize |
Generate a per-document summary at ingest. |
The "Default ingest" column in the benchmarks table is what runs if you don't pass any flag. Whatever was actually used is recorded as a __settings__ header in the doc map (data/<suite>/maps/<benchmark>_*_map.jsonl) and as extra.ingest_settings in run_artifact.json, then surfaced in the report — no need to hunt through CLI history.
The backend's
ETL_SERVICEenv var (DOCLING|UNSTRUCTURED|LLAMACLOUD) is not per-upload. Restart the backend with a differentETL_SERVICEand re-ingest to compare ETLs (route through--processing-mode premiumif your backend uses that mode for the stronger ETL).
What --use-vision-llm produces
When vision is on, the backend's ETL pipeline (app/etl_pipeline/picture_describer.py) does, per embedded image in the PDF:
-
Extract the raw image bytes via
pypdf(deduped by sha256, size-capped to match the vision LLM's per-image limit). -
Per-image OCR — re-feed the image as a standalone upload through the configured ETL service (Docling / Azure DI / LlamaCloud) with
vision_llm=None, so the ETL's OCR engine extracts the literal text-in-image. -
Visual description — call the vision LLM on the image with a description-only prompt (it's explicitly told not to transcribe text — that's OCR's job). Steps 2 and 3 run in parallel per image.
-
Splice a horizontal-rule-delimited section at the image's original position in the parser markdown (replacing Docling's
<!-- image -->placeholder + caption, or the bareImage: <name>caption a stripped-image parser leaves behind):--- **Embedded image:** `MM-130-a.jpeg` **OCR text:** Slice 24 / 60 L R **Visual description:** - Axial contrast-enhanced CT showing a large cystic mass in the left upper quadrant. - Mass effect on the adjacent stomach; left kidney displaced inferiorly. ---
This is what makes --scenario symmetric-cheap and --scenario cost-arbitrage work: a non-vision LLM reading SurfSense's chunks sees the image's text and semantic content as plain markdown, alongside the surrounding case text, in the same retrieved chunk. Without it the cheap LLM would have nothing extra to read.
A/B testing the same corpus with different settings
SurfSense dedupes uploads by (filename, search_space_id) — not by content hash and not by ingestion settings. Re-uploading the same filename to the same SearchSpace with a different --use-vision-llm flag silently skips re-processing. Give each variant its own SearchSpace:
# Baseline arm (vision off)
python -m surfsense_evals setup --suite medical --provider-model anthropic/claude-sonnet-4.5
python -m surfsense_evals ingest medical medxpertqa --no-vision-llm
python -m surfsense_evals run medical medxpertqa --n 100
python -m surfsense_evals teardown --suite medical
# Vision arm (the benchmark default)
python -m surfsense_evals setup --suite medical --provider-model anthropic/claude-sonnet-4.5
python -m surfsense_evals ingest medical medxpertqa
python -m surfsense_evals run medical medxpertqa --n 100
python -m surfsense_evals report --suite medical
Both runs land in data/medical/runs/<ts>/medxpertqa/ with their settings recorded; rendered PDFs stay cached under data/medical/medxpertqa/pdfs/ so the second ingest is upload-only.
Environment variables
SURFSENSE_API_BASE(defaulthttp://localhost:8000)OPENROUTER_API_KEY— required for thenative_pdfarm and formodels list- One of
SURFSENSE_USER_EMAIL+SURFSENSE_USER_PASSWORD(LOCAL), orSURFSENSE_JWT(+ optionalSURFSENSE_REFRESH_TOKEN) for GOOGLE/pre-issued JWT EVAL_DATA_DIR(default<project>/data) — datasets, rendered PDFs, ingestion id maps, run outputs,state.jsonEVAL_REPORTS_DIR(default<project>/reports)OPENROUTER_BASE_URL(defaulthttps://openrouter.ai/api/v1) — only if you proxy OpenRouter
Adding a new domain suite
- Create
surfsense_evals/src/surfsense_evals/suites/<domain>/<benchmark>/with__init__.py,ingest.py,runner.py, optionalprompt.py. - Implement a
Benchmarksubclass (seecore/registry.py); compose withcore.clients.*,core.arms.*,core.parse.*,core.metrics.*. - Call
register(MyBenchmark())at the bottom of<benchmark>/__init__.py. Auto-discovery picks it up;setup --suite <domain>andingest/run <domain> <benchmark>work immediately.
Each suite gets its own SearchSpace (eval-<suite>-<UTC-ts>), state.json slot, data dir, reports dir, and pinned LLM. Suites never share a SearchSpace.
Out of scope (follow-up PRs)
- Docker service for
docker compose run evals run medical medxpertqa. - Multi-model sweeps (one slug per
setupfor now; aggregate reports come later). - A long-context-stuffing arm (give the model the same retrieved chunks SurfSense saw).
- LLM-judge grader for MMLongBench-Doc (paper uses GPT-4 as judge; we ship a deterministic rule-based grader).
- MedXpertQA-MM accuracy by image modality — dataset doesn't tag modality directly; we slice by
medical_taskandbody_system. - A
--slot <name>flag that decouples the state-slot key from the benchmark registry'ssuiteattribute, so parallel SearchSpaces with different ingestion settings can coexist on the same benchmark withoutteardownbetween A/B arms.
See c:/Users/91882/.cursor/plans/medical_rag_evals_(mirage_+_curev1)_e797a324.plan.md for the full design rationale.