mirror of https://github.com/supermemoryai/supermemory.git synced 2026-05-18 23:36:00 +00:00

Dhravya cba994be3f feat: add eval corpus data generator

7-phase pipeline for generating synthetic multi-file corpora:
1. Scenario Brief (SCENARIO.md) - world-building
2. Fact Registry (facts.json) - consistency source of truth
3. File Manifest (manifest.json) - per-file briefs
4. Clustering - topological sort + fact registry sharding
5. Parallel File Generation - concurrent workers
6. Validation - cross-reference & consistency audit
7. Question Generation - 10 eval questions per corpus

Supports all 13 data points (dp_001 through dp_013, 5 to 10,000 files).
Uses Gemini 2.5 Pro by default. Includes resume support, validation,
and 219 unit tests.

2026-04-28 23:24:42 +00:00

3.8 KiB

Raw Permalink Blame History

Eval Corpus Data Generator

Generates synthetic multi-file corpora for the SMFS memory eval benchmark. Each corpus simulates a real organization's shared memory — files written by many authors, in many formats, over a specific time period.

Architecture

7-phase pipeline:

Scenario Brief — One LLM call creates the "bible" for the corpus (cast, timeline, locked facts, per-file briefs)
Fact Registry — Extracts every concrete fact into structured JSON (the single source of truth for consistency)
File Manifest — Describes every file to generate (path, format, author, locked facts, cross-references)
Clustering — Groups files into clusters of 3-8, topologically sorted so dependencies generate first
File Generation — Parallel workers generate files within clusters, passing cross-reference context
Validation — Audits token counts, locked facts, name consistency, cross-references
Question Generation — Creates 10 eval questions per corpus

Setup

pip install -r requirements.txt

Requires a Gemini API key (or other LLM provider key) in the environment:

export GEMINI_API_KEY=your-key-here
# or
export OPENAI_API_KEY=your-key-here
export ANTHROPIC_API_KEY=your-key-here

Usage

# Generate a single data point
python generate.py dp_001

# Generate a range
python generate.py dp_001 dp_005

# Resume a failed generation
python generate.py dp_003 --resume

# Generate only questions for an existing corpus
python generate.py dp_001 --questions-only

# Validate an existing corpus
python generate.py dp_002 --validate-only

# Use a specific model
python generate.py dp_001 --model openai/gpt-4o

# Set concurrency for large corpora
python generate.py dp_010 --max-concurrent 20

# Custom output directory
python generate.py dp_001 --output-dir /path/to/output

Data Points

dp	files	scenario
dp_001	5	Two-person consulting kickoff
dp_002	10	Couple's anniversary weekend trip
dp_003	20	Single ER patient case across visits
dp_004	30	Small-claims legal matter
dp_005	50	Two-roommate co-living journal
dp_006	100	Indie open-source project, 6 months
dp_007	200	Grad-student lab, first semester
dp_008	300	Pre-seed startup, first 6 months
dp_009	500	Small therapy practice, 6 months
dp_010	1,000	Growth-stage startup, 6 months
dp_011	2,000	Newsroom investigation, 18 months
dp_012	5,000	Embassy at one posting, 3-year archive
dp_013	10,000	Tech-company CEO, full annual archive

Output Structure

output/dp_NNN/
├── SCENARIO.md           # Deep brief (world-building bible)
├── facts.json            # Structured fact registry
├── manifest.json         # File manifest with per-file briefs
├── data/                 # The actual corpus
│   ├── [domain folders]/
│   └── memory/
│       ├── profiles/
│       └── ...
├── question.json         # 10 eval questions
├── generation_log.json   # Audit trail (model, tokens, retries)
└── validation_report.json # Consistency audit results

Testing

python -m pytest test_planner.py test_clusterer.py test_worker.py test_validator.py test_questions.py -v

Design Decisions

Gemini 2.5 Pro as default model (free tier, large context window)
Fact registry sharding for large corpora: global facts (people, orgs) go to every worker; scoped facts (dates, financials) only go to workers that need them
Topological cluster ordering: files that cross-reference each other are co-generated; dependency clusters generate first
30% overshoot tolerance on token counts: slightly long is better than too short
Resume support: every phase checks for existing output and skips if found

3.8 KiB Raw Permalink Blame History