mirror of
https://github.com/supermemoryai/supermemory.git
synced 2026-05-18 23:36:00 +00:00
7-phase pipeline for generating synthetic multi-file corpora: 1. Scenario Brief (SCENARIO.md) - world-building 2. Fact Registry (facts.json) - consistency source of truth 3. File Manifest (manifest.json) - per-file briefs 4. Clustering - topological sort + fact registry sharding 5. Parallel File Generation - concurrent workers 6. Validation - cross-reference & consistency audit 7. Question Generation - 10 eval questions per corpus Supports all 13 data points (dp_001 through dp_013, 5 to 10,000 files). Uses Gemini 2.5 Pro by default. Includes resume support, validation, and 219 unit tests.
3.8 KiB
3.8 KiB
Eval Corpus Data Generator
Generates synthetic multi-file corpora for the SMFS memory eval benchmark. Each corpus simulates a real organization's shared memory — files written by many authors, in many formats, over a specific time period.
Architecture
7-phase pipeline:
- Scenario Brief — One LLM call creates the "bible" for the corpus (cast, timeline, locked facts, per-file briefs)
- Fact Registry — Extracts every concrete fact into structured JSON (the single source of truth for consistency)
- File Manifest — Describes every file to generate (path, format, author, locked facts, cross-references)
- Clustering — Groups files into clusters of 3-8, topologically sorted so dependencies generate first
- File Generation — Parallel workers generate files within clusters, passing cross-reference context
- Validation — Audits token counts, locked facts, name consistency, cross-references
- Question Generation — Creates 10 eval questions per corpus
Setup
pip install -r requirements.txt
Requires a Gemini API key (or other LLM provider key) in the environment:
export GEMINI_API_KEY=your-key-here
# or
export OPENAI_API_KEY=your-key-here
export ANTHROPIC_API_KEY=your-key-here
Usage
# Generate a single data point
python generate.py dp_001
# Generate a range
python generate.py dp_001 dp_005
# Resume a failed generation
python generate.py dp_003 --resume
# Generate only questions for an existing corpus
python generate.py dp_001 --questions-only
# Validate an existing corpus
python generate.py dp_002 --validate-only
# Use a specific model
python generate.py dp_001 --model openai/gpt-4o
# Set concurrency for large corpora
python generate.py dp_010 --max-concurrent 20
# Custom output directory
python generate.py dp_001 --output-dir /path/to/output
Data Points
| dp | files | scenario |
|---|---|---|
| dp_001 | 5 | Two-person consulting kickoff |
| dp_002 | 10 | Couple's anniversary weekend trip |
| dp_003 | 20 | Single ER patient case across visits |
| dp_004 | 30 | Small-claims legal matter |
| dp_005 | 50 | Two-roommate co-living journal |
| dp_006 | 100 | Indie open-source project, 6 months |
| dp_007 | 200 | Grad-student lab, first semester |
| dp_008 | 300 | Pre-seed startup, first 6 months |
| dp_009 | 500 | Small therapy practice, 6 months |
| dp_010 | 1,000 | Growth-stage startup, 6 months |
| dp_011 | 2,000 | Newsroom investigation, 18 months |
| dp_012 | 5,000 | Embassy at one posting, 3-year archive |
| dp_013 | 10,000 | Tech-company CEO, full annual archive |
Output Structure
output/dp_NNN/
├── SCENARIO.md # Deep brief (world-building bible)
├── facts.json # Structured fact registry
├── manifest.json # File manifest with per-file briefs
├── data/ # The actual corpus
│ ├── [domain folders]/
│ └── memory/
│ ├── profiles/
│ └── ...
├── question.json # 10 eval questions
├── generation_log.json # Audit trail (model, tokens, retries)
└── validation_report.json # Consistency audit results
Testing
python -m pytest test_planner.py test_clusterer.py test_worker.py test_validator.py test_questions.py -v
Design Decisions
- Gemini 2.5 Pro as default model (free tier, large context window)
- Fact registry sharding for large corpora: global facts (people, orgs) go to every worker; scoped facts (dates, financials) only go to workers that need them
- Topological cluster ordering: files that cross-reference each other are co-generated; dependency clusters generate first
- 30% overshoot tolerance on token counts: slightly long is better than too short
- Resume support: every phase checks for existing output and skips if found