supermemory/data-generator
Dhravya 771be5cef8 fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code
- Fix worker.py writing to data/data/ instead of data/ (critical path bug)
- Fix semaphore recreation on every call due to checking _value instead of capacity
- Fix questions.py resume returning raw string instead of list[dict]
- Fix prompts/file_gen.py reading 'summary' instead of 'brief' from manifest
- Extract shared unwrap_json_list() and truncate_to_tokens() into utils.py
- Remove redundant validation report writes in generate.py
- Remove unused imports and dependencies
- Fix f-string logger calls to use lazy %s formatting
- Move calendar import to top-level in validator.py
- Use write_text() for atomic writes in repair_files()
- Strengthen test_resume_support to assert return type
2026-04-28 23:49:23 +00:00
..
prompts fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
clusterer.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
generate.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
planner.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
questions.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
README.md feat: add eval corpus data generator 2026-04-28 23:24:42 +00:00
requirements.txt fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
test_clusterer.py feat: add eval corpus data generator 2026-04-28 23:24:42 +00:00
test_planner.py feat: add eval corpus data generator 2026-04-28 23:24:42 +00:00
test_questions.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
test_utils.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
test_validator.py feat: add eval corpus data generator 2026-04-28 23:24:42 +00:00
test_worker.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
utils.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
validator.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00
worker.py fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code 2026-04-28 23:49:23 +00:00

Eval Corpus Data Generator

Generates synthetic multi-file corpora for the SMFS memory eval benchmark. Each corpus simulates a real organization's shared memory — files written by many authors, in many formats, over a specific time period.

Architecture

7-phase pipeline:

  1. Scenario Brief — One LLM call creates the "bible" for the corpus (cast, timeline, locked facts, per-file briefs)
  2. Fact Registry — Extracts every concrete fact into structured JSON (the single source of truth for consistency)
  3. File Manifest — Describes every file to generate (path, format, author, locked facts, cross-references)
  4. Clustering — Groups files into clusters of 3-8, topologically sorted so dependencies generate first
  5. File Generation — Parallel workers generate files within clusters, passing cross-reference context
  6. Validation — Audits token counts, locked facts, name consistency, cross-references
  7. Question Generation — Creates 10 eval questions per corpus

Setup

pip install -r requirements.txt

Requires a Gemini API key (or other LLM provider key) in the environment:

export GEMINI_API_KEY=your-key-here
# or
export OPENAI_API_KEY=your-key-here
export ANTHROPIC_API_KEY=your-key-here

Usage

# Generate a single data point
python generate.py dp_001

# Generate a range
python generate.py dp_001 dp_005

# Resume a failed generation
python generate.py dp_003 --resume

# Generate only questions for an existing corpus
python generate.py dp_001 --questions-only

# Validate an existing corpus
python generate.py dp_002 --validate-only

# Use a specific model
python generate.py dp_001 --model openai/gpt-4o

# Set concurrency for large corpora
python generate.py dp_010 --max-concurrent 20

# Custom output directory
python generate.py dp_001 --output-dir /path/to/output

Data Points

dp files scenario
dp_001 5 Two-person consulting kickoff
dp_002 10 Couple's anniversary weekend trip
dp_003 20 Single ER patient case across visits
dp_004 30 Small-claims legal matter
dp_005 50 Two-roommate co-living journal
dp_006 100 Indie open-source project, 6 months
dp_007 200 Grad-student lab, first semester
dp_008 300 Pre-seed startup, first 6 months
dp_009 500 Small therapy practice, 6 months
dp_010 1,000 Growth-stage startup, 6 months
dp_011 2,000 Newsroom investigation, 18 months
dp_012 5,000 Embassy at one posting, 3-year archive
dp_013 10,000 Tech-company CEO, full annual archive

Output Structure

output/dp_NNN/
├── SCENARIO.md           # Deep brief (world-building bible)
├── facts.json            # Structured fact registry
├── manifest.json         # File manifest with per-file briefs
├── data/                 # The actual corpus
│   ├── [domain folders]/
│   └── memory/
│       ├── profiles/
│       └── ...
├── question.json         # 10 eval questions
├── generation_log.json   # Audit trail (model, tokens, retries)
└── validation_report.json # Consistency audit results

Testing

python -m pytest test_planner.py test_clusterer.py test_worker.py test_validator.py test_questions.py -v

Design Decisions

  • Gemini 2.5 Pro as default model (free tier, large context window)
  • Fact registry sharding for large corpora: global facts (people, orgs) go to every worker; scoped facts (dates, financials) only go to workers that need them
  • Topological cluster ordering: files that cross-reference each other are co-generated; dependency clusters generate first
  • 30% overshoot tolerance on token counts: slightly long is better than too short
  • Resume support: every phase checks for existing output and skips if found