mirror of https://github.com/supermemoryai/supermemory.git synced 2026-05-17 12:20:04 +00:00

History

Dhravya 771be5cef8 fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code - Fix worker.py writing to data/data/ instead of data/ (critical path bug) - Fix semaphore recreation on every call due to checking _value instead of capacity - Fix questions.py resume returning raw string instead of list[dict] - Fix prompts/file_gen.py reading 'summary' instead of 'brief' from manifest - Extract shared unwrap_json_list() and truncate_to_tokens() into utils.py - Remove redundant validation report writes in generate.py - Remove unused imports and dependencies - Fix f-string logger calls to use lazy %s formatting - Move calendar import to top-level in validator.py - Use write_text() for atomic writes in repair_files() - Strengthen test_resume_support to assert return type		2026-04-28 23:49:23 +00:00
..
prompts	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
clusterer.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
generate.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
planner.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
questions.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
README.md	feat: add eval corpus data generator	2026-04-28 23:24:42 +00:00
requirements.txt	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
test_clusterer.py	feat: add eval corpus data generator	2026-04-28 23:24:42 +00:00
test_planner.py	feat: add eval corpus data generator	2026-04-28 23:24:42 +00:00
test_questions.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
test_utils.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
test_validator.py	feat: add eval corpus data generator	2026-04-28 23:24:42 +00:00
test_worker.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
utils.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
validator.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00
worker.py	fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code	2026-04-28 23:49:23 +00:00

README.md

Eval Corpus Data Generator

Generates synthetic multi-file corpora for the SMFS memory eval benchmark. Each corpus simulates a real organization's shared memory — files written by many authors, in many formats, over a specific time period.

Architecture

7-phase pipeline:

Scenario Brief — One LLM call creates the "bible" for the corpus (cast, timeline, locked facts, per-file briefs)
Fact Registry — Extracts every concrete fact into structured JSON (the single source of truth for consistency)
File Manifest — Describes every file to generate (path, format, author, locked facts, cross-references)
Clustering — Groups files into clusters of 3-8, topologically sorted so dependencies generate first
File Generation — Parallel workers generate files within clusters, passing cross-reference context
Validation — Audits token counts, locked facts, name consistency, cross-references
Question Generation — Creates 10 eval questions per corpus

Setup

pip install -r requirements.txt

Requires a Gemini API key (or other LLM provider key) in the environment:

export GEMINI_API_KEY=your-key-here
# or
export OPENAI_API_KEY=your-key-here
export ANTHROPIC_API_KEY=your-key-here

Usage

# Generate a single data point
python generate.py dp_001

# Generate a range
python generate.py dp_001 dp_005

# Resume a failed generation
python generate.py dp_003 --resume

# Generate only questions for an existing corpus
python generate.py dp_001 --questions-only

# Validate an existing corpus
python generate.py dp_002 --validate-only

# Use a specific model
python generate.py dp_001 --model openai/gpt-4o

# Set concurrency for large corpora
python generate.py dp_010 --max-concurrent 20

# Custom output directory
python generate.py dp_001 --output-dir /path/to/output

Data Points

dp	files	scenario
dp_001	5	Two-person consulting kickoff
dp_002	10	Couple's anniversary weekend trip
dp_003	20	Single ER patient case across visits
dp_004	30	Small-claims legal matter
dp_005	50	Two-roommate co-living journal
dp_006	100	Indie open-source project, 6 months
dp_007	200	Grad-student lab, first semester
dp_008	300	Pre-seed startup, first 6 months
dp_009	500	Small therapy practice, 6 months
dp_010	1,000	Growth-stage startup, 6 months
dp_011	2,000	Newsroom investigation, 18 months
dp_012	5,000	Embassy at one posting, 3-year archive
dp_013	10,000	Tech-company CEO, full annual archive

Output Structure

output/dp_NNN/
├── SCENARIO.md           # Deep brief (world-building bible)
├── facts.json            # Structured fact registry
├── manifest.json         # File manifest with per-file briefs
├── data/                 # The actual corpus
│   ├── [domain folders]/
│   └── memory/
│       ├── profiles/
│       └── ...
├── question.json         # 10 eval questions
├── generation_log.json   # Audit trail (model, tokens, retries)
└── validation_report.json # Consistency audit results

Testing

python -m pytest test_planner.py test_clusterer.py test_worker.py test_validator.py test_questions.py -v

Design Decisions

Gemini 2.5 Pro as default model (free tier, large context window)
Fact registry sharding for large corpora: global facts (people, orgs) go to every worker; scoped facts (dates, financials) only go to workers that need them
Topological cluster ordering: files that cross-reference each other are co-generated; dependency clusters generate first
30% overshoot tolerance on token counts: slightly long is better than too short
Resume support: every phase checks for existing output and skips if found