Commit graph

2 commits

Author SHA1 Message Date
Dhravya
771be5cef8 fix: apply review feedback — fix double data/ prefix, semaphore bug, resume bug, consolidate duplicated code
- Fix worker.py writing to data/data/ instead of data/ (critical path bug)
- Fix semaphore recreation on every call due to checking _value instead of capacity
- Fix questions.py resume returning raw string instead of list[dict]
- Fix prompts/file_gen.py reading 'summary' instead of 'brief' from manifest
- Extract shared unwrap_json_list() and truncate_to_tokens() into utils.py
- Remove redundant validation report writes in generate.py
- Remove unused imports and dependencies
- Fix f-string logger calls to use lazy %s formatting
- Move calendar import to top-level in validator.py
- Use write_text() for atomic writes in repair_files()
- Strengthen test_resume_support to assert return type
2026-04-28 23:49:23 +00:00
Dhravya
cba994be3f feat: add eval corpus data generator
7-phase pipeline for generating synthetic multi-file corpora:
1. Scenario Brief (SCENARIO.md) - world-building
2. Fact Registry (facts.json) - consistency source of truth
3. File Manifest (manifest.json) - per-file briefs
4. Clustering - topological sort + fact registry sharding
5. Parallel File Generation - concurrent workers
6. Validation - cross-reference & consistency audit
7. Question Generation - 10 eval questions per corpus

Supports all 13 data points (dp_001 through dp_013, 5 to 10,000 files).
Uses Gemini 2.5 Pro by default. Includes resume support, validation,
and 219 unit tests.
2026-04-28 23:24:42 +00:00