- Add unit tests for internal/ai/eval package
- Validate configuration, retry logic, and custom SSE parsing
- Enables coverage for eval framework without requiring live Pulse server
Refactor patrol eval runner to use a dual approach:
1. Poll GET /api/ai/patrol/status until Running=false (primary signal)
2. Best-effort SSE stream connection for tool event visibility
Changes:
- Add status polling loop with configurable timeout
- Make SSE stream optional (may not connect in time)
- Add Completed flag to PatrolRunResult
- Improve assertion error messages
- Add new scenarios and assertions
This is more reliable than relying solely on SSE stream which
may timeout waiting for headers during slow patrol initialization.
Add comprehensive patrol evaluation framework:
- patrol.go: Runner for patrol scenarios with streaming support
- patrol_assertions.go: Assertions for tool usage, findings, timing
- patrol_scenarios.go: Scenarios for basic, investigation, finding quality
- eval_test.go: Unit tests for patrol eval runner
Scenarios:
- patrol-basic: Verifies patrol completes with tools and findings
- patrol-investigation: Ensures investigation before reporting
- patrol-finding-quality: Validates finding structure and evidence
Run with: go run ./cmd/eval -scenario patrol
- Add retry logic for transient failures (phantom, stream, empty response)
- Add environment variable overrides for infrastructure naming
- Add JSON report output per scenario
- Expand assertions with new validation types
- Add more comprehensive test scenarios
- Add docs/EVAL.md with usage documentation
The eval harness now better handles flaky AI responses and provides
detailed reports for debugging.