rcourtman
44fecc37c0
feat(eval): enhance AI eval harness with retries and reporting
...
- Add retry logic for transient failures (phantom, stream, empty response)
- Add environment variable overrides for infrastructure naming
- Add JSON report output per scenario
- Expand assertions with new validation types
- Add more comprehensive test scenarios
- Add docs/EVAL.md with usage documentation
The eval harness now better handles flaky AI responses and provides
detailed reports for debugging.
2026-01-28 21:24:12 +00:00
rcourtman
a04d41ce2c
Add end-to-end evaluation framework for AI assistant testing
...
Implement comprehensive eval framework for testing Pulse Assistant:
Core components:
- Runner: Executes scenarios against live API with SSE stream parsing
- Assertions: Reusable checks (tool usage, content, duration, errors)
- Scenarios: Multi-step test workflows with configurable assertions
Basic scenarios:
- QuickSmokeTest: Minimal functionality verification
- ReadOnlyInfrastructure: List, logs, status operations
- RoutingValidation: Command routing to correct targets
- LogTailing: Bounded log commands complete properly
- Discovery: Infrastructure discovery capabilities
Advanced scenarios:
- TroubleshootingScenario: Multi-step investigation workflow
- DeepDiveScenario: Thorough single-service investigation
- ConfigInspectionScenario: Reading configuration files
- ResourceAnalysisScenario: Cross-container resource comparison
- MultiNodeScenario: Operations across Proxmox nodes
- DockerInDockerScenario: Docker containers inside LXCs
- ContextChainScenario: Context retention across turns
Usage: go test ./internal/ai/eval -live -run TestQuickSmokeTest
2026-01-28 16:49:24 +00:00