Commit graph

12 commits

Author SHA1 Message Date
rcourtman
a2cfda0936 fix(test): remove flaky content type test in eval 2026-02-02 19:26:24 +00:00
rcourtman
9b304f8a78 test(ai): comprehensive eval coverage (~71%) including scenarios, overrides, and error cases 2026-02-02 19:18:19 +00:00
rcourtman
abc8900d4c test(ai): add patrol assertions tests, coverage now 53.3% 2026-02-02 19:11:39 +00:00
rcourtman
aa4d728963 test(ai): add patrol quality logic tests, coverage now 42.5% 2026-02-02 19:10:45 +00:00
rcourtman
469c687860 test(ai): improve eval package coverage to 40% 2026-02-02 19:09:13 +00:00
rcourtman
5959cd9d7f test(ai): add unit tests for eval runner
- Add unit tests for internal/ai/eval package
- Validate configuration, retry logic, and custom SSE parsing
- Enables coverage for eval framework without requiring live Pulse server
2026-02-02 14:54:01 +00:00
rcourtman
9b0fb527f5 feat(patrol): implement patrol findings, evaluation, and investigation logic
- Add core Patrol system for automated investigations
- Implement findings management and deduplication logic
- Add evaluation framework (patrol_eval) with quality assertions and scenarios
- Add patrol-specific tools and executor integration
- Add E2E test matrix script
2026-01-31 16:23:08 +00:00
rcourtman
95a0d7a6bd feat(backend): implement AI Patrol, Investigation, and system-wide refactors 2026-01-30 19:02:14 +00:00
rcourtman
0e880f3c89 feat(eval): improve patrol eval with polling-based completion
Refactor patrol eval runner to use a dual approach:
1. Poll GET /api/ai/patrol/status until Running=false (primary signal)
2. Best-effort SSE stream connection for tool event visibility

Changes:
- Add status polling loop with configurable timeout
- Make SSE stream optional (may not connect in time)
- Add Completed flag to PatrolRunResult
- Improve assertion error messages
- Add new scenarios and assertions

This is more reliable than relying solely on SSE stream which
may timeout waiting for headers during slow patrol initialization.
2026-01-29 08:20:39 +00:00
rcourtman
c409e7a05e feat(eval): add patrol-specific eval scenarios and assertions
Add comprehensive patrol evaluation framework:

- patrol.go: Runner for patrol scenarios with streaming support
- patrol_assertions.go: Assertions for tool usage, findings, timing
- patrol_scenarios.go: Scenarios for basic, investigation, finding quality
- eval_test.go: Unit tests for patrol eval runner

Scenarios:
- patrol-basic: Verifies patrol completes with tools and findings
- patrol-investigation: Ensures investigation before reporting
- patrol-finding-quality: Validates finding structure and evidence

Run with: go run ./cmd/eval -scenario patrol
2026-01-28 23:19:11 +00:00
rcourtman
44fecc37c0 feat(eval): enhance AI eval harness with retries and reporting
- Add retry logic for transient failures (phantom, stream, empty response)
- Add environment variable overrides for infrastructure naming
- Add JSON report output per scenario
- Expand assertions with new validation types
- Add more comprehensive test scenarios
- Add docs/EVAL.md with usage documentation

The eval harness now better handles flaky AI responses and provides
detailed reports for debugging.
2026-01-28 21:24:12 +00:00
rcourtman
a04d41ce2c Add end-to-end evaluation framework for AI assistant testing
Implement comprehensive eval framework for testing Pulse Assistant:

Core components:
- Runner: Executes scenarios against live API with SSE stream parsing
- Assertions: Reusable checks (tool usage, content, duration, errors)
- Scenarios: Multi-step test workflows with configurable assertions

Basic scenarios:
- QuickSmokeTest: Minimal functionality verification
- ReadOnlyInfrastructure: List, logs, status operations
- RoutingValidation: Command routing to correct targets
- LogTailing: Bounded log commands complete properly
- Discovery: Infrastructure discovery capabilities

Advanced scenarios:
- TroubleshootingScenario: Multi-step investigation workflow
- DeepDiveScenario: Thorough single-service investigation
- ConfigInspectionScenario: Reading configuration files
- ResourceAnalysisScenario: Cross-container resource comparison
- MultiNodeScenario: Operations across Proxmox nodes
- DockerInDockerScenario: Docker containers inside LXCs
- ContextChainScenario: Context retention across turns

Usage: go test ./internal/ai/eval -live -run TestQuickSmokeTest
2026-01-28 16:49:24 +00:00