Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.
- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation
Addresses feedback from PR #855
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
Add detailed API reference and update rollout playbook:
**New: docs/api/SCHEDULER_HEALTH.md**
- Complete endpoint reference for /api/monitoring/scheduler/health
- Request/response structure with field descriptions
- Enhanced "instances" array documentation
- Example responses showing all states (healthy, transient, DLQ)
- Useful jq queries for troubleshooting:
- Find instances with errors
- List DLQ entries
- Show open circuit breakers
- Sort by failure streaks
- Migration guide (legacy → new fields)
- Troubleshooting examples with real scenarios
**Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md**
- Enhanced "Accessing Scheduler Health API" section (§6)
- Added examples using new instances[] array
- Updated queries to use pollStatus, breaker, deadLetter fields
- Practical jq commands for operators
**Key Documentation Features:**
- Complete JSON schema with examples
- All new fields documented with types and descriptions
- Real-world troubleshooting scenarios
- Copy-paste ready jq queries
- Migration path for existing integrations
- Backward compatibility notes
Operators can now:
- Find error messages without log digging
- Understand circuit breaker states
- Track DLQ entries with full context
- Diagnose issues using single API call
Part of Phase 2 follow-up - enhanced observability