Commit graph

9 commits

Author SHA1 Message Date
rcourtman
ee63d438cc docs: standardize markdown syntax and remove deprecated sensor-proxy docs 2026-01-20 09:43:49 +00:00
rcourtman
2b48b0a459 feat: add --kube-include-all-deployments flag for Kubernetes agent
Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.

- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation

Addresses feedback from PR #855
2025-12-18 20:58:30 +00:00
courtmanr@gmail.com
fd39196166 refactor: finalize documentation overhaul
- Refactor specialized docs for conciseness and clarity
- Rename files to UPPER_CASE.md convention
- Verify accuracy against codebase
- Fix broken links
2025-11-25 00:45:20 +00:00
rcourtman
3c41d3960c docs: add operations runbooks and audit fixes 2025-11-14 01:01:21 +00:00
rcourtman
68ce8e7520 feat: finalize swarm service monitoring (#598) 2025-10-26 09:35:49 +00:00
rcourtman
e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00
rcourtman
c91b7874ac docs: comprehensive v4.24.0 documentation audit and updates
Complete documentation overhaul for Pulse v4.24.0 release covering all new
features and operational procedures.

Documentation Updates (19 files):

P0 Release-Critical:
- Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook
- Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status
- Operations: Enhanced audit-log-rotation.md with scheduler health checks
- Security: Updated proxy hardening docs with rate limit defaults
- Docker: Added runtime logging and rollback procedures

P1 Deployment & Integration:
- KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification
- PORT_CONFIGURATION.md: Service naming, change tracking via update history
- REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification
- PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration
- TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows

Features Documented:
- X-RateLimit-* headers for all API responses
- Updates rollback workflow (UI & CLI)
- Scheduler health API with rich metadata
- Runtime logging configuration (no restart required)
- Adaptive polling (GA, enabled by default)
- Enhanced audit logging
- Circuit breakers and dead-letter queue

Supporting Changes:
- Discovery service enhancements
- Config handlers updates
- Sensor proxy installer improvements

Total Changes: 1,626 insertions(+), 622 deletions(-)
Files Modified: 24 (19 docs, 5 code)

All documentation is production-ready for v4.24.0 release.
2025-10-20 17:20:13 +00:00
rcourtman
469d11fc7e docs: add comprehensive scheduler health API documentation
Add detailed API reference and update rollout playbook:

**New: docs/api/SCHEDULER_HEALTH.md**
- Complete endpoint reference for /api/monitoring/scheduler/health
- Request/response structure with field descriptions
- Enhanced "instances" array documentation
- Example responses showing all states (healthy, transient, DLQ)
- Useful jq queries for troubleshooting:
  - Find instances with errors
  - List DLQ entries
  - Show open circuit breakers
  - Sort by failure streaks
- Migration guide (legacy → new fields)
- Troubleshooting examples with real scenarios

**Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md**
- Enhanced "Accessing Scheduler Health API" section (§6)
- Added examples using new instances[] array
- Updated queries to use pollStatus, breaker, deadLetter fields
- Practical jq commands for operators

**Key Documentation Features:**
- Complete JSON schema with examples
- All new fields documented with types and descriptions
- Real-world troubleshooting scenarios
- Copy-paste ready jq queries
- Migration path for existing integrations
- Backward compatibility notes

Operators can now:
- Find error messages without log digging
- Understand circuit breaker states
- Track DLQ entries with full context
- Diagnose issues using single API call

Part of Phase 2 follow-up - enhanced observability
2025-10-20 15:13:38 +00:00
rcourtman
cb8be81f1d docs: add adaptive polling production rollout playbook (Phase 2 Task 10)
Add comprehensive operator playbook for production enablement:

**Prerequisites:**
- Test suite validation (unit, integration, soak)
- Monitoring readiness (Grafana dashboards, alerts)
- Configuration management and rollback planning
- Stakeholder sign-off

**Staging Rollout:**
- Feature flag enablement steps
- Verification procedures (scheduler health API)
- 24-48h observation window with success criteria
- Metric checkpoints at 0h, 12h, 24h

**Production Rollout:**
- Gradual strategy (25% nodes every 2 hours)
- Low-traffic maintenance window
- Per-cluster monitoring during rollout
- Success criteria and completion validation

**Grafana/Alert Configuration:**
- Dashboard panels: queue depth, staleness, throughput, breakers/DLQ
- Alert thresholds:
  - Queue depth > 1.5× instances for >10min (Warning)
  - Staleness > 60s for >5min (Critical)
  - DLQ growth (Warning)
  - Stuck breakers >10min (Critical)

**Rollback Procedure:**
- Clear disable/restart steps
- Verification of rollback success
- Post-rollback actions and incident reporting

**Troubleshooting:**
- Symptom/cause/action table
- Scheduler health API access guide
- Immediate rollback triggers

Operators can now safely enable adaptive polling following this step-by-step playbook.

Part of Phase 2 Task 10 (Documentation)
2025-10-20 15:13:38 +00:00