Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-29 03:50:18 +00:00

Author	SHA1	Message	Date
rcourtman	ee63d438cc	docs: standardize markdown syntax and remove deprecated sensor-proxy docs	2026-01-20 09:43:49 +00:00
rcourtman	2b48b0a459	feat: add --kube-include-all-deployments flag for Kubernetes agent Adds IncludeAllDeployments option to show all deployments, not just problem ones (where replicas don't match desired). This provides parity with the existing --kube-include-all-pods flag. - Add IncludeAllDeployments to kubernetesagent.Config - Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var - Update collectDeployments to respect the new flag - Add test for IncludeAllDeployments functionality - Update UNIFIED_AGENT.md documentation Addresses feedback from PR #855	2025-12-18 20:58:30 +00:00
courtmanr@gmail.com	fd39196166	refactor: finalize documentation overhaul - Refactor specialized docs for conciseness and clarity - Rename files to UPPER_CASE.md convention - Verify accuracy against codebase - Fix broken links	2025-11-25 00:45:20 +00:00
rcourtman	3c41d3960c	docs: add operations runbooks and audit fixes	2025-11-14 01:01:21 +00:00
rcourtman	68ce8e7520	feat: finalize swarm service monitoring (#598 )	2025-10-26 09:35:49 +00:00
rcourtman	e0396c1362	docs: update documentation for diagnostics improvements Add comprehensive operator documentation for the new observability features introduced in the previous commit. New Documentation: - docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new Prometheus metrics with alert suggestions Updated Documentation: - docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers, explain diagnostics endpoint caching behavior - docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs using request IDs - docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists with new per-node and scheduler metrics - docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation defaults These updates ensure operators understand: - How to set up monitoring/alerting for new metrics - How to configure file logging with rotation - How to troubleshoot using request correlation - What metrics are available for dashboards Related to: `495e6c794` (feat: comprehensive diagnostics improvements)	2025-10-21 12:45:19 +00:00
rcourtman	c91b7874ac	docs: comprehensive v4.24.0 documentation audit and updates Complete documentation overhaul for Pulse v4.24.0 release covering all new features and operational procedures. Documentation Updates (19 files): P0 Release-Critical: - Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook - Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status - Operations: Enhanced audit-log-rotation.md with scheduler health checks - Security: Updated proxy hardening docs with rate limit defaults - Docker: Added runtime logging and rollback procedures P1 Deployment & Integration: - KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification - PORT_CONFIGURATION.md: Service naming, change tracking via update history - REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification - PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration - TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows Features Documented: - X-RateLimit-* headers for all API responses - Updates rollback workflow (UI & CLI) - Scheduler health API with rich metadata - Runtime logging configuration (no restart required) - Adaptive polling (GA, enabled by default) - Enhanced audit logging - Circuit breakers and dead-letter queue Supporting Changes: - Discovery service enhancements - Config handlers updates - Sensor proxy installer improvements Total Changes: 1,626 insertions(+), 622 deletions(-) Files Modified: 24 (19 docs, 5 code) All documentation is production-ready for v4.24.0 release.	2025-10-20 17:20:13 +00:00
rcourtman	469d11fc7e	docs: add comprehensive scheduler health API documentation Add detailed API reference and update rollout playbook: New: docs/api/SCHEDULER_HEALTH.md - Complete endpoint reference for /api/monitoring/scheduler/health - Request/response structure with field descriptions - Enhanced "instances" array documentation - Example responses showing all states (healthy, transient, DLQ) - Useful jq queries for troubleshooting: - Find instances with errors - List DLQ entries - Show open circuit breakers - Sort by failure streaks - Migration guide (legacy → new fields) - Troubleshooting examples with real scenarios Updated: docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Enhanced "Accessing Scheduler Health API" section (§6) - Added examples using new instances[] array - Updated queries to use pollStatus, breaker, deadLetter fields - Practical jq commands for operators Key Documentation Features: - Complete JSON schema with examples - All new fields documented with types and descriptions - Real-world troubleshooting scenarios - Copy-paste ready jq queries - Migration path for existing integrations - Backward compatibility notes Operators can now: - Find error messages without log digging - Understand circuit breaker states - Track DLQ entries with full context - Diagnose issues using single API call Part of Phase 2 follow-up - enhanced observability	2025-10-20 15:13:38 +00:00
rcourtman	cb8be81f1d	docs: add adaptive polling production rollout playbook (Phase 2 Task 10) Add comprehensive operator playbook for production enablement: Prerequisites: - Test suite validation (unit, integration, soak) - Monitoring readiness (Grafana dashboards, alerts) - Configuration management and rollback planning - Stakeholder sign-off Staging Rollout: - Feature flag enablement steps - Verification procedures (scheduler health API) - 24-48h observation window with success criteria - Metric checkpoints at 0h, 12h, 24h Production Rollout: - Gradual strategy (25% nodes every 2 hours) - Low-traffic maintenance window - Per-cluster monitoring during rollout - Success criteria and completion validation Grafana/Alert Configuration: - Dashboard panels: queue depth, staleness, throughput, breakers/DLQ - Alert thresholds: - Queue depth > 1.5× instances for >10min (Warning) - Staleness > 60s for >5min (Critical) - DLQ growth (Warning) - Stuck breakers >10min (Critical) Rollback Procedure: - Clear disable/restart steps - Verification of rollback success - Post-rollback actions and incident reporting Troubleshooting: - Symptom/cause/action table - Scheduler health API access guide - Immediate rollback triggers Operators can now safely enable adaptive polling following this step-by-step playbook. Part of Phase 2 Task 10 (Documentation)	2025-10-20 15:13:38 +00:00

9 commits