Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-29 03:50:18 +00:00

Author	SHA1	Message	Date
rcourtman	3b347b6548	fix: harden SQLite against I/O contention causing persistent lock errors - Move all SQLite pragmas from db.Exec() to DSN parameters so every connection the pool creates gets busy_timeout and other settings. Previously only the first connection had these applied. - Set MaxOpenConns(1) on audit, RBAC, and notification databases (metrics already had this). Fixes potential for multiple connections where new ones lack busy_timeout. - Increase busy_timeout from 5s to 30s across all databases to tolerate disk I/O pressure during backup windows. - Fix nested query deadlocks in GetRoles(), GetUserAssignments(), and CancelByAlertIDs() that would deadlock with MaxOpenConns(1). - Fix circuit breaker retryInterval not resetting on recovery, which caused the next trip to start at 5-minute backoff instead of 5s. Related to #1156	2026-02-02 17:29:14 +00:00
rcourtman	01f7d81d38	style: fix gofmt formatting inconsistencies Run gofmt -w to fix tab/space inconsistencies across 33 files.	2025-11-26 23:44:36 +00:00
rcourtman	9b1709a05b	feat: enhance scheduler health API with rich instance metadata Add comprehensive instance-level diagnostics to /api/monitoring/scheduler/health New Response Structure: Enhanced "instances" array with per-instance details: - Instance metadata: displayName, type, connection URL - Poll status: last success/error timestamps, error messages, error category - Circuit breaker: state, timestamps, failure counts, retry windows - Dead letter: present flag, reason, attempt history, retry schedule Implementation: Data structures: - instanceInfo: cache of display names, URLs, types - pollStatus: tracks successes/errors with timestamps and categories - dlqInsight: DLQ entry metadata (reason, attempts, schedule) - circuitBreaker: enhanced with stateSince, lastTransition Tracking logic: - buildInstanceInfoCache: populate metadata from config on startup - recordTaskResult: track poll outcomes, error details, categories - sendToDeadLetter: capture DLQ insights (reason, timestamps) - circuitBreaker: record state transitions with timestamps Backward Compatible: - Existing fields (deadLetter, breakers, staleness) unchanged - New "instances" array is additive - Old clients can ignore new fields Testing: - Unit test: TestSchedulerHealth_EnhancedResponse validates all fields - Integration tests: still passing (55s) - All error tracking and breaker history verified Operator Benefits: - Diagnose issues without log digging - See error messages directly in API - Understand breaker states and retry schedules - Track DLQ entries with full context - Single API call for complete instance health view Example: Quickly identify "401 unauthorized" on specific PBS instance, see it's in DLQ after 5 retries, and know when next retry scheduled. Part of Phase 2 follow-up work to improve observability.	2025-10-20 15:13:38 +00:00
rcourtman	160adeb3b8	feat: add scheduler health API endpoint (Phase 2 Task 8) Task 8 of 10 complete. Exposes read-only scheduler health data including: - Queue depth and distribution by instance type - Dead-letter queue inspection (top 25 tasks with error details) - Circuit breaker states (instance-level) - Staleness scores per instance New API endpoint: GET /api/monitoring/scheduler/health (requires authentication) New snapshot methods: - StalenessTracker.Snapshot() - exports all staleness data - TaskQueue.Snapshot() - queue depth & per-type distribution - TaskQueue.PeekAll() - dead-letter task inspection - circuitBreaker.State() - exports state, failures, retryAt - Monitor.SchedulerHealth() - aggregates all health data Documentation updated with API spec, field descriptions, and usage examples.	2025-10-20 15:13:38 +00:00
rcourtman	b1f445b33d	feat: implement error handling with circuit breakers and backoff (Phase 2 Task 7) Adds comprehensive error resilience: - Circuit breaker with closed/open/half-open states (3 failures = trip) - Exponential backoff with jitter (2s initial, 2x multiplier, 5min max) - Dead-letter queue for tasks exceeding 5 retry attempts - Error classification (transient vs permanent) using internal/errors helpers - Per-instance failure tracking and breaker state management - Integration with staleness tracker for outcome recording Task 7 of 10 complete (70%). Ready for API surfaces and testing.	2025-10-20 15:13:37 +00:00

5 commits