Pulse/docs/architecture/pulse-patrol-deep-dive.md
rcourtman fa1b74792e docs: add comprehensive deep-dive documentation for AI subsystems
Adds detailed architecture documentation for Pulse Patrol and Pulse Assistant. Updates AI.md and PULSE_PRO.md. Also includes additional tests.
2026-02-02 10:29:07 +00:00

29 KiB
Raw Permalink Blame History

Pulse Patrol: Technical Deep Dive

This document provides an in-depth look at the engineering behind Pulse Patrol — a context-aware, learning AI analysis system that goes far beyond traditional threshold-based monitoring.


Executive Summary

Pulse Patrol is not a simple alerting system. It's a multi-layered intelligence platform that:

  1. Learns what's normal for your environment
  2. Predicts issues before they become critical
  3. Correlates events across your entire infrastructure
  4. Remembers past incidents and successful remediations
  5. Investigates issues autonomously when configured
  6. Verifies fixes and tracks remediation effectiveness

All while running entirely on your infrastructure with BYOK (Bring Your Own Key) for complete privacy.


Architecture Overview

   ┌─────────────────────────────────────────────────────────────────────┐
   │                      PULSE PATROL SERVICE                          │
   │                                                                     │
   │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
   │  │   Baseline   │  │   Pattern    │  │ Correlation  │              │
   │  │    Store     │  │  Detector    │  │   Detector   │              │
   │  │  (Learning)  │  │ (Prediction) │  │ (Root Cause) │              │
   │  └──────────────┘  └──────────────┘  └──────────────┘              │
   │          │                 │                 │                      │
   │          └─────────────────┼─────────────────┘                      │
   │                            │                                        │
   │                    ┌───────▼───────┐                               │
   │                    │  Intelligence │                               │
   │                    │  Orchestrator │                               │
   │                    └───────┬───────┘                               │
   │                            │                                        │
   │  ┌──────────────┐  ┌──────▼───────┐  ┌──────────────┐              │
   │  │   Incident   │  │   Forecast   │  │  Knowledge   │              │
   │  │    Store     │  │   Service    │  │    Store     │              │
   │  │  (History)   │  │ (Prediction) │  │  (Learning)  │              │
   │  └──────────────┘  └──────────────┘  └──────────────┘              │
   │                            │                                        │
   │                    ┌───────▼───────┐                               │
   │                    │    Patrol     │                               │
   │                    │   Service     │                               │
   │                    └───────┬───────┘                               │
   │                            │                                        │
   │         ┌──────────────────┼──────────────────┐                    │
   │         │                  │                  │                    │
   │   ┌─────▼─────┐     ┌─────▼─────┐     ┌─────▼─────┐               │
   │   │  Signal   │     │ Agentic   │     │ Evaluation│               │
   │   │ Detection │     │   Loop    │     │   Pass    │               │
   │   │(Determin.)│     │  (LLM)    │     │  (LLM)    │               │
   │   └───────────┘     └───────────┘     └───────────┘               │
   │                            │                                        │
   │                    ┌───────▼───────┐                               │
   │                    │ Investigation │                               │
   │                    │  Orchestrator │                               │
   │                    └───────────────┘                               │
   └─────────────────────────────────────────────────────────────────────┘

1. Baseline Learning Engine

📁 Location: internal/ai/baseline/store.go

What It Does

The baseline engine learns what "normal" looks like for each resource in your environment. Rather than using static thresholds, it builds statistical models from your actual metrics history.

How It Works

type MetricBaseline struct {
    Mean          float64           // Average value
    StdDev        float64           // Standard deviation
    Min           float64           // Minimum observed
    Max           float64           // Maximum observed
    SampleCount   int               // Number of data points
    HourlySamples map[int][]float64 // Samples bucketed by hour (0-23)
}

Key Features:

  • Time-of-day awareness: Tracks hourly buckets to understand that 3pm CPU usage differs from 3am
  • Z-score anomaly detection: Flags deviations > 2.0 standard deviations from baseline
  • Anomaly severity classification:
    • normal: z-score < 2.0
    • mild: 2.0 ≤ z-score < 2.5
    • moderate: 2.5 ≤ z-score < 3.0
    • severe: 3.0 ≤ z-score < 4.0
    • extreme: z-score ≥ 4.0
// Anomaly detection with hourly context
func (s *Store) IsAnomaly(resourceID, metric string, value float64) (bool, float64) {
    baseline := s.baselines[resourceID]
    // Use hourly mean if we have enough samples for current hour
    hourlyMean, usedHourly := baseline.Metrics[metric].GetHourlyMean(time.Now().Hour())
    
    if usedHourly {
        zScore = (value - hourlyMean) / baseline.StdDev
    } else {
        zScore = (value - baseline.Mean) / baseline.StdDev
    }
    
    return math.Abs(zScore) > anomalyThreshold, zScore
}

Callback System:

// Event-driven anomaly response
func (s *Store) SetAnomalyCallback(callback AnomalyCallback) {
    // Called when anomaly detected — can trigger targeted patrol
}

2. Pattern Detection & Prediction

📁 Location: internal/ai/patterns/detector.go

What It Does

Tracks historical events and identifies recurring patterns to predict future failures.

Event Types Tracked

const (
    EventHighMemory   = "high_memory"   // Memory exceeded threshold
    EventHighCPU      = "high_cpu"      // CPU exceeded threshold
    EventDiskFull     = "disk_full"     // Disk space critical
    EventOOMKill      = "oom_kill"      // Out-of-memory kill
    EventServiceCrash = "service_crash" // Service crashed
    EventUnresponsive = "unresponsive"  // Resource became unresponsive
    EventBackupFailed = "backup_failed" // Backup job failed
)

Pattern Structure

type Pattern struct {
    ResourceID      string
    EventType       EventType
    Occurrences     int           // How many times this has happened
    AverageInterval time.Duration // Average time between occurrences
    LastOccurrence  time.Time
    NextPredicted   time.Time     // When we expect this to happen again
    Confidence      float64       // 0.0 to 1.0
    AverageDuration time.Duration // How long events typically last
}

Prediction Algorithm

  1. Collect events by resource and type
  2. Calculate inter-arrival times between occurrences
  3. Apply exponential smoothing for prediction
  4. Assign confidence based on consistency and sample size
type FailurePrediction struct {
    ResourceID  string
    EventType   EventType
    PredictedAt time.Time  // When we expect the failure
    DaysUntil   float64    // Days until predicted failure
    Confidence  float64    // How confident we are
    Basis       string     // "Based on 7 occurrences over 30 days"
    Pattern     *Pattern   // The underlying pattern
}

Example Output:

"VM prod-web-1 has experienced high_memory events 5 times in the past 14 days with an average interval of 2.8 days. Next occurrence predicted in ~1.5 days (confidence: 0.72)."


3. Correlation & Root Cause Analysis

📁 Location: internal/ai/correlation/

Correlation Detector (detector.go)

Tracks when events on one resource are followed by events on another, enabling cascade failure prediction.

type Correlation struct {
    SourceID     string        // Resource that has events first
    SourceName   string
    TargetID     string        // Resource that has events after
    TargetName   string
    EventType    EventType
    Occurrences  int           // How many times this sequence occurred
    AvgDelay     time.Duration // Average time from source to target event
    Confidence   float64
    Description  string        // "When storage-1 has disk_full, web-1 crashes within 5 minutes"
}

Key Methods:

// Get what depends on this resource
func (d *Detector) GetDependencies(resourceID string) []string

// Get what this resource depends on
func (d *Detector) GetDependsOn(resourceID string) []string

// Predict cascade effects
func (d *Detector) PredictCascade(resourceID string, eventType EventType) []CascadePrediction

Cascade Prediction Example:

"If storage-1 experiences disk_full, expect web-1 to become unresponsive within 3-7 minutes (confidence: 0.85), and database-1 to experience service_crash within 10-15 minutes (confidence: 0.67)."

Root Cause Engine (rootcause.go)

Goes beyond correlation to identify the underlying cause of related issues.

type ResourceRelationship struct {
    SourceID     string
    TargetID     string
    Relationship RelationshipType // runs_on, uses_storage, uses_network, depends_on, etc.
}

type RootCauseAnalysis struct {
    ID            string
    TriggerEvent  RelatedEvent     // What started the incident
    RootCause     *RelatedEvent    // The actual root cause
    RelatedEvents []RelatedEvent   // All related events
    CausalChain   []string         // "storage-1 (disk_full) → db-1 (slow) → web-1 (timeout)"
    Confidence    float64
    Explanation   string           // Human-readable explanation
}

Relationship Types:

const (
    RelationshipRunsOn      = "runs_on"      // VM runs on Node
    RelationshipUsesStorage = "uses_storage" // VM uses Storage pool
    RelationshipUsesNetwork = "uses_network" // Guest uses Network
    RelationshipDependsOn   = "depends_on"   // Generic dependency
    RelationshipHosted      = "hosted"       // Container hosted on Docker
)

4. Forecast Service

📁 Location: internal/ai/forecast/service.go

What It Does

Extrapolates trends to predict when resources will exhaust capacity.

Trend Analysis

type Trend struct {
    Direction    TrendDirection // stable, increasing, decreasing, volatile
    RatePerHour  float64        // Change per hour
    RatePerDay   float64        // Change per day
    Acceleration float64        // Is the rate changing?
    Seasonality  *Seasonality   // Daily/weekly patterns
}

type Seasonality struct {
    HasDaily  bool
    HasWeekly bool
    PeakHours []int // e.g., [9, 10, 11, 14, 15, 16]
    PeakDays  []int // e.g., [1, 2, 3, 4, 5] (Mon-Fri)
}

Forecasting

type Forecast struct {
    ResourceID      string
    Metric          string         // cpu, memory, disk
    CurrentValue    float64
    PredictedValue  float64        // Value at horizon
    Trend           Trend
    TimeToThreshold *time.Duration // Time until critical threshold
    ThresholdValue  float64
    Description     string         // "Disk will be full in 12 days at current rate"
}

Example Output:

"Storage pool local-zfs is at 78% and growing +2.3%/day. At this rate, it will reach 95% (critical) in 7.4 days. Weekly pattern detected: higher growth on weekdays. Recommend action by end of week."


5. Incident Memory System

📁 Location: internal/ai/memory/

Incident Store (incidents.go)

Maintains full incident timelines with auditability.

type Incident struct {
    ID           string
    AlertID      string
    AlertType    string
    ResourceID   string
    ResourceName string
    Severity     string
    StartedAt    time.Time
    ResolvedAt   *time.Time
    Duration     time.Duration
    Acknowledged bool
    AckUser      string
    AckTime      *time.Time
    Events       []IncidentEvent  // Full timeline
}

type IncidentEvent struct {
    ID        string
    Type      IncidentEventType  // alert_fired, acknowledged, analysis, command, resolved
    Timestamp time.Time
    Summary   string
    Details   map[string]interface{}
}

Event Types:

  • alert_fired — Initial alert trigger
  • alert_acknowledged — User acknowledged the alert
  • analysis — AI analyzed the issue
  • command — Command was executed
  • runbook — Runbook was triggered
  • note — User added a note
  • resolved — Alert was resolved

Remediation Log (remediation.go)

Tracks every remediation action for learning and rollback.

type RemediationRecord struct {
    ID           string
    Timestamp    time.Time
    ResourceID   string
    ResourceType string
    ResourceName string
    FindingID    string        // Linked to the finding
    Problem      string        // What was wrong
    Action       string        // What was done
    Command      string        // Actual command executed
    Output       string        // Command output
    Outcome      Outcome       // resolved, partial, failed
    Duration     time.Duration
    Automatic    bool          // Was this auto-fix or manual?
    Rollback     *RollbackInfo // Rollback capability
}

type RollbackInfo struct {
    Reversible   bool
    RollbackCmd  string      // Command to undo
    PreState     string      // State before action
    RolledBack   bool
    RolledBackAt *time.Time
    RolledBackBy string
    RollbackID   string
}

Key Capabilities:

// Find remediations that worked for similar problems
func (r *RemediationLog) GetSuccessfulRemediations(problem string, limit int) []RemediationRecord

// Get remediations that can be undone
func (r *RemediationLog) GetRollbackable(limit int) []RemediationRecord

// Mark a remediation as rolled back
func (r *RemediationLog) MarkRolledBack(id, rollbackID, username string) error

6. Knowledge Store

📁 Location: internal/ai/knowledge/store.go

What It Does

Stores persistent, per-resource knowledge that the AI learns over time, encrypted at rest.

Note Categories

const (
    CategoryService    = "service"        // What services run here
    CategoryConfig     = "config"         // Configuration notes
    CategoryLearning   = "learning"       // AI-learned facts
    CategoryHistory    = "history"        // Historical context
    CategoryInfra      = "infrastructure" // Auto-discovered facts
)

Structure

type GuestKnowledge struct {
    GuestID   string
    GuestName string
    GuestType string
    Notes     []Note
    UpdatedAt time.Time
}

type Note struct {
    ID        string
    Category  string
    Title     string
    Content   string
    CreatedAt time.Time
    UpdatedAt time.Time
}

Discovery Context Integration

// Inject discovery context (versions, ports, config paths) into investigations
func (s *Store) SetDiscoveryContextProvider(provider func() string)

// Scoped context for specific resources
func (s *Store) GetInfrastructureContextForResources(resourceIDs []string) string

Example Knowledge:

Service: "Runs Jellyfin media server on port 8096, Caddy reverse proxy on 443" Config: "Config at /opt/jellyfin/config, database at /var/lib/jellyfin"
Learning: "High memory usage expected during transcoding — threshold 90% is normal"


7. Deterministic Signal Detection

📁 Location: internal/ai/patrol_signals.go

Philosophy

Patrol combines LLM judgment with deterministic detection to ensure no issues are missed, even if the LLM overlooks something.

Signal Types

const (
    SignalSMARTFailure = "smart_failure" // SMART health check failed
    SignalHighCPU      = "high_cpu"      // CPU exceeded threshold
    SignalHighMemory   = "high_memory"   // Memory exceeded threshold
    SignalHighDisk     = "high_disk"     // Storage pool filling up
    SignalBackupFailed = "backup_failed" // Backup task failed
    SignalBackupStale  = "backup_stale"  // No backup in 48+ hours
    SignalActiveAlert  = "active_alert"  // Critical/warning alert present
)

Configurable Thresholds

type SignalThresholds struct {
    StorageWarningPercent  float64       // Default: 75%
    StorageCriticalPercent float64       // Default: 95%
    HighCPUPercent         float64       // Default: 70%
    HighMemoryPercent      float64       // Default: 80%
    BackupStaleThreshold   time.Duration // Default: 48 hours
}

// Thresholds can sync with user-configured alert settings
func SignalThresholdsFromPatrol(pt PatrolThresholds) SignalThresholds

Detection Flow

  1. Tool calls complete during patrol
  2. DetectSignals() parses tool outputs for known patterns
  3. UnmatchedSignals() compares against findings the LLM reported
  4. Evaluation pass — if signals were missed, a focused LLM call reviews them
  5. Fallback creation — if still unmatched, deterministic findings are created

8. Investigation Orchestrator

📁 Location: internal/ai/investigation/

Investigation Session

type InvestigationSession struct {
    ID             string
    FindingID      string
    SessionID      string     // Chat session ID
    Status         Status     // pending, running, completed, failed, needs_attention
    StartedAt      time.Time
    CompletedAt    *time.Time
    TurnCount      int        // Agentic turns used
    Outcome        Outcome
    ProposedFix    *Fix
    ApprovalID     string     // If queued for approval
    ToolsAvailable []string
    ToolsUsed      []string
    EvidenceIDs    []string
    Summary        string
    Error          string
}

Configuration

type InvestigationConfig struct {
    MaxTurns                int           // Default: 15
    Timeout                 time.Duration // Default: 10 minutes
    MaxConcurrent           int           // Default: 3
    MaxAttemptsPerFinding   int           // Default: 3
    CooldownDuration        time.Duration // Default: 1 hour
    TimeoutCooldownDuration time.Duration // Default: 10 minutes (shorter for timeouts)
    VerificationDelay       time.Duration // Default: 30 seconds
}

Fix Structure

type Fix struct {
    ID          string
    Description string
    Commands    []string
    RiskLevel   string   // low, medium, high, critical
    Destructive bool     // Flagged by pattern matching
    TargetHost  string
    Rationale   string
}

Destructive Command Detection

Commands are scanned for dangerous patterns:

  • rm -rf, dd if=, mkfs.
  • shutdown, reboot, poweroff
  • iptables -F, ufw disable
  • Database drops, config wipes

9. Agentic Patrol Loop

📁 Location: internal/ai/patrol_ai.go

Dynamic Turn Budget

const (
    patrolMinTurns          = 20
    patrolMaxTurnsLimit     = 80
    patrolTurnsPer50Devices = 5  // +5 turns per 50 devices
    patrolQuickMinTurns     = 10 // Scoped patrols are faster
    patrolQuickMaxTurns     = 30
)

func computePatrolMaxTurns(resourceCount int, scope *PatrolScope) int {
    // Scales with environment size
}

Patrol Phases

  1. Seed Context Building — Inventory + thresholds + active findings + notes
  2. Streaming Analysis — LLM investigates using MCP tools
  3. Signal Detection — Deterministic check on tool outputs
  4. Evaluation Pass — Focused review of missed signals
  5. Stale Finding Reconciliation — Resolve findings whose issues cleared
  6. Investigation Triggering — Queue findings for deep investigation

Thinking Token Cleanup

Responses from DeepSeek and other models may include internal reasoning markers. Patrol strips these:

func CleanThinkingTokens(content string) string {
    // Removes:
    // - <think>...</think>
    // - <end▁of▁thinking>
    // - <DSML...> (DeepSeek internal format)
    // - Internal reasoning lines ("Now, ", "Let me ", etc.)
}

10. Seed Context (What Patrol Sees)

Every patrol run builds comprehensive context:

Component What's Included
Nodes Status, load, uptime, 24h/7d trends, anomaly flags
VMs/LXCs Full metrics, backup status, OCI images, trend direction
Storage Usage %, growth rate, days-to-full prediction
Docker Container counts, health states, update availability
Kubernetes Nodes, pods, deployments, services, namespaces
PBS Datastore status, job outcomes, verification results
PMG Mail queue depths, spam stats, delivery rates
Agents Connection status, permissions, scope restrictions
Ceph Cluster health, OSD states, PG status
Baselines Per-resource learned normal ranges
Patterns Detected recurring issues and predictions
Correlations Known dependencies and cascade risks
Forecasts Capacity predictions for at-risk resources
Active Findings Existing issues being tracked
User Notes Your annotations explaining expected behavior
Suppression Rules What you've dismissed as not-an-issue
Recent Remediations What worked (or failed) recently

11. Findings Lifecycle

┌─────────────────────────────────────────────────────────┐
│                    Finding Created                      │
│  (by LLM via patrol_report_finding or deterministic)    │
└───────────────────────────┬─────────────────────────────┘
                            │
                            ▼
            ┌───────────────────────────────┐
            │    Threshold Validation       │
            │  (Must exceed user thresholds)│
            └───────────────┬───────────────┘
                            │
           Pass ◄───────────┼───────────► Reject (filtered out)
                            │
                            ▼
            ┌───────────────────────────────┐
            │   Semantic Deduplication      │
            │ (Similar finding already open?)│
            └───────────────┬───────────────┘
                            │
           New ◄────────────┼────────────► Merge (bump count)
                            │
                            ▼
            ┌───────────────────────────────┐
            │       Finding Stored          │
            │   (ai_findings.json)          │
            └───────────────┬───────────────┘
                            │
            ┌───────────────┼───────────────┐
            ▼               ▼               ▼
         ┌──────┐     ┌──────────┐   ┌────────────┐
         │Active│     │Investigate│   │ Auto-fix   │
         │(idle)│     │(approval) │   │(autonomous)│
         └──────┘     └─────┬─────┘   └─────┬──────┘
                            │               │
                            ▼               ▼
                    ┌───────────────────────────┐
                    │  Fix Proposed / Executed  │
                    └───────────┬───────────────┘
                                │
                                ▼
                    ┌───────────────────────────┐
                    │   Verification Delay      │
                    │      (30 seconds)         │
                    └───────────┬───────────────┘
                                │
                                ▼
                    ┌───────────────────────────┐
                    │   Verification Check      │
                    │  (Is the issue resolved?) │
                    └───────────┬───────────────┘
                                │
         ┌──────────────────────┼──────────────────────┐
         ▼                      ▼                      ▼
    ┌─────────┐          ┌────────────┐        ┌──────────────┐
    │Resolved │          │  Persists  │        │Needs Attention│
    │(closed) │          │(retry later)│       │ (escalate)    │
    └─────────┘          └────────────┘        └──────────────┘

12. What Makes Patrol Different

Traditional Alerting Pulse Patrol
Static thresholds Learned baselines + context
Single metric Cross-system correlation
Instant alerts Trend-aware predictions
No memory Incident history + pattern learning
Manual investigation Autonomous investigation
Manual fixes Verified auto-remediation
Alert fatigue Noise-controlled findings
Siloed tools Unified intelligence

13. Privacy & Security

  • BYOK: All AI calls use your API keys to your chosen provider
  • On-premises: All processing happens on your Pulse server
  • Encrypted storage: Sensitive data (keys, licenses) stored encrypted
  • Minimal context: Only necessary data sent to AI providers
  • No telemetry: No data sent to Pulse by default
  • Audit trail: All actions logged with timestamps and user attribution

Files Reference

Directory/File Purpose
internal/ai/patrol.go Core PatrolService definition, interfaces
internal/ai/patrol_run.go Patrol loop, scoped runs, lifecycle
internal/ai/patrol_ai.go LLM integration, agentic loop, context building
internal/ai/patrol_signals.go Deterministic signal detection
internal/ai/patrol_findings.go Finding CRUD, investigation triggers
internal/ai/patrol_triggers.go Event-driven patrol triggers
internal/ai/intelligence.go Unified intelligence orchestrator
internal/ai/baseline/ Baseline learning and anomaly detection
internal/ai/patterns/ Pattern detection and failure prediction
internal/ai/correlation/ Correlation detection and root cause analysis
internal/ai/forecast/ Trend extrapolation and capacity forecasting
internal/ai/memory/ Incidents, remediations, context tracking
internal/ai/knowledge/ Persistent per-resource knowledge store
internal/ai/investigation/ Investigation orchestrator and sessions
internal/ai/tools/ MCP tool implementations (50+ tools)

Summary

Pulse Patrol represents a comprehensive approach to infrastructure intelligence:

  1. It learns — Baselines, patterns, correlations
  2. It predicts — Forecasts, failure predictions, cascade analysis
  3. It remembers — Incidents, remediations, knowledge
  4. It investigates — Autonomous diagnosis with tool access
  5. It fixes — Verified remediation with rollback capability
  6. It improves — Tracks what works and learns from outcomes

All running on your infrastructure, with your AI keys, with complete transparency.