ntopng/doc/README.alerts.developer.md
2020-09-04 10:48:17 +02:00

5.8 KiB

Introduction

ntopng alerts are

  • stateful
  • stateless

Stateless alerts

Stateless alerts are stored when an issue is detected. Issues which trigger stateless alerts have no duration associated, that is, they are basically events such as the connection/disconnection of a device, or the change of the status of an SNMP device port.

To store a stateless alert, method alerts_api.store is called.

Stateless alerts lifecycle

  1. alerts_api.store enqueues the alert into the ntopng internal SQLite queue (ntop.pushSqliteAlert) and also into the ntopng recipients queue (ntop.pushAlertNotification).
  2. housekeeping.lua dequeues, every three seconds, the alert from the ntopng internal SQLite queue (ntop.popSqliteAlert) and also from the ntopng recipients queue (alert_utils.processAlertNotifications).
  • Alerts dequeued from the ntopng internal SQLite queue are sent to the C engine for the actual insertion into SQLite (interface.storeAlert).
  • Alerts dequeued from the ntopng recipients queue are enqueued again into per-recipient queues (notification_recipients.dispatchNotification), using the host pool id carried inside the alert to choose the recipients (recipients = pools:get_recipients(message.pool_id)). Alerts are always enqueued also into the builtin SQLite recipient builtin_sqlite_recipient.
  1. housekeeping.lua dequeues, every three seconds, alerts from every per-recipient queue (notification_recipients.processNotifications).
  • Alerts are dequeued from the per-recipient queue with a single call to m.dequeueRecipientAlerts which is also in charge of processing the dequeued alerts (e.g., batch and send them out in email messages).

Stateful alerts

Stateful alerts are triggered the issue is detected, and are released when the issue no longer occurs. Issues which trigger stateful alerts have a duration associated, that is, they are events such as threshold crosses (e.g., the throughput is above 1Mbps) or suspicious activities (e.g., the host is performing a TCP scan).

Methods alerts_api.trigger and alerts_api.release are called to trigger and release stateful alerts, respectively.

Stateful alerts lifecycle

  1. alerts_api.trigger sets into the C core, straight into the entity, the triggered alert (e.g., host.storeTriggeredAlert) and enqueues the alert into the ntopng recipients queue (ntop.pushAlertNotification). No SQLite enqueues/dequeues/insertions comes into play.
  2. alerts_api.release removes from the C core the previously triggered alert (e.g., host.releaseTriggeredAlert), enqueues the alert into ~~the ntopng internal SQLite queue (ntop.pushSqliteAlert) and also into ~~the ntopng recipients queue (ntop.pushAlertNotification).
  3. housekeeping.lua performs the very same operations 2 and 3 described above for stateless alerts.

Alert Queues

Queues are used to decouple the dispatch from the processing of alerts. Currently used queues are:

  • One in-memory queue for SQLite alerts (ntop->getSqliteAlertsQueue())
  • One in-memory queue for the ntopng recipients (ntop->getAlertsNotificationsQueue())
  • Multiple Redis queues for per-recipient queues (get_endpoint_recipient_queue(recipient_id))

Queue messages format

JSON messages are queued/dequeued. The format of the JSON is undocumented and contains variable-fields which depends on the alert type. However, a minimum set of fields is constant and include alert type, entity and severity.

Queue drops

  • When the SQLlite queue is full, alerts are dropped and counted into the system interface dropped alerts (iface->incNumDroppedAlerts(1)).
  • When the ntopng recipients queue is full, alerts are dropped and counted into the system interface dropped alerts (iface->incNumDroppedAlerts(1)) but drops are NOT counted.
  • When any of the per-recipient queues is full, alerts are dropped with a queue trim but drops are NOT counted.

Alert Recipients

Recipients are implemented as plugins, e.g., plugins/{email,webhook}_alert_endpoint/. Recipients are loaded in memory with plugins_utils.getLoadedAlertEndpoints(). A require is used to avoid loading them more than one time per Lua VM. Recipients are not loaded when enqueuing alerts, however, they are loaded when dequeuing alerts in housekeeping.lua which VM is re-used and only recreated once every two minutes.

It would be desirable to migrate current recipients implementation to an OO implementation to ease the monitoring of queue fill levels.

Critical Points

  • Only one in-memory queue is used for SQLite alerts and ntopng recipients.

    • An interface generating many alerts can jeopardize the queue and cause other interfaces alerts to be dropped.
    • An high number of alerts of a certain type can jeopardize the queue and cause alerts of other types to be dropped.
  • When the ntopng recipients queue or any of the per-recipient queues is full, alerts are dropped but drops are NOT counted.

  • housekeeping.lua is assumed to run every three seconds, however, it can be much slower than this, for example when it starts refreshing/downloading blacklists from the web. If the housekeeping gets stuck for a long time, alerts will not be dequeued, queues will grow, and eventually this will cause alert drops.

  • notification_recipients.processNotifications relies on housekeeping.lua running every three seconds ((now % m.EXPORT_FREQUENCY) < periodic_frequency). This is assumption is wrong and can cause alerts to stay in per-recipient queues indefinitely.

  • notification_recipients.processNotification relies on dequeueRecipientAlerts. If dequeueRecipientAlerts is slow, or perform only one operation at time, then alerts will be processed at a much slower rate than the generation rate (e.g., currently, max 1 mail is sent out every minute). It would be ideal to process all recipients in round-robin until there's no more work to do.