nDPI/tests/dga
Sam James c2b7d77784
build: respect environment options more (#1392)
* build: update m4/ax_pthread.m4 from serial 23 -> serial 31

Update ax_pthread.m4 to the latest version from the autoconf-archive
project.

Signed-off-by: Sam James <sam@gentoo.org>

* build: properly detect AR, CC, RANLIB

It's necessary to be able to override choice of AR/CC/RANLIB and other toolchain
variables/tools for cross-compilation, testing with other toolchains, and
to ensure the compiler chosen by the user is actually used for the build.

Previously, GNU_PREFIX was kind-of used for this but this isn't a standard
variable (at all) and it wasn't applied consistently anyway.

We now use the standard autoconf mechanisms for finding these tools.

(RANLIB is already covered by LT_INIT.)

Signed-off-by: Sam James <sam@gentoo.org>

* build: use $(MAKE)

This ensures that parallel make works correctly, as otherwise, a fresh
make job will be started without the jobserver fd, and hence
not know about its parent, forcing -j1.

* build: respect CPPFLAGS, LDFLAGS

- CPPFLAGS is for the C preprocessor (usually for setting defines)
- LDFLAGS should be placed before objects for certain flags to work
  (e.g. -Wl,--as-needed)

Signed-off-by: Sam James <sam@gentoo.org>

Co-authored-by: Luca Deri <lucaderi@users.noreply.github.com>
2022-01-18 14:30:14 +01:00
..
dga_evaluate.c Improved DGA detection 2021-03-03 19:30:01 +01:00
Makefile.in build: respect environment options more (#1392) 2022-01-18 14:30:14 +01:00
README.md Implement DGA detection performances tracking workflow. (#1064) 2020-11-16 21:17:16 +01:00
test_dga.csv Improved DGA detection 2021-03-03 19:30:01 +01:00
test_non_dga.csv Implement DGA detection performances tracking workflow. (#1064) 2020-11-16 21:17:16 +01:00

DGA detection testing workflow

Overview

nDPI provides a set of threat detection features available through NDPI_RISK detection.

As part of these features, we provide DGA detection.

Domain generation algorithms (DGA) are algorithms seen in various families of malware that are used to periodically generate a large number of domain names that can be used as rendezvous points with their command and control servers.

DGA detection heuristic is implemented here.

DGA performances test and tracking allows us to detect automatically if a modification is harmful.

The modification can be a simple threshold change or a future lightweight ML approach.

Used data

Original used dataset is a collection of legit and DGA domains (balanced) that can be obtained as follow:

wget https://raw.githubusercontent.com/chrmor/DGA_domains_dataset/master/dga_domains_full.csv

We split the dataset into DGA and NON-DGA and we keep 10% of each as test set and 90% as training set.

python3 -m pip install pandas
python3 -m pip install sklearn

Instruction using python3

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("dga_domains_full.csv", header=None, names=["type", "family", "domain"])
df_dga = df[df.type=="dga"]
df_non_dga = df[df.type=="legit"]
train_non_dga, test_non_dga = train_test_split(df_non_dga, test_size=0.1, shuffle=True, random_state=27)
train_dga, test_dga = train_test_split(df_dga, test_size=0.1, shuffle=True, random_state=27)

test_dga["domain"].to_csv("test_dga.csv", header=False, index=False)
test_non_dga["domain"].to_csv("test_non_dga.csv", header=False, index=False)
train_dga["domain"].to_csv("train_dga.csv", header=False, index=False)
test_non_dga["domain"].to_csv("test_non_dga.csv", header=False, index=False)

Detection approach must be built on top of training set only, test set must be kept as unseen cases for testing

dga_evaluate

After nDPI compilation, you can use dga_evaluate helper to check number of detections out of an input file.

dga_evaluate <file name>

You can evaluate your modifications performances before submitting it as follows:

./do-dga.sh

If your modifications decreases baseline performances, test will fails. If not (well done), test passed and you must update the baseline metrics with your obtained ones.