FQDN (Fully Qualified Domain Name) Classifier

A Machine Learning classifier for predicting whether an FQDN (Fully Qualified Domain Name) is benign or malicious.

The project trains a classifier on features extracted from domain analysis (DNS, SSL, HTTP behavior, WHOIS, and lexical properties) and exposes a Flask API for inference.

Data Source: The model can be trained using domain lists from fabriziosalmi/blacklist or any compatible blacklist/whitelist files.

Key Features

Feature Extraction: Extracts ~22 binary/ternary features per domain including DNS record presence, SSL certificate validity, HTTP redirect behavior, HSTS, suspicious keywords, TLD risk, domain length, hyphen/digit counts, and optional WHOIS data.
Multiple Model Types: Supports Gaussian Naive Bayes, Logistic Regression, and Random Forest (default).
Flask API: REST API with input validation and health check endpoint.
CLI Prediction: Classify domains directly from the terminal using predict.py.
Centralized Configuration: All settings managed via settings.py with config.ini override support.
Multithreaded Extraction: Concurrent domain analysis using a configurable thread pool.

Architecture

The project consists of three main components:

augment.py: The ETL pipeline. Enriches raw FQDN lists with analysis features (DNS, HTTP, SSL, WHOIS) and writes a JSON dataset.
fqdn_classifier.py: The training script. Trains a model on the augmented dataset and serializes it using joblib.
api.py / predict.py: The inference layer. Loads the serialized model to serve predictions via CLI or HTTP API.

Installation

Clone the repository:

git clone https://github.com/fabriziosalmi/fqdn-model.git
cd fqdn-model

Set up environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Usage

1. Training (Optional)

If you want to retrain the model with your own data or fresh data from fabriziosalmi/blacklist:

# 1. Download lists
wget https://raw.githubusercontent.com/fabriziosalmi/blacklist/master/blacklist.txt
wget https://raw.githubusercontent.com/fabriziosalmi/blacklist/master/whitelist.txt

# 2. Augment data (extract features)
python augment.py -i blacklist.txt -o blacklist.json --is_bad Yes
python augment.py -i whitelist.txt -o whitelist.json --is_bad No

# 3. Merge & Train
python merge_datasets.py blacklist.json whitelist.json -o dataset.json
python fqdn_classifier.py dataset.json

2. Prediction (CLI)

Classify domains directly from your terminal:

python predict.py google.com
python predict.py malicious-test-domain.xyz

3. API Serving

Start the server:

python api.py

Health Check:

curl http://localhost:5000/health

Prediction:

curl -X POST http://localhost:5000/predict \
     -H "Content-Type: application/json" \
     -d '{"fqdn": "example.com"}'

Configuration

The project uses a hierarchical configuration system:

Defaults: Defined in settings.py.
Config File: Values in config.ini override defaults.
CLI arguments: Runtime arguments override everything.

See settings.py for all available options.

Testing

Run the test suite:

pytest tests/

Data Format

The training data consists of two text files:

whitelist.txt: Contains a list of benign FQDNs, one per line.
blacklist.txt: Contains a list of malicious FQDNs, one per line.

Each line should contain only the FQDN itself, without extra characters or whitespace.

Example:

whitelist.txt:

google.com
facebook.com
wikipedia.org

blacklist.txt:

malware-domain.xyz
phishing-site.tk
evil-domain.com

Model Details

Supported Models: Gaussian Naive Bayes (gaussian_nb), Logistic Regression (logistic_regression), Random Forest (random_forest, default)
Default Estimators (Random Forest): 100 (configurable via --rf_n_estimators)

Features Extracted

Features are extracted by the analyze_fqdn function in augment.py. Each feature is encoded as a binary or ternary value (0 = benign indicator, 1 = malicious indicator, 2 = unknown/unavailable):

DNS record presence (A, AAAA, MX, TXT, CNAME)
SSL certificate validity
HTTP status code
Final protocol (HTTP vs HTTPS)
HTTP-to-HTTPS redirect
Excessive redirects
HSTS header presence
Suspicious keyword detection in page content
SSL verification failure
Risky TLD
Domain length
Hyphen count
Digit count
IP address embedded in domain
URL shortener detection
Subdomain count
Page title and body presence
WHOIS data (creation date, expiration date, age) — optional, requires --whois flag

Model Persistence

Trained models are saved using joblib in the models/ directory and loaded automatically by predict.py and api.py.

Performance Metrics

The fqdn_classifier.py script evaluates the trained model using:

Accuracy: Overall correctness of the model.
ROC AUC: Area under the receiver operating characteristic curve.
Precision: Proportion of true positives among predicted positives.
Recall: Proportion of true positives among actual positives.
F1-Score: Harmonic mean of precision and recall.
Log Loss / Brier Score: Probability calibration metrics.
Confusion Matrix: Counts of true/false positives and negatives.
Feature Importance: Ranking of features by contribution (for Random Forest and Logistic Regression).

Contributing

Contributions are welcome! Here's how you can contribute:

Fork the repository.
Create a new branch: git checkout -b feature/my-new-feature or git checkout -b fix/my-bug-fix
Make your changes and commit them: git commit -am 'Add some feature'
Push to the branch: git push origin feature/my-new-feature
Create a new Pull Request.

Guidelines:

Follow the existing code style.
Write clear and concise commit messages.
Provide tests for your changes.
Explain the purpose of your changes in the Pull Request description.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Credits

scikit-learn: https://scikit-learn.org/
tldextract: https://github.com/john-kurkowski/tldextract
joblib: https://joblib.readthedocs.io/en/latest/
rich: https://github.com/Textualize/rich
Flask: https://flask.palletsprojects.com/

Contact

If you have any questions or suggestions, feel free to open an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FQDN (Fully Qualified Domain Name) Classifier

Key Features

Architecture

Installation

Usage

1. Training (Optional)

2. Prediction (CLI)

3. API Serving

Configuration

Testing

Data Format

Model Details

Features Extracted

Model Persistence

Performance Metrics

Contributing

License

Credits

Contact

About

Uh oh!

Releases 6

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github		.github
docs		docs
models		models
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
api.py		api.py
augment.py		augment.py
blacklist.10k.json		blacklist.10k.json
blacklist.10k.txt		blacklist.10k.txt
blacklist.txt		blacklist.txt
config.ini		config.ini
confusion_matrix.png		confusion_matrix.png
dataset.1k.json		dataset.1k.json
dataset.20k.json		dataset.20k.json
fqdn_classifier.py		fqdn_classifier.py
merge_datasets.py		merge_datasets.py
package-lock.json		package-lock.json
package.json		package.json
predict.py		predict.py
requirements.txt		requirements.txt
settings.py		settings.py
whitelist.10k.json		whitelist.10k.json
whitelist.10k.txt		whitelist.10k.txt
whitelist.txt		whitelist.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FQDN (Fully Qualified Domain Name) Classifier

Key Features

Architecture

Installation

Usage

1. Training (Optional)

2. Prediction (CLI)

3. API Serving

Configuration

Testing

Data Format

Model Details

Features Extracted

Model Persistence

Performance Metrics

Contributing

License

Credits

Contact

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages