Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

CLASSIFICATION OF DGA BASED MALWARE USING

ENSEMBLE AND DEEP HYBRID LEARNING

MSc Thesis Research Proposal

By
BEREKET HAILU BIRU

Advisor: SOLOMON ZEMENE (Ph.D.)

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

(JANUARY 2022)
OUTLINES
1. INTRODUCTION
• Statement of problem
• Significance of the study
• Research questions
• Objective

2. LITERATURE REVIEW

3. METHODOLOGY
• Proposed Methods
• General workflow
• Evaluation Metrics
1. INTRODUCTION

• The hijackers are increasing day by day which is the most risky.
• Even though the network itself has the utmost security, once it is
connected to the internet, it will become insecure - vulnerable to attack.
• Inter-network relies on DNS, the digital asset, the phone book that holds
the map of domain name to the IP-address.
• DNS defines who is on the internet and reveal companies’ identity online.
• On June 23, 2020, CSC’s Domain Security Report stated that 83% of
worldwide 2000 organizations are at greater risk of domain name
hijacking.
• The traditional approach fell short to prevent the DNS abuse because of the
DGA tool used by attackers which makes the reverse engineering
technique to be difficult.
DGA (Domain Generation Algorithm) is a piece of code or an algorithm
used by cybercriminals which has two main purposes:
i. Generating a massive domain name randomly.
ii. Creating command and control channels between the malware and the
attacker.
• DGA generators are normally seed-based and may generate thousands of
domain names. The seed is understood to each side; therefore, the same
sequences generate on both client and source sides without having to speak.
• According to Netlab 360, There are more than 49 DGA based malware
families. Some DGAs can be bizarre (e.g., gegjiimqmlgtdmk.tf or
jxbdxeyxttdmcjagi.me), and some are more complex to detect (e.g.,
huoseavas.name or agtisaib.info) due to their nature of being pronounceable
like a human language.
There are two Binary classification classify between two
types of mutually exclusive classes.
The separation or identification of
classification legitimate FQDNs from malicious one is
techniques to answered by binary experiment
Multi class classification is a classification
detect the task with more than two classes.
malicious The multiclass experiment goes beyond
the binary experiment in order to identify
domain [1] not only the legitimate FQDN but also to
[27], those sort malware samples according to their
families.
are:
1.1. Statement of Problem
• The malwares imitate the pattern of normal domain names by concatenating
randomly chosen English dictionary words.
• Their unpredictability nature.
• Unable to attribute specific malware generating domain.
• Poor performance in detection of zero-day malware.
lack of adequately organized or carefully reviewed dataset [26][27],
Improper selection of learning and classification algorithms, and
no enough context and feature are used to express the domain name [19]
[33][27].
1.2. Significance of the Study

1. An ensemble and DH model for DGA based malware classification using


both ML and DL classifiers.
2. Test the trained model more extensively on domain names generated by
new and previously unseen (i.e., untrained) malware.
3. Evaluation of the proposed approach by using different datasets with
various evaluation metrics; and the results are compared with other
existing methods.
1.3. Research Questions

1. How to configure and apply ensemble and DHL techniques for binary
and multiclass classification?
2. Can we increase the accuracy & performance of DGA based malware
classification by combining DL and ML techniques?
3. How to evaluate the proposed approach for multiclass classification
and make prediction for a new data?
1.4. Objective

• Implementing ensemble and DHL techniques to DGA based


General malware classification in DNS.
Objective

• Implement an ensemble learning and deep hybrid models in


DNS.
• Implementing the model for detecting DGA based malware
and classification of each malware family.
Specific • Testing the ability of the proposed model performance.
Objectives •
Comparing the model performance results with other
existing methods.
2. LITERATURE REVIEW

Daniel S. Berman (2019),


Proposed 1D Application of Capsule Networks to DGA based malware
Detection [25].
 CapsNet, CNN and LSTM algorithm were used.
The performance of all the models was worse in detecting word-based
DGAs .
The unknowndropper malware remain undetected with the novel DGA
experiment (previously unseen DGA).
Mattia Zago et al. (2019),
Presented the feature discovery process in detection of DGA based botnet
[27].
Context-Free feature family is more than capable of pinpointing DGA based
malwares without harming the users’ privacy.
Both binary and multiclass experiment was done & evaluated on 5 ML
algorithms (RF, NN, SVM, DT, AB and kNN).
Multiclass experiment performed worse than the binary one due to be
unable to distinguish similar malware like Oakbot and Matsnu.
 JONATHAN P. et al. (2019),
Presented a novel DGA, that can generate large numbers of unregistered
domain names, called CharBot [31].
Highlight a dangerous weakness of modern DGA classifiers to extremely
simple attacks.
They tested current models including FANCI (RF based on human-
engineered features) and LSTM.MI (DL approach) and got poor performance
for real-time detection of the DGAs.
All models failed to detect CharBot and DeceptionDGA domains
successfully.
 Ryan R.Curtin et al. (2019),
Combination of a novel recurrent neural network architecture with domain
registration side information (WHOIS) to detect DGA domains [26].

Effective to detect DGA families with high smash-word score.


Less effective for those DGA families that do not look like natural domain
names.
Yanchen Qiao et al. (2019),
Long Short-Term Memory (LSTM) with attention mechanism method [24].
Used the character sequence of the domain name as a feature.
Effective in DGA domain names classification but no big improvement
done.

Large number of Cryptolocker, Locky, and Necurs are classified as Ramnit.


Fangli Ren et al. (2020),
Presented an integrated attention mechanism and deep neural network to
detect and classify the domain names [18].

Achieved better performance on arithmetic-based, part-wordlist-based and


wordlist-based DGA such as matsnu and suppobox families .

But the malware Cryptolocker, gameover and locky were not properly
classified.
Vinayakumar Ravi et al. (2021),
Novel technique to detect randomly generated domain names using DL
approach [19].

Their model was tested against three different adversarial attacks: DeepDGA,
CharBot, and MaskDGA.

The method was able to identify DNS homograph attacks and DGAs.
But not effective for the DGA belongs to a novel family (previously unseen DGA).
Author & year Dataset Algorithm Results & Gap/ future work

Ali Soleymani and Fatemeh  Spamhaus website [17].  Decision tree, support  RF has the highest
Arabgo (2021)  Alexa rankings vector machine, random classification accuracy.
forest, and logistic  Few numbers of DGA
regression family (4) were used.

Karunakaran P. (2020) public sources as well as  CNN, RNN with auto  low performance
real time environments encoder and PCA  Limited dataset
(Principal Component  limited number of DGA
Analysis) family
 Character level
embedding structure
(CLES)

Mattia Zago et al. (2019) Netlab 360, Plohmann 2015,  (PCA) to extract n  Multiclass experiment
& (Malware Domain List features performed worse than
2009; OSINT).  RF, NN, SVM, DT, AB the binary
Majestic12 Ltd: The Majestic and kNN.  unable to distinguish
Million (2018) backlink data. similar malware like
Oakbot or Matsnu
Daniel S.  Alexa top one  RNN & CNN  The unknowndropper malware was entirely
Berman et al. million non- DGA (CapsNet, CNN undetected.
(2019) domains and LSTM)  The performance of all the models was worse
 DGA feed from in detecting word-based DGAs
Bambenek
Consulting.
JONATHAN  Alexa  LSTM.MI, B-RF,  All models fail to adequately detect CharBot
PECK et al.,  Bambenek FANCI method. and DeceptionDGA domain.
(2019) Consulting feeds  FANCI and LSTM.MI (DL approach) performed
 QNAME worse for real-time detection.

Yanchen Qiao  Bambenek  LSTM with  Effective but no big improvement done.
et al., (2019) Consulting attention  large number of Cryptolocker, Locky, and
 Alexa mechanism. Necurs are classified as Ramnit.
 
Ryan R. Curtin  DGA from GitHub.  RNN + WHOIS  Able to reliably detect difficult DGA families
et al. (2019)  Non-DGA from Alexa (domain such as matsnu, suppobox, rovnix, and others.
top 1 million registration side  Less effective for those DGA families that do
 Open DNS public information) not look like natural domain names.
domain lists
Different researches were done on the classification and detection of DGA
based malwares, but it was unsuccessful in identification of some malware.

Developing a model capable of detecting malicious domains altogether is


critical, and every one models tested here fail to try to do so [25].

In this research we will try to find other method to perform the recognition
in an automated way.
3. RESEARCH METHEDOLOGY

• The classical ML methods: has resulted less in performance and accuracy in


understanding the dataset and doing feature engineering .
• On the other hand, the final classification of a DL model driven by fully
connected NNL may result in over-fitting or may require unwanted usage of
computational resources and power, which isn't happens in classical ML.
• Due to this, the proposed research will adopt the Ensemble and Hybrid
approach (ML & DL) in order to take the benefits of both of these
approaches, alleviate the drawbacks, to increase the prediction accuracy
and decrease the computational complexity and fuse them together [35]
[36].
3.2.1. Ensemble Learning
• Is the process of mixing multiple learning algorithms to get their
collective performance [37].
•Bagging

•Boosting

•Stacking
3.2.2. Deep Hybrid Learning
• Is resultant of fusion network, which can be attained by combining Deep
Learning and Machine Learning techniques.
• Thus, we can take the benefits from both, reduce the drawbacks and
provide more accurate and less computationally expensive solutions [39].
3.2.3. General workflow
3.2.4. Evaluation Metrics

• Recall or Sensitivity

• False Positive Rate (FPR)

• Accuracy
THANK YOU!

You might also like