Professional Documents
Culture Documents
Classification of Dga Based Malware Using
Classification of Dga Based Malware Using
By
BEREKET HAILU BIRU
(JANUARY 2022)
OUTLINES
1. INTRODUCTION
• Statement of problem
• Significance of the study
• Research questions
• Objective
2. LITERATURE REVIEW
3. METHODOLOGY
• Proposed Methods
• General workflow
• Evaluation Metrics
1. INTRODUCTION
• The hijackers are increasing day by day which is the most risky.
• Even though the network itself has the utmost security, once it is
connected to the internet, it will become insecure - vulnerable to attack.
• Inter-network relies on DNS, the digital asset, the phone book that holds
the map of domain name to the IP-address.
• DNS defines who is on the internet and reveal companies’ identity online.
• On June 23, 2020, CSC’s Domain Security Report stated that 83% of
worldwide 2000 organizations are at greater risk of domain name
hijacking.
• The traditional approach fell short to prevent the DNS abuse because of the
DGA tool used by attackers which makes the reverse engineering
technique to be difficult.
DGA (Domain Generation Algorithm) is a piece of code or an algorithm
used by cybercriminals which has two main purposes:
i. Generating a massive domain name randomly.
ii. Creating command and control channels between the malware and the
attacker.
• DGA generators are normally seed-based and may generate thousands of
domain names. The seed is understood to each side; therefore, the same
sequences generate on both client and source sides without having to speak.
• According to Netlab 360, There are more than 49 DGA based malware
families. Some DGAs can be bizarre (e.g., gegjiimqmlgtdmk.tf or
jxbdxeyxttdmcjagi.me), and some are more complex to detect (e.g.,
huoseavas.name or agtisaib.info) due to their nature of being pronounceable
like a human language.
There are two Binary classification classify between two
types of mutually exclusive classes.
The separation or identification of
classification legitimate FQDNs from malicious one is
techniques to answered by binary experiment
Multi class classification is a classification
detect the task with more than two classes.
malicious The multiclass experiment goes beyond
the binary experiment in order to identify
domain [1] not only the legitimate FQDN but also to
[27], those sort malware samples according to their
families.
are:
1.1. Statement of Problem
• The malwares imitate the pattern of normal domain names by concatenating
randomly chosen English dictionary words.
• Their unpredictability nature.
• Unable to attribute specific malware generating domain.
• Poor performance in detection of zero-day malware.
lack of adequately organized or carefully reviewed dataset [26][27],
Improper selection of learning and classification algorithms, and
no enough context and feature are used to express the domain name [19]
[33][27].
1.2. Significance of the Study
1. How to configure and apply ensemble and DHL techniques for binary
and multiclass classification?
2. Can we increase the accuracy & performance of DGA based malware
classification by combining DL and ML techniques?
3. How to evaluate the proposed approach for multiclass classification
and make prediction for a new data?
1.4. Objective
But the malware Cryptolocker, gameover and locky were not properly
classified.
Vinayakumar Ravi et al. (2021),
Novel technique to detect randomly generated domain names using DL
approach [19].
Their model was tested against three different adversarial attacks: DeepDGA,
CharBot, and MaskDGA.
The method was able to identify DNS homograph attacks and DGAs.
But not effective for the DGA belongs to a novel family (previously unseen DGA).
Author & year Dataset Algorithm Results & Gap/ future work
Ali Soleymani and Fatemeh Spamhaus website [17]. Decision tree, support RF has the highest
Arabgo (2021) Alexa rankings vector machine, random classification accuracy.
forest, and logistic Few numbers of DGA
regression family (4) were used.
Karunakaran P. (2020) public sources as well as CNN, RNN with auto low performance
real time environments encoder and PCA Limited dataset
(Principal Component limited number of DGA
Analysis) family
Character level
embedding structure
(CLES)
Mattia Zago et al. (2019) Netlab 360, Plohmann 2015, (PCA) to extract n Multiclass experiment
& (Malware Domain List features performed worse than
2009; OSINT). RF, NN, SVM, DT, AB the binary
Majestic12 Ltd: The Majestic and kNN. unable to distinguish
Million (2018) backlink data. similar malware like
Oakbot or Matsnu
Daniel S. Alexa top one RNN & CNN The unknowndropper malware was entirely
Berman et al. million non- DGA (CapsNet, CNN undetected.
(2019) domains and LSTM) The performance of all the models was worse
DGA feed from in detecting word-based DGAs
Bambenek
Consulting.
JONATHAN Alexa LSTM.MI, B-RF, All models fail to adequately detect CharBot
PECK et al., Bambenek FANCI method. and DeceptionDGA domain.
(2019) Consulting feeds FANCI and LSTM.MI (DL approach) performed
QNAME worse for real-time detection.
Yanchen Qiao Bambenek LSTM with Effective but no big improvement done.
et al., (2019) Consulting attention large number of Cryptolocker, Locky, and
Alexa mechanism. Necurs are classified as Ramnit.
Ryan R. Curtin DGA from GitHub. RNN + WHOIS Able to reliably detect difficult DGA families
et al. (2019) Non-DGA from Alexa (domain such as matsnu, suppobox, rovnix, and others.
top 1 million registration side Less effective for those DGA families that do
Open DNS public information) not look like natural domain names.
domain lists
Different researches were done on the classification and detection of DGA
based malwares, but it was unsuccessful in identification of some malware.
In this research we will try to find other method to perform the recognition
in an automated way.
3. RESEARCH METHEDOLOGY
•Boosting
•Stacking
3.2.2. Deep Hybrid Learning
• Is resultant of fusion network, which can be attained by combining Deep
Learning and Machine Learning techniques.
• Thus, we can take the benefits from both, reduce the drawbacks and
provide more accurate and less computationally expensive solutions [39].
3.2.3. General workflow
3.2.4. Evaluation Metrics
• Recall or Sensitivity
• Accuracy
THANK YOU!