Download as odp, pdf, or txt
Download as odp, pdf, or txt
You are on page 1of 28

VIRUS DETECTION USING DEEP

LEARNING

By
Saurabh Malusare
Rojan Sudev
Rishabh Nrupnarayan

Under The Guidance of


Prof. Anil M. Bhadgale
INTRODUCTION

A computer virus is a program or piece of code


that, when executed replicates by reproducing
itself or infecting other computer program by
modifying them.
VIRUS DETECTING TECHNIQUES

• Signature Based Detection


• Heuristic Based Detection
• Detection using Bait
LIMITATIONS OF CONVENTIONAL
TECHNIQUES

• Time required between virus detection and


creation
• Large Database have to be maintained
• New patterns of virus cannot be detected
PROBLEM DEFINITION

Using Deep learning to classify whether a file is


virus or legitimate , while overcoming the
existing limitations of conventional techniques.
System Architecture
Important fields of PE header:
Feature Selection
• Extract only features relevant to classification
• Fisher Score algorithm for feature selection
• Fisher Score based on ranks.
• Ranks between 0 and 1
• Higher rank,more relevance
Fisher Score formula:

• µi,p = mean of positive samples for ith PE header feature


• µi,n = mean of negative samples for ith PE header feature
• σi,p = standard deviation of positive samples for ith PE
header feature
• σi,n = standard deviation of negative samples for ith PE
header feature
Feature Extraction
• Extract 21 most relevant features determined
using Fisher Score.
• These features are real values.
• Normalize features using min-max
normalization
• Features are scaled to [0,1]
• Normalized Feature values are then converted
to binary values using the condition:

If feature >mean(feature)
feature=1
else
feature-0
DBN
• Deep belief network obtained by stacking
several RBMs(Restricted Boltzmann machine)
on top of each other.
• The hidden layer of the RBM at layer `i`
becomes the input of the RBM at layer `i+1`.
• When used for classification, the DBN is
treated as a MLP, by adding a logistic
regression layer on top.
RBM

Fig. RBM

Fig. Forward phase

Fig. Backward phase


RBM Training
Contrastive Divergence-k(CD-k):
• Take a training sample v, compute the
probabilities of the hidden units and sample a
hidden activation vector h from this
probability distribution.
• Compute the outer product of v and h and call
this the positive gradient.
• From h, sample a reconstruction v1 of the
visible units, then resample the hidden
activations h1 from this.
• Repeat above step k times to calculate vk and
Training DBN
• DBN trained in semi-supervised way.
2 phases:
1)Unsupervised training phase
2)Supervised training phase
Unsupervised Training
Algorithm:
• 1. Train the first layer as an RBM that models the raw input as its visible
layer.
• 2. Use that first layer to obtain a representation of the input that will be
used as data for the second layer.
• 3. Train the second layer as an RBM, taking the transformed data
(samples ) as training examples (for the visible layer of that RBM).
• 4. Iterate (2 and 3) for the desired number of layers, each time
propagating upward either samples .
Supervised Training
• Uses Logistic Regression on top of DBN
• Logistic Regression Model trained in
Supervised way-uses labelled virus and
legitimate files
• Logistic regression is a probabilistic, linear
classifier parametrized by a weight
matrix W and a bias vector b .
Fine Tuning Parameters

• Number of hidden layers


• Number of processing units per hidden layer
• Learning rate
PERFORMANCE
EVALUATION

03/06/17 CS-152 23
SNAPSHOTS
RESULTS
• Feature Extractor capable of extracting
relevant features from dataset and input
file.
• DBN capable of classifying a given PE
structure file as virus or legitimate with an
accuracy of 94.5%.
CONCLUSION

You might also like