Spam Filtering Email Classification SFECM Using Gain and Graph Mining Algorithm

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Spam Filtering Email Classification (SFECM) using

Gain and Graph Mining Algorithm


M. K. Chae 1, Abeer Alsadoon1, P.W.C. Prasad1, A. Elchouemi2
1
Charles Sturt University Study Centre, Sydney, Australia
2
Walden University, USA

Abstract— This paper proposes a hybrid solution of spam email This paper aims to combine spam filter which uses
classifier using context based email classification model as main information gain calculation and context based email
algorithm complimented by information gain calculation to classification model with the aim of improving the spam email
increase spam classification accuracy. Proposed solution consists of classification accuracy to become 100%. The proposed
three stages email pre-processing, feature extraction and email
classification. Research has found that LingerIG spam filter is
solution uses spam filter to firstly filter all the the spam emails
highly effective at separating spam emails from cluster of from inbox. Then the context-based email classification
homogenous work emails. Also experiment result proved the model can classify emails into several folders.
accuracy of spam filtering is 100% as recorded by the team of
This paper is organized as follows: Section I and II
developers at University of Sydney. The study has shown that
implementing the spam filter in the context –based email
presents the introduction and literature review. Section III is
classification model is feasible. Experiment of the study has proposed solution and section IV is results and discussion.
confirmed that spam filtering aspect of context-based classification Conclusion can be found in section V.
model can be improved.
II. EASE OF USE
Keywords— email classification; graph mining algorithm;
spam; email classifier Study of literatures regarding automated email
classification has found there are at least four different types
I. INTRODUCTION of approaches to automated email classification: Traditional
Email is a cost-effective method of communication approach, Ontology-based approach, Graph-mining approach,
commonly found in all areas of industries. Education industry Neural-Network approach. Among many solutions proposed
is not an exception. Workforce in education industry spends by other researchers, Linger and context based email
fair amount of time in front of computer chasing up on emails. classification model were notable discoveries.
This is more so with jobs that deal with high volume of emails A. Traditional Approaches to email classification
each day such as administrator in education industry.
Managing incoming email is a critical matter to many because Text classification algorithms have been adopted to email
emails can herald important meetings, work messages, lunch, classification systems [3][4][5]. These includes Naïve Bayes
industry related information, upcoming events which many algorithm [4] and Support Vector Machine [3] which tokenize
cannot afford to miss. the email for calculation determining similarity of emails to
either spam or other useful type of email.
Also, email is a means to transfer important documents in
education agency. Often the documents contain international Experiment conducted by Alsmadi and Alhami [3] have
student’s private information and scanned copy of application found that removing stop words in emails improve accuracy of
to apply for admission into education institution such as email classification. Jason D. M Rennie [4] performed email
Universities, TAFEs and private colleges. At present we still classification using a Naïve Bayes algorithm in an email
find important work related emails in spam folder. Therefore classification system named ifile. An email classification
there is still a need to improve accuracy of email classifiers method named Three-Phase Tournament method devised by
using new and existing algorithms. Sayed et al [5] has shown very unstable accuracy ranging from
2% to 95%.
One possible solution to improving spam classification
algorithm is using a spam filter named LingerIG implemented B. Ontology-based Approaches to email classification
in 2003 in an email classification system named Linger [1]. The template is used to format your paper and style the
The basic principle of how this spam filter works bases on text. All margins, column widths, line spaces, and text fonts
calculating information gain. However the problem with this are prescribed; please do not alter them. You may note
solution is its accuracy in classifying non-spam emails into peculiarities. For example, the head margin in this template
folders. Out of many email learner used by Linger, at best, measures proportionately more than is customary. This
Widrow-Hoff gives unstable accuracy which moves between measurement and others are deliberate, using specifications
82.40% ~ 48.50% [1] when classifying emails into folders. that anticipate your paper as one part of the entire proceedings,
Current solution such as context based email classification and not as an independent document. Please do not revise any
model [2] has been developed to better adapt at classifying of the current designations.
emails into homogenous groups.

978-1-5090-4228-9/17/$31.00 ©2017 IEEE

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:51 UTC from IEEE Xplore. Restrictions apply.
C. Graph-mining approaches to email classification. A. SFECM: Components and Implementation
Graph-mining approaches to email classification take This system consists of three stages : Email Preprocessing,
advantage of semantic features and structure in emails by Feature Extraction and Email Classification. The proposed
converting emails into graphs and matching template graphs system runs POS Tagger on email in email preprocessing
with graphs made from each emails [8][9][10]. Typical graph stage to turn email texts into email features. At feature
mining algorithm converts emails into graphs. Substructures extraction stage, proposed system filters Spam from a set of
of graphs are then extracted from graphs. Parameters prune inputted emails. Then from filtered emails, sign-off words,
substructures. Representative substructures remain. greeting words, keywords are extracted to form email graph.
Substructures are ranked just so that in case an email graph At this stage, template graphs update using new email graphs.
matches more than two representative substructures, emails go Template graphs are then ranked in email classification stage
into a folder which the matched representative with higher to be assigned to represent relevant folder. Then email graphs
rank. are matched to representative template graphs and placed to
folder of the representative template graph that graph matches
eMailSift is a graph mining algorithm devised by Aery and
most. Detailed diagram of this proposed work is presented in
Chakravarthy [8]. Aery and Chakravarthy have reported the
Figure 2.
email classification accuracy increased from 80% to 95% as
the number of inputted emails increased from 60 to 370 [8].
On the contrary, a later work by Chakravarthy et al [9] named
m-InfoSift showed that email classification accuracy
decreased as number of folders increased. Accuracy of the
email classification decreased from 100% to 91% as number
of folders increased from 2 to 4 [9].
D. Current Best Selected Solution.
Graph-mining algorithm named Context-based Email
Classification System was proposed by Wasi et al [10]. It
consists of graph mining algorithm and Event Identification
System. As shown in the Table 3, Accuracy of email
classification was 80% when 300 emails were used for
training. Accuracy of email classification rose to 85% when
750 emails were used [10]. Accuracy reached 88% when 1500
emails used. Accuracy became 93% as number of training
emails counts 3000. As this result shows, it took 10 times
more emails to raise accuracy from 80% to 93% . A major
flaw in this system bases on the huge number of emails it
requires to reach accuracy of 100%.
Also, the context-based email classification model does Fig. 1. Flowchart of Proposed Solution.
not have spam filter even though the model addresses
clustering of homogenous emails into groups. Proposed work
therefore needs to address this insufficiency to improve the
model. An email classification system named Linger was
developed by Jason Clark, Irena Koprinska and Josiah Poon at
University of Sydney. Linger uses neural network [1]. Linger
uses a spam filter named LingerIG (Information Gain). As
shown in the Table 3, result of their experiment showed that
when LingerIG was used, Linger showed 100% accuracy at
spam email classification.
III. PROPOSED SOLUTION
Name of the proposed solution is Spam Filtered Email
Classification Model (SFECM). Proposed solution is based on
the Context-Based Email Classification Model to filter spam
before the actual email classification. Activity at each stages
of email classification is shown below in Figure 1.

Fig. 2. Detailed Diagram of Proposed Solution.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:51 UTC from IEEE Xplore. Restrictions apply.
B. Spam Filter: Algorithm
The proposed solution’s algorithm at spam filtering event If impWork > Average Information Gain of all non-
spam emails
is presented below:
Keep the email in inbox.
Algorithm(1) : Proposed Spam Filter
If impWork > impSpam
INPUT: Test email samples (E) = {E1, E2, … , EN}
Keep the email in inbox.
OUTPUT: Classified emails (E’)= {E1’,E2’,…,EN’}
Step 7: End of Algorithm
BEGIN
Step 1: USE a set of test emails as sample emails.
Step 2: INPUT a sample email into proposed solution’s spam Following diagrams are visualization of the algorithm.
filter.
Step 3: Start a loop
For each email feature in EN if
Email contains email feature that matches
feature in spam feature list, increment count.
Keep counted number of matches as
numMatchSpam; number of email features in
the email that match features from spam
filter’s list of spam email features.
END FOR
Step 4: Start another loop
For each email feature in EN if
Email contains email feature that matches
feature in non-spam feature list, increment
count.
Make integer variable in source code
numMatchWork; number of email features in
the email that match features from spam
filter’s list of work email features.
Keep counted number of matches as
numMatchWork.
END FOR
Step 5: Calculate Entropy/Impurity Fig. 3. Step 1 Training Spam Filter using set of test emails
Calculate the spam email features are contained in
email by dividing number of spam email features by
number of email features in the email.
Call this impuritySpam(impSpam).
Calculate impSpam= numMatchSpam / number of
Features In Email;
Find the work email features are contained in email
by using below formula.
Call this impurityWork (impWork)
impWork = numMatchWork/ number of Features In
Email;
Step 6: Move email to either spam or keep email in inbox.
If impSpam > Average Information Gain of all Spam
emails
Move the email to spam folder
directory.
If impSpam > impWork
Move the email to spam folder directory.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Step 4 & 5 Match Extracted Words to WORK Feature List.

Fig. 4. Step 1 Make SPAM Feature List and WORK Feature List.

Fig. 7. Step 6 Move Email to SPAM Folder.

IV. RESULT AND DISCUSSION


This section tests the implementation of the proposed
solution. The system has been implemented by Java project.
This project read each eml files into string objects so that
email features can be organized into arrays.
The proposed system depends on extracted email features
from eml file. Each predefined activities utilize extracted
Fig. 5. Step 2 & 3 & 5 Match Extracted Words to SPAM Feature List.
email features such as to calculate Average Information Gain
of all Spam emails and Average Information Gain of all work
(non-spam) emails. Email features are used to populate list of
arraylist. Email features are also used to train the spam filter.
There is one possible experiment to compare the accuracy of

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:51 UTC from IEEE Xplore. Restrictions apply.
IG spam filter and accuracy of solution without using IG spam Corpus2 1.57 4 5 80 3 5 60 382.56s
filter.
Corpus3 8.78 4 5 80 3 5 60 680.94s
As it can be seen in the Fig 8, Fig 9, Table 1 and Table 2,
the experiment has used ten groups of emails, each group Corpus4 5.33 3 5 60 4 5 80 593.95s
consisting of 10 eml files. Each group has five spam emails Corpus5 3.45 3 5 60 4 5 80 345.30s
and five work emails. Tables show number of correctly
classified spam email and work emails for both solutions. Corpus 6 13.1 4 5 80 3 5 60 154.65s

Experiment results have been compared between the Corpus 7 18.8 3 5 60 3 5 60 166.30s
results from current solution and proposed solution. Also the
accuracy of solution at classifying spam email and work email Corpus 8 0.8 3 5 60 4 5 80 263.40s
has been measured by the number of emails being correctly Corpus 9 3.28 4 5 80 3 5 60 433.79s
put into spam folder and work email folder. In order to
compare the difference that data size has on processing time, Corpus 2.49 3 5 60 4 5 80 335.12s
size of email groups are listed on the column second from the 10
left.
As it can be seen in Fig 8 and Table 2, a comparison
between the proposed solution and existing solution shows
that the proposed solution provides absolutely 100% accuracy
when classifying spam emails from sets of emails. This result
proves that adopting spam filter into existing solution
reinforces its capability to accurately classify spam emails
from work emails.
Fig 8 also shows that in terms of accuracy the result of the
experiment on classifying work (non-spam) emails from email
sets are similar. They all provide 100% accuracy when
classifying work (non-spam) emails from a mixed set of
emails.
Fig 8 shows the improvement in the accuracy of work
emails by adopting IG spam filter. Fig 8 shows that processing
time of existing solution and proposed solution does not show
Fig. 8. Result of Accuracy – Work Emails.
significant difference. Also the relationship between the data
size and processing time does not seem to depend on each
other. TABLE II. RESULTS OF SAMPLE EMAIL SPAM CLASSIFICATION USING
INFORMATION GAIN CALCULATION – PROPOSED SOLUTION
Implementation Result of First Step in Proposed Solution :
TABLE I. RESULTS OF SAMPLE EMAIL SPAM CLASSIFICATION USING Email Classification (IG Spam Classifier)
INFORMATION GAIN CALCULATION – CURRENT BEST SOLUTION
Implementation Result of First Step in Current Best Solution : Tested Corpus Proposed Solution
Email Classification (IG Spam Classifier) (eml files)
Corpus Data Accuracy (%) Time
Tested Corpus Current Best Solution Name size (Sec.)
(eml files) (MB) Spam Work
classified spam emails
Number of correctly

Number of spam emails

Accuracy

Classified work emails


Number of correctly
Number of work emails

Accuracy

Corpus Data Accuracy (%) Time


Name size (Sec.)
(MB) Spam Work
classified spam emails
Number of correctly

Number of spam emails

Accuracy

Classified work emails


Number of correctly
Number of work emails

Accuracy

Corpus1 6.49 5 5 100 5 5 100 189.46

Corpus2 1.57 5 5 100 5 5 100 389.22


Corpus1 6.49 4 5 80 4 5 80 195.33s
Corpus3 8.78 5 5 100 5 5 100 675.34

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:51 UTC from IEEE Xplore. Restrictions apply.
Corpus4 5.33 5 5 100 5 5 100 714.17

Corpus5 3.45 5 5 100 5 5 100 353.93 V. CONCLUSION


Corpus 6 13.1 5 5 100 5 5 100 197.34 This paper has identified that 100% accuracy in spam
classification of email system is still an unmet need. Project
Corpus 7 18.8 5 5 100 5 5 100 198.22 has drawn upon the work of the existing email classification
Corpus 8 0.8 5 5 100 5 5 100 159.64
systems known as ‘context-based email classification system’
and ‘Linger’ to address the unmet need. Main steps of the
Corpus 9 3.28 5 5 100 5 5 100 121.29 context-based email classification system begins with
preprocessing email using POS Tagger then it extracts several
Corpus 2.49 5 5 100 5 5 100 118.60
email features to transform emails into graphs and then
10
graphs are matched to representative graph so that emails are
classified to the folder which the representative graph with
highest match represent . Linger implements information gain
classifier for filtering spam and use neural network to classify
emails into homogenous clusters. The proposed system
adopts spam filter from Linger to reinforce the accuracy
needed to separate spam emails without any mistake. Proposed
solution provides 100% accuracy at filtering spam from a set
of mixed emails. As far as the experiment shows, processing
time between using spam filter and not using spam filter differ
insignificantly. It is important to however to stress the need to
reduce the processing time of the spam classification because
processing time of 0.1 second is an unmet need in this solution.
REFERENCES
Fig. 9. Result of Processing Time. [1] J. Clark, I. Koprinska and J. Poon, "Linger - A Smart Personal Assistant
for E-Mail Classification", in International Conference on Artificial
Neural Networks, 2003, pp. 274–277.

TABLE III. COMPARATIVE STUDY – LINGER AND CONTEXT-BASED EMAIL [2] S. Wasi, S. Jami and Z. Shaikh, "Context-based email classification
CLASSIFICATION MODEL model", Expert Systems, vol. 33, no. 2, pp. 129-144, 2015.

S.N Method Author Accuracy Processing [3] I. Alsmadi and I. Alhami, "Clustering and classification of email
Name Time contents", Journal of King Saud University - Computer and Information
Sciences, vol. 27, no. 1, pp. 46-57, 2015.
classified emails
Number of correctly

Percentage
Accuracy

consuming
Is this method time
automated
Can Manual Step be

[4] J. Rennie, "ifile : An Application of Machine Learning to E-Mail


Filtering", in Proceedings of the KDD (Knowledge Discovery in
Databases) Workshop on Text Mining, 2000.
[5] S. Sayed, "Three-Phase Tournament-Based Method for Better Email
Classification", International Journal of Artificial Intelligence &
Applications, vol. 3, no. 6, pp. 49-56, 2012.
[6] M. Fuad, D. Deb and M. Hossain, "A trainable fuzzy spam detection
system", in 7th International Conference on Computer and Information
1. Linger Clark, J., Total 2893 100% - Y Technology, 2004.
Koprinska, emails
I., & Poon, contained in [7] S. Youn and D. McLeod, "Spam Email Classification using an Adaptive
J. 2003 corpus Ontology", JSW, vol. 2, no. 3, 2007.
“LingSpam”
were classified [8] M. Aery and S. Chakravarthy, “eMailSift: Email Classification Based on
all correctly. Structure and Content,” Data Mining, Fifth IEEE Int. Conf., pp. 18–25,
481 emails were 2005.
spam and 2412 [9] S. Chakravarthy, A. Venkatachalam, and A. Telang, “A graph-based
were legit approach for multi-folder email classification,” Proc. - IEEE Int. Conf.
emails. Data Mining, ICDM, pp. 78–87, 2010.
2. Context- Shaukat 240 emails of 80% Y Y [10] T. Ayodele, S. Zhou, and R. Khusainov, “Email Classification Using
based Wasi, Syed 300 emails were Back Propagation Technique,” Int. J., vol. 1, no. 1, pp. 3–9, 2010.
email Imran Jami correctly
classifi- and Zubair classified. [11] D. Patil and Y. Dongre, “A Clustering Technique for Email Content
cation Ahmed Mining,” Int. J. Comput. Sci. Inf. Technol., vol. 7, no. 3, pp. 73–79,
model Shaikh, 2015.
2016

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:51 UTC from IEEE Xplore. Restrictions apply.
[12] K. Taghva, J. Borsack, J. Coombs, A. Condit, S. Lumos, and T. Nartker,
“Ontology-based classification of email,” Proc. ITCC 2003. Int. Conf.
Inf. Technol. Coding Comput., pp. 194–198, 2003.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:51 UTC from IEEE Xplore. Restrictions apply.

You might also like