Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Email Forensic Analysis Based on k- means clustering

Arya P Nampoothiri Minu Lalitha Madhavu


Department of Computer Science and Engineering Department of Computer Science and Engineering
Sree Buddha College of Engineering, Sree Buddha College of Engineering,
Alappuzha, India Alappuzha, India
e-mail: aryapradeep013@gmail.com e-mail: minulalitha@gmail.com

Abstract— Computer crime activities are increasing more and Organization on Computer Evidence) was founded in 1995.
more, which bring great threat to network security. Email is Then several countries developed laws and regulations
used for several computer crime activities due to its simplicity. related to computer forensics and cyber crimes. They started
In this scenario, email forensics is needed. This paper proposed government computer forensics departments to regulate
an email forensic method using k- means clustering. We collect cyber crimes.
and analyze email data of suspicious users. Then filtering and Several excellent computer forensic products are
clustering is done to obtain the email communication network developed by information security companies and
graph. Finally, we apply spam filtering to avoid spam mails in organizations. These products and tools focus mainly on the
network graph and k- means clustering on email messages to
basic functions like email data collecting, email decoding
obtain the accurate communication graph. The algorithm can
and simple relationship graph drawing.
analyze the core members and the structure of criminal
organization. The organized crime network and other social groups
exhibit the same structural characteristics. The only
Keywords- Email forensics, betweenness centrality, k- means difference is on the purpose of their actions. In a social
clustering, spam messages. network graph, the group members were the nodes and their
communications were the edges. Similarly, in a criminal
I. INTRODUCTION network graph. Thus to ensure their purpose of actions, we
Before some years, computers and networks were a use text clustering. To prevent the organized crime, we not
blessing for us. But now, it becomes a curse because of the only should explore the characteristics and behavior of the
rapid increase in the criminal activities committed on criminal, but also analyze the gang organization and its
computer networks. This creates a huge threat to network structure, especially core members of it.
security. To investigate criminal cases, mainly the cyber There were a lot of research works on the cyber
crimes, computer forensics is used by intelligence forensics. Hisato Tashiro [2] proposed a method for finding
departments. leadership in email networks by betweenness centralities.
There exists several mean for communication through Linton C. Freeman [3] described three measures of vertex
network. One of them is emails. Currently, emails replaced centrality (in-degree centralities) such as degree centrality,
the letters as the daily means of communication. Due to its closeness centrality and betweenness centrality. All these
simplicity and inherently vulnerable nature, email can be centralities measure how active a particular node is. In
misused. It becomes a tool for cyber crimes. But during weighted email networks, the community structure can be
digital forensic and investigation, the email communication detected by deleting all the boundaries. HaiboWang [4]
remains as a major source of evidence. During the initial proposed such an algorithm. A composite index named
stages of investigation, email evidence helps the law and mediumness, which is derived from betweenness centrality is
enforcement departments. In this context, the importance of used to measure how much an edge becomes a boundary
email forensics exists. between two communities. Many researchers studied the
In this paper, we mainly study the organized crime small groups and their structure in communication network,
investigation for email forensics. The member of a criminal like [5, 6, and 7].
gang communicates with other members. Here, we propose a Ulrik Brandes[8] introduce more efficient and faster
method to find the core member of a criminal gang and the algorithms based on a new accumulation technique which
relationship between them using traffic analysis and integrates with traversal algorithms. Thus betweenness can
betweenness centrality clustering. Also, based on their be calculated exactly even for fairly huge networks. Ali
communication, we cluster them into several clusters using Tizghadam [9] proposes to use the concept of resistance
k-means text clustering. distance and betweenness centrality from graph theory to
model the behavior of a communication network. Mohamed
Fazeen [10] present two methods for classification of
II. RELATED WORKS different social network actors; context-dependent approach
and context-independent approach.
Computer forensics is mainly used to combat cyber
Jennifer Xu [11] developed a criminal network
crimes. In 1984, FBI of United States of laboratory set up
intelligence analysis system called CrimeNet Explorer. It
computer forensics lab. The IOCE (International

978-1-4799-8792-4/15/$31.00 2015
c IEEE 814

Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:30:24 UTC from IEEE Xplore. Restrictions apply.
employ several descriptive measures such as node level B. Attribute identification
measures, group level measures, etc. It uses animation From each email, we can extract certain attributes
approach to analyze and visualize criminal network necessary foe email forensics. Common attributes include
dynamics over time, uses concept space methods and social from address, to address, cc, subject, received, date, etc.
network analysis methods. CrimeNet Explorer includes From these attributes, we can identify the relationship
network creating, network segmentation, structural analysis between several email accounts. Also a large amount of
and network visualization stages. evidences can be concluded from this.
Based on Bayesian reasoning, Matthew J [12] and his
team members developed a tool to build network structure of C. Network Graph generation
terrorist organizations. This can accurately analyze the scale From the data and attributes that have been collected, we
of the entire criminal networks, membership, organizational can create network communication graph. Nodes in the
structure, operational capabilities and its subgroups. graph represent email accounts and edges between them
Ahmed Abbasi [13] proposed a method for authorship represent the relationship or communication between them.
analysis in web contents. It evaluates the linguistic features The from address and the to address acts as the two
of web messages and by comparing them with writing styles. endpoints of the edge. The value of corresponding edge is
Corney M [14] uses an extended set of e-mail document the number of email between from address and to address.
features including structural characteristics and linguistic From this email communication network graph, we can
patterns, together with a Support Vector Machine learning further study the criminal organizations and their core
algorithm, mine the e-mail content and perform authorship members.
analysis.
Rachid Hadjidj [15] designed and implemented a D. Traffic filtering
comprehensive software toolkit called IEFAF. It offers After creating network graph, we apply filtering on the
several functionalities including email storing, editing, resultant graph. For traffic filtering, we use the weight of
searching, querying and email account localization. And this each edge. Based on the applications and criminal cases, we
achieves a static forensics and analysis. But the limitation is set a threshold value. If the value of an edge is higher than
that it cannot achieve a complex mail network association the threshold value, it indicates that the relationship between
analysis. John Haggerty [16] proposes a framework consists those two accounts is very close. If the value of the threshold
of acquisition, importation, triage, analysis, and presentation is less than the specified threshold, we delete those edges.
stages. It is used for the investigation of unstructured email After filtering, we get some nodes and edges from the
messages. initial network graph. This may form into several clusters.
E. Betweenness centrality based clustering
III. EMAIL FORENSICS USING SNA METHOD From the clusters, we need to delete the redundant edges.
To determine the most distant relative edges, we use
Social Network Analysis (SNA) provides a set of norms centrality analysis. In this paper, we use Betweenness
and methods to analyze a social group structure and centrality. It is a measure of vertex’s centrality. If a vertex
properties. It can be used for efficient network analysis than has large centrality value, it indicates the high importance of
manual method in terms of manpower and material that particular node. Also, it shows that more
resources. Several researches on SNA based criminal communications traffic passes through this vertex, so this
network analysis were there. vertex can be the center vertex in the small group.
YanHua Liu [1] proposed an email forensics analysis The formula for calculating the betweenness centrality of
using SNA method. The framework includes email the vertex i are as follows:
decoding, attributes extracting, searching, email ݃௝௞ሺ௜ሻ
communication network graph generating, data filtering, ‫ܥܤ‬ሺ݅ሻ ൌ ෍
email clustering, finding suspect criminal organization and ݃௝௞
௝ழ௞
identifying core members of criminal organization.
Here, gjk(i) is the number of the shortest path between
A. Email Data Collection vertex j and vertex k through the vertex i, gjk is the number of
The email data collection stage includes collecting email the shortest path between vertex j and vertex k .
data of suspicious email ids. We collect email data from One of the efficient centrality calculation algorithm is
several mail clients. As standardization of different mail Brandes algorithm[8]. Let G= (V,E) be a graph and s,t be a
storage file format and protocols becomes common among fixed pair of graph nodes. Let ³st be the number of shortest
nations, it becomes simpler to collect email data from several paths between s and t and ³st(v) be the number of those
mail clients. We have to separate out every single email from shortest paths that pass through v.
one email account. Thus we can collect all the user message Dependency of a source vertex s on a vertex v is defined
data. as:
ߜ‫ݏ‬ሺ‫ݒ‬ሻ ൌ  ෍ ‫ݐݏߪܸ א ݐ‬ሺܸሻ ‫ݐݏߪ כ‬

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 815

Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:30:24 UTC from IEEE Xplore. Restrictions apply.
and between centrality is

‫ܥܤ‬ሺ‫ݒ‬ሻ ൌ  ෍ ‫ݏ ߜܸ א ݒ ് ݏ‬ሺሺ‫ݒ‬ሻ

Pseudo code for calculating between centraality is shown in


Figure 1.

Figure 2: (a) Scattered documents beefore clustering


(b) Clustered documents after clusteering

GraphSharp is a graph layout fraamework, which contains


some layout algorithms and a Graph hLayout control for WPF
applications. From original email daata set, we apply subject-
Figure 1 : Between centrality pseeudo code based and content-based filtering methods
m to retrieve related
messages for specified cases by inveestigator.
The disadvantage of the paper is that it cannot A. Spam filtering
distinguish whether the communication is ccriminal oriented
The output graph from clusteringg will contain spam email
or not. In order to improve that, we ppropose the new
accounts. We need to eliminate the spam
s accounts for getting
approach. the criminal networks. Eliminating spam
s accounts saves time
IV. EMAIL FORENSICS USING K
K- MEANS and also enables efficient investigaation. The input for spam
CLUSTERING filtering is the network communicaation graph obtained after
betweenness clustering.
To solve the e-mail dynamic relatiionship analysis, For spam filtering, we use severral attributes like number
especially in email communication relationsship and structure of attachments, attachment file size, attachment type,
analysis, we propose a new email foorensics analysis difference in from and reply to addreess fields, and so on.
framework. The main stages of this framew work includes data
collection, email attribute identification, network graph B. Clustering based on k- means teext clustering
creation, traffic filtering, betweenness centtrality clustering, The main disadvantage of traffic
t filtering is that
spam filtering and k- means clustering. friendship can be treated as crimin
nal relation because huge
The framework of proposed system is shhown in figure 2. communications takespalce betweeen friends. To avoid this
problem we perform text clusteriing. Figure 2 shows k-
means clustering process.

For that, we use k- means teext clustering [17]. For


clustering large data sets, k- means is
i more efficient one. The
objective of k-means clustering is to o minimize the total sum
of the squared distance of every point to its corresponding
cluster centroid. The process in ncludes selecting seeds,
forming clusters, computing cen ntroids, and rearranging
clusters. This process continues for several loops and finally
arrives at several clusters based on email
e messages.
The main steps in k- means docu ument clustering is shown
in figure 3.
Stop word removal is done to make
m the indexing process
more effective. Then term frequeency- inverse document
frequency (TF-IDF) of each docu ument is calculated and
thereby converting it into a vecto or space model. Among
Figure 2 : Proposed system framew work several similarity matrices, here, cosine
c similarity is used.
Here, we perform additional spam filteriing and k- means Cosine similarity can be calculated byb equation 1.
clustering on the existing email forensics frramework. Also,
for network graph generation we use anoother tool called σ࢔
࢏స૚ ࡭࢏‫࢏࡮כ‬
GraphSharp. •‹‹Žƒ”‹–› ൌ  (1)
ටσ࢔
࢏స૙ ࡭࢏
૛ ‫כ‬ටσ࢔
࢏స૙ ࡮࢏

816 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)

Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:30:24 UTC from IEEE Xplore. Restrictions apply.
Base on the similarity value, assignmentt of documents to ACKNOWLEDG GMENT
each cluster occurs. This process repeats until no objects First of all, we thank almighty for
f giving us strength and
change its cluster. courage for taking a relevant area annd to do research on that.
We thank everyone, who directly or indirectly helped us for
doing the thesis.
REFERENCE
ES

[1] YanHua Liu, GuoLong Chen, and Lili L Xie, “An Email Forensics
Analysis Method Based on Sociaal Network Analysis,” 2013
International Conference on Cloud Co omputing and Big Data, 978-1-
4799-2829-3/13 $26.00 © 2013 IEEE E DOI 10.1109/CLOUDCOM-
ASIA.2013.38
[2] Hisato Tashiro, Antonio Lau, Junichi Mori, etl. Email Network
Analysis for Leadership. proceedings of o 2011 IEEE IEEM.
[3] Linton C. Freeman. A Set of Meassures of Centrality Based on
Betweenness. Sociometry, Vol. 40, No. 1 (Mar., 1977), pp. 35-41.
[4] Haibo Wang, Ning Zheng, Ming Xu, Yanhua Guo.Detecting
Community Structure in Weighted Em mail Network[C]. proceedings of
1stInternational Symposium on Comp puter Network and Multimedia
Technology, Wuhan, CHINA 2009.
[5] M.E.J.Newman, M.Girvan. Finding g and evaluating community
Figure 3 : k- means document clusttering process structure in networks. Phys.Rev.E69, 026113,
0 2004.
[6] Eric D. Kolaczyk, David B. Chu u, Marc Barthélemy. Group
From the clusters thus formed, we cann easily find the betweenness and co-betweenness: Inter-related notions of
suspected criminal organization and its core members. coalitioncentrality. Social Networks 31 (2009) 190–203.
Through further investigation, the role of criminals can be [7] Borgatti, S., Everett, M., 2006. A graph-theoretic
g perspective on
centrality. Social Networks 28, 466-484 4.
identified.
[8] Ulrik Brandes. A faster algorithm for betweenness
b centrality. Journal
V. RESULTS AND DISCUSS
SIONS of Mathematical Sociology,2001,25(2): 163-177.
[9] Ali Tizghadam and Alberto Leon-G Garcia, University of Toronto.
The network graph obtained with SNA method is not Betweenness Centrality and Resistance Distance in
precise since it do not perform any filterinng methods. The CommunicationNetworks. IEEE Netwo ork, 2010,6(24): 10-16.
graph 1 shows the results obtained with fouur different trials. [10] Mohamed Fazeen,Ram Dantu,Parthasaarathy Guturu. Identification of
By analyzing the graph, it is clear that tthe network size leaders, lurkers, associates and spammers in a social network:context-
dependent and context-independent approaches. Social Network
decreases about half in amount. Analysis and Mining, 2011,1(3):241-25 54.
[11] Jennifer Xu, Byron Marshall, Sidd dharth Kaza Hsinchun Chen.
Analyzing and Visualizing Criminal Network Dynamics: A Case
Study.Lecture Notes in Computer Science[M]. Springer Berlin
/Heidelberg, 2004: 359-377.
[12] MATTHEW J. DOMBROSKI , KATHLEEN M. CARLEY.
NETEST: Estimating a Terrorist Netw work’s Structure. Computational
&Mathematical Organization Theory, 8, 8 235–241, 2002.
[13] Ahmed Abbasi , Hsinchun Chen. Ap pplying authorship analysis to
extremist-group Web forum
m messages[J]. IEEE
IntelligentSystems,2005,20(5): 67- 75.
[14] de Vel O, Anderson A, Corney M, Mo ohay G. Mining e-mail content
for author dentification forensics[J]. sigmod Record_2001,30( 4): 55-
64.
[15] Rachid Hadjidj,Mourad Debbaabi,Hakim Lounis,Farkhund
Iqbal,Adam Szporer,Djamel Benredjeem. Towards an integrated e-
mail forensic analysis framework[J]. Digital
D Investigation, 2009,5(3-
4), 124-137.
[16] John Haggerty, Alexander J. Karran, David
D J. Lamb, Mark Taylor. A
Framework for the Forensic Investigation of Unstructured
EmailRelationship Data. International Journal of Digital Crime and
VI. CONCLUSION Forensics, 2011,3(3), 1-18.
Email forensics is needed to combat oorganized crime. [17] Manjot Kaur, and Navjot Kaur, “Web Document Clustering
This paper proposed an effective email forensics, which Approaches Using K-Means Algorith hm”, International Journal of
Advanced Research in Computer Science and Software Engineering,
include email data collection, network graphh creation, traffic Volume 3, Issue 5, May 2013
filtering, betweenness centrality clusteringg, spam filtering
and k- means clustering. The new methhod can analyze
criminal organization’s structure and core m
members.

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 817

Authorized licensed use limited to: Jahangirnagar University. Downloaded on September 05,2021 at 04:30:24 UTC from IEEE Xplore. Restrictions apply.

You might also like