Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Search Model for Searching the Evidence in Digital

Forensic Analysis
Pratap Sakhare
Sweedle Mascarnes Prajyoti Lopes Department of Information
Department of Computer Department of Information Technology
Engineering Technology St. John College of Engineering and
St. Francis Institute of Technology St. Francis Institute of Technology Technology,
Mumbai, India Mumbai, India Mumbai, India
swedle.mascarenhas@gmail.com prajyotilopes@gmail.com prataps@sjcet.co.in

Abstract— Digital forensics investigation involves the tools [6]. The search techniques provided by current DFI
examination of the digital devices to gather evidence associated tools include file name search, text string based search,
with the crime. The amount of data that needs to be processed in regular expression search and approximate matching search.
a typical forensic investigation today is immense and extremely Digital forensic tools existing in market work only on a
diverse. Thus finding relevant evidence related to the case is
specific type of data source and are limited for specific tasks
difficult and time-consuming. This research work aims in
designing a search model that hunts pertinent evidence by such as file-system analysis, network protocol analysis,
analysing large amount of data for faster and more effective volatile memory analysis etc. Also these tools are not
processing in criminal investigation. The proposed work allows scalable to large data sets (i.e. Gigabytes and Terabytes).
the investigator to semantically cluster the data collected from Data mining is a good solution to handle massive
digital devices based on the keyword specified by the
investigator or suggested by system. Experiments were
volumes of data. Data Mining is the process of extracting
conducted on dummy crime dataset to test the accuracy and the knowledge from large data sets. Employing data mining
scalability of the proposed system. Experimental results proved techniques to aid DFI has many potential advantages [7]. If
that subject suggestion improved the accuracy and thus speeds applied properly it meets the three-fold goal of reducing
up the process of searching the evidence. system and human processing time, improving the
effectiveness and quality of the data analysis and reducing
Keywords - digital forensic investigation; crime investigation; cost. Various data mining techniques have been previously
clustering; subject suggestion. used by law enforcement agencies. Of all these existing
techniques, clustering is proved as one of the powerful
technique that enables the forensic investigators to explore
I. INTRODUCTION large databases quickly and efficiently.
Digital forensics has become an intrinsic activity in However, these tools and clustering algorithms are
crime investigation. It is a process of inspecting digital media applied directly on suspect’s dataset without any knowledge
with the aim of generating digital evidence linked to an about the content within. Hence, the results obtained by these
incident under investigation. According to Carrier et al. [1], tools and clustering algorithms leads to false positives and
digital evidence of an incident is any digital data that false negatives.
supports or refutes a hypothesis about the incident. The six
basic steps defined by Digital Forensics Research Workshop For faster and more effective processing of data
(DFRWS) and generally followed in the forensic within criminal investigations, this research work aims in
investigation are Identification, Preservation, Collection, designing a search model that can automatically analyze large
Examination, Analysis and Presentation [2]. quantities of data and allows the investigator to semantically
cluster the data set on suspect’s hard drive based on the
Out of all the phases mentioned, Examination phase keyword specified by the investigator or suggested by
is the most difficult and time consuming phase. In this phase, system. The proposed system identifies top frequent
investigators have to go through hundreds and thousands of keywords that are commonly appearing in the dataset and
unstructured documents gathered from suspect’s digital these keywords are suggested to the investigator. The
devices to fetch the relevant evidence. Identifying significant investigator can either select the subject/keywords suggested
evidence using manual browsing is time consuming process. by the proposed model or he can manually enter his own
Therefore there is a necessity for automated search systems. search query. The proposed model uses novel semantic
Traditionally forensic examiners rely on the search engines document clustering algorithm that groups all documents into
provided by operating systems or existing DFI tools namely clusters where each cluster represents keyword/subject
Guidance Encase [3], Access Data Forensic Toolkit (FTK) specified by the investigator or suggested by the system. In
[4], and Sleuth kit / Authopsy [5] etc. to find evidence. this way, subject suggestion helps the investigator to initiate
However, there are some problems with today’s forensic

978-1-4673-7910-6/15/$31.00 2015
c IEEE 1353
the process of finding the evidence from an unknown dataset. text string search and thus helps investigators to locate
Experiments were conducted on dummy crime dataset to test relevant document relevant more quickly. In [16], SOM-
the Accuracy and the Scalability of the proposed system. based algorithms have been used for clustering files by taking
Experimental results show that the subject suggestion model according to creation dates/times and file extensions. So it’s
improves the Accuracy and thus enhances the performance of an easy task for the forensic investigator for the analysis once
existing DFI framework. he got specified pattern.
R. Hadjidj [17] has proposed email forensic analysis
tool which helps to gather the evidences related to crime in
II. RELATED WORK the court of law. The problem of clustering e-mails for
Data mining techniques are specifically designed to forensic analysis has also been addressed in [18], where a
find and retrieve relevant data amongst voluminous amounts Kernel-based variant of K-means was applied. The forensic
of data [8]. Hence, they are able to support digital examiner only analyses the clusters which are relevant to the
investigations. The broad classification of data mining admitted case, hence this speed up the forensic analysis
techniques and its applications in Digital Forensic field have process.
been explained in the following sections. In all the papers discussed above, the number of
A. Association Rule Mining: clusters needs to be specified explicitly before applying
clustering techniques. But, in real time to specify the number
Association rule mining is one of the vital of clusters explicitly is not an easy task because forensic
techniques of data mining, receiving increasing attention in investigator does not know amount and the nature of data
Forensic Investigation. Association rules are the if/then gathered from suspect’s digital devices. This is one of the
statements that help to uncover relationships between biggest limitations of most of the clustering algorithms.
seemingly unrelated data. Apriori & FP-growth algorithms
have been one of the most famous algorithms used for D. Summary:
Frequent Pattern mining. D. Bansal and L. Bhambhu [9] have To summarize, researchers have developed several
used Apriori algorithm to extract frequent patterns occurring data mining techniques to automate the process of
in a dataset. In [10], authors have presented a unified data investigation; however, it has been observed that none of the
mining solution to find the author by studying the linguistic existing data mining techniques take advantage of the
and computational characteristics of the written documents. information the investigator initially provides while taking in
B. Supervised learning/Classification: consideration that no training data is available.

Classification techniques are applied to organize The research work presented in [19] has focused on a
different crime entities as per the properties common in them. subject-based semantic document clustering algorithm that
In [11] and [12], a Support Vector Machine (SVM) has been allows an investigator to cluster the data gathered from
applied to determine the author and the gender of the author suspect’s digital devices, according to the subject initially
of an e-mail. S.Yamuna and N. Bhuvaneswari [13] have used defined by the investigator (e.g. hacking, child pornography
decision tree algorithm to analyze and predict the future etc.). The limitation of this method is that, the investigators
crime. may often fail to give an appropriate search query, as they are
unlikely to have the advance knowledge of all criminal events
In summary, text classification approaches all that have already occurred on a suspect's computer.
depend on the availability of training data for the purpose of
training their classifiers. Therefore, text classification has The presented research work overcomes this
failed to address this subject/keyword-based clustering limitation by introducing a novel search model integrated
problem because of the lack of training data in DFI. with keyword suggestion feature. The objective of the
proposed framework is to help investigators efficiently
C. Unsupervised Text Clustering: identify relevant information from the unknown dataset in
Clustering has been a well-known data mining order to improve the performance of existing DFI framework.
technique. There have been few research articles reporting
the use of these clustering techniques in the DFI field.
III. PROPOSED WORK
Link analysis is a technique used for investigation of
criminal activity to identify relationships between This section depicts the workflow of the collection
organizations, people and transactions. Chen et al. [14] have of the data from suspect’s digital devices to finding the
introduced a crime data mining framework that identifies relevant evidence in Figure 1 and Figure 2. The data from
relationships between different crime entities by applying suspect’s digital devices found at the crime scene is collected
different clustering techniques. and brought to forensic department for further investigation.
These data is preprocessed by applying various preprocessing
The research presented in [15] has used a Kohonen techniques as shown in the Figure 1.
Self-Organizing Map to cluster search results retrieved during

1354 2015 International Conference on Green Computing and Internet of Things (ICGCIoT)
the set of features is conflated into a single feature by
removal of the different suffixes -s, -ed, -ing to get the
single feature convert.
In the present work, stemming is done with the
help of WordNet stemmer. The WordNet [20] is the
online dictionary for English language which gives
synonyms, definitions, several senses of a word. The
WordNet dictionary contains the WordNet stemmer
class, which is used to stem the word with to its root
form.
The working of WordNet stemmer is explained below:
• Morph the word by applying the morphing rules
provided by WordNet stemmer interface. Check
whether the morphed word exits in the WordNet
dictionary.

• If the morph word (root form of the word) exists


in the dictionary, then returned the morph word.

• If the morphed word does not exist in the


dictionary, returned the original word.

3) Tokenization:
Tokenization is the process of breaking a sentence into
words, phrases, symbols, or other meaningful elements
called tokens. The tokenization process separates the
stream of text into tokens by using white spaces and
punctuation marks as separators.
Figure 1 Preprocessing of the data collected form suspect’s digital For example, the string “Ram and Sham are best
devices friends” will yield four tokens: “Ram”, “Sham”, “best”
and “friends”.

A. Pre-processing: B. Indexing:
Pre-processing is an important task and critical step The objective of indexing is to compute the weight
in information retrieval and text mining. Preprocessing is to of all the distinct terms obtained after preprocessing. The
separates the text into individual words. The following pre- weight is assigned based on frequency of occurrence of the
processing procedures are applied on the suspect's document term in the document and the number of documents that has
set. the term. The two main components that affect the
significance of a term in a document are the Term Frequency
1) Stop words Removal: (TF) factor and Inverse Document Frequency (IDF) factor
Stop words are linguistic words which carry no [21]. TF-IDF is a very popular weight assignment technique
information. In this phase, pronouns (I, am, he, she, used in text mining and information retrieval field.
they), preposition (to, for, a, in, the) and conjunctions After pre-processing, the system identifies top
(so, and, or, but) are removed since they do not have any frequent keywords that are commonly appearing in the
effect on the text mining process. Similarly, special dataset and these keywords are suggested to the investigator.
characters and punctuations are also removed. The upper The investigator can either select the keywords suggested by
case characters are converted to lower case in this the proposed model or he can manually enter his own search
process. query. The proposed model uses novel semantic document
clustering algorithm that groups all documents into clusters
2) Stemming:
where each cluster represents keyword specified by the
Stemming is the process of removing affixes (prefixes investigator. The basic components of the proposed system as
and suffixes) of each word. For example: convert, shown in Figure 2 are explained in detail in each sub section.
converts, converted, and converting. From this example,

2015 International Conference on Green Computing and Internet of Things (ICGCIoT) 1355
the investigator for formalizing the search query as they are
unaware of the suspect’s dataset. Hence the subject
suggestion model attempts to improve the Accuracy and thus
enhances the performance of existing DFI framework.

B. Search Query Expansion:


The subject keywords that are suggested by the
system or entered manually by the investigator are further
expanded by finding its synonyms through WordNet [20].
The proposed framework integrates WordNet to find out
semantic relations among words and thus make an attempt to
improve the effectiveness of retrieval.

C. Proposed Clustering Algorithm:


The expanded query obtained from the previous step
is searched on the indexed dataset. The proposed framework
uses a novel semantic document clustering algorithm, to
cluster the documents. Algorithm 1 provides an overview of
the clustering process used to semantically cluster the
documents stored on suspect’s hard disk.
The input to this algorithm is the indexed dataset
obtained after applying preprocessing techniques as discussed
in Section 3.2 and the search query entered by the user. Each
searched keyword is stemmed to its root form and its
synonyms are fetched from the WordNet dictionary as
Figure 2 The proposed Keyword/Subject Suggestion model for DFI illustrated in step 2 and 3 of the algorithm. Then the cosine
A. Subject Suggestion: similarity [19] is computed between final query vector which
consist of the search keyword and its synonyms and each
The key contribution of this research work is
document from the document set. The cosine similarity is
keyword/subject suggestion module. The aim to provide
used to measure similarity between two documents. The
keyword suggestions is to help the investigator to initiate the
documents whose similarity_score is more than 0 for the
process of finding the evidence from an unknown dataset.
corresponding keyword are clustered together whereas the
Keyword/Subject Suggestion module recommends the top
other documents are grouped in generic cluster.
frequent keywords that repeatedly appear in the dataset.
The steps to identify the top frequent keywords are as Algorithm 1: Document clustering.
follows:
Input: Indexed dataset, Search_query_vector S Null
• For each word, determine the frequency of the
1. for each search keyword si ‫ א‬S do
occurrence of the word in the dataset. 2. Root_word r ĸ find_stem(si)
• Determine the total number of words in the entire 3. Synonym [1..N] ĸ find_synonyms(r)
dataset. 4. Final_query_vector qi ĸ si ‫ ׫‬Synonym[1..N]
5. for each document dj ‫ א‬D do
• Finally, for each word, compute the keyword score 6. similarity_score ĸcalculate
cosine_similarity(dj,qi)
by using the TF-IDF formula.
7. if (similarity_score>0) then
8. Ci ĸ dj
• Compare the keyword score of all the words and
9. else
arrange the words in the descending order, based 10. general_cluster ĸ dj
on their keyword score. 11. end if
12. end for
Once the keyword score of all the words in the 13. end for
dataset is computed then the top ten keywords who’s Output: A set of clusters
‘Keyword_score’ is highest among all the keywords, are
highlighted and suggested/recommended by the system. In this way the proposed framework is designed to
The investigator can either select the subject from quickly find the relevant evidence from large dataset found at
the suggested list of keywords or manually enter his/her own the crime scene by suggesting the keywords or the subjects
search query. By suggesting these subject keywords it aids on which the investigator can search for the evidence.

1356 2015 International Conference on Green Computing and Internet of Things (ICGCIoT)
IV. SYSTEM IMPLEMENTATION TABLE I ACCURACY OF THE PROPOSED SEARCH MODEL
With subject Without subject
A. Datasets: Accuracy Parameters
suggestion suggestion
For validating the proposed system, experiments Average Precision 1 0.93
Average Recall 0.93 0.89
were conducted on the following datasets. The dataset consists Average F-measure 0.96 0.90
of the documents of .txt, .doc, .pdf, .ppt, .pptx, .xls and .xlsx
extension. Below is a brief summarization of each data set's Here the Average F-measure with subject suggestion
characteristics: module is 0.96 whereas Average F-measure without subject
1. Crime Dataset_1: suggestion module is 0.90. Thus, it is concluded that the
This dataset is also used to test the Accuracy of subject suggestion module improves the overall Accuracy of
system. The dataset comprises of total 220 the system.
documents, out of which, there are 100 crime Scalability:
documents of 4 different classes namely, Kidnap, Table II illustrates the Scalability of the proposed
Robbery, Sexual Assault and Murder and 120 other system. To test the Scalability of the proposed system,
general documents. experiments were conducted on ‘Classic3’ dataset, consisting
2. Crime Dataset_2: of 10,000.
Crime Dataset_2 is further extended to build a larger TABLE II SCALABILITY OF THE PROPOSED SEARCH MODEL
data set comprising of 1000 documents in which
there are, 700 crime documents of 14 different crime
Subject
categories and 300 other documents. This dataset is Sr. No. of
Preprocessing
Suggestion
Searching Total
used to check the Accuracy as well as the Scalability time time time
No. Documents time
(seconds) (seconds) (seconds)
of the system so that the correlation between (seconds)
Accuracy and Scalability can be studied. 1 10 1 0 1 2

3. Classic3: 2 50 4 0 1 5

Classic3 is a benchmark data set used in text mining 3 100 6 1 1 7


[22]. It consists of 7095 documents from 4 disjoint 4 500 15 1 4 20
classes: 1400 CRAN documents, 1033 MED 5 1000 26 2 6 34
documents, 1460 CISI documents and 3204 CACM
6 5000 194 4 46 244
documents. This dataset is used to check the
Scalability of system. 7 10,000 289 5 55 349

B. Experimental Results: Results show that pre-processing phase consumes


The performance of our proposed system is maximum system runtime. The runtime is directly
evaluated based on two parameters: Accuracy and proportional to the data set's size. Since each phase of the
Scalability.This Section depicts the experimental results algorithm grows linearly with respect to the total number of
obtained after testing the Accuracy and Scalability of the documents, hence the experimental results suggest that the
proposed framework. It also illustrates the correlation system presented in this work is scalable.
between Accuracy and Scalability.
Accuracy and Scalability correlation:
Accuracy: To see the correlation between Accuracy and
Accuracy of the system is based on three parameters Scalability, experiments were conducted on ‘Crime
i.e. Precision, Recall and F-measure as shown in Table I. To Dataset_2’. The objective of these experiments was to check
check the system Accuracy, experiments were conducted on whether the Accuracy of the system gets affected as the
‘Crime Dataset_1’. Accuracy of the system is tested for datasets size grows. Table III illustrates the relation between
around 50 keywords. the Precision, Recall and F-measure score with the runtime
calculation of each phase.

TABLE III ACCURACY AND SCALABILITY CORRELATION

Without Subject Suggestion


Preprocessing Subject Searching With Subject Suggestion
No. Of Total Time
Time Suggestion Time Time
Documents (seconds)
(seconds) (seconds) (seconds) AP AR AF AP AR AF

10 4 0 1 5 1 1 1 1 1 1
100 18 1 3 22 1 0.98 0.99 1 0.96 0.98
1000 63 3 13 79 1 0.95 0.97 1 0.91 0.95

Where: *AP= Average Precision, *AR= Average Recall, *AF= Average F-measure

2015 International Conference on Green Computing and Internet of Things (ICGCIoT) 1357
From the above Table III it was observed that the average F
measure with subject suggestion module and without subject [9] D. Bansal and L. Bhambhu, “Execution of APRIORI Algorithm o
suggestion module does not degrade much as the number of Data Mining Directed Towards Tumultuous Crimes Concerning
documents is increased. Women”, International Journal of Advanced Research in
Computer Science and Software Engineering, Vol.3, Issue. 1
pp.54-62, 2013.

V. CONCLUSIONS AND FUTURE WORK [10] F. Iqbal, H. Binsalleeh, B. Fung and M. Debbabi, “A unified data
mining solution for authorship analysis in anonymous textua
This research work developed a search model that communications”, Journal of Information Sciences, Vol. 231, pp
can automatically analyze large quantities of data and allows 98–112, 2013.
the investigator to semantically cluster the data set on
[11] Corney, M., de Vel, O., Anderson, A., and Mohay, “Gende
suspect’s hard drive based on the keywords specified by the preferential Text Mining of E-mail Discourse”, in Proc. the 18th
investigator or suggested by system. Experiments were annual Computer Security Applications Conference, 2002, pp
performed on different dummy crime datasets to test the 282-289.
Accuracy and Scalability of the proposed DFI framework.
[12] De Vel, O. Corney, M. and Mohay, “Mining E-Mail Content fo
Thus, the experimental tests proved that subject Author Identification Forensics”, Journal of SIGMOD Record
recommendations improved the Accuracy (F-measure) of the ACM Press, Vol. 30, Issue. 4, pp. 55–64, 2001.
system. Next, the experiments were conducted on the classic3
dataset to test the Scalability of the system. The objective of [13] S.Yamuna and N. Bhuvaneswari, “Data Mining Techniques to
Analyze and Predict Crimes”, International Journal of
this Scalability experiment was to measure the runtime of Engineering And Science, Vol.1, Issue.2, pp.243-247, 2012.
each phase and thus to ensure that the system works well for
10 documents to 10,000 documents. From the experimental [14] H. Chen, W. Chung, J. J. Xu, G. Wang, Y. Qin, and M. Chau
results, it was proved that the model presented in this work is “Crime data mining: A general framework and some examples”
Journal of Computer, Vol. 37, Issue. 4, pp. 50–56, 2004.
scalable. The correlation between the Accuracy and
Scalability was tested on Crime Dataset_3 and it was [15] N. L. Beebe and J. G. Clark, “Post-retrieval search hit clustering to
observed that the Accuracy of the system was not affected improve information retrieval effectiveness: Two digital forensic
much as the dataset increases. Thus, the proposed DFI case studies”, Journal of Decision Support Systems, Vol.51, Issue
4, pp. 732–744, 2011.
framework attempted to speed-up the investigation process
and reduced the time taken for a digital investigation. [16] N. L. Beebe and J. G. Clark, “Digital forensic text string searching
Improving information retrieval effectiveness by thematically
clustering search results”, Journal of Digital Investigation
REFERENCES Elsevier, Vol. 4, pp. 49–54, 1997.
[1] E. Casey, 1st ed., Digital Evidence and Computer Crime: Forensic
Science, Computers and the Internet with CD-ROM, Academic [17] R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, and D
Press, USA, 2000. Benredjem, “Towards an integrated e-mail forensic analysi
framework”, Journal of Digital Investigation, Elsevier, Vol. 5
[2] G. Palmer and M. Corporation, “A road map for digital forensic Issue. 3, pp. 124–137, 2009.
research”, in Proc. the 1st Digital Forensic Research Workshop,
2001, pp. 1-48. [18] S. Decherchi, S. Tacconi, J. Redi, A. Leoncini, F. Sangiacomo, and
R. Zunino, “Text clustering for digital forensics analysis,” Journa
of Information Assurance and Security, Vol. 5, Issue. 2, pp. 384–
[3] Guidance Encase Tool, [Online], 391, 2010.
http://www.guidancesoftware.com/, 2014.
[19] G. Dagher and B. Fung, “Subject-based semantic documen
[4] Access Data Forensic Toolkit, [Online], clustering for DFIs”, Journal of Data & Knowledge Engineering
http://www.accessdata.com, 2014. Vol. 86, pp. 224–241, 2013.

[5] Sleuth Kit & Authopsy Tool, [Online], http://www.sleuthkit.org, [20] G.A. Miller, “WordNet: a lexical database for English”, Journal of
2014. Communications of the ACM, Vol. 38, Issue. 11, pp. 39–41, 1995.

[6] Spyridon Dosisa, Irvin Homema and Oliver Popova, “Semantic [21] Sweedle Mascarnes and Joanne Gomes, “Subject based Clustering
Representation and Integration of Digital Evidence”, in Proc. 17th for DFI with Subject Suggestion”, International Journal of
International Conference in Knowledge Based and Intelligent Computer Applications (IJCA), Vol. 102, No. 11, 2014.
Information and Engineering Systems, 2013, pp. 1266 – 1275.
[22] Classic3 Dataset, [Online], ftp://ftp.cs.cornell.edu/pub/smart/
[7] Nicole Beebe and Jan Clark, Advances In Digital Forensics, 2013.
Springer, 2005.

[8] J. Han and M. Kamber, 2nd ed., Data mining: Concepts and
Techniques, Elsevier, 2006.

1358 2015 International Conference on Green Computing and Internet of Things (ICGCIoT)

You might also like