PPH-1801046-EC-Finalversion

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/326102942
CARVING AND CLUSTERING FILES IN RAM FOR MEMORY FORENSICS
Article in Far East Journal of Electronics and Communications · July 2018

DOI: 10.17654/EC018050695
CITATIONS READS
7 13,647
4 authors:
Ziad A Al-Sharif Yasmee Khalee

Jordan University of Science and Technology Walailak University
46 PUBLICATIONS 549 CITATIONS 1 PUBLICATION 7 CITATIONS
SEE PROFILE SEE PROFILE
Mohammed I. Al-Saleh Mahmoud Al-Ayyoub

Jordan University of Science and Technology Ajman University
44 PUBLICATIONS 424 CITATIONS 313 PUBLICATIONS 9,591 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Ziad A Al-Sharif on 13 October 2018.
The user has requested enhancement of the downloaded file.

Far East Journal of Electronics and Communications
© 2018 Pushpa Publishing House, Allahabad, India
http://www.pphmj.com
http://dx.doi.org/10.17654/EC018050695
Volume 18, Number 5, 2018, Pages 695-722 ISSN: 0973-7006
CARVING AND CLUSTERING FILES IN RAM FOR

MEMORY FORENSICS
Ziad A. Al-Sharif1,*, Attaa Y. Al-Khalee2, Mohammed I. Al-Saleh2

and Mahmoud Al-Ayyoub2
1
Software Engineering Department
Jordan University of Science and Technology
Irbid, 22110, Jordan
2
Computer Science Department
Jordan University of Science and Technology
Irbid, 22110, Jordan
Abstract
Memory contains vital information about the current state of a system

such as processes, network connections and opened files. The contents
of a file can be reconstructed from memory either by following the
Operating System’s data structures (which might not be always
available) or by carving data based on the file’s internal structure.
Unfortunately, the problem gets more complicated when carving
several files with the same internal structure that happen to coexist in
memory. This paper carves chunks of a certain file type from memory
and employs clustering techniques to distribute these chunks into
their corresponding files. As a running example, we carve and cluster
different PDF files. The paper employs the wording similarities among
related portions and shows that the Hierarchical clustering algorithm
facilitates grouping of the recovered text pieces within an adequate
accuracy.
Received: January 11, 2018; Revised: March 28, 2018; Accepted: March 30, 2018
Keywords and phrases: file carving, text clustering, Hierarchical clustering.
∗
Corresponding author
696 Ziad A. Al-Sharif et al.
1. Introduction
The technology revolution that contributes to the development of our life

has been targeted by perpetrators to illegally invade our assets. The
increasing rate of cybercrimes emphasizes the importance of digital forensics
in crime investigation [1]. Security engineers put endless efforts to counter-
measure cybercrimes1,2. However, perpetrators might find their ways in.
Digital Forensics (DF) helps the investigation process to prove the guiltiness
of the perpetrators or malicious persons. Digital Evidence (DE) is what a
Digital Investigator (DI) seeks to extract. For example, this type of DE was
used to apprehend the criminal that put a collar bomb on a high school
student in Australia3. DF examiners recovered the criminal’s name (DE) from
documents on a USB drive draped around the victim’s neck. This DE can be
sought in different digital media such as Hard Drive, network traffic, and
RAM [2]. Even though it is volatile, RAM embodies important information
about perpetrators’ activities [3-5]; one of which is the files that are currently
open.
File recovery and carving are popular approaches used in DF to
reconstruct files [6]. File recovery techniques utilize the metadata of the
underlying file system to recover files. However, such metadata is not always
available. Consequently, file carving techniques utilize the internal structure
of the file itself rather than the metadata. Even though file carving techniques
exist for the Hard Drives [7], we believe that file carving in RAM is still
challenging and needs more attention [8]. This challenge is twofold: first,
whatever resides from an open file in RAM ought to be scattered due to the
internal functionality of the operating system’s paging technique. Second,
multiple files of the same type can be opened simultaneously; making parts
of them all sit in RAM [9]. Carving such contents into one big random file
puts extra burden on the DI to analyze. Hence, there is a compelling need to
1
Internet Crime Complaint Center (IC3), www.ic3.gov
2
European Cybercrime Centre (EC3), www.europol.europa.eu
3
The Collar Bomber’s Explosive Tech Gaffe, www.pcworld.com
Carving and Clustering Files in RAM for Memory Forensics 697
cluster the carved contents and reconstruct the original files. We observe that
data in a carved file segment will probably share some similarity with other
segments that belong to the same file and should eventually “be stuck”
together.
In this paper, we propose a new approach for carving multiple PDF files
based on their internal structure. This is possible because a PDF file basically
consists of several objects, each of which has start and end marks that
distinguish it. Carving a single PDF file based on its structure is not new
[10, 11]. Previously, we investigated the process of identifying and locating
the objects of one PDF file in RAM (when only one PDF file is opened).
However, this paper explores and evaluates the process of acquiring various
objects that belong to different PDF files that are opened and used
simultaneously on the same machine. It employs text clustering techniques to
separate and recombine various fragmented text that are recovered from the
RAM of the target machine. Our extended goal is to carve multiple PDF files
that happen to co-exist in the RAM of a perpetrator’s machine. This paper
tries to answer the following question: How is it possible to carve and
separate the remaining textual contents of multiple PDF files in RAM?
In order to validate the proposed approach, two different experiments are

designed. First experiment builds a trust base on clustering techniques that
can be used. We evaluate two of the most widely used text clustering
algorithms: the Simple K-Means and the Hierarchical clustering algorithms.
These algorithms are applied independently on the same dataset of PDF
objects. The default distance function (Euclidean) is used. A comparison is
made to evaluate the clustering techniques. The second experiment aims at
DF practices. It targets PDF files that are opened by a perpetrator; it utilizes
the preferable text clustering settings that are identified during the first
experiment and applies them on a DF case. This experiment carves objects of
different PDF files from RAM dumps during three different scenarios. The
first scenario represents the RAM state while the PDF files are opened,
whereas the second and third scenarios represent the RAM state after closing
these files; one depicts the RAM state immediately after closing these files
and the other depicts the RAM state after 10 minutes from closing the same
files.
The rest of this paper is organized as follows: Section 2 presents the

background knowledge necessary to introduce our used methodology.
Section 3 presents our research methodology. Section 4 describes the two
different experiments that are used to validate our methodology. Section 5
presents our results. Then, a combined discussion is presented in Section 6.
Section 7 highlights some of the related works whereas Section 8 presents
the limitations of our approach and explains our future works. Finally,
Section 9 presents our conclusion.
2. Background
This section presents the basic structure of a PDF file. This structure is
used during the carving process. Then, it discusses file carving and how it is
different from file recovery. Finally, it presents some of the text clustering
techniques that are used in this paper.
2.1. PDF file structure
The popularity of PDF files is growing. Its portability across a diverse set
of platforms makes it one of the most widely used document file formats4.
Usually, a PDF file contains texts, bitmap images, vector graphics, and
possibly many other multimedia elements5. Figure 1 presents a sample of a
PDF file structure that only contains the “Hello World” text in one textual
object and other formatting objects. The figure shows that a PDF file
structure has four parts, as described below.
4
The 8 most popular document formats on the Web in 2015,
http://duff-johnson.com/2015/02/12/
5
The PDF File Format,
http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
2.1.1. Header
It is the first part of the PDF file. It is written as a line preceded by a

comment sign (%) and it determines the version number of the PDF
document. For example, % PDF-1.4, represents the header of the sample
PDF file presented in Figure 1. The next line of a PDF file often has ASCII
characters indicating to some PDF readers that the file has binary data.
Notice that #1 of Figure 1 has an empty second line, which indicates that this
PDF example has no encoded data; it is merely a stream of textual data
(“Hello World”) that is not encoded.
2.1.2. Body
It is the part in which all PDF objects are organized. These objects form
the document’s pages and all other data (texts, annotations, images, and
fonts). Some of the non-display objects may be included in this section. For
example, some objects may describe the security level used in the document
such as the password, while other objects may describe the implementation
of the document’s general layout. See #2 of Figure 1.
Figure 1. An example of a simple PDF file structure that consists of one

page that contains a single line of text: “Hello World”.#1: Represents the
PDF header. #2: Represents the PDF body. #3: Represents the cross
reference table (xref). Finally, #4: Represents PDF trailer.
2.1.3. Cross-reference table

It is also known as the xref table. It is the only part of a PDF file with a
fixed format that enables a random access to all objects in the PDF file. In
order to be randomly accessed, all objects in the body section should have a
reference in the xref table. A single object is referenced by an entry of
exactly 20 bytes long including the end of line marker, each of which has an
offset and a generation number. This generation number represents the
number of times that the referenced object has been freed. It is followed by
the letter f or n to state whether the object is free or in use, respectively.
Objects’ references are consecutively encapsulated in one section. If the PDF
file has never been updated, the xref table contains one section only. Once a
change is occurred, a new section is added at the end of xref table declaring a
reference to the change. For example, if an object is deleted, the deleted
object is marked by the letter f and its generation number is increased.
Physically, if the object is reused again, the letter is reversed back to n, which
is also associated with another increase in the generation number. See #3 of
Figure 1.
2.1.4. Trailer
It allows PDF readers to rapidly locate the xref table and some special
objects such as the root object. Hence, a PDF reader is a program that begins
the reading process from the end of the file that has end-of-file marker
%EOF on the last line of the PDF file. This special marker is preceded by the
offset of the xref table, which should be read next. The trailer section is of
type dictionary. Its contents are specified between two double angle brackets
as key-value pairs. See #4 of Figure 1.
2.2. File carving and file recovery

Modern operating systems provide abstractions, through which end users
can easily interface with storage media. Information about files in a given file
system are called metadata, which is stored in a special structure such as the
i-nodes in UNIX-based systems. File recovery uses this information to
reclaim files. In most file systems, when a file is deleted, a flag in its
metadata entries is modified; called logical deletion [12]. This leaves the
file’s content without any modification until a new content is written over by
another process [13]. However, whenever the file system is damaged or
deleted, the file format and structure can be utilized to reclaim its contents
[14]. This process is called file carving. The file’s internal structure provides
a set of markers that define its various blocks and the type of their contents.
File carving is a challenging process; especially when it deals with
fragmented files, in which the sequences of blocks are allocated to different
file types.
This paper uses the standard structure of PDF files (the physical file
structure) to carve textual objects of multiple PDF files from the RAM of a
criminal’s machine, then it applies an automated text clustering algorithm to
separate these carved objects based on their originals.
2.3. Text clustering

Text clustering aims at grouping various text instances based on the
similarities of their contents. Similarities are captured by applying some
distance functions [15]. The primary factor affecting the clustering results is
the dataset characteristics and the selection features that describe the
documents. Current advanced digital forensic approaches adopted text
mining and clustering techniques in their process [16, 17].
This paper utilizes text clustering after carving fragmented PDF objects
of multiple PDF files from the RAM of the target machine (computer). Based
on this application domain, we evaluate two of the most commonly used
clustering algorithms.
2.3.1. Simple K-Means clustering

This algorithm is a centroid-based clustering technique. It uses the
Euclidean distance to partition the dataset into subsets (clusters). This
technique relates to the Nearest Neighbor technique used in machine learning
for data classification [17]. This clustering algorithm is well known for its
simplicity and effectiveness. However, the quality of its result is highly
affected by the initially picked centroid.
2.3.2. Hierarchical clustering
It is a connectivity-based clustering technique. It builds a tree-like

structure based on the similarity measurement between two clusters. The leaf
nodes of the tree represent the documents whereas the intermediate nodes
represent the merged clusters. It utilizes an agglomerative algorithm, in
which each best similar pairs of documents are merged into a higher level.
The merging iterates bottom up until no more merging is possible [18]. The
experimental domain of this paper evaluates the Hierarchical clustering with
three different linkage rules: Average, Mean, and Ward.
2.3.3. Document representation
Text documents are often described by their terms (T ) as a bag of words

[19], which basically means that only terms appearances and their
occurrences are considered, not their relative order of appearance. Each
single document (text instance) is defined as a feature vector of its terms
(e.g., td ) , which are mapped to dimensions in the resulting data space. Then,
the terms’ occurrences are used to compute the weights/values for each
term/feature. For example, let tf (d , t ) denote the frequency of term t ∈ T
in document d ∈ D, where D is the set of all documents. Then the vector
representation of a document d is described in Equation 1,
td = (tf (d , t1 ), ..., tf (d , tm )). (1)
In this paper, we use another weighting technique which is the Term

Frequency-Inverse Document Frequency (TF − IDF ) [19]. TF-IDF tries to
find the terms that occur frequently in a small number of documents but
rarely in the other documents. It tends to be more relevant and specific for
that particular group of documents. TF-IDF assigns a weight for each term
frequency tf in the document d reflecting the term significance with its
occurrences in all documents as defined in Equation 2, where df (t ) is the
number of documents in which term t occurs,
TF − IDF (d , f ) = tf (d , t ) X log⎛⎜
D ⎞
⎟. (2)
⎝ df (t ) ⎠
The increasing size of documents broadens the dimensions of the vector

space. This representation affects the clustering algorithm in terms of time,
space, and accuracy. Feature selection and reduction techniques are used to
decrease the number of dimensions and make it more manageable. Such
techniques return the most useful subset of terms; redundant or irrelevant
terms are ignored. This operation is performed because these terms do not
provide useful information in the clustering operation.
2.3.4. Similarity measures
It describes the closeness or separation between two target instances.

Generally, this measure computes the similarity between pairs of instances
and maps it as a single numeric value, which relies on two factors: the two
instances’ characteristics and the measure itself. Different measurements can
be used in text clustering. We applied our experiments based on default
distance (Euclidean); it is the only distance used by the Simple K-Means
algorithm.
Euclidean distance computes the distance between two instances in

multidimensional spaces. Given two documents d a and d b with the vectors
are d a and d b , the Euclidean distance is calculated based on Equation 3,

where the term set is T = {t1, ..., tm }. As mentioned previously, we use the
TF-IDF value as term weights, that is wt , a = TF − IDF (d a , t ) ,
m
DE (ta , tb ) = ∑ wt , a − wt , b 2 . (3)
t =1
2.3.5. Linkage rules
In Hierarchical clustering, the linkage rule determines the distance

between clusters. This distance can be calculated between pairs of clusters
based on different formulas, three of which are tested in this paper:
(1) Average: In this linkage, the distance is defined between two clusters
by computing the average of distances between each point in the first cluster
A and each point in the second cluster B, denoted by D ( A, B ). See Equation
4, where TAB is the sum of all pairwise distances between cluster A and
cluster B and N A and N B are the sizes of cluster A and B, respectively,
D ( A, B ) = TAB ( N A × N B ). (4)
(2) Mean: This linkage calculates the mean distance of merged clusters.
Also, it is called group-average agglomerative clustering. In this linkage, the
two clusters are merged such that, after the merge, the average pairwise
distance within the newly formed cluster is minimized. See Equation 5,
where observations i and j are in cluster t, which is formed by merging
clusters A and B:
D( A, B ) = Average D(i, j ). (5)
(3) Ward: This linkage keeps track of the summation square between
two clusters. It aims at minimizing the increase of this summation as much as
possible. Initially, the sum of square for first merging is zero [18]. This sum
starts to increase based on the formula of the Ward rule that is described in
Equation 6. The m j is the center of cluster j, and n j is the number of points
in it. Δ is called the merging cost of combining the clusters A and B:
Δ ( A; B ) = ∑ xι − m A U B 2 − ∑ xι − m A 2 − ∑ xι − m B 2
i∈ A U B i∈ A i∈ B
n AnB
= m A − mB 2 . (6)
n A + nB
3. Methodology
We assume the extreme condition that no metadata is available to recover

these files in RAM. This assumption is actually reasonable because the
metadata can be overwritten in practice. Consequently, carving PDF files by
utilizing their internal structure is valuable and, in many scenarios, the only
option. A PDF file mainly consists of several objects that can be uniquely
identified by surrounding markers, which can be simply used to identify
these objects. However, such objects might have come from several files.
Thus, identifying the original files that these objects come from is
challenging. This paper overcomes this challenge by utilizing clustering
techniques that resort these objects to their origins. Figure 2 shows our
investigation model that is further illustrated in the following steps.
Figure 2. Investigation model. Step 0: Creates a RAM dump, which is a bit-

by-bit copy of the RAM of the target machine. Step 1: Carves PDF’s textual
objects from the RAM dump. Steps 2 and 3 prepare objects for the clustering
process; these steps are explained in Algorithm 1. Step 4 applies one of the
text clustering algorithms to resort these carved data into their originals.
3.1. Step 1: Carving PDF objects
Many tools and methods can be used to create and analyze memory
dumps. We obtain the target memory dump of a Virtual Machine (VM)
created using Oracle’s Virtual Box6. Then, the RAM pages are investigated
6
Oracle VirtualBox, www.virtualbox.org
to carve PDF’s textual data objects, see Figure 2. PDF objects are carved for
a group of PDF files, which are used by the user and have their footprints in
the RAM of the target machine. The regular format of the PDF objects
facilitates the carving process. A pool of all PDF objects is created for further
analysis.
3.2. Step 2: Extracting textual data
Recovered objects are analyzed further to determine whether it contains

textual data or not. Often, this step requires the decoding of object’s stream.
The type of a filter can be determined by looking at the Filter entry of the
object that holds the compressed stream. However, in a RAM memory dump,
it is unlikely to have a complete physical and/or logical structure for the
target PDF file. Furthermore, in the case of opening more than one PDF
document at the same time, objects from different PDF files will be
indifferent RAM pages, it becomes difficult to identify the logical structure
of a particular PDF file without the help of the operating system.
3.3. Step 3: Text segmentation and correction
The plain text obtained from the decoding process might have problems
leading to incorrect segmentation. The problems can be related to whether it
has meaningful words or not. For example, an encoded sentence such as
“Virtual Machine for Computer Forensics systems” can be decoded as
follows: “V irtual Machi ne for Comput er F orensics s y s t e m s.”
Moreover, some characters may remain encoded and disrupt the spelling
of such words. For example, in (classi”cation), the middle double quotation
symbol (”) is actually the letters (fi). The spellings of words that have these
characters need to be rectified. Therefore, we use the enchant library7 to
check the spelling of each word in the resulting list and to rectify the
incorrect ones.
7
Enchant Library, www.abisource.com/projects/enchant/
3.4. Step 4: Clustering
Text clustering techniques are used to group the text instances (textual
data) based on their contents. The basic idea we are following is that text
instances of similar content are likely to originate from the same PDF file.
We investigate various parameters and rules on two of the most popular
clustering algorithms: Simple K-Means and Hierarchical clustering. These
two algorithms are applied with their default distance function (Euclidean).
Additionally, in order to validate our results, we use two essential
evaluation metrics that are commonly used for text processing. First, the
Macro-Average Precision, which is the average of precision that is defined
based on Equation 7, where Precision is the fraction of the retrieved
instances that are relevant to a given set, True Positive (TP) is the set of
instances that are assigned correctly to a given set and False Positive (FP) is
the set of instances that are incorrectly assigned to that set. Second, the
Macro-Average Recall, which is the average of recall that is defined based on
Equation 8, where Recall is the fraction of relevant instances that are
retrieved in a given set, and False Negative (FN) is the set of instances that
are incorrectly not assigned to that set:
TP
Precision = , (7)
TP + FP
TP
Recall = . (8)
TP + FN
4. Experiments
Our methodology consists of two parts: being able to extract PDF objects
from RAM and being able to cluster them correctly. In order to validate our
methodology, we design two experiments. The first one is to validate the
clustering part (given a set of objects taken from multiple PDF files, how
effective can we cluster them?), while the second one is to validate the
objects extraction from RAM. More details about these two experiments
follow.
4.1. Experiment #1
This experiment is designed to verify the ability to extract and cluster
objects of multiple files. In order to increase the difficulty level of the
clustering process, all PDF files are chosen so that they share a common
subject. We use 50 PDF files that are randomly obtained from a conference
proceeding8. Often, computer users do not open too many files
simultaneously. In order to evaluate the impact of the objects’ pool size on
the clustering process, these files are divided into five subsets. The first
subset ( S1 ) contains 10 files. Then S2 contains the files in S1 in addition to
10 more files for a total of 20 files. Each one of the remaining sets contains
the same files of the previous set in addition to 10 more files.
For each Si , we extract all objects (Obj x ) from all files in that set.
Among various types of streams, we are interested in textual data streams.
Each extracted text (after processing and decoding shown in Subsections 3.2
and 3.3) is saved into a separate plain text file. Accordingly, S1 produces
140 text files, while, Sets 2, 3, 4 and 5 produce 272, 367, 493 and 645,
respectively. See Table 1.
Table 1. The sets of used PDF files in Experiment #1. Set Si +1 has 10 more
files over Si
Set Used PDF files Number of textual data objects
S1 F1 − F10 140
S2 F1 − F20 272
S3 F1 − F30 367
S4 F1 − F40 493
S5 F1 − F50 645
In this experiment, we first compare the Simple K-Means and the

Hierarchical algorithm. Then, we further investigate the Hierarchical
8
IEEE INFOCOM 2009
algorithm to figure out the best clustering settings (linkage rules and feature
vectors). Algorithm 1 summarizes our preparation steps for the clustering
process and Algorithm 2 shows our experimental procedure that is applied on
each Si presented in Table 1. The results are presented in Subsection 5.1.
Algorithm 1. Preprocessing
1: Input: A dataset of raw PDF objects ( Seti )
2: Result: Returns Pooli
3: Data: Creates Pooli of simple text files
4: for each Obj x ∈ Seti do
5: if Obj x has textual data then
6: Extract its textual data;
7: if Data is Encoded then

8: Decode textual data;
9: Fix inappropriately segmented text;
10: Fix inappropriately spelled words;
11: Save text in a plain text file;
12: Add resulted file to Pooli ;
Figure 3. Preprocessing algorithm: prepares a set of PDF objects for text

clustering.
4.2. Experiment #2
This experiment is intended to utilize the results obtained from
Experiment #1 on a digital forensic application that targets the volatility and
randomness of data in RAM. We intend to verify the ability to carve PDF
objects from a memory dump, extract their textual data, and finally separate
them into clusters, each of which represents a large part of the original PDF
file. A Windows 7 Virtual Machine (VM) with 512 MB of RAM is used. The
VM is created using Oracle’s Virtual Box.
Algorithm 2. First experiment:
1: Input: A Set of PDF files ( Si )
2: Result: Macro-Average Precision and Recall ∀ Clusters

3: Data: Creates Seti of textual data objects
4: for each F j ∈ Si do
5: > F j is a PDF file
6: for each Obj x ∈ F j do
7: if Objx is a Data Object then
8: Add Obj x to Seti ;
9:  Alg. 1 takes Seti and returns Pooli , see Alg. 1
10: Apply Algorithm 1 on Seti ;
11: Apply the K-Means Text Clustering Algorithm on Pooli ;
12: for each Clusterk do
13: Find the Macro-Average Precision;

14: Find the Macro-Average Recall;
15: Apply the Hierarchical Text Clustering on Pooli ;

Figure 4. The algorithm used in first experiment. It takes a group of PDF

files, identifies their objects, extracts textual data, resorts them into clusters,
and finally validates the clustering results.
Five different PDF files of different subjects are opened. Information

about these files is presented in Table 2. A RAM dump is created for the
following three different scenarios:
• Scenario1: the dump is taken while the five PDF files are opened.
• Scenario2: the dump is taken right after closing the five PDF files.
• Scenario3: the dump is taken after 10 minutes of closing the files.
Table 2. Specification of the five PDF files used in Experiment #2

File Size/KB Number of Number of textual
all objects data objects
F1 31 117 22
F2 243 166 17
F3 525 297 16
F4 432 140 9
F5 996 318 18
Total 2227 1038 82
Algorithm 3 shows the details of this experiment, it also uses Algorithm

1 to prepare the carved objects for the clustering process. For each RAM
dump, we identify PDF objects and extract textual data from their streams.
Extracted text is collected in one pool of plain text files; each of which
represents the contents of one stream. Then, the Hierarchical text clustering
algorithm is applied on this pool. The results are presented in Subsection 5.2.
5. Results
5.1. Results from Experiment #1

This section presents the results obtained from clustering PDF objects of
50 files. A comparison between two of the most common clustering
algorithms (Simple K-Means and Hierarchical clustering) is conducted. The
Weka tool9 is used for this experiment and both algorithms are used with
9
Weka, https://sourceforge.net/projects/weka/
their default parameter values. Then, we turn our attention to the algorithm
that produce the best result (which is the Hierarchical clustering algorithm)
and test different parameter values for it to explore whether we can improve
its accuracy. Specifically, we test the Hierarchical clustering algorithm with
three different linkage rules: Average, Mean, and Ward. Then, three different
Word to Keep Values (WKVs): 100, 500 and 1000 are evaluated.
Algorithm 3. Second Experiment:

1: Input: A Memory Dump (Dumpi)
2: Result: Set of clusters and their Precision and Recall
3: Data: Creates Seti of textual PDF objects
4: repeat
5: Identify PDF objects ∈ Dumpi ;
6: Carve identified PDF objects (Obj x );
7: Add Obj x to the Seti ;
8: until: All Obj x ∈ Dumpi are Carved
9:  Alg. 1 takes Seti and returns Pooli , see Alg. 1
10: Apply Algorithm 1 on Seti ;
11: Apply the Hierarchical Text Clustering on Pooli ;

Figure 5. The algorithm used in second experiment. It takes a RAM dump

and returns a set of clusters along their Macro-Average Precision and Recall.
5.1.1. Simple K-Means vs. Hierarchical clustering
This experiment shows that the Hierarchical clustering algorithm

outperforms the Simple K-Means algorithm. The two different accuracy

measures used to evaluate the algorithms under consideration are: Macro-
Averaged precision and recall. Table 3 presents the precision of the
clustering process. It is clear that Hierarchical clustering outperforms the
Simple K-Means. Results of the Hierarchical clustering algorithm range from
0.849 to 0.988 whereas the corresponding values range from 0.612 to 0.773
for the Simple K-Means. Table 3 also shows that the Hierarchical clustering
algorithm outperforms the Simple K-Means algorithm in terms of recall too.
Results from the Hierarchical clustering algorithm range from 0.821 to 0.986
whereas the corresponding values range from 0.629 to 0.801 for the Simple
K-Means. Taking into consideration that the input text documents share the
same subject, these results are very good.
Table 3. Results from Experiment #1. It compares the results of the two text
clustering algorithms. The evaluation is based on the Macro-Average
Precision and Macro-Average Recall. (i) Represents the results from
Hierarchical Clustering and (ii) Represents results from the Simple K-Means
- Macro-Average Precision Macro-Average Recall
Set (i) (ii) (i) (ii)
S1 0.988 0.773 0.986 0.801
S2 0.849 0.753 0.821 0.771
S3 0.900 0.699 0.890 0.732
S4 0.910 0.612 0.905 0.640
S5 0.894 0.647 0.890 0.629
5.1.2. Linkage Rules: Average vs. Mean vs. Ward
Further experiments are applied on the Hierarchical clustering algorithm

to identify the best linkage rule to be applied on the pool of objects. The
results show that the Ward rule is better than the other two: Average and
Mean. Table 4 presents the precision of the clustering process for each pool.
The values range from 0.637 to 0.808 for the Average rule, from 0.763 to
0.902 for the Mean rule, and from 0.882 to 0.987 for the Ward rule. This
means that the Ward rule outperforms the other two. Furthermore, Table 4
shows the recall of the clustering process for these three linkage rules. The
values range from 0.656 to 0.817, from 0.791 to 0.908, and from 0.847 to
0.986 for the Average, Mean, and Ward rules, respectively. This means the
Ward beats the other two rules in terms of the recall too.
Table 4. Results from Experiment #1. It compares the results for the
Average, Mean, and Ward Rules when applied on Hierarchical clustering; it
is clear that the Ward rule outperforms the other two
Set Ward Mean Avg. Ward Mean Avg.
S1 0.987 0.902 0.666 0.986 0.908 0.767
S2 0.882 0.763 0.777 0.877 0.791 0.817
S3 0.884 0.782 0.808 0.891 0.801 0.811
S4 0.936 0.808 0.663 0.936 0.804 0.711
S5 0.897 0.789 0.637 0.847 0.8 0.656
5.1.3. Word to keep value: 100 vs. 500 vs. 1000
Finally, in this experiment, a comparison is made between three WKVs.

The results show that 500 and 1000 are better than 100 as the pool size is
increased. Table 5 presents the precision of the clustering process for each
pool. The values range from 0.626 to 0.974, from 0.882 to 0.987, and from
0.894 to 0.988 for WKV of 100, 500 and 1000, respectively. Also, Table 5
shows the values of the recall of the clustering process for the WKV of 100
range from 0.583 to 0.972, whereas these values range from 0.847 to 0.986,
and from 0.821 to 0.986 for the WKV of 500 and 1000, respectively. It is
clear that the clustering results can be improved by increasing the WKV,
especially on bigger datasets.
Table 5. Results from Experiment #1. It compares between three different

Word to Keep Values (WKV) of: 1000, 500, and 100 on each pool
Set (1000) (500) (100) (1000) (500) (100)
S1 0.988 0.987 0.974 0.986 0.986 0.972
S2 0.849 0.882 0.854 0.821 0.877 0.802
S3 0.900 0.884 0.780 0.890 0.891 0.739
S4 0.910 0.936 0.744 0.905 0.936 0.686
S5 0.894 0.897 0.626 0.890 0.847 0.583
5.2. Results from Experiment #2

This section presents the results of carving and clustering objects of five
PDF files that are obtained from a RAM dump on three different scenarios,
each of which is repeated five times (five runs). Table 6 shows that an
automated process can cluster and separate textual data objects based on
their original PDF files. In Scenario1, the results of clustering algorithm show
that the precision and recall are 0.977 and 0.966, respectively, whereas, in
Scenario2 and Secenario3, the precision and recall are both 1. It is
experimentally clear that the Hierarchical clustering algorithm is useful when
it is used to separate textual data objects that are carved from RAM.
Table 6. The results of carving textual objects of five different files that are
obtained from the RAM dump of the used machine
Scenario Number of successfully Macro-Average
No. Carved textual objects Precision Recall
Scenario1 48 0.977 0.966

Scenario2 40 1 1
Scenario3 40 1 1
6. Discussion
This section presents a combined overview for the obtained results. In

the first experiment, 50 PDF files are taken of the same topic, which would
be a challenging case to the clustering technique; assuming that similar topics

would increase the correspondence between the recovered textual streams
and it would negatively affect the precision and recall of the clustering
results. Thus, the binary representation of each PDF file is analyzed and the
textual data objects are extracted. The total number of resulted text files from
all 50 PDF files is 645, which is the size of our final set ( S5 ). The size of
generated text files ranges from 1 to 9 KB and their number in each cluster
ranges from 7 to 16. Table 3 clearly shows that the Hierarchical clustering
outperforms the Simple K-Means in both precision and recall. This means
that Hierarchical clustering would serve our application domain better than
the Simple K-Means algorithm. Additionally, Table 4 clearly shows that the
Ward rule produced better results over the other two rules, whereas Table 5
shows the impact of WKVs on the clustering results, in which the 500 and
1000 values have very close results, and they outperform the 100.
Consequently, we selected the Hierarchical clustering algorithm for
executing Experiment #2, in which we opened five different PDF files. Then
we created corresponding memory dumps during three scenarios. The total
number of objects for the five files is 1038, in which there are a total of 82
textual objects, see Table 2. The numbers of successfully carved textual data
objects for each scenario during five different runs are presented in Table 6.
After collecting a combined set of all textual objects from each memory
dump, the Hierarchical algorithm is applied on each set. The results are
highly promising to the DI. However, some of the factors that may helped us
obtain very good results are: (1) this set is smaller than the sets used in the
first experiment and (2) the used PDF files in the second experiment are of
different themes (topics), which reduced the similarities between their terms.
7. Related Work
Technological digital advances provided a new platform for perpetrators

to invade our private life and to attack us quietly with their various malicious
activities [20]. In general, some researchers have put dedicated efforts to
establish and exploit the multi-disciplinary of digital forensic process model
[21], while others are dedicated to investigate the potential of anti-forensics

tools and techniques [22, 23].
In order to overcome this new type of crimes, the DF of computer
systems and their storage devices are emerged [24]. Often, researchers are
focused on employing and integrating new automated methods and tools into
the DF investigation process such as data mining, clustering, and machine
learning techniques that would facilitate the investigation process of a
particular kind of artifacts [16, 17]. For example, text clustering has proved
its influence in forensic area; in particular, researchers in [25] provided a
document clustering experiments for files, which consists of unstructured
text. Additionally, researchers in [7] presented an enhanced classification
method to overcome the compound file problem using the context-based
block classification scheme suitable for hard drives.
Recently, memory forensics has evolved as a major part of digital

forensics [3, 26]. It targets the investigation of the volatile nature of the RAM
(main memory). RAM analysis has evolved from simple string scanning to
building dedicated tools and techniques. Since then, researchers are
benefiting from the vital information that can be available in RAM [2]. Some
researchers are focused on the memory acquisition techniques [27] while
others are focused on the emerging platforms and computing domains. For
example, researchers in [28] proposed a hardware virtualization to facilitate a
lightweight live memory forensic approach, in which they proposed on the
fly virtualization environment that allows the migration of virtual machines
without modification or termination. Others have addressed the general
forensic challenges and proposed solutions based on the cloud computing
environments [4, 29]. Additionally, the emerging spread of mobile devices
and their applications put more pressure on researchers to investigate the
potential forensic artifacts that can be found in their RAM [30] and the cache
of their applications [31].
On the other hand, some researchers have dedicated their efforts to

investigate particular files in Hard Drives such as multimedia files of images,
audios and videos [14, 32]. However, to the best of our knowledge, no
previous work has tackled the problem of file carving in RAM. Additionally,
this paper addresses a more challenging problem where carving multiple files
of the same type is required. In this research, we believe that whenever the
file system is damaged or deleted, the file carving techniques can be used to
recover the maximum amount of data from the RAM memory by means of
file carving methods and clustering techniques. We target the simultaneous
use of multiple PDF files and we investigate the data recovery of their
contents from RAM. Such DE would help DIs prove that the file is actually
used and opened on that target machine.
8. Limitation and Future Work
The research presented in this paper adds to the investigation techniques

that can be used by DIs. However, it endures some limitations. In particular,
in all three scenarios, we were able to successfully carve about 50.24% to
46.34% of the original textual data objects that were found in the original
five PDF files. This means that the object carving process from these
memory dumps is not 100% perfect. This can be justified based on different
reasons, one of which is that these objects might not be available in RAM
when the image is created. Another reason is that part of these objects might
be scattered over multiple and non-consecutive RAM pages. Whereas, in our
approach we were concerned with textual objects that can fit in one RAM
page or maybe scattered over multiple but consecutive pages; we searched
these pages only for complete textual data objects. PDF RAM pages that are
scattered over non-consecutive pages are left for future work. For this point,
various techniques can be used as a solution; the most suggested method is
context-block based classification [7]. Additionally, the methodology used in
this paper targets only the textual data objects of PDF files. Whereas, PDF
files often include other objects such as images, part of our future work is to
extend our methodology to incorporate non textual data into the evidence
collection process (carving and clustering process). Furthermore, our
investigation model targeted the normal use of the average PC users. Hence,
we intend to investigate the applicability of the approach on a cloud

environment where the number of users and the available resources are
relatively huge.
9. Conclusion
The increasing rate of cybercrimes emphasizes the importance of digital

forensics in resolving crimes. Extracting digital evidence from the device
involved in committing a crime is necessary to prove the guiltiness of a
criminal. While seeking digital evidence in various devices is necessary to
prove the guilt of the accused, inspecting the volatile memory or RAM of a
machine is not of a less importance. RAM contains vital information about
the current state of a system. In particular, when metadata is not present, file
carving in RAM is a special challenge because of the almost random
distribution of files over RAM frames. Opening different files that have the
same format (type) complicates the situation even further. Carving such
contents alone can be insufficient to establish the evidence in the court of
law. This paper enhances the carving process of a certain file type by
clustering means, which are used to group carved chunks into clusters
(eventually files) based on similarity features. Due to the wide spread of PDF
files on different platforms, our experimentation assumes that multiple PDF
files have been opened in a specific machine, then most probably their
contents will be almost randomly scattered over different RAM frames. We
validated our clustering approach on different sets of PDF files before
applying it on a digital forensic case. Our validation process included the
experimentation with two of the most popular clustering algorithms. Results
show that the Hierarchical clustering outperforms the K-Means algorithm.
When the Hierarchical algorithm is applied using the Ward rule and WKV
greater than 500, it gives the best clustering results; validated by the Macro-
Average Precision and Recall. It allows the grouping of the extracted file
segments into their originals. We conclude that employing clustering
methods can be used to increase the suitability of the digital evidence and
reduce the burden on DIs.
Acknowledgement
The authors thank the anonymous referees for their valuable suggestions
and comments.
References
[1] V. S. Harichandran, D. Walnycky, I. Baggili and F. Breitinger, Cufa: A

more formal definition for digital forensic artifacts, Digital Investigation
18 (2016), S125-S137.
[2] A. Case and G. G. R. III, Memory forensics: The path forward, Digital
Investigation, 2017.
[3] M. H. Ligh, A. Case, J. Levy and A. Walters, The Art of Memory Forensics:
Detecting Malware and Threats in Windows, Linux, and Mac Memory, 1st ed.,
Wiley Publishing, 2014.
[4] M. E. Alex and R. Kishore, Forensics framework for cloud computing, Computers
& Electrical Engineering, 2017.
[5] M. I. Al-Saleh and Z. A. Al-Sharif, Utilizing data lifetime of TCP buffers in
digital forensics: Empirical study, Digital Investigation 9(2) (2012), 119-124.
[6] L. Aronson and J. V. D. Bos, Towards an engineering approach to file carver
construction, 2011 IEEE 35th Annual Computer Software and Applications
Conference Workshops, 2011, pp. 368-373.
[7] L. Sportiello and S. Zanero, Context-based File Block Classification, Springer
Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 67-82.
[8] J. H. Jones and T. M. Khan, A method and implementation for the empirical study
of deleted file persistence in digital devices and media, 2017 IEEE 7th Annual
Computing and Communication Workshop and Conference (CCWC), 2017.
[9] R. Schramp, Live transportation and RAM acquisition proficiency test, Digital
Investigation, 2017.
[10] Z. A. Al-Sharif, D. N. Odeh and M. I. Al-Saleh, Towards carving pdf files in the
main memory, The International Technology Management Conference (ITMC),
2015.
[11] M. Al-Saleh and Z. Al-Sharif, Ram forensics against cyber crimes involving files,
The Second International Conference on Cyber Security, Cyber Peace fare and
Digital Forensic (CyberSec2013), The Society of Digital Information and
Wireless Communication, 2013, pp. 189-197.
[12] M. A. H. B. Azhar and T. E. A. Barton, Forensic Analysis of Secure Ephemeral
Messaging Applications on Android Platforms, Springer International Publishing,
Cham, 2016, pp. 27-41.
[13] J. Lyle, A strategy for testing metadata based deleted file recovery tools, P.
Gladyshev, M. Rogers, eds., Digital Forensics and Cyber Crime, Vol. 88 of
Lecture Notes of the Institute for Computer Sciences, Social Informatics and
Telecommunications Engineering, Springer Berlin Heidelberg, 2012, pp. 104-114.
[14] E. Alshammary and A. Hadi, Reviewing and evaluating existing file carving
techniques for jpeg files, 2016 Cybersecurity and Cyberforensics Conference
(CCC), 2016, pp. 55-59.
[15] W. Zhang, T. Yoshida and X. Tang, A comparative study of TF*IDF, LSI and
multiwords for text classification, Expert Systems with Applications 38(3) (2011),
2758-2765.
[16] C. Lee and M. Chung, Digital forensic for location information using hierarchical
clustering and k-means algorithm, Journal of Korea Multimedia Society
19(1) (2016), 30-40.
[17] N. Milosevic, A. Dehghantanha and K.-K. R. Choo, Machine learning aided
android malware classification, Computers & Electrical Engineering, 2017.
[18] S. Kang and W. T. K. Chien, A method to group reliability data by hierarchical
clustering, 2016 IEEE International Conference on Industrial Engineering and
Engineering Management (IEEM), 2016, pp. 345-349.
[19] A. Rocha, W. J. Scheirer, C. W. Forstall, T. Cavalcante, A. Theophilo, B. Shen,
A. R. B. Carvalho and E. Stamatatos, Authorship attribution for social media
forensics, IEEE Transactions on Information Forensics and Security 12(1) (2017),
5-33.
[20] G. Edmond, A. Towler, B. Growns, G. Ribeiro, B. Found, D. White, K.
Ballantyne, R. A. Searston, M. B. Thompson, J. M. Tangen, R. I. Kemp and K.
Martire, Thinking forensics: Cognitive science for forensic practitioners, Science
and Justice 57(2) (2017), 144-154.
[21] R. Lutui, A multidisciplinary digital forensic investigation process model,
Business Horizons 59(6) (2016), 593-604.
[22] K. Conlan, I. Baggili and F. Breitinger, Anti-forensics: Furthering digital forensic
science through a new extended, granular taxonomy, Digital Investigation
18 (2016), S66-S75.
[23] K. Lee, H. Hwang, K. Kim and B. Noh, Robust bootstrapping memory analysis
against anti-forensics, Digital Investigation 18 (2016), S23-S32.
[24] S. L. Garfinkel, Digital forensics research: The next 10 years, Proceedings of the
Tenth Annual DFRWS Conference, Digital Investigation 7 (2010), S64-S73.
[25] L. da Cruz Nassif and E. Hruschka, Document clustering for forensic analysis: An
approach for improving computer inspection, IEEE Transactions on Information
Forensics and Security 8(1) (2013), 46-54.
[26] F. N. Dezfouli, A. Dehghantanha, R. Mahmoud, N. F. B. M. Sani and S. Bin
Shamsuddin, Volatile memory acquisition using backup for forensic investigation,
Proceedings Title: 2012 International Conference on Cyber Security, Cyber
Warfare and Digital Forensic (CyberSec), 2012, pp. 186-189.
[27] R. J. McDown, C. Varol, L. Carvajal and L. Chen, In-depth analysis of computer
memory acquisition software for forensic purposes, Journal of Forensic Sciences
61 (2016), S110-S116.
[28] Y. Cheng, X. Fu, X. Du, B. Luo and M. Guizani, A lightweight live
memory forensic approach based on hardware virtualization, Information Sciences
379 (2017), 23-41.
[29] P. Leimich, J. Harrison and W. J. Buchanan, A RAM triage methodology for
hadoop HDFS forensics, Digital Investigation 18 (2016), 96-109.
[30] M. Huber, B. Taubmann, S. Wessel, H. P. Reiser and G. Sigl, A flexible
framework for mobile device forensics based on cold boot attacks, EURASIP
Journal on Information Security 2016 (2016) 17.
[31] S. Wu, Y. Zhang, X. Wang, X. Xiong and L. Du, Forensic analysis of we chat on
android smart phones, Digital Investigation, 2017.
[32] R. C. Pandey, S. K. Singh and K. K. Shukla, Passive forensics in image and video
using noise features: A review, Digital Investigation 19 (2016), 1-28.
View publication stats

PPH-1801046-EC-Finalversion

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PPH-1801046-EC-Finalversion

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

CARVING AND CLUSTERING FILES IN RAM FOR MEMORY FORENSICS

Article in Far East Journal of Electronics and Communications · July 2018

Ziad A Al-Sharif Yasmee Khalee

SEE PROFILE SEE PROFILE

Mohammed I. Al-Saleh Mahmoud Al-Ayyoub

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

CARVING AND CLUSTERING FILES IN RAM FOR

Ziad A. Al-Sharif1,*, Attaa Y. Al-Khalee2, Mohammed I. Al-Saleh2

Memory contains vital information about the current state of a system

The technology revolution that contributes to the development of our life

In order to validate the proposed approach, two different experiments are

The rest of this paper is organized as follows: Section 2 presents the

2.1. PDF file structure

It is the first part of the PDF file. It is written as a line preceded by a

Figure 1. An example of a simple PDF file structure that consists of one

2.1.3. Cross-reference table

2.2. File carving and file recovery

2.3. Text clustering

2.3.1. Simple K-Means clustering

2.3.2. Hierarchical clustering

It is a connectivity-based clustering technique. It builds a tree-like

2.3.3. Document representation

Text documents are often described by their terms (T ) as a bag of words

td = (tf (d , t1 ), ..., tf (d , tm )). (1)

In this paper, we use another weighting technique which is the Term

The increasing size of documents broadens the dimensions of the vector

2.3.4. Similarity measures

It describes the closeness or separation between two target instances.

Euclidean distance computes the distance between two instances in

are d a and d b , the Euclidean distance is calculated based on Equation 3,

2.3.5. Linkage rules

In Hierarchical clustering, the linkage rule determines the distance

D( A, B ) = Average D(i, j ). (5)

We assume the extreme condition that no metadata is available to recover

Figure 2. Investigation model. Step 0: Creates a RAM dump, which is a bit-

Recovered objects are analyzed further to determine whether it contains

3.3. Step 3: Text segmentation and correction

3.4. Step 4: Clustering

In this experiment, we first compare the Simple K-Means and the

1: Input: A dataset of raw PDF objects ( Seti )

2: Result: Returns Pooli

3: Data: Creates Pooli of simple text files

4: for each Obj x ∈ Seti do

5: if Obj x has textual data then

6: Extract its textual data;

7: if Data is Encoded then

12: Add resulted file to Pooli ;

Figure 3. Preprocessing algorithm: prepares a set of PDF objects for text

2: Result: Macro-Average Precision and Recall ∀ Clusters

5: > F j is a PDF file

6: for each Obj x ∈ F j do

7: if Objx is a Data Object then

8: Add Obj x to Seti ;

9:  Alg. 1 takes Seti and returns Pooli , see Alg. 1

10: Apply Algorithm 1 on Seti ;

11: Apply the K-Means Text Clustering Algorithm on Pooli ;

12: for each Clusterk do

13: Find the Macro-Average Precision;

16: for each Clusterk do

17: Find the Macro-Average Precision;

Figure 4. The algorithm used in first experiment. It takes a group of PDF

Five different PDF files of different subjects are opened. Information