Automated Behavioral Analysis of Malware: A Case Study of Wannacry Ransomware

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322669812

Automated Behavioral Analysis of Malware: A Case Study of WannaCry


Ransomware

Conference Paper · December 2017


DOI: 10.1109/ICMLA.2017.0-119

CITATIONS READS
14 1,176

2 authors, including:

Robert Bridges
Oak Ridge National Laboratory
36 PUBLICATIONS   229 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Power consumption analysis for malware detection View project

All content following this page was uploaded by Robert Bridges on 24 July 2018.

The user has requested enhancement of the downloaded file.


Automated Behavioral Analysis of Malware
A Case Study of WannaCry Ransomware
Qian Chen Robert A. Bridges
Department of Electrical and Computer Engineering Computational Sciences and Engineering Division
The University of Texas at San Antonio Oak Ridge National Laboratory
San Antonio, TX 78249 Oak Ridge, TN 37831
guenevereqian.chen@utsa.edu bridgesra@ornl.gov

Abstract—Ransomware, a class of self-propagating malware host’s data, ransomware must perform a series of actions; e.g.,
that uses encryption to hold the victims’ data ransom, has identifying files for encryption/deletion and exchanging en-
emerged in recent years as one of the most dangerous cyber cryption keys with a command and control (CC) server. Hence,
threats, with widespread damage; e.g., zero-day ransomware
WannaCry has caused world-wide catastrophe, from knocking discovery of the malware’s pre-encryption footprint promises
U.K. National Health Service hospitals offline to shutting down accurate, in-time detection and is the focus of ransomware
a Honda Motor Company in Japan [1]. Our close collaboration analysis efforts. Our hypothesis is that data analytics on host
with security operations of large enterprises reveals that defense logs can automate discovery of features that are indicative of
against ransomware relies on tedious analysis from high-volume ransomware’s presence before encryption (and of malware’s
systems logs of the first few infections. Sandbox analysis of freshly
captured malware is also commonplace in operation. executions in general). Such a capability promises automated
We introduce a method to identify and rank the most discrim- pattern-generation and analysis capabilities that are robust
inating ransomware features from a set of ambient (non-attack) to syntactic polymorphism in ransomware and more general
system logs and at least one log stream containing both ambient classes of malware; a critical necessity given the unfortunate
and ransomware behavior. These ranked features reveal a set of success of the ransomware economy.
malware actions that are produced automatically from system
logs, and can help automate tedious manual analysis. We test This work is motivated by our close collaboration with
our approach using WannaCry and two polymorphic samples by cyber operations at large enterprises. Most notably the 2015
producing logs with Cuckoo Sandbox during both ambient, and infection by polymorphic and then novel CryptoWall 3.0 [8, 9]
ambient plus ransomware executions. Our goal is to extract the induced a 160 man-hour forensic effort to manually analyze
features of the malware from the logs with only knowledge that the few infected hosts’ logs in an attempt to produce shareable
malware was present. We compare outputs with a detailed anal-
ysis of WannaCry allowing validation of the algorithm’s feature threat intelligence reports and pre-encryption detection capa-
extraction and provide analysis of the method’s robustness to bilities. This operational scramble begs the question, “How
variations of input data—changing quality/quantity of ambient to automatically extract the sequence of events induced by
data and testing polymorphic ransomware. Most notably, our malware given a large volume of logs from a few hosts
patterns are accurate and unwavering when generated from that were infected with potentially polymorphic malware?”
polymorphic WannaCry copies, on which 63 (of 63 tested) anti-
virus (AV) products fail. Moreover, such facilities regularly receive freshly discovered
malware samples, which are analyzed in sandboxes; hence,
I. I NTRODUCTION an automated pattern-generation tool that is provably (more)
Ransomware is a class of self-propagating malware that uses robust to polymorphism from dynamic analysis is needed.
encryption to hold victim’s data ransom and has emerged as To our knowledge no automated method for extracting
a dominant worldwide threat, crippling personal, industrial, the footprint of malware from the ambient and/or sandbox-
and governmental networked resources [2–5]. Most notably, generated logging data is known. Yet our close collaboration
the recent epidemic of WannaCry [6, 7] was one of the with security analysts reveals that such a method is needed
largest ransomware attacks in history, halting hospital facilities in practice to benefit two primary use cases—(1) to expedite
and infecting large corporations and consumers in over 150 currently timely (hundreds of man hours) manual analysis of
countries. From initial exploit to completing encryption of a logs used to identify malwares actions from ambient system
logs in forensic efforts, (2) to automatically generate behav-
This manuscript has been authored by UT-Battelle, LLC under Contract No. ioral analysis of malware samples from sandbox logging data
DE-AC05-00OR22725 with the U.S. Department of Energy. The United States (which is currently investigated manually). Our contribution
Government retains and the publisher, by accepting the article for publication,
acknowledges that the United States Government retains a non-exclusive, paid- is to present an algorithm for automatically extracting the
up, irrevocable, world-wide license to publish or reproduce the published form features that discriminate malware from ambient logging ac-
of this manuscript, or allow others to do so, for United States Government tivity given collections of logs known to contain malware
purposes. The Department of Energy will provide public access to these results
of federally sponsored research in accordance with the DOE Public Access executions (in addition to ambient logs) and logging data
Plan (http://energy.gov/downloads/doe-public-access-plan) known to contain no malware executions. Further, we present
systematic testing of our method using Cuckoo Sandbox to approaches [14–24]. Therefore, Cuckoo is an established tool
generate WannaCry ransomware logs showing when and how to generate repeatable malware analysis results.
it is robust to changes in input data. UNVEIL [25] is a novel dynamic analysis system for
To this end, we propose an information relative approach detecting ransomware attacks and modeling their behaviors.
using a Term Frequency-Inverse Document Frequency (TF- UNVEIL tracks changes to the analysis systems desktop by
IDF) metric to automatically extract and rank the most dis- calculating dissimilarity scores of desktop screenshots before,
criminating features of the new malware from logging data. during, and after executing the malware samples to identify
The TF-IDF method also preserves human-understandable ransomware. UNVEIL successfully identified previously un-
features, which is necessary for operators to understand their known evasive ransomware that was not detected by the anti-
analytics, e.g., in automatic malware analysis. malware software.
We leverage Cuckoo Sandbox (https://cuckoosandbox.org), UNVEIL focuses on ransomware attack detection while
an automatic malware analysis system, for dynamic analysis our work is a generic approach to extract the most discrim-
of executables including both WannaCry variants and scripts inative features of malware in logging data. We specially
simulating non-malicious user activity. Cuckoo reports activity tested our approach with WannaCry malware. Our approach
relating to files, folders, memory, network traffic, processes, automatically discovers features that are indicative of Wan-
and API calls, thus giving source data for experimentation naCry ransomware’s presence before encrypting targeted files.
with ground truth. This builds results on a well-adopted, open-
source malware analysis tool (Cuckoo), ensures repeatability TABLE I
II. P REREQUISITES WANNAC RY F ILES & F OLDERS
of results, and proves a concept that we believe will be
transferable to more general system logs, e.g., Windows Log- A. WannaCry Name Meaning
b.wnry Bitmap file for Desktop image
ging Service output (https://digirati82.com/wls-information) Ransomware Attack c.wnry Configuration file
that are not shareable outside the organization. r.wnry Q&A file, payment instructions
Our primary goal is s.wnry Tor client
The algorithm’s outputs are validated using a detailed
to extract malware’s ac- t.wrny WANACRY! file with RSA keys
analysis of WannaCry. Our contributions include feature ex- u.wnry @WannaDecryptor@.exe
tivity from a set of
traction techniques from Cuckoo output, and, most notably, Folder containing RTF files with
logs only knowing the \msg payment instructions in 128
a method to automatically extract the most discriminative languages (e.g., korean.wnry)
logs contain malware
ransomware features from event logs given a set of ambient taskse.exe Launches decryption tool
activity and thereby au- taskdl.exe Removes temporary files
(non-attack) logs and logs containing ransomware (potentially
tomating malware anal-
mixed with logs from many normal activities). We present four
ysis and pattern generation. Before testing the capability, we
experiments showing our method (1) can automatically extract
present an overview of WannaCry to be used as ground
features that are indicative of the malware, (2) the method is
truth for validating the malware features extracted from our
robust to the quantity of known, non-malicious logging data
tests. In May 2017, the WannaCry ransomware attack infected
included, (3) the method succeeds when the logs containing
over 300k Windows computers in over 150 countries. The
malware activity also include a majority of non-malicious
dropper of the malware carries two components. One uses
ambient logs, and (4) our method can produce a pattern that
the “EternalBlue” exploit against a vulnerability of Windows’
is robust to polymorphic changes that bypass 63 (of 63 tested)
Server Message Block (SMB) protocol to propagate, and the
anti-virus (AV) detectors.
other is a WannaCry ransomware encryption component [6].
Static analysis of WannaCry has been documented by analysts
A. Related Work and cyber security companies [26]. The analyzed WannaCry
Malware analysts usually adopt static and dynamic analysis files and action sequence are summarized in Tables I and
techniques to determine behavior and risks of a specific mal- II, respectively. More details are available in the technical
ware sample. Static analysis analyzes malware samples before report [27]. This gives ground-truth for evaluating the features
it is executed [10] but struggles to analyze self-modifying identified from Cuckoo logs.
and polymorphic code. Dynamic analysis is more powerful
for malware forensics analysis because it allows analysts to B. Cuckoo Sandbox & Produced Logs
understand malware behavior and activities by executing the Cuckoo Sandbox is an automatic malware analysis system,
malware sample. In this work, we use Cuckoo Sandbox for which provides detailed results of suspicious files’ activities
dynamic analysis. and behaviors by executing the files (e.g., Windows executa-
Cuckoo has been used to identify polymorphic malware bles, document exploits, URLs and HTML files, Java JAR, ZIP
samples [11], trigger malware that detects it is in a sandbox, file, Python files etc.) in a virtualized and isolated environment.
and identifies particular malware actions in different network The suspicious file’s performance, such as changes of files and
profiles [12], find IP addresses, domains and file hashes of folders, memory dumps, network traffic, processes and the API
malware samples to generate network-related indicators of calls are monitored and analyzed by Cuckoo. The Cuckoo
compromise [13], and providing ground-truth training and reporting module elaborates the analysis results and saves
testing data for supervised Intrusion Detection System (IDS) the produced report into a human readable JavaScript Object

2
Notation (JSON) and HTML formats. In our experiments TABLE III
Cuckoo outputs ranged from 1MB to 1GB per analyzed file. E XTRACTED F EATURE F ORMAT FOR L OGS IN F IGURES 1 & 2

Log Examples Feature


TABLE II Fig. 1 “bigram: api=regcreatekeyexw+arguments=software\\wanacrypt0r”
WANNAC RY ACTIONS Fig. 2 “enhanced: object=registry+event=read+data=
(“eid”:1) regkey:activecomputernamecomputername ”
No. Action Fig. 2 “enhanced: object=file+event=write+data=
1 Imports CryptoAPI from advpi32.dll (“eid”:3) file:c:\\docume∼1\\cuckoo\\locals∼1\\temp\\b.wnry ”
2 Unzips itself to files in Table I
3 Generates machine-unique identifier
Pre-Encryption

Creates a registry, HKEY_LOCAL_MACHINE\


4
Software\WanaCrypt0r\wd libraries; and execute files.
Runs ‘attrib +h’, which sets the
5
current directory as a hidden folder See Fig. 2. Additionally, Cuckoo searches VirusTotal.com,
Runs ‘icacls . /grant Everyone:F /T /C
6 /Q’, which grants all users permissions
checking 63 AV vendors for signatures detecting the file under
to the current directory analysis. We leverage this to compare the capabilities of our
Imports public and private RSA AES
7
keys (000.pky, 000.eky) from t.wrny pattern generation for polymorphic samples to that of AV
Creates 00000000.res, a file containing unique user ID,
8
total encrypted file count, and total encrypted file size
vendors. See our technical report [27] for more information
9
Uses SHGetFolderPathW API to scan the file system on Cuckoo.
starting at the desktop folder
10 Finds the target files and generates one AES key per file
Uses the public RSA key to encrypt AES key of target
Encryption

11
file and saves encrypted AES key to the target file
Uses CreateFileW, ReadFile and WriteFile APIs to create
12 encrypted files. The string WANNACRY is written on
the infected files.
Calls taskdl.exe and MoveFileW API to replace
13
.WNCRYT temp files to .WCRY files
Calls taskse.exe C:\DOCUME˜1\cuckoo
\LOCALS˜1\Temp\@WanaDecryptor@.exe
14
to launch the decryption tool and
replace desktop image to “!WannaCryptor!.bmp”
Runs cmd.exe/cregaddHKLM\SOFTWARE\
Microsoft\Windows\CurrentVersion\Run
/v\"thsgvkvtwaipdcd971\"/tREG_SZ/d\
15
"\"C:\DOCUME˜1\cuckoo\LOCALS˜1\
Temp\tasksche.exe\\"\"/f to create
a unique identifier registry key
16 Runs @WanaDecryptor@.txt to create a copy of r.wnry
17 Temporary files with prefix ‘∼SD’ created then deleted

Although Cuckoo reports seven categories of malware Fig. 2. Enhanced logs of a Cuckoo Analysis JSON Report File
analysis outputs, we restrict ourselves to the behavior
III. M ETHOD : F EATURES & TF-IDF
category, as these are
analogous to system The general problem we consider is how to extract the most
logs collected by indicative features of malware from logs of the host on which
security operations the malware was active. Note that this set of logs may contain
from workstations. a majority of logs from non-malicious, ambient user activity.
This category Our approach is to obtain a second set of logs from only
contains the raw non-malicious activity (e.g., by creation in our case, or in the
behavioral logs for operational case of a ransomware outbreak, from system logs
each process running of non-infected hosts), and seek features of the infected logs
by the analyzed files, set that are uncommonly common (are high frequency in a
including logs of the few malicious documents only). Below we describe the feature
complete processes representation from Cuckoo behavior logs and the ranking
tracing, a behavioral method.
summary and a Conceptually, we consider a set of logs (Cuckoo enhanced
process tree. and behavior logs in our experiments) as a “document” and a
The enhanced selected subset of the log entries as “terms” or features. Two
class generates a entries are considered the same term/feature if they agree in all
more extensive high- fields except time and event ID. All enhanced logs are used.
level summary of the Only those behavior (non-enhanced) logs with the fields “cat-
processes and their egory” and “api” taking values registry and RegCreatKeyExW,
activities. Instead of respectively, are included. Altogether, a document (log stream)
reading from raw Fig. 1. Behavioral Log Example is represented as a count of each term/feature (a bag-of-words
behavioral logs, the model). See Table III.
enhanced class helps to interpret and summarize essential For application to ransomware pattern generation, we only
activities performed by the analyzed files, e.g., read, write, consider pre-encryption features; specifically, for WannaCry
and delete from registry, files, and directories; load Windows logs before the creation of the private key 00000000.eky. In

3
TABLE IV
practice, given the initial infection host logs, operators would T HE R ANKING OF F EATURES AND T HEIR TF-IDF W EIGHTS
have to identify a pre-infection cutoff and apply our method TF-IDF
Ranking Feature
to all logs previous. Although for general malware forensics / “enhanced: object=file+event=write+data=file:
Weight
1 299.36
analysis this is not necessary. c:\\docume∼1\\cuckoo\\locals∼1\\temp\\s.wnry”
“enhanced: object=file+event=write+data=file:
2 33.80
Given two sets of documents (in our case, at least one c:\\docume∼1\\cuckoo\\locals∼1\\temp\\b.wnry”
“enhanced: object=file+event=write+data=file:
3 24.14
with logs containing malware activities, the some contain- c:\\docume∼1\\cuckoo\\locals∼1\\temp\\u.wnry ”
“enhanced: object=file+event=read+data=file:
4 9.66
ing ambient activity), we apply Term-Frequency-Inverse- c:\\docume∼1\\cuckoo\\locals∼1\\temp\\t.wnry ”
“enhanced: object=file+event=write+data=file:
Document-Frequency (TF-IDF) an information relative term- 5
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m korean.wnry ”,
“enhanced: object=file+event=write+data=file:
9.66

weighting scheme [28]. Letting f (t, d) denote the frequency c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m vietnamese.wnry”
“enhanced: object=file+event=write+data=file:
of term t in document d, and N the size of the cor- 6
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m chinese (traditional).wnry”,
“enhanced: object=file+event=write+data=file:
8.05
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m japanese.wnry”
pus, the TF-IDF weightPis the product of the Term Fre- ”enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m chinese (simplified).wnry”,
quency, tf(t, d) = ft,d / t0 ∈d ft0 ,d (giving the likelihood of 7 . . . (24 various language features) 4.83
“enhanced: object=file+event=write+data=file:
t in d) and the Inverse Document Frequency, idf(t, D) = c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m turkish.wnry”
log N/1 + |{d ∈ D : t ∈ d}| (giving the Shannon’s informa- “enhanced: object=file+event=read+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\c.wnry ”,
“enhanced: object=file+event=write+data=file:
tion of the a document containing t). Intuitively, given a c:\\docume∼1\\cuckoo\\locals∼1\\temp\\c.wnry”,
“enhanced: object=file+event=write+data=file:
document, those terms that are uncommonly high frequency 8
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\taskdl.exe”,
3.22
“enhanced: object=file+event=write+data=file:
in that document are the only that receive high scores. c:\\docume∼1\\cuckoo\\locals∼1\\temp\\taskdl.exe”,
“enhanced: object=registry+event=read+data=regkey:
Our application is to consider all logs from infected hosts \\ activecomputernamemachineguid”
“enhanced: object=registry+event=read+data=regkey:
as a single document, then regard only the features from this 9 hkey local machine\\software\\microsoft\\cryptography\\defaults\\provider\\ 2.04
microsoft enhanced rsa and aes cryptographic provider (prototype)image path”
“infected” document and apply TF-IDF; hence, highly ranked “bigram: api=regcreatekeyexw+arguments=software\\wanacrypt0r”,
“enhanced: object=dir+event=create+data=file:
features occur often in (and are guaranteed to occur at least c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg”,
“enhanced: object=file+event=execute+data=file:attrib +h . ”,
once in) the “infected” document, but infrequently anywhere 10 “enhanced: object=file+event=execute+data=file:icacls . /grant everyone:f /t /c /q ”, 1.61
“enhanced: object=file+event=write+data=file:
else. c:\\docume∼1\\cuckoo\\locals∼1\\temp\\00000000.pky”,
“enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\r.wnry”
IV. E XPERIMENTS & R ESULTS
A. Analysis of WannaCry & Four Normal Activities 3) The same analysis report of the WannaCry executable
file and 17 normal performance analysis files.
In our first experiment, we first analyze malicious behavior
of the WannaCry executable file by sending it to the Cuckoo Normal activities are analyzed by Cuckoo Sandbox via sub-
Sandbox. Besides obtaining a Cuckoo analysis report of the mitting Python files. From the three experiments we find that
WannaCry sample (i.e., a malicious document), Python scripts the highest 10 weights calculated by TF-IDF and their features
of users’ normal activity (i.e. read, write and delete files, open are the same as the ranking shown in Table IV, regardless of
websites, watch YouTube videos, send and receive emails, the number of normal activity analysis files.
search flight tickets, post and delete tweets on Twitter) are
C. Combining Normal Activities with WannaCry
submitted to and executed by Cuckoo. One WannaCry mali-
cious document and four normal documents (Cuckoo analysis In this experiment, we create a Python file that executes
reports of users’ normal activity) are used to calculate TF- normal activities first and then trigger the WannaCry malware.
IDF weights for 74 pre-encryption WannaCry specific features. The Python file is analyzed by the Cuckoo Sandbox, and the
Note that the normal activity analysis reports contain various analysis report along with various non-malicious log files are
features that may or may not be in the malware analysis report, sent to the TF-IDF method. This experiment aims to validate
and some of the 74 features extracted from the malicious logs that our method can accurately identify specific features of the
may have occurred from ambient (non-malicious) behavior malware when a large majority of the features are indicative
(see Section IV-C where this is known). of non-malicious activity. In operations, this gives evidence
Table IV shows the most important 43 features (top-ten that our method can help IT staff distinguish the malware’s
TF-IDF weights). These highly ranked features are also the footprint from majority ambient logging data.
patterns of WannaCry obtained from the detailed technical To create the mixed normal + malware logs, we use a
analysis of the WannaCry executable file (Section II-A). Python script to search flight ticket information by opening the
website (i.e., www.google.com/flights) and changing
B. Analysis of WannaCry & Varying Normal Activities the date and airport codes of the URL link for various requests.
This experiment aims to validate that the ranking of Wan- After the normal activities, the system executes the WannaCry
naCry features is not influenced by varying the number of executable files. The Python script combining both normal and
normal documents. To validate the hypothesis, we calculate WannaCry activities is submitted to the Cuckoo analyzer, and
the TF-IDF weights in the following three scenarios. the report of the analysis is used to calculate and determine
1) One analysis report of the WannaCry executable file and the most important patterns of the combination scenario.
five normal performance analysis files. We still search for the timeline when the private key was
2) The same analysis report of the WannaCry executable first generated, and consider all features before the timeline
file and six normal performance analysis files. (pre-encryption features only). Since the Python file executes

4
normal activities (i.e., search for flight information) before the TABLE V
WannaCry malware, the number of pre-encryption features has F EATURES R ANKINGS FOR E XPERIMENT IV-C
increased from 74 to 1,085. Feature
Ranking Ranking
(case 1) (case 2)
Two experiments are designed as follows. “enhanced: object=file+event=write+data=file:
1 2
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\s.wnry”
1) One combination analysis of flight-search (normal) ac- “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\b.wnry”
2 7

tivities and WannaCry malware versus four flight-search “enhanced: object=file+event=write+data=file:


c:\\docume∼1\\cuckoo\\locals∼1\\temp\\u.wnry ”
4 13

normal activity analysis reports. “enhanced: object=file+event=read+data=file:


c:\\docume∼1\\cuckoo\\locals∼1\\temp\\t.wnry ”,
2) The same combination analysis report versus 21 normal “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m korean.wnry ”,
7 32

activities (including four normal analysis reports used in “enhanced: object=file+event=write+data=file:


c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m vietnamese.wnry ”
the above experiments). “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m chinese (traditional).wnry”,
8 36
“enhanced: object=file+event=write+data=file:
As we introduced in Section III, the “IDF” term down c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m japanese.wnry”
“enhanced: object=file+event=write+data=file:
weights the features occurring in many documents; hence, c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m chinese (simplified).wnry”,
9 40
“enhanced: object=file+event=write+data=file:
although most of the 1,085 features appearing in the nor- c:\\docume 1\\cuckoo\\locals 1\\temp\\msg\\m romanian.wnry ”
“enhanced: object=file+event=write+data=file:
mal+malware reports are not indicative of WannaCry, the c:\\docume 1\\cuckoo\\locals 1\\temp\\msg\\m bulgarian.wnry ”
. . . (22 various language features) 10 45
malware-specific features are still ranked as top features, “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg\\m turkish.wnry”
but the ranking of some malware features are scaled down. “enhanced: object=file+event=read+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\c.wnry ”,
This is because some normal activities appear frequently in “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\c.wnry”,
the combination analysis report, but relatively infrequently in “enhanced: object=file+event=write+data=file:
12 54
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\taskdl.exe”,
other normal reports, especially in Scenario Two where many “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\taskdl.exe”,
different normal activities are included. The ranking of the “enhanced: object=registry+event=read+data=regkey:
\\ activecomputernamemachineguid”
features for two experiments conducted in this Experiment are “enhanced: object=registry+event=read+data=regkey:
hkey local machine\\software\\microsoft\\cryptography\\defaults\\provider\\ 9 58
shown in Table V. We only list the rankings of the malware microsoft enhanced rsa and aes cryptographic provider (prototype)image path”
“bigram: api=regcreatekeyexw+arguments=software\\wanacrypt0r”,
specific features in Table IV. “enhanced: object=dir+event=create+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\msg”,
For example, "enhanced:_object=file+event= “enhanced: object=file+event=execute+data=file:attrib +h . ”,
“enhanced: object=file+event=execute+data=file:icacls . /grant everyone:f /t /c /q ”, 16 80
write+data=file:c:\\documentsandsettings\ “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\00000000.pky”,
\cuckoo\\applicationdata\\mozilla\ “enhanced: object=file+event=write+data=file:
c:\\docume∼1\\cuckoo\\locals∼1\\temp\\r.wnry”
\firefox\\profiles\\qk4ev1cw.default\
\places.sqlite", is a Firefox activity for reading To create a very similar variant, we modify the HEX code
and writing the “places.sqlite” file to save browsing history, file of the WannaCry malware; specifically, we change the
store bookmarks, annotations etc. As it is a common activity upper-case letters of the message “This program cannot be
in all analysis reports where Firefox applications are executed, run in DOS mode.” to all lower-case letters and the space
it is frequent in the combination analysis file (increasing characters to “-”. The polymorphic WannaCry executable file
TF), but occurs in (only) 14 out of 21 normal performance is then submitted to Cuckoo Sandbox, and none of the 63 virus
analysis (so IDF is not too small). Therefore, in the second databases of the VirusTotal scanner finds matched signatures
scenario, the TF-IDF weight of the feature is 69.1, which is of the polymorphic WannaCry malware. By using the same
higher than many of the WannaCry specific features. technique with the same four normal activity analysis reports
It is easy to mathematically prove, and we have empirically as shown in Experiment One (Section IV-A), the features and
verified that this method will produce false positives if and weights calculated by using the polymorphic malware analysis
only if non-malicious features occurring often in the document report but are the same as Table IV.
containing malware are infrequent elsewhere.
D. Analysis of Polymorphic WannaCry Malware V. C ONCLUSION AND F UTURE W ORK
As introduced in Section III, Cuckoo Sandbox analyzes all In this paper, we present a method to automatically extract
the behavioral activities of the submitted files while searching features of malware from host logs. Our experiments employed
on VirusTotal.com for matching 63 AV vendor’s signatures the relatively new and impactful WannaCry ransomware. For
with the suspicious file. The WannaCry malware is identified empirical validation we employed behavior logs from the
in Experiment IV-A and IV-B since we submitted a single analysis reports generated by Cuckoo Sandbox under various
WannaCry executable file to Cuckoo Sandbox. In Experiment scenarios of normal and malware activities. Our experimental
IV-C, we combined normal activities and the malware exe- results validate that the method can extract distinguishing
cutable file into one Python file. Although the same WannaCry features of the malware from logs containing a majority of
executable is called by the Python script, 0 of the 63 AV non-malicious events, and is robust to polymorphism. Most
vendors alert on it. We conjecture this is because the content of importantly, given a majority of ambient logs with ransomware
the Python file has no malware patterns. Although our method activities also included, accurate extraction of many ran-
can still identify the malware in this case, we design a final somware features are automatically identified.
experiment to validate that our TF-IDF method can identify Furthermore, we have identified and empirically exhibited
more subtle polymorphism. exactly how false indicators of malware could arise from our

5
method—by non-malware features occurring in the malware [6] C. E. R. Team-EU, “Wannacry ransomware
document, but relatively infrequently otherwise. In practice, campaign exploiting smb vulnerability,” May 2017.
for creating patterns from dynamic analysis, this scenario https://cert.europa.eu/static/SecurityAdvisories/2017/
is easily avoided. Testing the malware analysis and pattern CERT-EU-SA2017-012.pdf.
generation capability on ambient logs collected by operations [7] L. Pascu, “Wannacry hits honda factory in
will be next-step research. Our results in Experiment IV-C japan, 55 traffic cameras in australia,” June
indicate that in these adverse scenarios, although some highly 2017. https://businessinsights.bitdefender.com/
ranked features may be spurious, the majority of the ≈40 top- wannacry-ransomware-victims.
ten ranked features are accurate indicators. [8] D. Tsapakidis and A. Passidomo, “Detecting cryptowall
Although presentation of the method and results is outside 3.0 using real time event correlation,” January 2017.
the scope of this paper, the TF-IDF approach gives better [9] Y. Sela, “Anatomy of cryptowall 3.0 virus—a look inside
results for analyzing WannaCry malware than other discrimi- ransomware code and tactics,” February 2015.
nant analysis algorithms based on Fisher’s Linear Discriminant [10] R. A. Awad and K. D. Sayre, “Automatic clustering of
Analysis [29]. Further, the preservation of understandable malware variants,” in Intelligence and Security Informat-
features is a prerequisite for automating malware analysis that ics (ISI), 2016 IEEE Conference on, pp. 298–303, IEEE,
TF-IDF provides that other analysis capabilities do not. 2016.
Future research will consider integration with other detec- [11] A. Provataki et al., “Differential malware forensics,”
tion systems [30, 31] for automatic pattern generation, or Digital Investigation, vol. 10, no. 4, pp. 311 – 322, 2013.
enhancing autonomic security systems [32, 33]. Overall, we [12] J. M. Ceron et al., “Mars: An sdn-based malware analysis
hope this contribution leads to operational implementations solution,” in 2016 IEEE Symposium on Computers and
to expedite manual analysis of logs, malware analysis, and Communication (ISCC), pp. 525–530, June 2016.
to provide accurate pattern generation from both dynamic [13] L. Rudman et al., “Dridex: Analysis of the traffic and au-
analysis tools and host logs. tomatic generation of iocs,” in 2016 Information Security
for South Africa (ISSA), pp. 77–84, Aug 2016.
ACKNOWLEDGEMENTS [14] P. Shijo et al., “Integrated static and dynamic analysis for
Author’s wish to thank Laurence Nichols III, Maria McClel- malware detection,” Procedia Computer Science, vol. 46,
land, Mark Pleszkoch, Mike Iannacone, Sarah Powers and the pp. 804 – 811, 2015. Proceedings of the International
anonymous reviewers for helping us formulate, complete, and Conference on Information and Communication Tech-
polish this work. nologies, ICICT 2014, 3-5 December 2014 at Bolgatty
This material is based on research sponsored by the follow- Palace & Island Resort, Kochi, India.
ing: Laboratory Directed Research and Development Program [15] C. Lim et al., “Mal-ONE: A unified framework for fast
of Oak Ridge National Laboratory, managed by UT-Battelle, and efficient malware detection,” in 2014 2nd Interna-
LLC, for the U. S. Department of Energy, contract DE-AC05- tional Conference on Technology, Informatics, Manage-
00OR22725, DOE IJC3 Cyber R&D Effort, and the National ment, Engineering Environment, pp. 1–6, Aug 2014.
Science Foundation under Grant No.1700391. Any opinions, [16] M. Vasilescu et al., “Practical malware analysis based
findings, and conclusions or recommendations expressed in on sandboxing,” in 2014 RoEduNet Conference 13th
this material are those of the authors and do not necessarily Edition: Networking in Education and Research Joint
reflect the views of the National Science Foundation. Event RENAM 8th Conference, pp. 1–6, Sept 2014.
[17] R. Mosli et al., “Automated malware detection using
R EFERENCES artifacts in forensic memory images,” in 2016 IEEE Sym-
[1] “Honda halts japan car plant after wannacry virus hits posium on Technologies for Homeland Security (HST),
computer network,” June 2017. http://www.reuters.com/ pp. 1–6, May 2016.
article/us-honda-cyberattack-idUSKBN19C0EI. [18] N. Kawaguchi et al., “Malware function classification
[2] “Detecting cryptowall 3.0 using real time event using apis in initial behavior,” in 2015 10th Asia Joint
correlation.” https://digitalguardian.com/blog/ Conference on Information Security, pp. 138–144, May
detecting-cryptowall-3-using-real-time-event-correlation. 2015.
[3] “Crypto ransomware us-cert.” https://www.us-cert.gov/ [19] S. Kumar et al., “Machine learning classification model
ncas/alerts/TA14-295A. for network based intrusion detection system,” in 2016
[4] K. Murnane, “The malwarebytes report: The 11th International Conference for Internet Technology
2016 malware threat landscape,” Jan 2017. and Secured Transactions (ICITST), pp. 242–249, Dec
https://www.forbes.com/sites/kevinmurnane/2017/01/31/ 2016.
the-malwarebytes-report-the-2016-malware-threat-landscape/ [20] Y. Qiao et al., “Analyzing malware by abstracting the
#50be043721ee. frequent itemsets in api call sequences,” in 2013 12th
[5] M. LABS, “2017 state of malware report,” 2017. IEEE International Conference on Trust, Security and
https://www.malwarebytes.com/pdf/white-papers/ Privacy in Computing and Communications, pp. 265–
stateofmalware.pdf. 270, July 2013.

6
[21] S. S. Hansen et al., “An approach for detection and
family classification of malware based on behavioral
analysis,” in 2016 International Conference on Comput-
ing, Networking and Communications (ICNC), pp. 1–5,
Feb 2016.
[22] A. Pekta, , et al., “Runtime-behavior based malware clas-
sification using online machine learning,” in 2015 World
Congress on Internet Security (WorldCIS), pp. 166–171,
Oct 2015.
[23] D. Kirat et al., “Barecloud: Bare-metal analysis-based
evasive malware detection,” in Proceedings of the 23rd
USENIX Conference on Security Symposium, SEC’14,
(Berkeley, CA, USA), pp. 287–301, USENIX Associa-
tion, 2014.
[24] A. Fujino et al., “Discovering similar malware sam-
ples using api call topics,” in 2015 12th Annual IEEE
Consumer Communications and Networking Conference
(CCNC), pp. 140–147, Jan 2015.
[25] A. Kharaz et al., “Unveil: A large-scale, automated
approach to detecting ransomware,” in 25th USENIX
Security Symposium (USENIX Security 16), (Austin, TX),
pp. 757–772, USENIX Association, 2016.
[26] “Wannacry technical analysis,” May 2017.
http://support.threattracksecurity.com/support/solutions/
articles/1000250396-wannacry-technical-analysis.
[27] Q. Chen and R. A. Bridges, “Technical report: Auto-
mated behavioral analysis of malware a case study of
wannacry ransomware,” 2017. https://arxiv.org/abs/1709.
08753[online].
[28] G. Salton and C. Buckley, “Term-weighting approaches
in automatic text retrieval,” Information Processing &
Management, vol. 24, no. 5, pp. 513 – 523, 1988.
[29] M. Welling, “Fisher linear discriminant analysis,” De-
partment of Computer Science, University of Toronto,
vol. 3, no. 1, 2005.
[30] C. R. Harshaw et al., “Graphprints: Towards a graph
analytic method for network anomaly detection,” CISRC
’16, (New York, NY, USA), pp. 15:1–15:4, ACM, 2016.
[31] E. M. Ferragut et al., “A new, principled approach to
anomaly detection,” in 11th International Conference on
Machine Learning and Applications, vol. 2, pp. 210–215,
Dec 2012.
[32] Q. Chen et al., “A model-based validated autonomic ap-
proach to self-protect computing systems,” IEEE Internet
of Things Journal, vol. 1, pp. 446–460, Oct 2014.
[33] Q. Chen et al., “A model-based approach to self-
protection in computing system,” in Proceedings of the
2013 ACM Cloud and Autonomic Computing Conference,
CAC ’13, (New York, NY, USA), pp. 16:1–16:10, ACM,
2013.

View publication stats

You might also like