Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Visvesaraya Technological University

Belgaum, Karnataka-590 014

A Technical Seminar report on


“ROMADROID: A robust and efficient technique for
detecting android app clones”
Submitted in partial fulfilment of the requirement for the award of the degree of
Bachelor of Engineering in
Computer Science and Engineering

Submitted by
SAURABH KUSHWAHA 1SP16CS090

Under the Guidance of


Dr. B. Loganayagi
Professor, Dept of CSE

Department of Computer Science and Engineering


S.E.A. COLLEGE OF ENGINEERING & TECHNOLOGY,
BANGALORE - 49
2019-2020
S.E.A COLLEGE OF ENGINEERING ANDTECHNOLOGY
BENGALURU - 560049
(Affiliated to Visvesvaraya Technological University, Belagavi)

Department of Computer Science and Engineering

CERTIFICATE
Certified that the technical seminar entitled “ROMADROID: A robust and
efficient technique for detecting android app clones” carried out by SAURABH
KUSHWAHA (1SP16CS090), bonafied student of S.E.A College Of Engineering &
Technology, in partial fulfillment for the award of Bachelor of Engineering in Computer
Science and Engineering Department of the Visvesvaraya Technological University, Belgaum
during the year 2019-2020. It is certified that all corrections/suggestions indicated for
internal assessment have been incorporated in the report deposited in the department library.
The project report has been approved as it satisfies the academic requirements in respect of
technical seminar prescribed for the said degree.

Dr. B. LOGANAYAGI Dr. K. SUNDEEP KUMAR Dr. K. SURESH


Signature of the Guide Signature of the HOD Signature of the Principal

EXTERNAL EXAMINER Signature with date:

1.

2.
ABSTRACT

As mobile technology expands and spreads its reach over the entire modern world, the usage of
applications also increase drastically. Android being the most popular operating system choice in
smartphones, the applications (apps) play a major role in its daily functionality and also serve users
in various ways. Since illegal users frequently make a copy of a legitimate Android app and
redistribute the plagiarized app for commercial or malicious purposes, many studies have been
conducted to detect repackaged/cloned apps and make the Android ecosystem safer. A malicious
attacker might apply code obfuscation to avoid app clone detection. Therefore, it is necessary to
consider the effects of code obfuscation when detecting cloned apps.

A tool called RomaDroid is implemented, which can detect efficiently cloned apps based on
features inherent in each app’s AndroidManifest.xml file. The manifest file is XML structure
defined by tags or attributes and its XML document can be modeled as an ordered labeled tree.
The RomaDroid creates a string from the hierarchical tree structure of tags as well as the class
name of the components related to intent-filter tags in the manifest file, which are robust to code
obfuscation. That is, a string is created from each manifest file of two apps to be compared and
measure the similarity between the created two strings with the longest common subsequence
(LCS) algorithm. If the measured similarity exceeds a certain threshold, the two apps are
determined to be a clone pair (or similar app pair). To validate the RomaDroid, we perform various
experiments with both non-obfuscated apps and their obfuscated versions generated by three
obfuscation tools. The experimental results show that the RomaDroid detects accurately cloned
apps even in the cases code obfuscation has been applied.
ACKNOWLEDGEMENT

Firstly, I thank the management and late Shri A. Krishnappa. Chairman S E A College of
Engineering and Technology for providing necessary infrastructure and creating a functional
environment.

I would like to express my thanks to our college principal Dr. K Suresh for the assistance
and support given by him.

I would like to express my sincere thanks to Dr. Sundeep Kumar, HOD of Computer
Science and Engineering for his encouragement, guidance and motivation.

I have got an opportunity to develop my presentation skills while undertaking the technical
seminar entitled “ROMADROID: A robust and efficient technique for detecting android app
clones”.

I give my sincere thanks to my guide Dr. B. Loganayagi who have always been a guiding
force. She brought my attention to various cyber security related topics which helped me gain
awareness about them. She also enlightened me by giving precautionary tips in order to protect the
data stored in our computers.

I also extend my thanks to all faculty members of COMPUTER SCIENCE Dept. who
gave me valuable guidance and help when I needed it since I started learning about the seminar
topic.

My obligations remain due to all those people who have directly or indirectly helped me in
successful completion of the technical seminar. No amount of words written here will suffice for
my sense of gratitude towards them all.

SAURABH KUSHWAHA

(1SP16CS090)
TABLE OF CONTENTS

CHAPTERS CHAPTER NAME PAGE NO.


1 INTRODUCTION 1
1.1 Literature Survey 2
1.2 Objective 5
1.3 Existing System 5
1.4 Proposed System 5

2 SYSTEM DESIGN 7
2.1 Code Obfuscation 7
2.2 File structure of AndroidManifest.xml 9
2.3 Proposed Architecture 11
2.4 Representing XML file as a string 12
2.5 Handling Intent related data 13
2.6 How are two apps compared 14

3 EVALUATION 16
3.1 Assembling the dataset 18
3.2 Performance in terms of similarity threshold 19
3.3 Performance depending on obfuscation tools 20
3.4 Performance depending on the number of 21
components having intent-filters
3.5 Evaluating obfuscation resilience 22

4 CONCLUSION 23

5 REFERENCES 24
FIGURE INDEX

FIG NO. FIG NAME PAGE


NO.
2.1 AndroidManifest.xml file structure 9

2.3.1 Detection of app clones in RomaDroid 11

2.4.1 XML file with corresponding feature string 12

2.5.1 An example of feature information represented as a 13


tree (left) and with class names (right)

2.6.1 Algorithm to generate feature information 14

3.1 Process to evaluate performance 16

3.2 Similarity threshold performance of RomaDroid and 19


SimiDroid
TABLE INDEX

TABLE TABLE NAME PAGE


NO. NO.
3.3 Detection performance for apps obfuscated by each 20
obfuscation tool

3.4 Accuracy of detection depending on the number of 21


components

3.5 RomaDroid’s performance with different apps 22


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

Chapter 1
INTRODUCTION

Android has been and continue to be by far the most dominant operating system on smartphones. In
this modern age, smartphones have become an essential part of our daily lives as it can perform several tasks
using applications belonging to various categories like – health and fitness apps, business apps, productivity
apps, games, social media apps, entertainment apps etc. Many of these applications contain critical
information not only like the business logic but also the data gathered from user. According to statista, there
are around 2.56 million android apps as of first quarter of 2020 and Android has a massive market share of
88%. Due to this, attackers clone some of the android applications in popular categories and redistribute them
for malicious purposes seeking financial gain.
Unsuspecting users may install the modified or cloned version of genuine apps and come under a risk
of losing their data and in more severe cases a monetary or asset loss as well.
Hence, it becomes an essential task to eradicate cloned apps and prevent its spread. This is where
ROMADROID comes in to detect obfuscated app clones as well as non-obfuscated app clones on Android
markets. The RomaDroid tool uses the hierarchical tree structure of each app’s manifest file, implicit intents
and their component information inside the file to achieve robustness even in the cases where the cloned apps
are obfuscated.

[Dept of CSE, SEACET] 2019-2020 Page 1


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

1.1 Literature Survey

1.1.1 Guilty or not guilty: Using clone metrics to determine open source licensing
violations:

In this paper [1], the authors have tried to tackle two problems:

 What metrics are appropriate in evaluating the number and size of code clones between two
programs?
 What is the lower bound of code clone measurement needed to conclude that the suspected program
is guilty, and what is the upper bound needed to conclude it isn’t?

To answer the first question, the authors reviewed several potential clone metrics including ones
from www.ccfinder.net, such as the ratio of similarity between another file and coverage. From these, they
selected three clone-based measures: maximum length of clones (MLC), number of clone pairs (NCP), and
clone-based local similarity (LSim), which looks at the percentage of duplication within a suspicious pair.
To answer the second question, the authors first established a framework for defining guilty and not guilty
and then used their metrics to determine the upper and lower bounds. To experimentally identify these
bounds, authors analysed 1,225 pairs of OSS products for reuse-based clones.

1.1.2 Dissecting Android Malware: Characterization and evolution

In this paper [2], focus is on the Android platform and aim is to systematize or characterize existing Android
malware. Particularly, with more than one year effort, the authors collected more than 1,200 malware
samples that cover the majority of existing Android malware families, ranging from their debut in August
2010 to recent ones in October 2011. In addition, they were systematically characterized from various
aspects, including their installation methods, activation mechanisms as well as the nature of carried
malicious payloads. The characterization and a subsequent evolution-based study of representative families
reveal that they are evolving rapidly to circumvent the detection from existing mobile anti-virus software.
Based on the evaluation with four representative mobile security software, their experiments show that the
best case detects 79.6% of them while the worst case detects only 20.2% in the dataset.

[Dept of CSE, SEACET] 2019-2020 Page 2


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

1.1.3 On the robustness of clone detection to code obfuscation:

In this paper [3], the authors present a framework for semi-automated code obfuscations. Additionally, they
present a case study to evaluate the robustness of selected clone detectors against such obfuscations in order
to solve the problem of code cloning which may have legal implications as well such as license violations.

1.1.4 Adrob: Examining the landscape and impact of android application plagiarism:

In this paper [4] , the authors investigate application plagiarism on Android markets at a large scale. They
take the first step to characterize plagiarized applications and estimate their impact on the original
application developers. They crawled 265,359 free applications from 17 Android markets around the world
and ran a tool to identify similar applications ("clones"). Based on the data, they examined properties of the
cloned applications, including their distribution across different markets, application categories, and ad
libraries. Next, they examined how cloned applications affect the original developers. They also captured
HTTP advertising traffic generated by mobile applications at a tier-1 US cellular carrier for 12 days. To
associate each Android application with its advertising traffic, the authors extracted a unique advertising
identifier (called the client ID) from both the applications and the network traces. The authors estimate a
lower bound on the advertising revenue that cloned applications siphon from the original developers, and the
user base that cloned applications divert from the original applications.

1.1.5 AnDarwin: Scalable Detection of Android Application Clones Based on


Semantics:

Here [5], a technique is proposed in order to detect similar android apps using their semantic features. A tool
is implemented called AnDarwin and is used to evaluate it on 265,359 apps collected from 17 markets
including Google Play and numerous third-party markets. In contrast to earlier approaches, AnDarwin has
four advantages: it avoids comparing apps pairwise, thus greatly improving its scalability; it analyzes only
the app code and does not rely on other information — such as the app’s market, signature, or description —
thus greatly increasing its reliability; it can detect both full and partial app similarity; and it can
automatically detect library code and remove it from the similarity analysis. Two use cases are presented for
AnDarwin: finding similar apps by different developers (“clones”) and similar apps from the same developer
(“rebranded”). In ten hours, AnDarwin detected at least 4,295 apps that are the victims of cloning and
36,106 rebranded apps.
[Dept of CSE, SEACET] 2019-2020 Page 3
ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

1.1.6 A scalable and accurate two-phase approach to Android app clone detection:

This paper [7] proposes WuKong, a two-phase detection approach that includes a coarse-grained detection
phase to identify suspicious apps by comparing light-weight static semantic features, and a fine-grained
phase to compare more detailed features for only those apps found in the first phase. To further improve the
detection speed and accuracy, also introduced is an automated clustering-based preprocessing step to filter
third-party libraries before conducting app clone detection. Experiments on more than 100,000 Android apps
collected from five Android markets demonstrate the effectiveness and scalability of this approach.

[Dept of CSE, SEACET] 2019-2020 Page 4


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

1.2 Objectives
 To detect clone of a genuine android app by studying their respective androidmanifest.xml
files.
 Implement a robust tool for clone detection.
 To detect both, obfuscated and non – obfuscated android app clones.
 Study the threshold in order to label an app as a clone.
 Achieve high accuracy and scalability in the detection of the tool.

1.3 Existing System

 String based, token-based, AST-based, and simpler hashing-based approaches are fast, they are
typically not as accurate.

 Detection techniques are not as scalable and become unreliable when scaled.

 Text based clone detectors which are not effective in identifying obfuscated clones.

1.4 Proposed System

In this paper, the authors address the above two issues and find out a robust and accurate technique
for detecting cloned Android apps transformed by code obfuscation. Thus, they devise a new
efficient system, called RomaDroid, to detect obfuscated app clones as well as non-obfuscated app
clones on Android markets. RomaDroid uses the hierarchical tree structure of each app’s manifest
file and implicit intents and their component information inside the manifest file to achieve both
accuracy and robustness even under circumstances where illegal cloned apps are obfuscated in
order to conceal their piracy or reuse. RomaDroid is a robust app clone detector based on each
app’s manifest file in Android. RomaDroid is implemented and its effectiveness is verified using
148 original apps and their 620 obfuscated versions generated by the three popular Android
obfuscators, ProGuard [8], DexGuard [9], and DashO [10]. This approach shows low false
positives, false negatives, and low processing overhead. Overall, the main characteristics and
contributions of their study are summarized as follows:

 They develop a robust app clone detector and demonstrate its effectiveness in Android. The
detector can handle not only non-obfuscated apps but also obfuscated apps.
[Dept of CSE, SEACET] 2019-2020 Page 5
ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

 This approach utilizes only simple features in Androidmanifest.xml file in each Android APK
package. The features consist of a tree structure of the manifest file and the class name of
components associated with each implicit intent in the manifest file which is mostly resilient to
obfuscations. On the contrary, most of the previous studies utilize the information in the DEX
file in each APK package.
 They achieve high accuracy and scalability in detecting cloned apps based on the longest
common subsequence (LCS) algorithm to compare two apps after converting the features in
each manifest file into a string sequence.

1.4.1 Advantages

 Able to detect both obfuscated and non-obfuscated android app clones.

 Robust methodology to detect app clones.

 Less computing resource overhead.

 High accuracy and scalability in detection.

[Dept of CSE, SEACET] 2019-2020 Page 6


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

Chapter 2

SYSTEM DESIGN
Before looking into the details of how the methodology used to design RomaDroid, a few topics have to be
known:

 Code obfuscation
 XML file structure

2.1 Code Obfuscation

Code obfuscation is a technique that can be used to protect the intellectual property of software against
reverse engineering and tampering attacks. Obfuscation transforms program code and data of an app into
more complex and ambiguous representations but with preserving the original behavior of the app.

Even though obfuscation cannot completely prevent mobile apps to be reverse engineered, it can make
reverse engineering harder and more time-consuming.

Therefore, code obfuscation is a common practice recommended in the Android developer guide to protect
security protocols and other application components from reverse engineering attacks.

For improving the health of the overall Android ecosystem, there are several obfuscation tools which range
from free, open-source solutions providing only basic obfuscation features like ProGuard [8], up to premium
obfuscation features like DexGuard [9].

 The free ProGuard [8] is included with the Android SDK and supported by the official Android
Studio IDE. ProGuard has been integrated with the Android Software Development Kit (SDK).
Additionally, other obfuscation tools inherit most of their functionality from ProGuard.
 DexGuard [9] is the extended commercial version of ProGuard and utilizes name obfuscation with
the same basic functionality as ProGuard, but with additional features like string and class
encryption. Typically, the apps obfuscated by DexGuard are viewed as difficult to reverse engineer
because the class and method names are replaced with non-ASCII characters and strings are
encrypted.
 DashO [10] provides enterprise-grade protection for all Java and Android apps, reducing the risk of
intellectual property theft, piracy, and tampering. DashO’s services associated with obfuscation

[Dept of CSE, SEACET] 2019-2020 Page 7


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

include renaming of the names of methods and variables, making source code much harder to
understand, encrypting strings in sensitive parts of the target app, and code optimization with
pruning.

2.2.1 Why code obfuscation?

 It prevents stealing of intellectual properties to a certain extent.


 Reverse engineering code which is obfuscated using standard tools is a tough task.
 Prevents any further financial or business loss which may occur due to cloning a genuine app.

[Dept of CSE, SEACET] 2019-2020 Page 8


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

2.2 File structure of AndroidManifest.xml

The Android manifest file describes essential information about each app to the Android build tools,
the Android operating system, and Google Play. The AndroidManifest.xml file is within each
Android app and has the structure shown in Fig 2.1. The manifest file is described as XML structure
defined by tags, elements or attributes.

Fig 2.1: AndroidManifest.xml file structure.

The tag is a markup structure that starts with ‘‘<’’ and ends with ‘‘>’’ sign. XML makes differences
between start-tag and end-tag. End-tag starts with‘‘</’’and ends with ‘‘>’’. The element is a structure
that starts with start-tag and ends with end-tag. Attributes is a name and value pair that exists within a
start-tag.

As seen in Fig 2.1, the contents of the XML file can be represented as a tree with the first tag being
the root node and any tag within the node can be shown as child node. In the right side of the figure,
and ordered labeled tree is shown in which the <manifest> tag is root node and tags <permission>,
<uses-permission> and <application> are child nodes of root and together are sibling nodes. The
<application> node is the parent node of <activity> node. The <intent-filter> node is the child of
<activity> node. The <action> and <category> nodes are a leaf node which has not any child node.

A component can declare intent filters that describe how the component can be started. Permissions
are declared using the uses-permission tag followed by a common namespace (usually
android.permission.∗ for Google defined permission). Self-declared permissions which other apps can
request are labeled with the permission tag.

[Dept of CSE, SEACET] 2019-2020 Page 9


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

The Intent is an abstract description of an operation to be performed. The Intent is a unidirectional


message with an arbitrary action string that can be broadcasted to all apps or sent to a specific app.
When an app issues an intent to the system, the system locates an app component that can handle the
intent based on intent filter declarations in a manifest file. An app component can have any number of
intent filters (defined with the <intent-filter> tag). A <intent-filter> tag specifies the types of intents
that an activity, service, or broadcast receiver can respond to. An intent filter can list several actions.

In order to detect cloned apps, several features were extracted from each app’s manifest file such as
the tree structure of tags, intent-filter tags, the parent component names of the intent filter tags, etc.

[Dept of CSE, SEACET] 2019-2020 Page 10


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

2.3 Proposed Architecture

A new effective approach is proposed to detect app clones based on a manifest file. It is assumed that
cloned apps are repackaged after being obfuscated. In the AndroidManifest.xml with a hierarchical
tree structure, the intent-filter tags and the app’s component tags associated with them are required to
respond to external requests and cannot be obfuscated. The tree structure and those intents’ parent
components are used as program features. For a single app, a string is constructed that represents the
tree structure of the manifest file and reflects the intent-filter tags and their parent component tags. The
longest common subsequence (LCS) algorithm is used to measure the similarity of two strings. The
process of detection is illustrated in Fig 2.3.1.

Fig 2.3.1: Detection of app clones in RomaDroid.

2.3.1 Why use Androidmanifest.xml file for detection?

 The manifest file contains some information that cannot be obfuscated because without them an app
cannot interact with the Android system and other apps.
 XML tree structure and the class names of the parent components of the <intent-filter> tags as feature
information are used.
 Tags in each AndroidManifest.xml has a tree structure consisting of up to five levels. Tags must be
specified at a fixed level. Some tags in the manifest file of an app should not be transformed by
obfuscation tools at random but manually by developers.

[Dept of CSE, SEACET] 2019-2020 Page 11


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

2.4 Representing XML file as a string

The tags in an AndroidManifest.xml form a tree data structure. AndroidManifest.xml uses a space character
before the tag to identify the level of the tags. Lower-level tags are declared at positions spaced more than
upper-level tags. The following structural characteristics of the XML file are used:

 The tags of the XML file form a tree structure, and with AndroidManifest.xml the tree level is fixed.
The tags are specified in the depth-first search order.
 The inclusion relation between a parent and a child node is expressed by declaring a child tag inside a
parent tag. In addition, a parent tag is distinguished from a child node by whitespace characters as the
example shown in Fig 2.4.1.

Fig 2.4.1: XML file with corresponding feature string.

You can construct a single string from a tree structure, reflecting the number of blanks in front of each
tag. Fig 2.4.1 shows an example feature string, where ‘1B’ means one blank and ‘2B’ means two blanks,
and so on. A tag declared at ‘2B’ means a child node of a tag declared at ‘1B’, and a tag declared at ‘4B’
is a child node of a tag declared at ‘3B’. Reading the file sequentially from the beginning is equivalent to
traversing the tree structure in a depth-first-search order. A feature string, thus, represents a tree structure
in this order.

[Dept of CSE, SEACET] 2019-2020 Page 12


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

2.5 Handing Intent related data

The components that have the <intent-filter> tag and are not declared as <exported=’’false’’> cannot
be obfuscated as mentioned above. These components’ class names were added to feature
information. The left side of Fig 2.5.1 is a partial representation of an AndroidManifest.xml tree.

Fig 2.5.1: An example of feature information represented as a tree (left) and with class names (right).

It is assumed that the first level (the root node) begins with one blank, the third level (component nodes)
three blanks, the fourth level (intent-filter nodes) four blanks, the last level (leaf nodes) six blanks,
respectively. In the example, nodes A and B represent the Android app components and include intent-
filter tags. If a component node has an intent-filter child and is not declared as <export=’’false’’>, hash
value of the component’s class name was added to its feature string. Cryptographic hash function such as
SHA-256 is used as hash values have the same length regardless of lengths of the class names, thus
influence of different lengths of class names can be reduced. If a component node does not have an
intent-filter child, feature information remains the same. For the component that has an intent-filter node
as a child, its package name is usually obfuscated, but its class name is not obfuscated. Therefore, the
feature information is resilient to code obfuscation

[Dept of CSE, SEACET] 2019-2020 Page 13


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

2.6 How are two apps compared?

Fig 2.6.1 shows how to extract the tree structure and component information with an intent-filter and
to generate the final feature string. The similarity between two strings can be measured by the
longest common subsequence (LCS) algorithm.

Fig: 2.6.1 Algorithm to generate feature information.

Given two strings X and Y, the longest common subsequence of X and Y is the longest sequence Z that
is a subsequence of both X and Y.

[Dept of CSE, SEACET] 2019-2020 Page 14


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

For example, let

 X = (ABRACADABRA) and let


 Y = (YABBADABBADOO).

Then the longest common subsequence Z is (ABADABA). That is, the similarity ratio of the two strings
is calculated as follows:

LCS(s1, s2)| / max(|s1|, |s2|)

The denominator is the length of the longer string of s1 and s2. For example:
Given s1=‘‘ABDF’’ and s2=’’ABDEF’’, their similarity ratio=4/5=0.8. In the right side of Fig 2.5.1, two
apps differ only the bottom app has one more node on the fourth level. The two feature strings differ by
two bytes. High similarity ratio can be expected.

[Dept of CSE, SEACET] 2019-2020 Page 15


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

Chapter 3

EVALUATION
Fig 3.1 is the overall process of the experiment to evaluate the performance of RomaDroid. ProGuard[8],
DexGuard [9], and DashO [10] are used to create obfuscated app clones from original apps (non-
obfuscated ones). Next, a feature string of each app is generated and calculate the similarity ratios
between each pair of apps based on the feature string.

For the validation of our proposed technique, these definitions and equations are followed:

 True Positive (TP): For an app A, TP refers to the number of samples correctly classified as A,
where an app can be either a non-obfuscated app or an obfuscated app.
 True Negative (TN): For an app non-A, TN refers to the number of samples correctly classified
as non-A.
 False Positive (FP): For an app non-A, FP refers to the number of samples incorrectly classified
as A.
 False Negative (FN): For an app A, FN refers to the number of samples incorrectly classified as
non-A. If feature information is too generic, false positive will increase.

Fig 3.1: Process to evaluate performance.

[Dept of CSE, SEACET] 2019-2020 Page 16


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

As false positive goes up, accuracy, precision, F1 score, and false positive rate go down which are
represented as:

 Accuracy = (TP+TN)/(TP+TN +FP+FN)


 Precision = TP/(TP+FP)
 Recall = TP/(TP+FN)
 F1 Score = (2∗Precision∗Recall)/(Precision +Recall)
 FalsePositiveRate = FP/AllComparecase

[Dept of CSE, SEACET] 2019-2020 Page 17


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

3.1 Assembling the dataset

To construct the dataset for the experiment, 148 original apps are collected (non-obfuscated apps) from
F-Droid, a repository of free and open-source software on the Android platform. Next, the three
obfuscation tools are used to generate obfuscated versions of the original apps:

 ProGuard [8]
 DexGuard [9]
 DashO [10]

Two obfuscation options are applied of the tools for each app. For DexGuard, the renaming obfuscation
option and string encryption option; for ProGuard the default option and the default with the optimizing
option and for DashO the renaming obfuscation option and control flow modification option,
respectively. The proposed approach is mainly influenced by the renaming obfuscation. Since the
obfuscators apply each option independently, the result of renaming obfuscation is almost unchanged
regardless of how many obfuscation options are applied at the same time. Therefore, cloned apps are
made by applying one option at a time.

Theoretically, there should be 888 obfuscated app clones (148 apps ∗ 3 tools ∗ 2 options). However, total
count was 620 clones because sometimes an obfuscator may fail to obfuscate an app. As a result, 768
apps were used for the experiment, consisting of 148 original apps and 620 obfuscated apps.

[Dept of CSE, SEACET] 2019-2020 Page 18


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

3.2 Performance in terms of similarity threshold

Fig 3.2 shows the detection performance of the RomaDroid and SimiDroid in terms of the similarity
threshold. When the similarity ratio exceeds the threshold value, two apps are termed to be same. In
order to determine an appropriate similarity threshold value, experiments were carried out using 14
original apps and 62 obfuscated apps randomly selected from the dataset .The results are shown in
Fig 3.2.

Fig 3.2: Similarity threshold performance of RomaDroid and SimiDroid

In case of RomaDroid, you see the overall detection performance improves with the threshold
increasing up to 90%. However, when the threshold is 99%, the F1 score decreases significantly.
With 99% threshold, small changes in feature information make us judge two apps are different and
thus more false negatives are generated than the case the threshold is 90%. similarity threshold of
RomaDroid is set to 90% in the experiments. In case of SimiDroid, on the other hand, the lower
threshold considered, the slightly higher the accuracy and F1 score achieved. The objective of
SimiDroid is to identify and explain similarities/changes among app versions and among repackaged
apps.To calculate a similarity score of the given two apps (app1, app2), SimiDroid adopts four
similarity metrics: identical methods, similar methods, new methods, and deleted methods between
the two compared apps. SimiDroid computes the similarity score using a formula based on the four
metrics. In Fig 3.2, we see that the performance of SimiDroid is hardly affected by a pre-defined
threshold. The reason seems to be because of the four metrics-based formula.

[Dept of CSE, SEACET] 2019-2020 Page 19


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

3.3 Performance depending on the obfuscation tools

Experimentations on the dataset were performed to find out how the performance varies depending
on the obfuscation tools. Table 3.3 shows the results of detecting apps obfuscated by the three tools.
#Apps column displays the total number of apps obfuscated by each obfuscation tool.

Table 3.3: Detection performance for apps obfuscated by each obfuscation tool.

DashO might sometimes transform even the component name with intent-filter and the probability of
false negatives is higher than the cases of the other two tools. For apps obfuscated by ProGuard and
DexGuard, false negatives can occur when the size of an app is very small and the main activity only
have intent-filters. The feature information consists of a tree structure only and is very short. In
addition, this structure might be transformed during obfuscation or decompilation. As a result, two
apps are mistakenly considered to be different. False positive rate is the highest for apps obfuscated
by ProGuard and this is closely related to the number of components in an app. For ProGuard, the
proportion of apps with fewer than two components was higher than DexGuard or DashO. The
smaller the number of components, the shorter the length of feature information. When a feature
information string is short, the false positive rate increases if two component names are the same by
chance. Next, we explore how the number of components having intent-filters affects detection
performance.

[Dept of CSE, SEACET] 2019-2020 Page 20


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

3.4 Performance depending on the number of components having intent-


filters

The performance of detection depending on the number of components with intent-filters as well as
without <exported=’’false’’> attribute are shown in Table 3.4. The number of original apps is 148,
the number of obfuscated apps is 620, and the main activity information is not included in the feature
information.

Table 3.4: Accuracy of detection depending on the number of components.

The “No. of components” column represents the number of components with intent-filters. Of the
620 obfuscated apps, 92 apps have no component with intent-filters, 291 apps one component, 99
apps two components, 60 apps three components, 26 apps four components, and 52 apps more than
four components, respectively. For each row, 148 ∗ #Apps pairs were compared and calculated an
average. For example, 148×620 pairs were compared in row 1 and 148×528 in row 2, respectively.
The more the number of components with intent-filters, the higher the F1 Score. The recall is lower
than any other cases when apps have two or three components with intent-filters. The recall
decreases as false negative increases. In the dataset, false negative occurs for an app obfuscated by
DashO. This misjudgment is due to change of class names of components with intent-filters and
happens for apps with equal to or fewer than three components. For the first two rows, the number of
apps that are misjudged remains the same, but the total sample apps increase.

[Dept of CSE, SEACET] 2019-2020 Page 21


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

3.5 Evaluating obfuscation resilience

Table 3.5 presents the experimental results that compare:

 Two apps in the original apps;


 One in the original apps and one in the obfuscated apps;
 Two apps in obfuscated apps, respectively.

Table 3.5: RomaDroid’s performance with different apps.

In the table, ‘‘Original’’ and ‘‘Obfusca’’ denote original and obfuscated apps, respectively. The #Apps
column shows the total number of pairs of 148×148, 148× 620, and 620×620 pairs, respectively. In
terms of F1 score, the result with two original apps is 95.48; the result with original apps and obfuscated
apps 94.50; and the result with two obfuscated apps 92.11, respectively. F1 score is best when we
compare two original apps. As shown in Table 3.4, RomaDroid performs better for applications which
have many components with intent filters. The apps that degrade the performance have something in
common: They have fewer than three components with intent filters. The obfuscated apps are created
using 148 original apps, so they have the same number of components as the original apps and more
performance degrading apps. As a result, experiments with obfuscated apps show relatively low
performance. We can see that RomaDroid performs well for obfuscated apps as well as un-obfuscated
apps

[Dept of CSE, SEACET] 2019-2020 Page 22


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

Chapter 4

CONCLUSION
An effective methodology was established to detect android app clones even if obfuscation is used. As
android apps are easy to reverse engineer, there is an easy path for them to be copied or cloned. As
androidmanifest.xml file is present in all apps and some of its content is not affected by obfuscation. So to
make accuracy as good as possible a total of 768 apps were constructed, and many cases were found to have
over 99% accuracy.

The proposed method does not rely on the source code as it can be obfuscated leading to several hurdles if
used as a factor for detection. Most apps do not provide their source code, so clone detection techniques
using source codes is limited. But our approach takes advantage of the information in the executable
deployed, so it can be used for almost any app.

Using this technique any genuine app which is cloned and redistributed can be detected and the same can be
brought to the notice of the developers of the genuine app.

In future, the methodology can be implemented as an android app which may notify the user as soon as they
install any app which may result in stoppage or slow down the spread of fake or cloned android apps.

[Dept of CSE, SEACET] 2019-2020 Page 23


ROMADROID: A ROBUST AND EFFICIENT TECHNIQUE FOR DETECTING ANDROID APP CLONES

Chapter 5

REFERENCES
[1] A. Monden, S. Okahara, Y. Manabe, and K. Matsumoto, ‘‘Guilty or not guilty: Using clone metrics to
determine open source licensing violations,’’ IEEE Softw., vol. 28, no. 2, pp. 42–47, Mar./Apr. 2011.

[2] Y.Zhouand, X.Jiang, ‘‘Dissecting Android Malware: Characterization and evolution,’’in Proc.IEEE
Symp.Secur. Privacy,SanFrancisco,CA,USA, pp. 95–109, May 2012.

[3] S. Schulze and D. Meyer, ‘‘On the robustness of clone detection to code obfuscation,’’ in Proc. 7th Int.
Workshop Softw. Clones (IWSC), San Francisco, CA, USA, May 2013, pp. 62–68.

[4] C.Gibler, R.Stevens, J.Crussell, H.Chen, H.Zang and H.Choi,‘‘Adrob: Examining the landscape and
impact of android application plagiarism,’’
inProc.11thAnnu.Int.Conf.MobileSyst.,Appl.,Services,Taipei,Taiwan, Jun. 2013, pp. 459–460.

[5] K.Chen, P.Liu and Y.Zhang, ‘‘Achieving accuracy and scalability simultaneously in detecting
application clones on Android markets,’’ in Proc. ICSE, Hyderabad, India, Jun. 2014, pp. 175–186.

[6] J. Crussell, C. Gibler and H. Chen, ‘‘AnDarwin: Scalable detection of Android application clones based
on semantics,’’ IEEE Trans. Mobile Comput., vol. 14, no. 10, Oct. 2015.

[7] H.Wang, Y.Guo, Z.Ma and Z.Chen,‘‘WuKong: A scalable and accurate two-phase approach to Android
app clone detection,’’ in Proc. ISSTA, Baltimore, MD, USA, 2015, pp. 71–82.

[8] ProGuard - MobileAppProtection. Accessed: Feb.2019. [Online]. Available:


https://www.guardsquare.com/en/products/proguard

[9] DexGuard - MobileAppProtection. Accessed: Feb.2019. [Online]. Available:


https://www.guardsquare.com/en/products/dexguard

[10] DashO (Java Obfuscator & Android Obfuscator). Accessed: Feb. 2019. [Online]. Available:
https://www.preemptive.com/products/ dasho/overview

[Dept of CSE, SEACET] 2019-2020 Page 24

You might also like