Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Evaluation of Image Similarity Algorithms for

Malware Fake-Icon Detection


Jun-Seob Kim Wookhyun Jung Sangwon Kim
ESTsecurity Data Intelligence Lab Data Intelligence Lab
Seoul, Republic of Korea ESTsecurity ESTsecurity
jskim90@estsecurity.com Seoul, Republic of Korea Seoul, Republic of Korea
pplan5872@estsecurity.com bestksw@estsecurity.com

Shinho Lee Eui Tak Kim


Data Intelligence Lab Data Intelligence Lab
ESTsecurity ESTsecurity
Seoul, Republic of Korea Seoul, Republic of Korea
lee1029ng@estsecurity.com etkim@estsecurity.com

Abstract— Malware Creators have steadily used Social To do this, we need a method to search for malware by
engineering attacks that induce people to execute malware by extending a range of similar icons. In this paper, we verify the
stealing icons of well-known programs or disguise malware as icon-search performance of the image similarity hash algorithm
normal programs. Therefore, a method of comparing the based on icon information extracted from real malware and
similarity of icons has been proposed to detect this type of malware. present the performance index of the image similarity hash
To compare similarity, methods of icon hash comparison, machine algorithm.
learning, or image similarity have been used. Among them, the
image similarity hash algorithm for image similarity comparison II. BACKGROUND
has been used for the purpose of detecting icons used by malware
since it allowed us to search for a similar image just by a simple In this chapter, we will look at abuse examples of stealing or
calculation. However, It is required to inspect not only malware being disguised as legitimate icons. Afterward, we will look into
having identical image similarity hash value but also a wider range existing detection technology based on image similarity, and the
of malware using a similar icon to respond to malware that uses major image similarity hash algorithm used in comparing
icons to deceive people. In this paper, we verify the search similar images.
performance of the image similarity hash algorithm based on icon
information extracted from real malware and present the
A. Icon Abuse Case Type
performance index of the image similarity hash algorithm. Abuse cases of stealing and being disguised as legitimate
icons can be divided into 4 types. To detect these types of abuse,
Keywords— image hash, icon similarity, malware we can use methods of judging icon similarity.
I. INTRODUCTION TABLE I. ICON ABUSE CASE TYPE AND DETECTION METHOD
Malware Creators have steadily used Social engineering Case Type Detection Method
attacks that induce people to execute malware by stealing icons Induces execution by stealing The icon similarity of
of well-known programs or disguise malware as normal ①
documents/audio files icon documents/audio files
programs. Many cases have recently been discovered in Induces execution by stealing The icon similarity of legitimate

malware such as Ransomeware. Also, in some cases, malware legitimate application icon application
sometimes wears a self-produced icon to disguise itself as a Steals a well-known The icon similarity of well-

application icon to disguise known application
normal file since a program without an icon can be easily Disguised as a normal file by The icon similarity of existing
suspected. To detect these types of malware, some methods ④
containing a self-produced icon malware detected previously
using icon similarity were suggested [1]. To compare similarity,
methods of icon hash comparison, machine learning, or image
similarity have been used. Among them, the image similarity
hash algorithm for image similarity comparison has been used
for the purpose of detecting icons used by malware since it made
it possible to search for a similar image just by a simple
calculation. However, It is required to inspect not only malware
with identical image similarity hash value but also a wider range
of malware using a similar icon to respond to malware that uses
icons to deceive people.

This work was supported by Institute for Information & Communications Fig. 1. Icons related to each type of abuse case
Technology Promotion (IITP) grant funded by the Korea government (MSIT)
(No. 2019-0-00026, ICT Infrastructure protection against intelligent malware
threats) in 2020.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
978-1-7281-6758-9/20/$31.00 ©2020 IEEE 1638 ICTC 2020
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 18,2023 at 18:11:34 UTC from IEEE Xplore. Restrictions apply.
B. Existing Detection Technology based on Image Similarity A. Percentage of Collected Malware Containing an Icon
Machine learning, File-based Hash algorithm for an image To validate malware detection method based on icon image
such as SSDeep [2], and image similarity-based method are used similarity, we collected and analyzed 15,400 samples (PE file)
for the existing detection technology. from issue malware in AleinVault report. AS a result, 47.63% of
malware contained icons.
TABLE II. DETECTION TECHNOLOGIES USING IMAGE SIMILARITY
Patent number
Summary
(US standard)
Similarity analysis based
Cylance 15/358,009 [3]
on machine learning
Detection method based
Avast 14/716,685 [4]
on image Similarity
Detection method based
Kaspersky 14/072,391 [5]
on image Similarity
Detection method through
Symantec 12/612,550 [6]
image hash comparison

However, it is not easy to adopt the Machine learning Fig. 3. Percentage of malware containing an icon
method due to the difficulty of establishing a data set and a need
for additional training. In the case of using a file-based hash We made 2,572 clusters from 7,312 malware containing
algorithm such as SSDeep, there is a problem that detection icons by using SHA256, and 693 malware, 9.5% of 7,312,
performance tremendously decreases when the shape, resolution, reused identical duplicated icons more than twice. Especially,
and color of the images is changed. the top 7 malware reused identical icons more than 120 times.
C. Major Image Similarity Hash Algorithm From this result, we could confirm that the malware detection
method based on icon image similarity was valid.
To compare the image similarity, four image similarity hash
algorithms [7] are usually used. B. Method of Searching for Similar Icons by using Image
Similarity Hash Algorithm and Hamming Distance
TABLE III. DETECTION TECHNOLOGIES USING IMAGE SIMILARITY We can use the Hamming distance between two image hash
Abbreviation Name Features values to compare the similarity and search for similar icons.
The more we increase the Hamming distance, the wider range of
aHash Average hashing Average-based similarity
similar icons we can search for. The search result following
Discrete Cosine Transform Hamming distance depends on the icon type and image
pHash Perception hashing
(DCT)-based similarity
similarity hash algorithm.
dHash Difference hashing Gradient trace-base similarity
Discrete Wavelet Transform
wHash Wavelet hashing
(DWT)-based similarity

The image similarity hash algorithms in Table 3 have a


property that they output the same hash value in some degree of Fig. 4. The result following the Hamming distance in pHash algorithm
image variation.
As presented in Figure 4, in the case of searching for an icon
D. Image Similarity Hash Algorithm and Hamming Distance 1, if the Hamming distance is 0, only an identical icon can be
The image similarity hash algorithm allows to compare searched. On the other hand, if the Hamming distance is 9, two
similarity by calculating the Hamming distance. more icons can be searched. Like this, we can search for similar
icons just by adjusting the Hamming distance to the desired
range.
IV. EXPERIMENT
In this chapter, we present the result of the verified
Fig. 2. Hamming distance for hash value represented to 64bit binary performance index of image similarity algorithms by using
(0 means the identical image. The maximum value is 64) samples collected in the wild.
III. METHODOLOGY A. Experimental Condition
In this chapter, we will validate the malware detection
TABLE IV. EXPERIMENTAL ENVIRONMENT
method based on icon image similarity and propose a method
for searching similar malware by using image similarity hash CPU Intel Core i7-8700, CPU @ 3.20GHz
algorithm and Hamming distance [8]. RAM 32GB
OS Microsoft Windows 10 64-Bit
Language Python 3.7

1639
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 18,2023 at 18:11:34 UTC from IEEE Xplore. Restrictions apply.
To secure the objectivity of entire malware samples, we The best performance appeared when using pHash. The
filtered out 7,312 of samples containing icons from 15,400 minimum Hamming distance was 6, and the maximum
samples in the AlienVault report’s issue malware samples (PE Hamming distance was 17, which was the biggest distance gap.
file) and removed identical icons to compose the dataset [9] with 3 icons were searched at the minimum Hamming distance, and
2,572 samples. As for icon size, we extracted the 32x32 size of 15 icons were searched at the maximum Hamming distance
icons using extract-icon-py [10]. Otherwise, we extracted icons where the most number of icons were searched.
that had approximately 32x32 size and resized them to 32x32
using Pillow [11].
B. Experiment Method
The number one in Figure 5 was a step to calculate hash in
advance to increase search speed, and the number 2 was the icon
selection step for which we selected an Adobe icon. The search Fig. 8. Similar icons searched at Hamming distance = 17 by pHash
result was limited to the point where no more similar icons
appear. In the case of wHash, minimum and maximum Hamming
distances were the same as 9, but we couldn’t search any icon at
the maximum Hamming distance.
V. CONCLUSION AND FUTURE WORK
In this paper, we verified the icon-search performance of the
image similarity hash algorithm based on icon information
extracted from real malware and present the performance index
of each image similarity hash algorithm. Also, we presented that
the image similarity hash algorithm cannot only be used to check
if the hash value accords but also effectively search malware
having similar icons by extending the search range.
Fig. 5. Method of searching for similar icons
In future work, we will research the utilization plan of
C. Experiment Result similar icon-search methods proposed in this paper such as the
clustering of malware having similar icons.
TABLE V. EXPERIMENT RESULT
ACKNOWLEDGMENT
Hamming distance
Algorithm The number of The number of This work was supported by Institute for Information &
Min. Max.
search result search result Communications Technology Promotion (IITP) grant funded by
aHash 7 2 7 2 the Korea government (MSIT) (No. 2019-0-00026, ICT
dHash 8 2 13 7 Infrastructure protection against intelligent malware threats) in
pHash 6 30 17 15 2020.
wHash 9 0 9 0
REFERENCES
[1] Martin Šmarda and Pavel Šrámek, “Using Image Similarity Algorithms on
In the experiment result, minimum Hamming distance Application Icons,” In Procc 24th Virus Bulletin Intc Confc, Seattle, WA,
means Hamming distance where more than one icon is searched. USA, Sep., 2014.
maximum Hamming distance means Hamming distance where [2] J. Kornblum, “Identifying almost identical files using context triggered
no more different types of icons are searched. In the case of piecewise hashing,” Digital investigation, vol. 3, Sep. 2006, pp. 91–97.
aHash, minimum and maximum Hamming distances was the [3] M. Wolff, P. S. N. Neto, X. Zhao, J. Brock and J. Juan, “Icon Based
Malware Detection,” U.S. Patent Application 15/358,009, Sep. 19, 2019.
same as 7, the searched icons were 2 in the maximum Hamming
distance. [4] Martin Smarda and Pavel Sramek, “Tunable multi-part perceptual image
hashing,” U.S. Patent 14/716,685, Apr. 18, 2017.
[5] I. I. Tatarinov, “System and Method for Detecting Malicious Executable
Files Based on Similarity of Their Resources,” U.S. Patent 14/072,391,
May. 26, 2015.
[6] Bhaskar Krishnappa, “Method and system for identifying icons,” U.S.
Fig. 6. Similar icons searched at Hamming distance = 7 by aHash Patent 12/612,550, Aug. 28, 2012.
[7] ImageHash, Accessed: Jul. 30, 2020. [Online]. Available: https://github.
In the case of dHash, minimum Hamming distance was 8, com/JohannesBuchner/imagehash
and maximum Hamming distance was 13. 2 icons were searched [8] Distance - Utilities for comparing sequences, Accessed: Jul. 30, 2020.
at minimum Hamming distance, and 7 icons were searched at [Online]. Available: https://github.com/doukremt/distance
maximum Hamming distance. [9] Icon Dataset, Accessed: Jul. 30, 2020. [Online]. Available: https://github.
com/jskim90/icon
[10] extract-icon-py, Accessed: Jul. 30, 2020. [Online]. Available: https://gith
ub.com/firodj/extract-icon-py
[11] Pillow, Accessed: Jul. 30, 2020. [Online]. Available: https://python-
Fig. 7. Similar icons searched at Hamming distance = 13 by dHash pillow.org/

1640
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 18,2023 at 18:11:34 UTC from IEEE Xplore. Restrictions apply.

You might also like