Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 38

SEMINAR

PHISHING ATTACKS:
TRENDS, DETECTION SYSTEMS
USE OF ML, NLP AND CV

ASSIST. PROF. Dr. Ahmet Selman Bozkır


Dept. of Computer Engineerıng - Hacettepe Unıversity
Today
■ What is phishing ?
■ Facts and current trends
■ Types of phishing
■ Examples of attack types
■ Why the problem of phishing could not be solved yet?
■ Phishing detection methods in the literature
■ Vision based analysis and various studies, challenges
■ What we have done so far? Our vision
■ Conclusion
What is Phishing?
• Phishing is a criminal mechanism employing both social
engineering and technical subterfuge to steal consumers’
personal identity data and financial account credentials.
• Social engineering schemes prey on unwary victims by
fooling them into believing they are dealing with a trusted,
legitimate party, such as by using deceptive email
addresses and email messages.
(APWG – Anti-Phishing Working Group )
• Phone phreaking + fishing -> «phishing»
Underlying Truth
• In 350BC, Aristotle noted that “our senses can be trusted
but they can be easily fooled”.
• According to the study written by Richard Gregory claims
that only %20 of our visual perception comes through our
eyes, the remaining part is rely on our inferences.
• Actual Reason : Careless operations and Ignorance
Phish or not?
Phish or not?
A typical life cycle of a phishing
campaign
Facts and Current Trends

https://docs.apwg.org/reports/apwg_trends_report_q1_2021.pdf
Facts and Current Trends
Types of phishing attacks

Typical Spear
Whaling
phishing phishing

More quantity / less profit Less quantity /More profit


Example e-mails for “typical phishing”
1. [1]
Example e-mails for “typical phishing”
1. [1]
Example e-mails for “spear phishing”
1. [1]
An example e-mail of “whaling phishing”
1. [1]
Why the problem of phishing could not
be solved yet?
• Even Highly Trained Users are Clicking
When reading a hundred emails in the middle of a stressful workday, even the most well-
trained and observant employee will click on a malicious email.

• Phishing Attacks are Increasingly Sophisticated


-Employees are taught to look for typos and poor grammar to identify a text lure, but over
the last year, attackers have improved their spelling and learned to match legitimate
messages.

-More phishing sites are using HTTPS certificates in order to fool users with the green
“secure” icon in the browser that, ironically, users will interpret as ‘safe’.

-Domain Spoofing and Domain Impersonation Is More Sophisticated. the attacker can
send from an authentic Microsoft address
Why the problem of phishing could not
be solved yet?
• Phishing Has Become too Targeted for Traditional Spam-Type filters
Broad Spam-like Phishing Attacks are Easily Caught.
Targeted, Customized Phishing Attacks are Hard to Catch and on the Rise: Spear-phishing
attacks, especially business email compromise (BEC), have almost doubled since the
beginning of the year, made easier by the large scale data breaches last year.

• Targeted Attacks Have Become Psychologically More Sophisticated


-Attackers have learned to combine personalized information through profiling.
-Fear, urgency, and curiosity - > entertainment, social and reward recognition.
Combatting Methods against Phishing
The URL

The Image/Screenshot

Domain Knowledge
(Web Information)

The Source
Code (DOM)
Classification of Anti-Phishing Methods
Blacklist URL DOM Visual
Similarity
Google Safe Browsing
Sahingoz et al. (2017) CANTINA+ (2011) Maurer et al.(2013)
API

CatchPhish – Rao et
Rosiello et al. (2007) Marchal et al. (2016) Verilogo (2015)
al. (2018)

URLNet – Hung et al.


Han et al. (2008) Buber et al. (2017) DeltaPhish (2017)
(2018)

PDA – Jain&Gupta PDRCNN – Wang et PhishIRIS - Dalgic et


Jain & Gupta (2018)
(2016) al. (2019) al. (2018)

Grambeddings –
Bozkir et al (2022)

Less resource Time consuming / More resource/ Robust to “zero-hour” attacks


The URL
• The Uniform Resource Locator (URL) is the address of any resource,
in which case it is the webpage, in World Wide Web
• Many researchers use this source of information in their studies to
extract key features to identify a phishing webpage.
• While some of them purpose a solution by using hand-crafted
(lexical) features, the others chose to apply machine learning based
features
Some Phishing URLs
http://www.cnhedge.cn/js/index.htm?http://us.battle.net/login/en/?ref=http://spdfozrus.battle.net/d3/en/index
http://www.arvindudyog.com/bright/bright/drake/bright/45886564bea8a9f07a8055347163a4a3/
http://amcnamibia.com/wp-admin/file/files/db/file.dropbox/
http://www.arvindudyog.com/papa/
http://www.iowasaferoutes.org/wp-content/plugins/wpsecone/dhl/
http://www.imanaforums.com/neomodules/accesst/
http://ausbuildblog.com.au/wp-content/heaven/index.php
http://fengshuireview.com/upload/free.mobile.fr/facturtion/finale/free/
http://searchenginetricks.ca/cam/config/webmail/
http://www.i-robot.kiev.ua/self/dropbox/dropbox/dropbox/
http://www.justaskaron.com/octapharma.com.ca/
http://i-robot.kiev.ua/self/dropbox/dropbox/dropbox/index.php
http://kiltonmotor.com/others/m.i.php?
n=1774256418&rand.13inboxlight.aspxn.1774256418&rand=13inboxlightaspxn.1774256418&username1=&usernam
e
http://www.sindhuratna.com/new2015/document.php
http://www.sindhuratna.com/new2015/document.php
http://justaskaron.com/octapharma.com.ca/index.php
http://www.alexsandroleiloes.com.br/admin/beats/verification-folder.php
http://www.vantaiduccuong.com/soutdoc/es/
http://www.pt-tkbi.com/providernet/provider/provider/webmail/securenow/webnet/
http://www.alhadbaa.org/googledrive/
http://www.parfumwangimurah.com/g9/
http://proseind.cl/new/index.php
http://annstringer.com/storagechecker/domain/ii.php
Lexical URL Features
• #dots
• #special characters
• #suffixes
• Length of URL
• Length of the query string
• Subdomain name
• Suspicious Characters / Punny code
• TLD Name and its length
• Domain Name
• The depth of the subdomain
• Having a SSL certificate (https)
• ….
Most discriminative 4-grams: chi-square
• “%20(“ :99.35901350685741
• “.log” :155.82961566651434
• “logi “ :1947.7954010788872
• “ogin” :2096.632706999275
• “secu” :895.0781029132113
• “/wp-” :1629.5131963112008
Source Code
• Consists of HTML DOM, JS and CSS components.
• Used as the main markup directives for layout information
The source code is no longer applicable!

• Thanks to capabilities of JavaScript and comprehensive libraries such as


React.js and Angular.js, the web page implementation is changing from
static rendering to dynamic rendering.
• Ajax and dynamic content loading
• Misuse of HTML tags
• Numerous ways of markup for the same rendering!
• Thus, HTML, CSS or tag similarity are not guaranteed to be source of
evidence!
URLNet - 2018
• One of the first published work based
on Deep Learning methods.

Le, Hung, et al. "URLNet: learning a URL representation with deep learning for malicious URL detection."
arXiv preprint arXiv: 1802.03162 (2018).
Grambeddings - 2022
• A novel deep learning based solution involving 4
information channels each composed of 5 layers
for extracting information rich representations from
different levels of character groups.

• An adjustable n-gram embedding matrix powered


by an efficient and effective pre-processing
scheme.

• Attention mechanism to improve capturing long


and short term co-occurrences of useful n-grams.

• Dataset involving 800K real-world phishing.

• Accuracy 98.24% (SOTA)

Bozkir, Ahmet Selman, Firat Coskun Dalgic, and Murat Aydos. "GramBeddings: A New Neural Network for URL Based Identification of Phishing Web
Pages Through N-gram Embeddings." Computers & Security 124 (2023): 102964.
Grambeddings Dataset
Visual similarity vs
Pure vision based analysis
Logo
• DOM tree similarity
• Visual features
• CSS Similarity
• Layout Similarity via
VIPS (Block and
overall layout) Screenshot of
whole page

Image with
Layout
Why Computer Vision?
• 47%-83% of the newly found phishing pages are added to lists in 12 hours. Zero day attacks
need pro-active solutions!
• Predefined or handy-crafted heuristics are evaded by attackers
• 23% of the users do not even look at the address bar! (Dhamija et al.)
• Substitution of textual HTML elements with <IMG>
• Loading of dynamic / AJAX based content, IFRAME
• Robustness against complex backgrounds or page layouts
• Brand recognition can be done in a holistic manner, not with a single logo!
• Language and source code independence
• And the most important is vision based solutions are in concordance with human
perception
Challenges related to vision based anti-
phishing
• Lack of a well curated dataset
• Vast amount of brands
• High intra-class variations among the phishing samples of brands
• Inconsistent layouts
• Unrelated layouts and color schemes
• Data leakage which skews the bias
Phish-Iris Dataset

Publicly available at https://web.cs.hacettepe.edu.tr/~selman/phish-iris-dataset


HOG and MPEG7 like compact visual
descriptors (2016, 2018)
• Based on image global image
similarity via descriptors
• Process whole webpage’s
screenshot.
• 92% accuracy.

- Bozkir, Ahmet Selman, and Ebru Akcapinar Sezer. "Use of HOG descriptors in phishing detection." 2016 4th International Symposium on Digital Forensic and Security
(ISDFS). IEEE, 2016
- F. C. Dalgic, A. S. Bozkir, and M. Aydos, “Phish-iris: A new approach for vision based brand prediction of phishing web pages via compact visual descriptors,” in
Proceedings of the IEEE International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT),2018
White-Net (Phishing Website Detection
by Visual Whitelists)
• Consists of three CNNs where they are
structured as Siamese Networks.
• 2 steps in training stage (81% top-1 match)
• Based on FaceNet.

- Sahar Abdelnabi, Katharina Krombholz and Mario Fritz, WhiteNet: Phishing Website Detection by Visual Whitelists, https://arxiv.org/pdf/1909.00300.pdf, 2019
Verilogo : proactive phishing detection
via logo recognition
•SIFT based keypoint matching over 400/200 px stripes
•Pairwise comparison (not scalable)
•6 seconds/image
•352 image dataset

G. Wang et al., Verilogo: Proactive Phishing Detection via Logo Recognition, 2010
LogoSENSE
•Object detection strategy with Max-Margin Loss
SVM and HOG
•0.04 seconds to analyze on CPU ~(1024*1024 px)
•A special dataset covering 15 brands on 1530
training + 1979 testing images (1000 samples for
legitimate)

Bozkir and Aydos, LogoSENSE: A Companion HOG based Logo Detection Scheme for Phishing Web Page and E-mail Brand Recognition , Computers & Security, 2020
Towards Multi-Modal Analysis of
Semantic and Visual Appearance
• We, now, investigate how the underlying semantic information can be merged with visual
information to extract discriminative features for phishing web page classification
• In recent years, a few public datasets covering HTML, URL and Screenshots were published.
However they are not flawless!
• We collected our own dataset in 2.5 years using our tools and tight scheduling
• Our multi-lingual analysis show promising results surpassing 90% of accuracy
• Combined with visual features we are targeting the success reaching up to 98%
Conclusion
• The phishing problem is escalating! The arms race is no longer symmetric!
• The capabilities in image-understanding has direct impact on vision based anti-phishing and
they are gaining popularity
• Visual information should be processed together with NLP features by also considering
layout variations
• As of 2022, we still do not have a standardized dataset like ImageNet
• Accuracy is not enough. Obtaining low FPR is crucial!
• The solutions are not personalized! Personal attitudes play key-role.
Nonetheless, they are like “event horizon”.
• We are/will continue to contribute the field. Stay tuned for upcoming papers and resources.
23.11.2022
Thanks for lıstenıng

You might also like