Professional Documents
Culture Documents
Phishing Attacks
Phishing Attacks
PHISHING ATTACKS:
TRENDS, DETECTION SYSTEMS
USE OF ML, NLP AND CV
https://docs.apwg.org/reports/apwg_trends_report_q1_2021.pdf
Facts and Current Trends
Types of phishing attacks
Typical Spear
Whaling
phishing phishing
-More phishing sites are using HTTPS certificates in order to fool users with the green
“secure” icon in the browser that, ironically, users will interpret as ‘safe’.
-Domain Spoofing and Domain Impersonation Is More Sophisticated. the attacker can
send from an authentic Microsoft address
Why the problem of phishing could not
be solved yet?
• Phishing Has Become too Targeted for Traditional Spam-Type filters
Broad Spam-like Phishing Attacks are Easily Caught.
Targeted, Customized Phishing Attacks are Hard to Catch and on the Rise: Spear-phishing
attacks, especially business email compromise (BEC), have almost doubled since the
beginning of the year, made easier by the large scale data breaches last year.
The Image/Screenshot
Domain Knowledge
(Web Information)
The Source
Code (DOM)
Classification of Anti-Phishing Methods
Blacklist URL DOM Visual
Similarity
Google Safe Browsing
Sahingoz et al. (2017) CANTINA+ (2011) Maurer et al.(2013)
API
CatchPhish – Rao et
Rosiello et al. (2007) Marchal et al. (2016) Verilogo (2015)
al. (2018)
Grambeddings –
Bozkir et al (2022)
Le, Hung, et al. "URLNet: learning a URL representation with deep learning for malicious URL detection."
arXiv preprint arXiv: 1802.03162 (2018).
Grambeddings - 2022
• A novel deep learning based solution involving 4
information channels each composed of 5 layers
for extracting information rich representations from
different levels of character groups.
Bozkir, Ahmet Selman, Firat Coskun Dalgic, and Murat Aydos. "GramBeddings: A New Neural Network for URL Based Identification of Phishing Web
Pages Through N-gram Embeddings." Computers & Security 124 (2023): 102964.
Grambeddings Dataset
Visual similarity vs
Pure vision based analysis
Logo
• DOM tree similarity
• Visual features
• CSS Similarity
• Layout Similarity via
VIPS (Block and
overall layout) Screenshot of
whole page
Image with
Layout
Why Computer Vision?
• 47%-83% of the newly found phishing pages are added to lists in 12 hours. Zero day attacks
need pro-active solutions!
• Predefined or handy-crafted heuristics are evaded by attackers
• 23% of the users do not even look at the address bar! (Dhamija et al.)
• Substitution of textual HTML elements with <IMG>
• Loading of dynamic / AJAX based content, IFRAME
• Robustness against complex backgrounds or page layouts
• Brand recognition can be done in a holistic manner, not with a single logo!
• Language and source code independence
• And the most important is vision based solutions are in concordance with human
perception
Challenges related to vision based anti-
phishing
• Lack of a well curated dataset
• Vast amount of brands
• High intra-class variations among the phishing samples of brands
• Inconsistent layouts
• Unrelated layouts and color schemes
• Data leakage which skews the bias
Phish-Iris Dataset
- Bozkir, Ahmet Selman, and Ebru Akcapinar Sezer. "Use of HOG descriptors in phishing detection." 2016 4th International Symposium on Digital Forensic and Security
(ISDFS). IEEE, 2016
- F. C. Dalgic, A. S. Bozkir, and M. Aydos, “Phish-iris: A new approach for vision based brand prediction of phishing web pages via compact visual descriptors,” in
Proceedings of the IEEE International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT),2018
White-Net (Phishing Website Detection
by Visual Whitelists)
• Consists of three CNNs where they are
structured as Siamese Networks.
• 2 steps in training stage (81% top-1 match)
• Based on FaceNet.
- Sahar Abdelnabi, Katharina Krombholz and Mario Fritz, WhiteNet: Phishing Website Detection by Visual Whitelists, https://arxiv.org/pdf/1909.00300.pdf, 2019
Verilogo : proactive phishing detection
via logo recognition
•SIFT based keypoint matching over 400/200 px stripes
•Pairwise comparison (not scalable)
•6 seconds/image
•352 image dataset
G. Wang et al., Verilogo: Proactive Phishing Detection via Logo Recognition, 2010
LogoSENSE
•Object detection strategy with Max-Margin Loss
SVM and HOG
•0.04 seconds to analyze on CPU ~(1024*1024 px)
•A special dataset covering 15 brands on 1530
training + 1979 testing images (1000 samples for
legitimate)
Bozkir and Aydos, LogoSENSE: A Companion HOG based Logo Detection Scheme for Phishing Web Page and E-mail Brand Recognition , Computers & Security, 2020
Towards Multi-Modal Analysis of
Semantic and Visual Appearance
• We, now, investigate how the underlying semantic information can be merged with visual
information to extract discriminative features for phishing web page classification
• In recent years, a few public datasets covering HTML, URL and Screenshots were published.
However they are not flawless!
• We collected our own dataset in 2.5 years using our tools and tight scheduling
• Our multi-lingual analysis show promising results surpassing 90% of accuracy
• Combined with visual features we are targeting the success reaching up to 98%
Conclusion
• The phishing problem is escalating! The arms race is no longer symmetric!
• The capabilities in image-understanding has direct impact on vision based anti-phishing and
they are gaining popularity
• Visual information should be processed together with NLP features by also considering
layout variations
• As of 2022, we still do not have a standardized dataset like ImageNet
• Accuracy is not enough. Obtaining low FPR is crucial!
• The solutions are not personalized! Personal attitudes play key-role.
Nonetheless, they are like “event horizon”.
• We are/will continue to contribute the field. Stay tuned for upcoming papers and resources.
23.11.2022
Thanks for lıstenıng