Professional Documents
Culture Documents
Baitalarm: Detecting Phishing Sites Using Similarity in Fundamental Visual Features
Baitalarm: Detecting Phishing Sites Using Similarity in Fundamental Visual Features
Jian Mao1,2 , Pei Li 1 , Kun Li1 , Tao Wei3 , and Zhenkai Liang4
1
School of Electronic and Information Engineering, BeiHang University, China
2
The State Key Laboratory of Integrated Services Networks, Xidian University, China
3
Institute of Computer Science and Technology, Peking University, China
4
School of Computing, National University of Singapore, Singapore
Abstract—In this paper, we present a new solution, BaitA- appearance, a few solutions [9]–[11] is based on comparison
larm, to detect phishing attack using features that are hard of the image of a rendered page. However, this solution
to evade. The intuition of our approach is that phishing pages is not efficient. They can be affected by slight differences
need to preserve the visual appearance the target pages. We
present an algorithm to quantify the suspicious ratings of web caused by different browser rendering engines. Moreover, if
pages based on similarity of visual appearance between the the target page cannot be indexed by search engines, such
web pages. Since CSS is the standard technique to specify as a page that can be displayed only after a user login, the
page layout, our solution uses the CSS as the basis for above solutions cannot be applied.
detecting visual similarities among web pages. We prototyped To robustly detecting phishing sites, we aim to use fun-
our approach as a Google Chrome extension and used it to
rate the suspiciousness of web pages. The prototype shows the damental visual features of a web page’s appearance as
correctness and accuracy of our approach with a relatively low the basis of detecting page similarities. In this paper, we
performance overhead. propose a novel solution, BaitAlarm, to efficiently detect
phishing web pages. Note that page layouts and contents
I. I NTRODUCTION are fundamental feature of web pages’ appearance. Since
the standard way to specify page layouts is through the style
Phishing is a form of social engineering attack in which an sheet (CSS), we develop an algorithm to detect similarities
attacker mimics electronic communications to lure users to in key elements related to CSS.
provide their confidential information. Such communications We implemented BaitAlarm in a Google Chrome exten-
trick users to visit phishing web sites, which collect users’ sion. Our evaluation on more than 7000 phishing pages con-
private information, such as passwords, credit card numbers, firms our assumptions. BaitAlarm achieved accurate results
and social security numbers. According to the investigation in detecting hundreds of samples in phishtank.com, a
report from APWG [1], phishing attacks increased 50% per web collection of phishing attack samples.
month, among which around 5% phishing mails attract users
to visit the phishing web sites. II. OVERVIEW
A widely-used type of solutions detects phishing URLs A. Page Layout and CSS
and alert users before they visit the URLs. For example, The visual appearance of a web page is decided by its
Bayesian anti-phishing toolbar [2], [3] maintains a blacklist page layout and contents. To achieve a consistent appearance
database of phishing sites. Special characteristics of web across all variants of web browsers, Cascading Style Sheets
sites hosting phishing pages, such as the lifetime and the (CSS) is the standard technology for web pages to specify
registration date of a web site, can also be used to detect their visual appearance. When the user opens a web page, the
phishing attacks [4]–[6]. However, the features that such browser captures the CSS structure of the page, which is a
solutions are based on, such as URL strings, are not funda- series of rules specifying visual properties for page elements.
mental features of phishing pages. As a result, it is not hard A CSS rule includes two main components: a selector
for attackers to find ways to evade such defense mechanisms. and one or more declarations. The selector is usually an
Since phishing pages need to lure users by their visual HTML element, and each declaration consists of a property
appearance, i.e., page contents and page layouts, they are and a value. The property is the style attribute of the HTML
usually similar to the target pages. Recent solutions [7], element. Each property has a value [12].
[8] check whether the contents of the page being visited is Selectors can be split into several categories, such as
similar to other pages indexed by search engines. However, tag selectors, id selectors, .class selectors and other
such solutions can be confused by attackers through embed- selectors (e.g., some attribute selectors, etc). Properties illus-
ding invisible contents. To capture the similarity in visual trate the attributes related to the elements that selected by
791
Page similarity detection/computing
Before we illustrate our visual similarity computing algo-
rithm, we first define three notations.
Definition 2: (Complexity Score) The Complexity Score
of a web page is a fundamental visual layout metrics. Given
the comparison-unit of the web page A, CompU nit(A), the
complexity of the web page A is Similarity Checker Layout Model Builder
<Similarity Score> <Comparison Unit>
NA
Mn <Decision> <Page Info.>
Mn is the number of the n-th property’s optional values <Account Info.-Webpage Mapping Table>
n,m
{V aluemn }; kt , kcn,m , kin,m and kon,m represent the num-
ber of the Tag, Class, ID, Others selectors with the value
V aluemn respectively, and wt , wc , wi , wo are corresponding
Figure 1: BaitAlarm Architecture
weight values.
Definition 3: (Match Score) Given the comparison-units
of the web pages A and B, the Match Score of A and B Phase III: Making decision based on the page similarity
A,B
labeled as Smatch is and additional features of the web pages: If the similarity
NA
Mn
Sim(Sus, V ic) is beyond a preset threshold , that means
A,B
suspicious page and victim page should be the same page.
Smatch = en,m
t ∗wt +en,m
c ∗wc +en,m
i ∗wi +kon,m ∗eo
n=1 m=1
If there exist some other evidences proving that these two
, pages are different, for example, the URL of two pages have
different domains, we conclude that the suspicious page Sus
is a phishing page and output our decision.
where en,m
t , en,m
c , en,m
i and en,m
o represent the number of This is a first step toward our high-level idea of detecting
equal selectors with the value V aluemn belong to the Tag, page similarity using fundamental page features. It confirms
Class, ID, Others categories respectively. our assumptions on CSS role in detecting phishing attacks.
Definition 4: (Similarity) Given the comparison-units of
the web pages A and B, the Similarity between A and B is B. BaitAlarm Architecture
A,B
match score (A, B) Smatch The overall architecture of the BaitAlarm extension is
Sim(A, B) = = shown in Figure 1. BaitAlarm includes three main compo-
min{score (A), score (B)} min{SA , SB }
. nents: Pre-Processor, Layout Monitor, and Network Library.
The Pre-Processor consists of Page Filter, DOM, and
HTML Parser. After a web page is loaded, the Page Filter
Based on our analysis of phishing pages, the ID and checks it over. If the web page has been loaded before, it
Class selectors influence more in visual layout similarity. does not need further analysis. If the loaded page is new
Generally, different web pages should have different ID and and contains some specific UI (e.g., login form), the Page
Class selectors, especially for some unusual name of the ID Filter triggers the detecting process. The HTML Parser and
selector. the DOM extract the layout information of the suspicious
Summary of our approach. Our visual layout similarity page. When the user inputs personal information, such as
based phishing detection scheme includes three phases. Login ID, the browser holds the page and the Pre-Processor
Phase I: Extracting and normalizing CSS structure of sends the layout information to the Layout Monitor.
the suspicious page: Given a suspicious page Sus, we can The Layout Monitor consists of a Layout Model Builder
get the CSS structure of the page CSS(Sus). Then we and a Similarity Checker. When the Layout Monitor gets
convert CSS(Sus) into the normalized model Comparison- the layout information of the suspicious page from the
unit of web page Sus, Compunit(Sus). Pre-Processor, the Layout Model Builder models them into
Phase II: Computing similarity between the suspicious “comparison-unit” and sent them to the Similarity Checker,
page and victim page: After we obtain the normalized model together with additional page features (e.g., page domain,
Compunit(Sus), we match the two comparison-units of the etc.). After the Similarity Checker gets the comparison unit
suspicious page and victim page, and compute the similarity of the suspicious page, it searches the Network Library for
score of the two pages Sim(Sus, V ic). the victim pages feature model (comparison unit) indexed
792
Target Page Paypal Sulake Corp. AOL Blizzard Orkut Cielo Tibia Facebook Other
Number 1978 1029 329 267 207 167 162 109 3516
Ratio 25.48% 13.25% 4.24% 3.44% 2.67% 2.15% 2.10% 1.40% 45.28%
Similarity p 0.24 < p < 0.3 0.3 ≤ p < 0.4 0.4 ≤ p < 0.6 0.6 ≤ p < 0.8 0.8 ≤ p < 1 1
Number 3 16 9 20 42 420
Ratio 0.59% 3.14% 1.76% 3.29% 8.24% 82.35%
by the same personal information that has been inputted by Paypal is the most popular target page for phishing attacks
the user before. (with the forging ratio 25.48%); The next three other popular
If the Similarity Checker does not find the matched page, phishing target pages are Sulake Corp., AOL and Blizzard.
then it informs the browser to release the page and treat it as There are almost 46.41% phishing samples targeting these
a new registering web site. The Similarity Checker reports four websites. We use them and their phishing pages for
the page information and its layout model to the Network BaitAlarm system training and threshold adjustment.
Library.
Similarity between phising pages and their target page
If the Similarity Checker finds the matched page (or
pages) and gets its (their) layout model and additional page Firstly, we analyzed the similarity of the phishing pages
information. The checker calculates the similarity score of and their victim page. We use Paypal site and Paypal-
the pages and outputs the decision based on their similarity phishing pages in the phishtank database as the test samples.
score and additional page information. There are totally 1680 phishing Paypal login pages in
In our scheme, if a page’s similarity score is less than the database. Among them, 784 pages were no longer
the preset threshold, the page is innocent. Then browser unavailable online, and 396 pages have the different visual
releases the page and the Similarity Checker reports the page layout from the Paypal site, in which the similarity ratio
information and its layout model to the Network Library. reported by BaitAlarm ranges from 0 to 0.216.
Otherwise, the Similarity Checker checks additional page We analyzed the rest 510 Paypal-phishing pages in the
information to make the decision. For example, if the pages database and show their similarity ratios measured by BaitA-
have a relatively high similarity but their URLs have different larm in Table II. We can see that 82.35% paypal-phishing
domains, the suspicious page is regarded as a phishing pages got the similarity score 1. There is only 0.59% pages
page. The checker will submit the related information to with the similarity scores less than 0.3. According to our
the Network Library and inform the browser to pop up a manual analysis, pages with the similarity score less than
warning page. 0.3 are visually different from the Paypal’s page and users
The Network Library maintains the user’s surfing history can distinguish them easily.
information (e.g., URL, layout model, etc.), Whitelist/Black- For AOL websites, we made the same experiment based
list and a “Personal Info-Historical Page Mapping Table”. on 276 samples selected from phishtank.com that were
The table is used to search for the victim pages based on labeled as AOL-phishing pages. 242 pages’ visual appear-
users’ information captured by the browser. ance was distinct from AOL. For the remained 36 phishing
pages, BaitAlarm reported that the similarity score is 1.
IV. I MPLEMENTATION AND E VALUATION
Similarity to Other Web Pages
We developed BaitAlarm as an extension in the Google
Chrome browser and used it to implement the real-time We made experiments to study false positive by illus-
phishing detection. trating the similarity between other web pages and some
Our evaluation is performed on a computer with an target pages (without losing the generality, we took Paypal
Intel(R) Core(TM)2 Duo CPU (3.00GHz) and 2GB of mem- as the target page). In this experiment, we chose 302 web
ory. We used Google Chrome v21.0.1180.15. The phishing pages randomly that include university websites, government
pages are collected from Phishtank.com, and the sample homepages, E-business websites, and social network sites,
data set consists of 7764 phishing sites. etc. We show the results in Table III. 86.30% of the benign
pages’ similarity score is less than 0.04 and there is no
A. Training and Threshold Determining benign page’s similarity score beyond 0.18.
We analyzed 7764 phishing samples in our dataset and We also tested the similarity score between phishing
counted the forging frequency of the specific victim pages. pages and their non-target pages. In this case, we randomly
The statistical result is shown in Table I. We can see that selected 276 phishing pages cited by the phishtank.com
793
Similarity 0-0.04 0.04-0.08 0.08-0.12 0.12-0.18 0.18-1
Number 252 20 9 11 0
Ratio 86.30% 6.85% 3.08% 3.77% 0
794
Reputation Scoring: WOT [18] and iTrustPage [4], [5] [6] I. Fette, N. Sadeh, and A. Tomasic, “Learning to detect
aim to rate a page on the possibility of phishing using phishing emails,” in Proceedings of the International World
reputation scores, which are either reported from the anti- Wide Web Conference (WWW), May 2007.
phishing community or computed from the given web page. [7] Y. Zhang, J. Hong, and L. Cranor, “Cantina: A content-based
Nevertheless, the two approaches listed above are user approach to detecting phishing web sites,” in Proceedings of
assisted and WOT’s rating scheme is based on the subjective the International World Wide Web Conference (WWW), May
comments submitted by the users. 2007.
Unlike the anti-phishing methods discussed above, BaitA-
[8] A. Nourian, S. Ishtiaq, and M. Maheswaran, “Castle: A
larm is based on the fundamental display features of web scocial framework for collaborative anti-phishing databases,”
pages. These features are monitored by browsers and treated ACM Transactions on Internet Technology, 2009.
as objective metrics in phishing web-page detection au-
tomatically. Compared to BaitAlarm, the whitelist based [9] C. Y., H. W., and Y. Le, “Anti-phishing based on automated
techniques cannot be used to identify the newly set-up individual white-list,” in Proceedings of the 4th ACM work-
shop on Digital identity management, 2008, pp. 51–60.
benign web page, and it needs to be updated manually with
a learning and verification period, which might cause a high [10] D. Xiaotie, H. Guanglin, and F. A.Y., “An antiphishing
false positive rate. strategy based on visual similarity assessment,” Internet Com-
puting, vol. 10, no. 2, pp. 58–65, 2006.
VI. C ONCLUSION
[11] L. Wenyin and D. Xiaotie, “Detecting phishing web pages
Phishing is a popular social engineering attack used by with visual similarity assessment based on earth mover’s
attackers to collect sensitive information from victim users. distance,” IEEE Transactions on Dependable and Secure
Computing, vol. 3, no. 4, pp. 301–311, 2006.
This paper introduces a novel antiphishing approach, BaitA-
larm, which is based on efficient similarity comparison be- [12] W3CSchool, “Css tutorial-w3cschool,” http://www.
tween the suspicious page and the target page. In particular, w3schools.com/css/.
BaitAlarm uses CSS and related elements to represent visual
features of a web page. Our evaluation using a large number [13] SpoofStick, “Spoofstick,” http://www.corestreet.com/
spoofstick/.
of phishing pages supports the key idea of our approach. In
the future work, we will work on improving BaitAlarm’s [14] E. Medvet, E. Kirda, and C. Kruegel, “Visual-similarity-based
resilience to evasion attacks. phishing detection,” in Proceedings of SecureComm 2008.
ACM, September 2008.
Acknowledment. The authors thank anonymous review- [15] T.-C. Chen, S. Dick, and J. Miller, “Detecting visually similar
ers for their insightful comments. This work was sup- web pages: Application to phishing detection,” ACM Trans-
ported in part by the Beijing Natural Science Foundation action on Internet Technology, vol. 10, no. 2, pp. 1–38, May
(No. 4132056), the National Key Basic Research Program 2010.
(NKBRP) (973 Program) (No. 2012CB315905), the Beijing
[16] D. Boneh, “Spoofguard,” http://crypto.stanford.edu/
Natural Science Foundation (No.4122024), and the Na- SpoofGuard.
tional Natural Science Foundation of China (No. 61272501,
61173154, 61003214). [17] L. Wenyin, N. Fang, X. Quan, B. Qiu, and G. Liu, “Discover-
ing phishing target based on semantic link network,” Future
R EFERENCES Generation Computer Systems, no. 26, pp. 381–388, 2010.
[1] APWG, “Investigation report,” http://www.antiphishing.org/ [18] WOT, “Web of trust,” http://www.antiphishing.org/reports/
reports/apwg trends report h2 2011.pdf, 2011. apwg trends report h2 2011.pdf, 2011.
795